<a href="https://colab.research.google.com/github/nickdhollman/MS-BAnDS-Google-Colab/blob/main/Association_Rules_Mining_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MARKET BASKET ANALYSIS INCLUDES ASSOCIATION RULES

*IN ADDITION TO THE PREDICTIVE BUSINESS ANALYTICS MATERIALS, SEE MATERIALS LOCATED IN THE FOLLOWING DRIVE LOCATION C:\Users\nickd\OneDrive - Oklahoma A and M System\Programming for Data Science\Week 8 - Association Rule Mining FOR MORE INFO ON ASSOCIATION RULE MINING*

TUTORIAL WEBSITE: https://365datascience.com/tutorials/python-tutorials/market-basket-analysis/

DATASET: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset?resource=download

In [None]:
from google.colab import files

uploaded = files.upload()

Saving Groceries_dataset.csv to Groceries_dataset.csv


In [None]:
# Importing libraries
!pip install sidetable
!pip install mlxtend
import numpy as np
import pandas as pd
import sidetable
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules



In [None]:
# Loading the data
groceries = pd.read_csv("Groceries_dataset.csv")

In [None]:
#initial data exploration
print(groceries.head())
print(groceries.shape)
print(groceries.dtypes)

   Member_number        Date   itemDescription
0           1808  21-07-2015    tropical fruit
1           2552  05-01-2015        whole milk
2           2300  19-09-2015         pip fruit
3           1187  12-12-2015  other vegetables
4           3037  01-02-2015        whole milk
(38765, 3)
Member_number       int64
Date               object
itemDescription    object
dtype: object


THREE MAIN COMPONENTS OF THE APRIORI ALGORITHM: SUPPORT, LIFT, CONFIDENCE

DETAILS OF EACH BELOW:

Support(item) = Transactions comprising the item (OR ITEMS) / Total transactions

we use SUPPORT to assess the overall popularity of a given product

If support = 0.75, this means that the item is present in 75% of purchases

A high support value indicates that the item is present in most purchases, therefore marketers should focus on it more.

*For the value of the rules support (ex; Bread -> Milk) a support value for the rule equal to 0.05 indicates that bread and milk were purchased together for 5% of total purchases*

Confidence (Bread -> Milk) = Transactions comprising bread and milk (SUPPORT VALUE OF BREAD -> MILK) / Transactions comprising bread (SUPPORT BREAD ONLY)

Confidence tells us the likelihood of different purchase combinations

Confidence (Bread -> Milk) = ¾ = 0.75. This means that 75% of the customers who bought bread also purchased milk. - GIVEN 4 CUSTOMERS BOUGHT BREAD, 3 PURCHASED MILK & BREAD


Lift refers to the increase in the ratio of the sale of milk when you sell bread.

Lift = Confidence (Bread -> Milk) / Support(Milk) = 0.75/0.10 = 7.5.

(support (milk) = 0.10))

This means that customers are 7.5 times more likely to buy milk if you also sell bread.

Lift > 1: This indicates that the presence of A increases the likelihood of B occurring. The rule "A → B" is stronger than random chance.

Lift = 1: This indicates that A and B are independent. The occurrence of A does not affect the likelihood of B occurring.

Lift < 1: This indicates that the presence of A decreases the likelihood of B occurring. The rule "A → B" is weaker than random chance.

*LIFT IS REFLEXIVE, SO RULE A -> B LIFT IS THE SAME AS RULE B -> A LIFT*

*THIS MEASURES HOW MUCH BETTER THE RULE IS FOR PREDICTION THAN A RANDOM GUESS*

*** THE TUTORIAL THAT SHOWS THIS CALCULATION IS WRONG FOR LIFT, BUT THE KAGGLE LINK SHOWS THE CORRECT CALCULATION

In [None]:
df = groceries.copy()

Before we perform market basket analysis, we need to convert this data into a format that can easily be ingested into the Apriori algorithm. In other words, we need to turn it into a tabular structure comprising ones and zeros, as displayed in the bread and milk example above.

To achieve this, the first group items that have the same member number and date:

In [None]:
df['single_transaction'] = df['Member_number'].astype(str)+'_'+df['Date'].astype(str)

df.head()
#this is combining the member number and date as a single variable which gives us a unique transaction number

Unnamed: 0,Member_number,Date,itemDescription,single_transaction
0,1808,21-07-2015,tropical fruit,1808_21-07-2015
1,2552,05-01-2015,whole milk,2552_05-01-2015
2,2300,19-09-2015,pip fruit,2300_19-09-2015
3,1187,12-12-2015,other vegetables,1187_12-12-2015
4,3037,01-02-2015,whole milk,3037_01-02-2015


In [None]:
df2 = pd.crosstab(df['single_transaction'], df['itemDescription'])
df2.head()
#this gives us the breakdown of products within each transaction

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
single_transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The final data pre-processing step involves encoding all values in the above data frame to 0 and 1.

This means that even if there are multiples of the same item in the same transaction (value > 1), the value will be encoded to 1 since market basket analysis does not take purchase frequency into consideration.

For details on map function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.map.html#pandas.DataFrame.map - basically it applies a function that accepts an iterable (such as a list, dictionairy, dataframe) and returns a new iterable w/o the need of a for loop

In [None]:
def encode(item_freq):
    res = 0
    if item_freq > 0:
        res = 1
    return res

basket_input = df2.map(encode)

In [None]:
basket_input.head()

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
single_transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
print(basket_input.columns)

Index(['Instant food products', 'UHT-milk', 'abrasive cleaner',
       'artif. sweetener', 'baby cosmetics', 'bags', 'baking powder',
       'bathroom cleaner', 'beef', 'berries',
       ...
       'turkey', 'vinegar', 'waffles', 'whipped/sour cream', 'whisky',
       'white bread', 'white wine', 'whole milk', 'yogurt', 'zwieback'],
      dtype='object', name='itemDescription', length=167)


CONVERT TO BOOL TYPES TO IMPROVE PERFORMANCE

In [None]:
basket_input = basket_input.astype(bool)

In [None]:
frequent_itemsets = apriori(basket_input, min_support=0.001, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="lift")

rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(bottled water),(UHT-milk),0.060683,0.021386,0.001069,0.017621,0.823954,1.0,-0.000228,0.996168,-0.185312,0.013201,-0.003847,0.033811
1,(UHT-milk),(bottled water),0.021386,0.060683,0.001069,0.05,0.823954,1.0,-0.000228,0.988755,-0.179204,0.013201,-0.011373,0.033811
2,(other vegetables),(UHT-milk),0.122101,0.021386,0.002139,0.017515,0.818993,1.0,-0.000473,0.99606,-0.201119,0.01513,-0.003956,0.058758
3,(UHT-milk),(other vegetables),0.021386,0.122101,0.002139,0.1,0.818993,1.0,-0.000473,0.975443,-0.184234,0.01513,-0.025175,0.058758
4,(UHT-milk),(sausage),0.021386,0.060349,0.001136,0.053125,0.880298,1.0,-0.000154,0.992371,-0.121998,0.014096,-0.007688,0.035976


Here, the “antecedents” and “consequents” columns show items that are frequently purchased together.

In this example, the first row of the dataset tells us that if a person were to buy bottled water, then they are also likely to purchase UHT-milk.

To get the most frequent item combinations in the entire dataset, let’s sort the dataset by support, confidence, and lift:

In [None]:
# retrieve the top 8 rules - sorting by support, then confidence, then lift
rules.sort_values(["support", "confidence","lift"],axis = 0, ascending = False).head(8)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
623,(rolls/buns),(whole milk),0.110005,0.157923,0.013968,0.126974,0.804028,1.0,-0.003404,0.96455,-0.214986,0.055,-0.036752,0.107711
622,(whole milk),(rolls/buns),0.157923,0.110005,0.013968,0.088447,0.804028,1.0,-0.003404,0.97635,-0.224474,0.055,-0.024222,0.107711
695,(yogurt),(whole milk),0.085879,0.157923,0.011161,0.129961,0.82294,1.0,-0.002401,0.967861,-0.190525,0.047975,-0.033206,0.100317
694,(whole milk),(yogurt),0.157923,0.085879,0.011161,0.070673,0.82294,1.0,-0.002401,0.983638,-0.203508,0.047975,-0.016634,0.100317
550,(soda),(other vegetables),0.097106,0.122101,0.009691,0.099794,0.817302,1.0,-0.002166,0.975219,-0.198448,0.046252,-0.02541,0.089579
551,(other vegetables),(soda),0.122101,0.097106,0.009691,0.079365,0.817302,1.0,-0.002166,0.980729,-0.202951,0.046252,-0.019649,0.089579
649,(sausage),(whole milk),0.060349,0.157923,0.008955,0.148394,0.939663,1.0,-0.000575,0.988811,-0.063965,0.042784,-0.011316,0.102551
648,(whole milk),(sausage),0.157923,0.060349,0.008955,0.056708,0.939663,1.0,-0.000575,0.99614,-0.070851,0.042784,-0.003875,0.102551


The resulting table shows that the four most popular product combinations that are frequently bought together are:

Rolls and milk

Yogurt and milk

Sausages and milk

Soda and vegetables

##### SAMPLE INTERPRETATION FOR FIRST COLUMN:
antecedent support = 0.11, this means that rolls/buns is present in 11% of purchases

consequent support = 0.1579, this means that whole milk is present in 15.79% of purchases

support = 0.013968, this means that rolls/buns & whole milk are both present in 1.4% of purchases

A high support value indicates that the item is present in most purchases, therefore marketers should focus on it more

Confidence (rolls/buns -> whole milk) = (support of both rolls/buns & whole milk (0.013968) / support of rolls/buns (0.11)) = ~0.12697. This means that 12.7% of the customers who bought rolls/buns also purchased whole milk.

Lift = (Confidence (roll/buns -> whole milk)(0.12697)) / (Support(buns/bread)(0.11)) = 1.15.

This means that customers are 1.3 times more likely to buy milk if you also sell bread.

Lift = (Confidence (roll/buns -> whole milk)(0.12697)) / (Support(whole milk)(0.1579)) = 0.80.

This means that customers are 0.80 times likely to buy rolls/buns if you also sell whole milk. This indicates that the presence of rolls/buns decreases the likelihood of whole milk occurring. The rule "rolls/buns → whole milk" is weaker than random chance.

In [None]:
# retrieve the top 8 rules - sorting by lift, then support, then confidence
rules.sort_values(["lift", "support", "confidence"],axis = 0, ascending = False).head(8)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
729,"(whole milk, yogurt)",(sausage),0.011161,0.060349,0.00147,0.131737,2.182917,1.0,0.000797,1.082219,0.548014,0.020992,0.075973,0.07805
732,(sausage),"(whole milk, yogurt)",0.060349,0.011161,0.00147,0.024363,2.182917,1.0,0.000797,1.013532,0.576701,0.020992,0.013351,0.07805
728,"(whole milk, sausage)",(yogurt),0.008955,0.085879,0.00147,0.164179,1.91176,1.0,0.000701,1.093681,0.481231,0.015748,0.085657,0.09065
733,(yogurt),"(whole milk, sausage)",0.085879,0.008955,0.00147,0.017121,1.91176,1.0,0.000701,1.008307,0.521727,0.015748,0.008239,0.09065
247,(specialty chocolate),(citrus fruit),0.015973,0.053131,0.001403,0.087866,1.653762,1.0,0.000555,1.038081,0.401735,0.020731,0.036684,0.057141
246,(citrus fruit),(specialty chocolate),0.053131,0.015973,0.001403,0.026415,1.653762,1.0,0.000555,1.010726,0.4175,0.020731,0.010612,0.057141
730,"(sausage, yogurt)",(whole milk),0.005748,0.157923,0.00147,0.255814,1.619866,1.0,0.000563,1.131541,0.384877,0.009065,0.11625,0.132562
731,(whole milk),"(sausage, yogurt)",0.157923,0.005748,0.00147,0.00931,1.619866,1.0,0.000563,1.003596,0.45443,0.009065,0.003583,0.132562


In [None]:
# retrieve the top 8 rules - sorting by confidence, then support, then lift
rules.sort_values(["confidence", "support", "lift"],axis = 0, ascending = False).head(8)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
730,"(sausage, yogurt)",(whole milk),0.005748,0.157923,0.00147,0.255814,1.619866,1.0,0.000563,1.131541,0.384877,0.009065,0.11625,0.132562
712,"(rolls/buns, sausage)",(whole milk),0.005347,0.157923,0.001136,0.2125,1.345594,1.0,0.000292,1.069304,0.258214,0.007007,0.064813,0.109847
724,"(soda, sausage)",(whole milk),0.005948,0.157923,0.001069,0.179775,1.138374,1.0,0.00013,1.026642,0.122281,0.006568,0.025951,0.093273
652,(semi-finished bread),(whole milk),0.00949,0.157923,0.001671,0.176056,1.114825,1.0,0.000172,1.022008,0.103985,0.010081,0.021534,0.093318
718,"(rolls/buns, yogurt)",(whole milk),0.007819,0.157923,0.001337,0.17094,1.082428,1.0,0.000102,1.015701,0.076751,0.00813,0.015459,0.089702
728,"(whole milk, sausage)",(yogurt),0.008955,0.085879,0.00147,0.164179,1.91176,1.0,0.000701,1.093681,0.481231,0.015748,0.085657,0.09065
307,(detergent),(whole milk),0.008621,0.157923,0.001403,0.162791,1.030824,1.0,4.2e-05,1.005814,0.030162,0.008499,0.005781,0.085839
408,(ham),(whole milk),0.017109,0.157923,0.00274,0.160156,1.014142,1.0,3.8e-05,1.002659,0.014188,0.015904,0.002652,0.088754


SEQUENCE ANALYSIS VS. ASSOCIATION RULES (QUESTION IN KRITIK ASSIGNMENT):

FROM CHAT GPT:
Sequence Analysis vs. Association Rules

Both sequence analysis and association rules are techniques used in data mining to find patterns, but they focus on different aspects of the data and have different applications. Here's how they relate:

1. Association Rules:
Purpose: Association rules focus on discovering relationships between items that co-occur in a set of transactions. The goal is to find itemsets that appear together frequently.

Example: "If a customer buys bread, they are likely to also buy milk."

Association rules are generally unordered. They look for co-occurrences but don't account for the order in which the items appear.

Key Components:

Antecedent (Left side): The item(s) that appear first in the rule (e.g., bread).

Consequent (Right side): The item(s) that are likely to appear after the antecedent (e.g., milk).

Metrics like Support, Confidence, and Lift are used to evaluate the strength of the association.

Example Rule: "Bread -> Milk" means that when bread is bought, milk is likely to be bought as well.

2. Sequence Analysis:
Purpose: Sequence analysis goes beyond co-occurrence of items; it focuses on the order or sequence in which events or items occur. It's concerned with finding patterns or frequent subsequences in sequences of items or events that appear in temporal or chronological order.

Example: "A customer buys bread, then milk, then butter."

Sequence analysis considers the order in which these items are purchased. It doesn't just look for "bread and milk together" but "bread followed by milk."

Key Components:

Sequences: Ordered lists of items or events (e.g., "bread → milk → butter").

Sequential Pattern Mining: Identifies subsequences that occur frequently in the dataset.

Applications: Often used in tasks like customer behavior analysis, web page clickstream analysis, and biological sequence analysis.

Relationship Between Sequence Analysis and Association Rules:
Association Rules can be seen as a simpler form of sequence analysis, where the order doesn't matter. In association rule mining, the items just need to co-occur, and there's no consideration of when or in what sequence they appear.

Sequence Analysis is a more complex form of mining that not only looks at which items appear together but also considers the order or temporal aspects. Sequence analysis is especially useful in cases where the temporal sequence is important, like predicting the next item a customer will buy, or studying sequential patterns in time-series data.

Example to Compare Both:
Association Rule: "Bread → Milk" suggests that if someone buys bread, they are likely to buy milk. The order doesn't matter here.

Sequence Analysis: "Bread → Milk → Butter" suggests a specific order: first, the customer buys bread, then milk, and then butter. Sequence analysis would look for such patterns of items occurring in a particular order.

Key Differences:
Order:

Association rules: No order between items.

Sequence analysis: Order matters (sequence of events or items is important).

Context:

Association rules: Typically used for market basket analysis and finding co-occurrence patterns in static transactions.

Sequence analysis: Used when the order of events or time is critical, such as in web page navigation, customer purchase history, or biological sequence analysis.

Example in a Retail Context:
Association Rule Example: You might find a rule like "Bread -> Milk", which says that customers who buy bread are likely to buy milk as well, regardless of when these items are bought.

Sequence Analysis Example: You might find a sequence like "Bread -> Milk -> Butter", which says that after buying bread, customers typically go on to buy milk and then butter in that order, suggesting a pattern of customer behavior over time.

Conclusion:
Association rules are typically applied in contexts where the co-occurrence of items is the primary focus, without concern for the order in which they occur.

Sequence analysis is used when the order or temporal relationship between events is important, such as in time-series or sequential data.

In practice, sequence analysis can be seen as an extension of association rule mining where you add the temporal dimension (or sequence order) to the analysis.