<a href="https://colab.research.google.com/github/metehanunal0/Apriori-Market-Basket-Optimisation/blob/main/Apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px

# Exploring Dataset

In [3]:
# dataset
data = pd.read_csv("Market_Basket_Optimisation.csv")
# printing the shape of the dataset
data.shape

(7500, 20)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   shrimp             7500 non-null   object 
 1   almonds            5746 non-null   object 
 2   avocado            4388 non-null   object 
 3   vegetables mix     3344 non-null   object 
 4   green grapes       2528 non-null   object 
 5   whole weat flour   1863 non-null   object 
 6   yams               1368 non-null   object 
 7   cottage cheese     980 non-null    object 
 8   energy drink       653 non-null    object 
 9   tomato juice       394 non-null    object 
 10  low fat yogurt     255 non-null    object 
 11  green tea          153 non-null    object 
 12  honey              86 non-null     object 
 13  salad              46 non-null     object 
 14  mineral water      24 non-null     object 
 15  salmon             7 non-null      object 
 16  antioxydant juice  3 non

In [5]:
data.tail()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
7495,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7496,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7497,chicken,,,,,,,,,,,,,,,,,,,
7498,escalope,green tea,,,,,,,,,,,,,,,,,,
7499,eggs,frozen smoothie,yogurt cake,low fat yogurt,,,,,,,,,,,,,,,,


In [6]:
transaction = []
for i in range(0, data.shape[0]):
    for j in range(0, data.shape[1]):
        transaction.append(data.values[i,j])

In [8]:
# converting to numpy array
transaction = np.array(transaction)

In [9]:
#  Transform Them a Pandas DataFrame
df = pd.DataFrame(transaction, columns=["items"])
# Put 1 to Each Item For Making Countable Table, to be able to perform Group By
df["incident_count"] = 1
#  Delete NaN Items from Dataset
indexNames = df[df['items'] == "nan" ].index
df.drop(indexNames , inplace=True)
# Making a New Appropriate Pandas DataFrame for Visualizations
df_table = df.groupby("items").sum().sort_values("incident_count", ascending=False).reset_index()
#  Initial Visualizations
df_table.head(10).style.background_gradient(cmap='coolwarm')

Unnamed: 0,items,incident_count
0,mineral water,1787
1,eggs,1348
2,spaghetti,1306
3,french fries,1282
4,chocolate,1230
5,green tea,990
6,milk,972
7,ground beef,737
8,frozen vegetables,715
9,pancakes,713


The output shows that mineral water has been purchased more frequently than other products.

In [10]:


# Create a bar chart using Plotly Express
fig = px.bar(df_table.head(30), x='items', y='incident_count',
             title='Item Incident Counts',
             labels={'items': 'Items', 'incident_count': 'Incident Count'},
             color='incident_count',
             color_continuous_scale='sunsetdark')

# Customize the layout if needed
fig.update_layout(
    xaxis=dict(tickangle=-45),
    yaxis=dict(title='Incident Count'),
    coloraxis_colorbar=dict(title='Count', tickformat=','),
)

# Show the bar chart
fig.show()

A Barcharting is a method for displaying hierarchical data using nested figures. We can use a barchart to visualize all the items from our dataset more interactive.

# Data Preprocessing

Before getting the most frequent itemsets, the dataset needs to be transformed into a True – False matrix where rows are transactions and columns are products.

In [13]:
transaction = []
for i in range(0,data.shape[0]):
    transaction.append([str(data.values[i,j])for j in range(0,data.shape[1])])

In [15]:
# importing the required module
from mlxtend.preprocessing import TransactionEncoder
# initializing the transactionEncoder
te = TransactionEncoder()
te_ary = te.fit(transaction).transform(transaction)
dataset = pd.DataFrame(te_ary, columns=te.columns_)
# dataset after encoded
dataset

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


We have 121 columns/features at the moment. Extracting the most frequent itemsets from 121 features would be compelling. So, we will start with the Top 50 items.

In [16]:
# select top 50 items
first50 = df_table["items"].head(50).values
# Extract Top50
dataset = dataset.loc[:,first50]
# shape of the dataset
dataset.shape

(7500, 50)

# Apriori Algorithm

In [17]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5954 sha256=38c440ec7ac7f47d93cbc0dfb29a6c36aae6dbe8af86b8af1ee056be8ff3bb9a
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [18]:
# importing the required module
from mlxtend.frequent_patterns import apriori, association_rules

# Extracting the most frequest itemsets via Mlxtend.
# The length column has been added to increase ease of filtering.
frequent_itemsets = apriori(dataset, min_support=0.01, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
# printing the frequent itemset
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.238267,(mineral water),1
1,0.179733,(eggs),1
2,0.174133,(spaghetti),1
3,0.170933,(french fries),1
4,0.163867,(chocolate),1
...,...,...,...
229,0.010933,"(chocolate, mineral water, ground beef)",3
230,0.011067,"(mineral water, ground beef, milk)",3
231,0.011067,"(frozen vegetables, mineral water, milk)",3
232,0.010533,"(eggs, chocolate, spaghetti)",3


The output shows that mineral water is the dataset’s most frequently occurring item. For further experiment, we can print out all items with a length of 2, and the minimum support is more than 0.05.

In [19]:
# printing the frequntly items
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.05) ]


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,support,itemsets,length
50,0.050933,"(eggs, mineral water)",2
51,0.059733,"(mineral water, spaghetti)",2
53,0.052667,"(chocolate, mineral water)",2


The output shows that the eggs and mineral water combination are the most frequently occurring items when the length of the itemset is two.

Similarly, we can find the most frequently occurring items when the itemset length is 3:


In [20]:
# printing the frequntly items with length 3
frequent_itemsets[ (frequent_itemsets['length'] == 3) ].head(3)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,support,itemsets,length
217,0.014267,"(eggs, mineral water, spaghetti)",3
218,0.013467,"(eggs, chocolate, mineral water)",3
219,0.013067,"(eggs, mineral water, milk)",3


# Further Association Rules

In [21]:
#  We set our metric as "Lift" to define whether antecedents & consequents are dependent our not
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift",ascending=False)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedents_length,consequents_length
219,(ground beef),(herb & pepper),0.098267,0.049467,0.016000,0.162822,3.291555,0.011139,1.135402,0.772060,1,1
218,(herb & pepper),(ground beef),0.049467,0.098267,0.016000,0.323450,3.291555,0.011139,1.332841,0.732423,1,1
291,"(mineral water, spaghetti)",(ground beef),0.059733,0.098267,0.017067,0.285714,2.907540,0.011197,1.262427,0.697745,2,1
294,(ground beef),"(mineral water, spaghetti)",0.098267,0.059733,0.017067,0.173677,2.907540,0.011197,1.137893,0.727562,1,2
311,(olive oil),"(mineral water, spaghetti)",0.065733,0.059733,0.010267,0.156187,2.614731,0.006340,1.114306,0.661001,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
61,(low fat yogurt),(eggs),0.076400,0.179733,0.016800,0.219895,1.223453,0.003068,1.051483,0.197749,1,1
123,(escalope),(french fries),0.079333,0.170933,0.016400,0.206723,1.209376,0.002839,1.045116,0.188046,1,1
122,(french fries),(escalope),0.170933,0.079333,0.016400,0.095944,1.209376,0.002839,1.018373,0.208822,1,1
165,(shrimp),(green tea),0.071333,0.132000,0.011333,0.158879,1.203625,0.001917,1.031956,0.182171,1,1


The ***association_rules*** function returns a DataFrame that includes the following columns:

*   **antecedents**: This column contains the itemsets (sets of items) that form the antecedent of the association rule. An antecedent is the "if" part of the rule.

*   **consequents**: This column contains the itemsets that form the consequent of the association rule. The consequent is the "then" part of the rule.

*   **antecedent support**: This column shows the support value (frequency) of the antecedent itemset in the dataset.

*   **consequent support**: This column shows the support value of the consequent itemset in the dataset.

*   **support**: This column shows the support value of the combined antecedent and consequent itemset. The support of an itemset indicates how frequently it appears in the dataset.

*   **confidence**: This column indicates the confidence level of the association rule. Confidence measures how often the rule has been found to be true in the data. It is calculated as "support(consequent ∩ antecedent) / support(antecedent)".

*   **lift**: Lift is a measure of the strength of the association between the antecedent and consequent. It indicates how much more likely the items in the rule occur together compared to if they were statistically independent. Lift is calculated as "(support(consequent ∩ antecedent) / (support(consequent) * support(antecedent)))".

*   **leverage**: measures the difference between the observed frequency of two items occurring together and the frequency that would be expected if they were independent.

  * A positive leverage value indicates that the items tend to occur together
  more often than expected by chance.
  * A leverage value close to 0 indicates that the items occur together as expected by chance.
  * A negative leverage value indicates that the items tend to occur together less often than expected by chance.

*   **conviction**: measures the ratio of the expected frequency of the consequent occurring without the antecedent to the observed frequency.

  * Conviction values greater than 1 indicate a positive association between the antecedent and consequent.
  * A conviction value significantly greater than 1 indicates a strong positive association.
  * A conviction value close to 1 indicates that the antecedent and consequent are statistically independent.
  Conviction values less than 1 suggest a negative association.

*   **Zhangs metric** : combines the lift value with a correction term based on the difference between the observed and expected support of the antecedent and consequent occurring together.

  * A higher value of Zhang's metric indicates a stronger and more interesting association rule.
  * The correction term penalizes rules that have an unexpectedly low or high joint support.










