##  Market basket analysis -  Apriory and Eclat models.

Groceries dataset is going to be analysed through Apriory and Eclat model to see which combination of products are recommended.

In [1]:
#Installing apyori
!pip install apyori



In [2]:
#Importing libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings('ignore') # We can suppress the warnings

### 1. EDA Analysis

In [3]:
#Loading the dataset
dataset = pd.read_csv('Groceries_dataset.csv')
dataset.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21/07/2015,tropical fruit
1,2552,05/01/2015,whole milk
2,2300,19/09/2015,pip fruit
3,1187,12/12/2015,other vegetables
4,3037,01/02/2015,whole milk


In [4]:
#checking the shape of the dataset
dataset.shape

(38765, 3)

In [5]:
##Checking data types on dataset to check variables
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


In [6]:
## Before proceding, Data needs to be converted to from object to datetime type
dataset['Date'] = pd.to_datetime(dataset['Date'])
dataset.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,2015-07-21,tropical fruit
1,2552,2015-05-01,whole milk
2,2300,2015-09-19,pip fruit
3,1187,2015-12-12,other vegetables
4,3037,2015-01-02,whole milk


In [7]:
#Verify if there is any null value
#there are no any null value
dataset.isnull().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

In [8]:
#checking summary statistics
dataset.describe()

Unnamed: 0,Member_number
count,38765.0
mean,3003.641868
std,1153.611031
min,1000.0
25%,2002.0
50%,3005.0
75%,4007.0
max,5000.0


In [9]:
#checking the duplicates in the dataset
duplicate_rows=dataset[dataset.duplicated()] 
print('number of duplicate rows: ', duplicate_rows.shape)

number of duplicate rows:  (759, 3)


In [10]:
# Used to count the number of rows before removing the data 
dataset.count() 

Member_number      38765
Date               38765
itemDescription    38765
dtype: int64

In [11]:
# Dropping the duplicates 
dataset = dataset.drop_duplicates() 
dataset.head(5) 

Unnamed: 0,Member_number,Date,itemDescription
0,1808,2015-07-21,tropical fruit
1,2552,2015-05-01,whole milk
2,2300,2015-09-19,pip fruit
3,1187,2015-12-12,other vegetables
4,3037,2015-01-02,whole milk


In [12]:
#checking shape of data after dropping duplicates
dataset.shape

(38006, 3)

In [13]:
## Use lamdba funtion to group the data based on Member_number and Date
dataset1= dataset.groupby(['Member_number', 'Date']).agg({'itemDescription': lambda x: ', '.join(x)}).reset_index()
dataset1.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1000,2014-06-24,"whole milk, pastry, salty snack"
1,1000,2015-03-15,"sausage, whole milk, semi-finished bread, yogurt"
2,1000,2015-05-27,"soda, pickled vegetables"
3,1000,2015-07-24,"canned beer, misc. beverages"
4,1000,2015-11-25,"sausage, hygiene articles"


In [14]:
#how many items do you have as a maximum in your data
transactions = []
for row in range(0,len(dataset1)):
    transactions.append(dataset1['itemDescription'][row].split(','))
transactions[:2]

[['whole milk', ' pastry', ' salty snack'],
 ['sausage', ' whole milk', ' semi-finished bread', ' yogurt']]

In [15]:
#Training the model with the dataset

from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.001, min_confidence = 0.1, min_lift = 1, 
                min_length = 2, max_length = 2)


#minimum support: (number of transactions with the products/ total transactions)
#minimum confidence: there is not a specific rule/value. We will have a level of confidence of 0.1
#minimum lift: it has to be at least 3 to be relevant. We will set 1 as minimum lift
#min/max length: minimum/maximum number of elements we want to have in the rule (left and right)

In [16]:
#Displaying the first results coming directly from the output of the apriori function

results = list(rules)

In [17]:
results

[RelationRecord(items=frozenset({' beef', ' whole milk'}), support=0.0012697988371315912, ordered_statistics=[OrderedStatistic(items_base=frozenset({' beef'}), items_add=frozenset({' whole milk'}), confidence=0.1347517730496454, lift=1.5752271719858155)]),
 RelationRecord(items=frozenset({' beef', 'whole milk'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' beef'}), items_add=frozenset({'whole milk'}), confidence=0.10638297872340427, lift=1.469813952574606)]),
 RelationRecord(items=frozenset({' beverages', 'sausage'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' beverages'}), items_add=frozenset({'sausage'}), confidence=0.11111111111111112, lift=2.2137890220446814)]),
 RelationRecord(items=frozenset({' cat food', 'whole milk'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' cat food'}), items_add=frozenset({'whole milk'}), confidence=0.11278195488

**Discussion** Under this model, the results shown that the strongest combination will be between other vegetables and whole milk

### 1.2. Apriory model 

Data is going to be processed first using the Apriory model to check the results.

In [18]:
#Putting the results well organised into a Pandas Data Frame

def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 
                                                               'Support', 'Confidence', 'Lift'])


In [19]:
#Displaying non sorted results

resultsinDataFrame

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,beef,whole milk,0.00127,0.134752,1.575227
1,beef,whole milk,0.001002,0.106383,1.469814
2,beverages,sausage,0.001002,0.111111,2.213789
3,cat food,whole milk,0.001002,0.112782,1.558224
4,chewing gum,whole milk,0.001002,0.111111,1.298872
5,chicken,whole milk,0.001002,0.15625,1.826538
6,citrus fruit,whole milk,0.001871,0.123348,1.441919
7,frankfurter,other vegetables,0.001002,0.1875,2.842515
8,frankfurter,whole milk,0.001002,0.1875,2.191846
9,newspapers,whole milk,0.003208,0.106904,1.477016


In [20]:
#Displaying the results sorted by descending lifts
# lifts higher will be the highest combination
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
23,sausage,yogurt,0.001871,0.184211,3.142921
7,frankfurter,other vegetables,0.001002,0.1875,2.842515
12,sausage,other vegetables,0.001604,0.157895,2.393697
22,sausage,whole milk,0.002005,0.197368,2.307206
2,beverages,sausage,0.001002,0.111111,2.213789
8,frankfurter,whole milk,0.001002,0.1875,2.191846
20,sausage,rolls/buns,0.001537,0.151316,2.189689
21,sausage,soda,0.001337,0.131579,2.130753
19,pork,sausage,0.001002,0.105634,2.104659
13,berries,other vegetables,0.00147,0.135802,2.058776


**Discussion**: After anaylising the results, it can be noted that strongest combination will be sausages and yogurt following by frankfurter sausages which go with other vegetables. It would be recommended to place the sausages fridges where vegetables section. 


### 1.3. Eclat model evaluation.

The same data is going to be processed using the Eclat model to check the results.

In [21]:
#Training the model with the dataset

from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.001, min_confidence = 0.1, min_lift = 2, 
                min_length = 2, max_length = 2)



#minimum support: (number of transactions with the products/ total transactions)
#minimum confidence: there is not a specific rule/value. We start with 0.80 and see where we get a good result
#minimum lift: it has to be at least 3 to be relevant
#min/max length: minimum/maximum number of elements we want to have in the rule (left and right)

In [22]:
#Displaying the first results coming directly from the output of the apriori function

results = list(rules)

In [23]:
results

[RelationRecord(items=frozenset({' beverages', 'sausage'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' beverages'}), items_add=frozenset({'sausage'}), confidence=0.11111111111111112, lift=2.2137890220446814)]),
 RelationRecord(items=frozenset({' frankfurter', ' other vegetables'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' frankfurter'}), items_add=frozenset({' other vegetables'}), confidence=0.1875, lift=2.8425151975683893)]),
 RelationRecord(items=frozenset({' frankfurter', ' whole milk'}), support=0.0010024727661565194, ordered_statistics=[OrderedStatistic(items_base=frozenset({' frankfurter'}), items_add=frozenset({' whole milk'}), confidence=0.1875, lift=2.191845703125)]),
 RelationRecord(items=frozenset({' sausage', ' other vegetables'}), support=0.001603956425850431, ordered_statistics=[OrderedStatistic(items_base=frozenset({' sausage'}), items_add=frozenset({' other vegetables'}),

In [24]:
#Putting the results well organised into a Pandas DF

def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    return list(zip(lhs, rhs, supports))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Product 1', 'Product 2', 'Support'])

In [25]:
#Displaying the results sorted by descending supports
#this is easy to read. It produces vertical results

resultsinDataFrame.nlargest(n = 10, columns = 'Support')

Unnamed: 0,Product 1,Product 2,Support
8,sausage,whole milk,0.002005
9,sausage,yogurt,0.001871
3,sausage,other vegetables,0.001604
6,sausage,rolls/buns,0.001537
4,berries,other vegetables,0.00147
7,sausage,soda,0.001337
0,beverages,sausage,0.001002
1,frankfurter,other vegetables,0.001002
2,frankfurter,whole milk,0.001002
5,pork,sausage,0.001002


**Discussion:** Under this model, the results shown that the strongest combination will be quite similar to the previous Apriory analysis, where sausages have a stronger support with whole milk followed by sausages and yogurt. The third combination will be sausage with other vegetables.