#### What is apriori algorithm?

Apriori Algorithm is a Machine Learning algorithm that is used to gain insight into the structured relationships between different items involved. It’s a data mining technique that is used for mining frequent itemsets and relevant association rules.

Example: Recommending products based on your purchased items. You can see this in different e-commerce websites. (Recommendation system)

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori,association_rules

In [2]:
df = pd.read_excel(r'D:\Study Materials\Python\Machine Learning\Association Apriori Method\Online Retail.xlsx')

In [4]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [12]:
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [6]:
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [7]:
df.shape

(541909, 8)

In [10]:
df.isnull().any()

InvoiceNo      False
StockCode      False
Description     True
Quantity       False
InvoiceDate    False
UnitPrice      False
CustomerID      True
Country        False
dtype: bool

In [9]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

### Data Preprocessing

In [11]:
# Removing Leading and trailing Extra Space from Description column
df['Description'] = df['Description'].str.strip()

In [19]:
#Dropping the rows without any invoice number
df.dropna(axis = 0, subset = ['InvoiceNo'], inplace = True)

#Converting Invoice Column to string Because it containing 'c'(credit) before invoice number
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

#Dropping all the transactions which were done on credit
df = df[~df['InvoiceNo'].str.contains('C')]

In [20]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [22]:
len(df['Country'].unique())

38

#### Analyzing the market trend of France

1. Spliting the data according to region(i.e France)
2. Performing One-Hot Encoding to make data suitable for Analysis
3. Applying One Hot Encoding to France Market

In [26]:
# Spliting the data according to region(i.e France)
market_basket_france= (df[df['Country'] == 'France'] #Filters the DataFrame to include only rows where the 'Country' is 'France'
                      .groupby(['InvoiceNo','Description'])['Quantity'] # Groups the data by 'InvoiceNo' and 'Description', and sums up the 'Quantity' for each group.
                      .sum().unstack() #Pivots the grouped data to create a matrix where rows represent invoices, columns represent products, and the values represent the total quantity of each product in each invoice.
                      .reset_index() #Resets the index to default integer values and brings 'InvoiceNo' back as a regular column.
                      .fillna(0) #Fills NaN (missing) values with 0.
                      .set_index('InvoiceNo')) #Sets 'InvoiceNo' as the index.

In [27]:
# Performing One-Hot Encoding to make data suitable for Analysis

def hot_encoding(x):
    if(x <= 0):
        return 0
    if(x >= 1):
        return 1

In [28]:
# Applying One Hot Encoding to France Market
basket_encoded = market_basket_france.applymap(hot_encoding)
market_basket_france = basket_encoded

In [29]:
market_basket_france.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Building the Model

In [32]:
#Building the Model
frq_items = apriori(market_basket_france, min_support= 0.1, use_colnames= True)



##### Association rule: 
It identifies frequent patterns and associations(relations) among a set of items. Ex: If you go to buy a keyboard, you might also get a mouse. So place them aside in your market to get more profit.

##### Support: 
Support refers to the default popularity of an item and can be calculated by finding the number of transactions containing a particular item divided by total number of transactions.

     Support (Keyboard) = (Transactions containing (Keyboard))/(Total Transactions))  

##### Confidence: 
Confidence refers to the likelihood that an item B (mouse) is also bought if item A (keyboard) is bought. Like our keyboard and mouse example.

     Confidence(Keyboard→Mouse) = (Transactions containing both (Keyboard and Mouse))/(Transactions containing Keyboard)

##### Lift: 
Lift(Keyboard -> Mouse) refers to the increase in the ratio of sale of Mouse when the Keyboard is sold. Lift(Keyboard -> Mouse) can be calculated by dividing Confidence(Keyboard→Mouse) divided by Support(Mouse).

     Lift(Keyboard→Mouse) = (Confidence(Keyboard→Mouse))/(Support (Mouse))

In [34]:
#collecting the Inferred rules in the dataframe
rules = association_rules(frq_items, metric= 'lift', min_threshold= 1)
rules = rules.sort_values(['confidence', 'lift'], ascending = [False, False])

In [36]:
print(rules.head())

                                 antecedents                      consequents  \
40           (SET/6 RED SPOTTY PAPER PLATES)    (SET/6 RED SPOTTY PAPER CUPS)   
42  (SET/6 RED SPOTTY PAPER PLATES, POSTAGE)    (SET/6 RED SPOTTY PAPER CUPS)   
35       (STRAWBERRY LUNCH BOX WITH CUTLERY)                        (POSTAGE)   
27      (ROUND SNACK BOXES SET OF4 WOODLAND)                        (POSTAGE)   
41             (SET/6 RED SPOTTY PAPER CUPS)  (SET/6 RED SPOTTY PAPER PLATES)   

    antecedent support  consequent support   support  confidence      lift  \
40            0.127551            0.137755  0.122449    0.960000  6.968889   
42            0.107143            0.137755  0.102041    0.952381  6.913580   
35            0.122449            0.765306  0.114796    0.937500  1.225000   
27            0.158163            0.765306  0.147959    0.935484  1.222366   
41            0.137755            0.127551  0.122449    0.888889  6.968889   

    leverage  conviction  zhangs_metric  
40

### Interpretation:

1. The first rule suggests that customers who buy "SET/6 RED SPOTTY PAPER PLATES" are highly likely (confidence of 96%) to also buy "SET/6 RED SPOTTY PAPER CUPS". The lift value of 6.97 indicates a strong positive correlation.


2. The second rule involves the same products as the first but with the addition of "POSTAGE". Customers who buy "SET/6 RED SPOTTY PAPER PLATES" and "POSTAGE" are highly likely (confidence of 95%) to also buy "SET/6 RED SPOTTY PAPER CUPS". The lift value is again high, indicating a strong positive correlation


3. The third and fourth rules involve different products and "POSTAGE". These rules suggest that customers who buy "STRAWBERRY LUNCH BOX WITH CUTLERY" or "ROUND SNACK BOXES SET OF 4 WOODLAND" are likely to buy "POSTAGE" (confidence of 93% approx). The lift values are close to 1, indicating a moderate positive correlation.


4. The fifth rule suggests that customers who buy "SET/6 RED SPOTTY PAPER CUPS" are highly likely (confidence of 89%) to also buy "SET/6 RED SPOTTY PAPER PLATES". The lift value is again high (6.96), indicating a strong positive correlation.