<a href="https://colab.research.google.com/github/jay-madane/ML_clg_labs/blob/main/ml_lab5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Machine Learning 5
##Analysis by using the Apriori Algorithm

####Name: Jay Kiran Madane
####PRN: RBT21CB036
####Title: Apriori Algorithm for Market Basket Analysis.
####Aim: To perform Apriori Algorithm on a given dataset.
####Dataset: Market Basket Optimisation
####Theory:
Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects. It means how two or more objects are related to one another. In other words, we can say that the apriori algorithm is an association rule leaning that analyzes that people who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule between different objects. The association rule describes how two or more objects are related to one another. Apriori algorithm is also called frequent pattern mining. Generally, you operate the Apriori algorithm on a database that consists of a huge number of transactions. Let's understand the apriori algorithm with the help of an example; suppose you go to Big Bazar and buy different products. It helps the customers buy their products with ease and increases the sales performance of the Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.





In [None]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5955 sha256=d2a7ccdf4f76b05bfe4d00a1493b06689e5990282ecc1706dd2917f9d721110d
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib as plt
from apyori import apriori

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data = pd.read_csv('/content/drive/MyDrive/Groceries_dataset.csv')
data.shape

(38765, 3)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


In [None]:
data.isna().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

In [None]:
# Let's split up this line of code, we first count the occurances of each item in the dataset,
# then sort the value in asending order and filter out the first 10 items, this would be the top 10 selling items
x = data['itemDescription'].value_counts().sort_values(ascending=False)[:10]

In [None]:
print("Top 10 frequently sold products")
fig = px.bar(x= x.index, y= x.values)
fig.update_layout(title_text= "Top 10 frequently sold products ", xaxis_title= "Products", yaxis_title="Number of item sold")
fig.show()

Top 10 frequently sold products


In [None]:
#Now let's look at the 10 least selling products
#The only change in code would be to not sort the values in descending order
y = data['itemDescription'].value_counts().sort_values(ascending=True)[:10]

In [None]:
print("10 least frequently sold products")
fig = px.bar(x= y.index, y= y.values)
fig.update_layout(title_text= "10 least frequently sold products ", xaxis_title= "Products", yaxis_title="Number of item sold")
fig.show()

10 least frequently sold products


In [None]:
pd.DataFrame(data['Member_number'].value_counts().sort_values(ascending=False))[:10]

Unnamed: 0,Member_number
3180,36
3050,33
2051,33
3737,33
2625,31
3915,31
2433,31
2271,31
3872,30
2394,29


In [None]:
#Let's create few new column by modifying the date column in the dataframe
#Filtering out the year value from the date by splitting the date on - which gives a list and then taking out the last value which is the year value
data["Year"] = data['Date'].str.split("-").str[-1]

#Creating a new column in Month-Year format by splitting the date by - and filtering out the second and last value from the list which belongs to month and year respectively
data["Month-Year"] = data['Date'].str.split("-").str[1] + "-" + data['Date'].str.split("-").str[-1]

In [None]:
#Plotting a bar graph with number of sales in each  month of each year
fig1 = px.bar(data["Month-Year"].value_counts(ascending=False),
              orientation= "v",
              color = data["Month-Year"].value_counts(ascending=False),

               labels={'value':'Count', 'index':'Date','color':'Meter'})

fig1.update_layout(title_text="Exploring highest sales by  date")
fig1.show()

###Implementation of Apriori Algorithm

In [None]:
# Creating a list of names of unique products present in the itemDescription column
products = data['itemDescription'].unique()

In [None]:
products[:10]

array(['tropical fruit', 'whole milk', 'pip fruit', 'other vegetables',
       'rolls/buns', 'pot plants', 'citrus fruit', 'beef', 'frankfurter',
       'chicken'], dtype=object)

In [None]:
#For modelling and finding the relationship between products we need to be working with numerical values, so let's one hot encode the products
data1=data.copy()
one_hot = pd.get_dummies(data1['itemDescription'])
data1.drop(['itemDescription'], inplace =True, axis=1)
data1 = data1.join(one_hot)
data1.head()

Unnamed: 0,Member_number,Date,Year,Month-Year,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,1808,21-07-2015,2015,07-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2552,05-01-2015,2015,01-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,2300,19-09-2015,2015,09-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1187,12-12-2015,2015,12-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3037,01-02-2015,2015,02-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
#Group the data based on Member_number and then by date and computing the sum by products using the products in the earlier created project list.
data2 = data1.groupby(['Member_number', 'Date'])[products[:]].sum()
data2.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,tropical fruit,whole milk,pip fruit,other vegetables,rolls/buns,pot plants,citrus fruit,beef,frankfurter,chicken,...,flower (seeds),rice,tea,salad dressing,specialty vegetables,pudding powder,ready soups,make up remover,toilet cleaner,preservation products
Member_number,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1000,15-03-2015,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,24-06-2014,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#Reset the index of the newly formed dataset.
data2 = data2.reset_index()[products]
data2.head()

Unnamed: 0,tropical fruit,whole milk,pip fruit,other vegetables,rolls/buns,pot plants,citrus fruit,beef,frankfurter,chicken,...,flower (seeds),rice,tea,salad dressing,specialty vegetables,pudding powder,ready soups,make up remover,toilet cleaner,preservation products
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#Creating a function product_names which takes some data and for each product in the data, if the value of that product in the data is more than zero, then replace the value with the product name from the product list.
def product_names(x):
    for product in products:
        if x[product] >0:
            x[product] = product
    return x

#Apply the created function on data2 dataset.
data2 = data2.apply(product_names, axis=1)
data2.head()

Unnamed: 0,tropical fruit,whole milk,pip fruit,other vegetables,rolls/buns,pot plants,citrus fruit,beef,frankfurter,chicken,...,flower (seeds),rice,tea,salad dressing,specialty vegetables,pudding powder,ready soups,make up remover,toilet cleaner,preservation products
0,0,whole milk,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,whole milk,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#Filter out the values from the data frame data2
x = data2.values
#Convert into list values in each row if value is not zero
x = [sub[~(sub==0)].tolist() for sub in x if sub [sub != 0].tolist()]
transactions = x
transactions[0:10]

[['whole milk', 'yogurt', 'sausage', 'semi-finished bread'],
 ['whole milk', 'pastry', 'salty snack'],
 ['canned beer', 'misc. beverages'],
 ['sausage', 'hygiene articles'],
 ['soda', 'pickled vegetables'],
 ['frankfurter', 'curd'],
 ['whole milk', 'rolls/buns', 'sausage'],
 ['whole milk', 'soda'],
 ['beef', 'white bread'],
 ['frankfurter', 'soda', 'whipped/sour cream']]

In [None]:
#Now we have to figure out various assosiations between items in the dataset
#Create an apriori instance
#Make a list out of the associations

associations = apriori(transactions, min_support = 0.00030, min_confidence = 0.05, min_lift = 3, max_length = 2, target = "associations")
association_results = list(associations)
print(association_results[0])

RelationRecord(items=frozenset({'fruit/vegetable juice', 'liver loaf'}), support=0.00040098910646260775, ordered_statistics=[OrderedStatistic(items_base=frozenset({'liver loaf'}), items_add=frozenset({'fruit/vegetable juice'}), confidence=0.12, lift=3.5276227897838903)])


In [None]:
#iterate through the list of associations and for each item
for item in association_results:

    #for each item filter out the item pair and create item list containing individual items in the itemset
    itemset = item[0]
    items = [x for x in itemset]

    #Print the relationship( First value in items to second value in items)
    print("Rule : ", items[0], " -> " + items[1])

    #Print support,confidence and lift value of each itemset
    print("Support : ", str(item[1]))
    print("Confidence : ",str(item[2][0][2]))
    print("Lift : ", str(item[2][0][3]))

    print("=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>")

Rule :  fruit/vegetable juice  -> liver loaf
Support :  0.00040098910646260775
Confidence :  0.12
Lift :  3.5276227897838903
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>
Rule :  pickled vegetables  -> ham
Support :  0.0005346521419501437
Confidence :  0.05970149253731344
Lift :  3.4895055970149254
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>
Rule :  meat  -> roll products 
Support :  0.0003341575887188398
Confidence :  0.06097560975609757
Lift :  3.620547812620984
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>
Rule :  misc. beverages  -> salt
Support :  0.0003341575887188398
Confidence :  0.05617977528089888
Lift :  3.5619405827461437
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>
Rule :  spread cheese  -> misc. beverages
Support :  0.0003341575887188398
Confidence :  0.05
Lift :  3.170127118644068
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>
Rule :  seasonal products  -> soups
Support :  0.0003341575887188398
Confidence :  0.10416666666666667
Lift :  14.704205974842768
=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=