# About Dataset

# Association Rule Mining

**Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy**

**Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.**

# Apriori Algorithm

**Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases.**

**It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.** 

**The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.**

# Import Libraries .

In [13]:
import numpy as np 
import pandas as pd

# Reading data using pandas Library .

In [14]:
df=pd.read_csv("/kaggle/input/groceries-dataset/Groceries_dataset.csv")

# Show First five Rows from Data . 

In [15]:
df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


# Exploring dataset

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


In [17]:
df.dtypes

Member_number       int64
Date               object
itemDescription    object
dtype: object

# Statistics InFo about data 



In [18]:
df.describe()

Unnamed: 0,Member_number
count,38765.0
mean,3003.641868
std,1153.611031
min,1000.0
25%,2002.0
50%,3005.0
75%,4007.0
max,5000.0


* To Find nan values in data

In [19]:
df.isnull().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

* When used This command df.isnull().sum() I found it is not there nan values

In [20]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
38760    False
38761    False
38762    False
38763    False
38764    False
Length: 38765, dtype: bool

In [21]:
df.columns.values

array(['Member_number', 'Date', 'itemDescription'], dtype=object)

In [22]:
df.drop_duplicates(subset=None, keep=False, inplace=True)


In [23]:
df.isna().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

In [24]:
df.dropna()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk
...,...,...,...
38760,4471,08-10-2014,sliced cheese
38761,2022,23-02-2014,candy
38762,1097,16-04-2014,cake bar
38763,1510,03-12-2014,fruit/vegetable juice


In [25]:
df.shape

(37274, 3)

#  convert string data to numreic data




In [27]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['Date'] = encoder.fit_transform(df['Date'])

Date  = {index : label for index, label in enumerate(encoder.classes_)}

Date 

{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20,
 21: 21,
 22: 22,
 23: 23,
 24: 24,
 25: 25,
 26: 26,
 27: 27,
 28: 28,
 29: 29,
 30: 30,
 31: 31,
 32: 32,
 33: 33,
 34: 34,
 35: 35,
 36: 36,
 37: 37,
 38: 38,
 39: 39,
 40: 40,
 41: 41,
 42: 42,
 43: 43,
 44: 44,
 45: 45,
 46: 46,
 47: 47,
 48: 48,
 49: 49,
 50: 50,
 51: 51,
 52: 52,
 53: 53,
 54: 54,
 55: 55,
 56: 56,
 57: 57,
 58: 58,
 59: 59,
 60: 60,
 61: 61,
 62: 62,
 63: 63,
 64: 64,
 65: 65,
 66: 66,
 67: 67,
 68: 68,
 69: 69,
 70: 70,
 71: 71,
 72: 72,
 73: 73,
 74: 74,
 75: 75,
 76: 76,
 77: 77,
 78: 78,
 79: 79,
 80: 80,
 81: 81,
 82: 82,
 83: 83,
 84: 84,
 85: 85,
 86: 86,
 87: 87,
 88: 88,
 89: 89,
 90: 90,
 91: 91,
 92: 92,
 93: 93,
 94: 94,
 95: 95,
 96: 96,
 97: 97,
 98: 98,
 99: 99,
 100: 100,
 101: 101,
 102: 102,
 103: 103,
 104: 104,
 105: 105,
 106: 106,
 107: 107,
 108: 108,
 109: 109,
 110: 110,

In [28]:
df['itemDescription'] = encoder.fit_transform(df['itemDescription'])

itemDescription  = {index : label for index, label in enumerate(encoder.classes_)}

itemDescription 

{0: 'Instant food products',
 1: 'UHT-milk',
 2: 'abrasive cleaner',
 3: 'artif. sweetener',
 4: 'baby cosmetics',
 5: 'bags',
 6: 'baking powder',
 7: 'bathroom cleaner',
 8: 'beef',
 9: 'berries',
 10: 'beverages',
 11: 'bottled beer',
 12: 'bottled water',
 13: 'brandy',
 14: 'brown bread',
 15: 'butter',
 16: 'butter milk',
 17: 'cake bar',
 18: 'candles',
 19: 'candy',
 20: 'canned beer',
 21: 'canned fish',
 22: 'canned fruit',
 23: 'canned vegetables',
 24: 'cat food',
 25: 'cereals',
 26: 'chewing gum',
 27: 'chicken',
 28: 'chocolate',
 29: 'chocolate marshmallow',
 30: 'citrus fruit',
 31: 'cleaner',
 32: 'cling film/bags',
 33: 'cocoa drinks',
 34: 'coffee',
 35: 'condensed milk',
 36: 'cooking chocolate',
 37: 'cookware',
 38: 'cream',
 39: 'cream cheese ',
 40: 'curd',
 41: 'curd cheese',
 42: 'decalcifier',
 43: 'dental care',
 44: 'dessert',
 45: 'detergent',
 46: 'dish cleaner',
 47: 'dishes',
 48: 'dog food',
 49: 'domestic eggs',
 50: 'female sanitary products',
 51: 

In [33]:
df['itemDescription'].value_counts()


164    2232
102    1760
122    1580
138    1394
165    1238
       ... 
155       5
5         4
4         3
79        1
114       1
Name: itemDescription, Length: 167, dtype: int64

In [35]:
df['Date'].value_counts()


481    96
493    89
183    88
662    86
709    85
       ..
84     24
389    23
17     22
216    22
365    21
Name: Date, Length: 728, dtype: int64

In [36]:
df['Member_number'].value_counts()


3180    34
3737    33
3050    31
2051    29
2394    29
        ..
1701     2
2417     2
4454     1
3197     1
1439     1
Name: Member_number, Length: 3892, dtype: int64

# grouping dataset to form a list of products bought by same customer on same date


In [37]:
df=df.groupby(['Member_number','Date'])['itemDescription'].apply(lambda x: list(x))

In [38]:
df.head()


Member_number  Date
1000           341     [130, 164, 132, 165]
               562          [164, 105, 128]
               565                 [20, 92]
               597                [130, 73]
               633               [138, 108]
Name: itemDescription, dtype: object

# apriori takes list as an input, after that converting dtaset to a list


In [39]:
transactions = df.values.tolist()
transactions[:10]

[[130, 164, 132, 165],
 [164, 105, 128],
 [20, 92],
 [130, 73],
 [138, 108],
 [56, 40],
 [130, 164, 122],
 [164, 138],
 [8, 162],
 [56, 138, 160]]

# applying apriori algorithm


In [43]:
from apyori import apriori
rules = apriori(transactions, min_support=0.00030,min_confidence = 0.05,min_lift = 2,min_length = 2)
results = list(rules)
results

[RelationRecord(items=frozenset({138, 3}), support=0.00047435115538388563, ordered_statistics=[OrderedStatistic(items_base=frozenset({3}), items_add=frozenset({138}), confidence=0.2413793103448276, lift=2.5552614653935586)]),
 RelationRecord(items=frozenset({9, 35}), support=0.0003388222538456326, ordered_statistics=[OrderedStatistic(items_base=frozenset({35}), items_add=frozenset({9}), confidence=0.05102040816326531, lift=2.316640502354788)]),
 RelationRecord(items=frozenset({151, 15}), support=0.0003388222538456326, ordered_statistics=[OrderedStatistic(items_base=frozenset({151}), items_add=frozenset({15}), confidence=0.07462686567164178, lift=2.117824339839265)]),
 RelationRecord(items=frozenset({20, 84}), support=0.0004065867046147591, ordered_statistics=[OrderedStatistic(items_base=frozenset({84}), items_add=frozenset({20}), confidence=0.12, lift=2.57764192139738)]),
 RelationRecord(items=frozenset({59, 28}), support=0.0004065867046147591, ordered_statistics=[OrderedStatistic(item

In [46]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
ordered_results = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence','Lift'] )

# Result of algorithm

In [47]:
ordered_results

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,3,138,0.000474,0.241379,2.555261
1,35,9,0.000339,0.051020,2.316641
2,151,15,0.000339,0.074627,2.117824
3,84,20,0.000407,0.120000,2.577642
4,59,28,0.000407,0.058824,2.487275
...,...,...,...,...,...
76,160,165,0.000610,0.214286,2.554293
77,130,138,0.000610,0.191489,2.027122
78,130,165,0.000407,0.230769,2.750777
79,138,130,0.000678,0.121951,2.038091
