<img src="https://annalyzin.files.wordpress.com/2016/04/association-rules-network-graph2.png">

<h6><center><a href="https://annalyzin.files.wordpress.com/2016/04/association-rules-network-graph2.png">Source</a></center></h6>
<h1><center>Associate Rule Mining</center></h1>

**Table of Contents:**

1. [Introduction](#Introduction)
2. [Overview: Dataset Description](#Overview)
2. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    * [Individual Feature Visualisation](#Individual-Feature-Visualisation)
    * [Multiple Feature Interaction Visualisation](#Multiple-Feature-Interactions-Visualisation)
3. [Associate Rule Learning](#Associate-Rule-Learning)
    * [Support](#Support)
    * [Confidence](#Confidence)
    * [Lift](#Lift)
    * [Apriori](#Apriori)

# Introduction

This notebook is an introduction in Association Rule Mining for the [Groceries dataset](https://www.kaggle.com/heeraldedhia/groceries-dataset) in python. We will first go through a brief Exploratory Data Analysis and then implement one of most popular Association Rule Learning Model. The aim of this dataset is to identify the association rules for the Market Basket Analysis. 

Association Rule Learning (or Associate Rule Mining) is a rule-based machine learning method to discover how items are associated to each other. Stores use them to figure out products that are bought together, this way they can provide different offers to the different customers e.g, buy one get one free. Earlier, recommendation systems like Amazon, Netflix used them.  In this notebook, we will go through Apriori type of Association Rule Learning model

*We will be using `apyori` package from python to implement the apriori model.*

Let’s get started!

**Importing libraries and reading the dataset**

In [6]:
import pandas as pd # for data reading and manipulation
import matplotlib.pyplot as plt # for visualization
import seaborn as sns # for visualization
import numpy as np # for numerical computation

%matplotlib inline

# Overview

In [7]:
# load the groceries-dataset
groceries = pd.read_csv('data/Groceries_dataset.csv', parse_dates=['Date'])
groceries.head()
#groceries.shape

Unnamed: 0,Member_number,Date,itemDescription
0,1808,2015-07-21,tropical fruit
1,2552,2015-05-01,whole milk
2,2300,2015-09-19,pip fruit
3,1187,2015-12-12,other vegetables
4,3037,2015-01-02,whole milk


**Dataset Description**
* Member_number: A unique id of each customer who bought groceries
* Date: The date at which the customer bought the groceries
* itemDescription: Description of the item that customer bought

# Exploratory Data Analysis

In EDA, we will first start with individual feature analysis then we will explore multiple feature interactions.

**Let's look at the time duration of the data we have given.**

In [None]:
print("We have the data from",groceries.Date.min(),"to", groceries.Date.max())

In [None]:
# For extracting year,month and day to new column,follow the code:
groceries['year'] = groceries['Date'].dt.year
groceries['month'] = groceries['Date'].dt.month
groceries['day'] = groceries['Date'].dt.day
groceries['day_of_week'] = groceries['Date'].dt.day_name()
groceries.head()

## Individual Feature Visualisation

We start by simply plotting the distributions of the each feature individually, before moving on to multi-feature visuals and correlations. Here, we’re dealing with the features one by one.

Let's look at the customers' visiting rate.

In [None]:
plt.rcParams["figure.figsize"] = [13, 7]

color = plt.cm.spring(np.linspace(0, 1, 5))

fig, (ax, ax2) = plt.subplots(ncols=2)

groceries['Member_number'].value_counts().head().plot(kind='bar', color = color, ax=ax, title='Customers who visited the store more often');
ax.set_xlabel("Customer ID")
ax.set_ylabel("Count")
groceries['Member_number'].value_counts(ascending=True).head().plot(kind='bar', color = color, ax=ax2, title='Customers who visited the store less often');
ax2.set_xlabel("Customer ID")
ax2.set_ylabel("Count");

We found that:
* Member number 3180 bought the highest number of groceries, followed by Member number 3050, 2051 and 3737.
* A lot of customers visited the store twice (seem to be tourists).

Let's look at the date at which customers visited the store.

In [None]:
plt.rcParams["figure.figsize"] = [13, 7]

color = plt.cm.winter(np.linspace(0, 1, 10))

fig, (ax, ax2) = plt.subplots(ncols=2)

groceries['Date'].value_counts().head().plot(kind='bar', color = color, ax=ax, title='Date at which the store got the highest number of visits');
ax.set_xlabel("Date")
ax.set_ylabel("Count")
groceries['Date'].value_counts(ascending=True).head().plot(kind='bar', color = color, ax=ax2, title='Date at which the store got the lowest number of visits');
ax2.set_xlabel("Date")
ax2.set_ylabel("Count");

We found that:
* A large number of customers visited the store on 21st January 2015 followed by 21st July 2015
* Few customers visited the store on 9th January 2015 followed by 16th March 2015 
* Both the highest most visitors and least visitors are recorded in 2015

Let's look at the total count of items bought

In [None]:
plt.rcParams["figure.figsize"] = [13, 7]
color = plt.cm.twilight(np.linspace(0, 1, 10))

fig, (ax, ax2) = plt.subplots(ncols=2)

groceries['itemDescription'].value_counts().head().plot(kind='bar', color = color, ax=ax, title='Most often bought Groceries');
ax.set_xlabel("Grocery items")
ax.set_ylabel("Count")
groceries['itemDescription'].value_counts(ascending=True).head().plot(kind='bar', color = color, ax=ax2, title='Least bought Groceries');
ax2.set_xlabel("Grocery items")
ax2.set_ylabel("Count");

We found:
* Whole milk is the highest bought item followed by other vegetables and rolls/buns
* Preservation products and kitchen utensils are the least bought. 

Let's look at the day of the month at which customers visited the store.

In [None]:
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = [12, 6]
color = plt.cm.ocean(np.linspace(0, 1, 31))

groceries['day'].value_counts().plot(kind='bar', color=color, title='Groceries bought in each day of the month').set(xlabel='Day of the month', ylabel='Count');

We found that:
* 28th is the day of the month where highest amount of items are bought
* 31st is the day of the month where lowest amount of items are bought (maybe because at the end of the month people were short on budget.)

Let's look at the groceries bought in each month.

In [None]:
plt.rcdefaults()
plt.rcParams["figure.figsize"] = [13, 7]
color = plt.cm.autumn(np.linspace(0, 1, 12))

groceries['month'].value_counts().plot(kind='bar', color=color, title='Groceries bought in each month').set(xlabel='Month', ylabel='Count');

We found that:
* In August, the highest amount of items are purchased.
* In September, the lowest amount of items are purchased.

Let's look at the groceries bought in each Year.

In [None]:
plt.rcdefaults()
plt.rcParams["figure.figsize"] = [13, 7]

groceries['year'].value_counts().plot(kind='bar', title='Groceries bought in each year').set(xlabel='Year', ylabel='Count');

We found that:
* In 2015, highest number of customers visited the store.
* Both 2015 and 2014 are really close in terms of customer visiting rate.

Let's look at groceries bought in each day of the week.

In [None]:
groceries['day_of_week'].value_counts().head(15).plot.pie(figsize = (15, 8), explode = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1))

plt.title('Groceries count on each day',fontsize = 20)
plt.xlabel('')
plt.ylabel('')
plt.xticks(rotation = 90)
plt.show()

We found that:
* The Customer Visiting Rate is equally distributed across all the days of week.

## Multiple Feature Interactions Visualisation

We started with looking at each individual feature, let's start looking at feature relations, we will first go through time series analysis to manipulate and visualize time series data. To make it easier, we will aggregate customer's data and items data into single dataframe with respect to time.

In [None]:
# let's aggregate the data with date to see more clearly which items are bought on which date
# create a new dataframe and store unique visitors and unique bought items
groceries_time = pd.DataFrame(groceries.groupby('Date')['itemDescription'].nunique().index)
groceries_time['members_count'] = groceries.groupby('Date')['Member_number'].nunique().values
groceries_time['items_count'] = groceries.groupby('Date')['itemDescription'].nunique().values
groceries_time['items'] = groceries.groupby('Date')['itemDescription'].unique().values
groceries_time.set_index('Date',inplace=True)
groceries_time.head()

Summarizing the data with Density plots to see where the mass of the data is located

In [None]:
plt.rcParams["figure.figsize"] = [10, 5]

sns.kdeplot(data = groceries_time['members_count'],shade=True);

It's seems like data is uniformly distributed without any trend. Let's verify that using lineplots.

In [None]:
groceries_time['members_count'].plot(figsize=(10, 5),title='Number of member visited with time');

It appears that store had a more or less steady increase in its stock price over the from January 2014 to the January 2016 window. Therefore, we will now use association rules to find pairs of items that are associated to each other.


# Associate Rule Learning

Association rule learning is a technique to discover how items are associated to each other. Association can be measured in three common ways.

## Support

It tell us about how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. It is measured as follows:

For Movie Recommendation, we calculate it as:
$$
\begin{equation*}
support(M) = \frac{\text{number of user watchlists containing M}}{\text{total number of user watchlists}}
\end{equation*}
$$

wheras for Market Basket Optimization, we calculate it as:
$$
\begin{equation*}
support(I) = \frac{\text{number of transactions containing I}}{\text{total number of transactions}}
\end{equation*}
$$

## Confidence

It tell us about how likely item B is purchased when item A is purchased, expressed as {A -> B}. It is measured as follows:

For Movie Recommendation, we calculate it as:
$$
\begin{equation*}
confidence(M_1\rightarrow{M_2}) = \frac{\text{number of user watchlists containing $M_1$ and $M_2$}}{\text{number of user watchlists containing $M_1$}}
\end{equation*}
$$

wheras for Market Basket Optimization, we calculate it as:
$$
\begin{equation*}
confidence(I_1\rightarrow{I_2}) = \frac{\text{number of transactions containing $I_1$ and $I_2$}}{\text{number of transactions containing $I_1$}}
\end{equation*}
$$

## Lift

It tell us about how likely the item B is purchased when the item A is purchased while controlling for how popular item B is. It is measured as follows:

For Movie Recommendation, we calculate it as:
$$
\begin{equation*}
lift(M_1\rightarrow{M_2}) = \frac{Confidence(M_1\rightarrow{M_2})}{Support(M_2)}
\end{equation*}
$$

wheras for Market Basket Optimization, we calculate it as:
$$
\begin{equation*}
lift(I_1\rightarrow{I_2}) = \frac{Confidence(I_1\rightarrow{I_2})}{Support(I_2)}
\end{equation*}
$$

## Apriori

Apriori algorithm consist of:

1. Step 1: Set a minimum support and confidence.
2. Step 2: Take all the subsets in transactions having higher support than minimum support.
3. Step 3: Take all the rules of these subsets having higher confidence than minimum confidence.
4. Step 4: Sort the rules by decreasing lift.

We will be using `apriori` function from `apyori` package to implement the apriori algorithm. It return all the different association measures (or the rules) such support, confidence and lift.

In [None]:
# importing the library
try:
    import apyori
except:
    !pip install apyori

from apyori import apriori # for association rule learning models

In [None]:
transactions = groceries_time['items'].tolist()

Let's run the algorithm and transform the result into well organised pandas dataframe to see which item pairs are associated more or less.

In [None]:
rules = apriori(transactions = transactions, min_support=0.00030, min_confidance=0.01, min_lift=3, min_length=2, max_length=2)
#let's transform them into a list
results = list(rules)

def inspect(results):
    '''
    function to put the result in well organised pandas dataframe
    '''
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Item #1', 'Item #2', 'Support', 'Confidence', 'Lift'])
resultsinDataFrame.head()

Let's sort all the rules by decreasing lift.

In [None]:
resultsinDataFrame.nlargest(n=10, columns='Lift')

**In the store, people bought liquer with preservation products, kitchen utensil with prosecco and preservation products with spices. The store should add deals with preservation products, kitchen utensil and frozen chicken to increase it sales. Thanks and I hope you enjoyed while reading it. Happy coding!**