<a href="https://colab.research.google.com/github/maverick98/CDS/blob/main/M6_SNB_MiniProject_1_Market_Basket_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project 1: Market Basket analysis

**DISCLAIMER:** THIS NOTEBOOK IS PROVIDED ONLY AS A REFERENCE SOLUTION NOTEBOOK FOR THE MINI-PROJECT. THERE MAY BE OTHER POSSIBLE APPROACHES/METHODS TO ACHIEVE THE SAME RESULTS.

## Learning Objectives

At the end of the experiment, you will be able to:

* extract summary level insight from a given dataset

* Integrate the data and identify the underlying pattern or structure

* understand the fundamentals of market basket analysis

* construct "rules" that provide concrete recommendations for businesses

## Dataset

The dataset chosen for this mini project is **Instacart Dataset**. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, there are orders between 4 and 100, with the sequence of products purchased in each order. The dataset also includes the products in each order, the time of day and day of week of each order, the name and aisle/department of each product, which are stored across various files.

## Problem Statement


Extract association rules and find groups of frequently purchased items from a large-scale grocery orders dataset.

## Grading = 10 Points

#### Import required packages

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
import os
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## **Stage 1**: Data Wrangling

We have five different files:

    - orders.csv
    - order_products__train.csv
    - products.csv
    - aisles.csv
    - departments.csv
	
These files contain the neccesary data to solve the problem. Load all the files correctly, after observing the header level details, data records etc
	
**Hint:** Use `read_csv` from pandas

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/Instacart.zip
!unzip -qq Instacart.zip

### Load the data

Load all the given datasets

In [None]:
root = '/content/Instacart/'
orders = pd.read_csv(root + 'orders.csv')
order_products_train = pd.read_csv(root + 'order_products__train.csv')
products = pd.read_csv(root + 'products.csv')
aisles = pd.read_csv(root + 'aisles.csv')
departments = pd.read_csv(root + 'departments.csv')

or

In [None]:
datasets = {}

for i in os.listdir('/content/Instacart/'):
  print(i)
  datasets[i] = pd.read_csv("Instacart/"+i) 
    
datasets = dict(sorted(datasets.items()))
datasets.keys()

In [None]:
names  = list(datasets.keys())
names

### Data Integration (1 point)

As the required data is present in different files, we need to integrate all the five to make single dataframe/dataset. For that purpose, use the unique identifier provided in all the dataframes so that it can be used to map the data in different files correctly.

**Example:** `product_id` is available in both `products` dataframe and `order_products__train` dataframe, we can merge these two into a single dataframe based on `product_id`

**Hint:** [pd.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

In [None]:
df1 = datasets[names[2]]
df1.columns, df1.shape

In [None]:
df2 = df1.merge(datasets[names[-1]], on='product_id')
df2.columns, df2.shape

In [None]:
df3 = df2.merge(datasets[names[0]], on='aisle_id')
df3.columns, df3.shape

In [None]:
df4 = df3.merge(datasets[names[3]], on='order_id')
df4.columns, df4.shape

In [None]:
df5 = df4.merge(datasets[names[1]], on='department_id')
df5.columns, df5.shape

In [None]:
df5.head()

In [None]:
final_df = df5

### Understanding relationships and new insights from the data (3 points)

1.  How many times was each product ordered?

    **Hint:** group orders by product
    

2.  Find the number of orders per department and visualize using an appropriate plot


3.  On which day of the week do customers tend to buy more groceries? Which are the peak hours
of shopping? 

  * Find the frequency of orders on week days using an appropriate plot 
  * Find the frequency of orders during hours of the day using an appropriate plot?
  

4. Find the ratio of Re-ordered and Not Re-ordered products and visualize it

5. Plot the heatmap of Re-order ratio of the Day of week vs Hour of day

### Group orders by products and get how many times each product was ordered

In [None]:
g = final_df.product_id.value_counts()
g = pd.DataFrame(g)
g.reset_index(inplace=True)
g.columns = ["product_id","count"]
g_products = g.merge(datasets['products.csv'],on="product_id")
g_products.head()

In [None]:
plt.figure(figsize=(50, 40))
g_products.head(20).plot(kind="bar",x="product_name",y="count")
plt.show()

(Banana is the top ordered product)

### Find the number of orders per department

Hint: Groupby

In [None]:
g = final_df.department_id.value_counts()
g = pd.DataFrame(g)
g.reset_index(inplace=True)
g.columns = ["department_id","count"]
g_dept = g.merge(datasets['departments.csv'],on="department_id")
g_dept.head(5)

In [None]:
g_dept.plot(kind="bar",x="department",y="count")
plt.show()

### Find the frequency of orders on week days

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=final_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of order by week day", fontsize=15)
plt.show()

### Find the frequency of orders for hours of the day

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=final_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of order by hour of day", fontsize=15)
plt.show()

### Find the ratio of Re-ordered and Not Re-ordered products and visualize

In [None]:
final_df[final_df['reordered']==1]['product_id'].max() #nunique()

In [None]:
sns.distplot(final_df[final_df['reordered']==1]['product_id'],  kde=False, label='Reordered')
sns.distplot(final_df[final_df['reordered']==0]['product_id'],  kde=False, label='Not reordered')

plt.legend(prop={'size': 12})
plt.show()

### Plot the heatmap of Re-order ratio of Day of week vs Hour of day ?

In [None]:
grouped_df = final_df.groupby(["order_dow", "order_hour_of_day"])["reordered"].aggregate("mean").reset_index()
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'reordered')

plt.figure(figsize=(12,6))
sns.heatmap(grouped_df)
plt.title("Reorder ratio of Day of week Vs Hour of day")
plt.show()

## **Stage 2:** Create a basket (4 points)

As the dataset contains huge amount of data, let us take a subset of the data to extract the association rules from it. 

**Assumption:** Segment the data by considering the 100 most frequent ordered items. Please note it is just an assumption. You can consider 'n' frequent order items as per your choice.

**Hint:**

- Drop the unwanted columns

- Find the frequencies of orders based on the products and  consider the 100 most frequent order items.

    **Hint:** Count the frequencies of orders for each product_id using `groupby()` and `count()` respectively

- Extract the records of 100 most frequent items (which are extracted in previous step) from combined dataframe.

- Create a Pivot table with `order_id` as index and `product_name` as columns and `reorder` as values. 

    - set the `order_id` as index using set_index()
    - fill all the nan values with 0

- After performing the above step, there are a lot of zeros in the data, make sure that any positive values are converted to a 1 and anything less than 0 is set to 0.


In [None]:
product_counts = final_df.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts.head(10)

In [None]:
freq_products = list(product_counts.product_id)
freq_products[1:10]

In [None]:
order_products = final_df[final_df.product_id.isin(freq_products)]
order_products.shape

In [None]:
# basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
# basket.head()

or

In [None]:
basket = order_products.pivot_table(columns='product_name', values='reordered',index='order_id' ).reset_index().fillna(0).set_index('order_id')

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket = basket.applymap(encode_units)
basket.head()

## **Stage 3:** Apply Apriori algorithm (2 points)

- As the dataset contains huge amount of data, let us take a subset of the data to extract the association rules from it.

  **Assumption:** Segment the basket by considering 100000 records. Please note its just an assumption, you can consider 'n'  records as per your choice.

  **Hint:** [apriori](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)

- Find the association rules and make a dataframe

In [None]:
shortbasket = basket[:100000]

In [None]:
frequent_items = apriori(shortbasket, min_support=0.01, use_colnames=True)
frequent_items.head()

In [None]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)