# Introduction
This kernel has been created by the [Information Systems Lab](http://islab.uom.gr) at the University of Macedonia, Greece for the needs of the elective course Special Topics of Information Systems I at the [Business Administration](http://www.uom.gr/index.php?tmima=2&categorymenu=2) department of the University of Macedonia, Greece.

In this Instacart Notebook, we will get insights regarding how each customer behave towards the products that has ordered in the past. <br>
To be more specific, we create variables that describe each possible combination of user (user_id) with product (product_id) from prior orders.

# Business Insights
* How many times a customer bought a product on its last 5 orders?
 - Find which products have been bought by the most users on all of their last five orders
* How frequently a customer bought a product after its first purchase?
 - How many times a customer bought a product?

# Python Skills
* Transform columns with .groupby( )
* Filter orders
* Calculate range of orders for a product

# Packages 
* pandas: .transform( ), .max( ) , .min( ), .hist(cumulative=True), .set_index( )

# Import data into Python
We load the required packages:

In [None]:
import pandas as pd               # for data manipulation
import matplotlib.pyplot as plt   # for plotting 
import seaborn as sns             # an extension of matplotlib for statistical graphics

Moreover, we load the .csv files in DataFrames:

In [None]:
orders = pd.read_csv('../input/orders.csv' )
order_products_prior = pd.read_csv('../input/order_products__prior.csv')
products = pd.read_csv('../input/products.csv')

The data that we use in this notebook come from both the orders and products purchased from each customer, so we create the **prd** DataFrame:

In [None]:
prd = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prd.head(100)

# 1 How many times a customer bought a product on its last 5 orders ?(times_last_5 & times_last_5_ratio)

In this business insight, we want to keep the last five orders for each customer and get how many times bought any product on them. To achieve this we need to:
* Create a new variable ('order_number_back') which keeps the order_number for each order in reverse order
* Keep only the last five orders for each order
* Perform a .groupby( ) on users and products to get how many times each customer bought a product.
* Create the following ratio:
![](https://latex.codecogs.com/gif.latex?times%5C%20last%20%5C5%5C%20%28of%5C%20a%5C%20purchased%5C%20product%5C%20from%5C%20a%5C%20user%29%3D%5Cfrac%7BTimes%5C%20a%5C%20user%5C%20bought%5C%20a%5C%20product%5C%20on%5C%20its%5C%20last%5C%205%5C%20orders%7D%7BTotal%5C%20orders%5C%20%3D5%7D)

## 1.1 Create a new variable ('order_number_back') which keeps the order_number for each order in reverse order
In this step we show how we create a reverse order_number for each customer. <br>
Have a look at the orders of customer 1 (user_id == 1)

In [None]:
prd[prd.user_id==1].head(45)

What we want to create, is a new column ('order_number_back') which indicates the last order as first, the second from the end as second and so on. To achieve this, we get the highest order_number (max) for user_id==1 and we subtract the order_number of each order from it. Thus for last order (order_number == 10) that will be: 
<br>
<br>

![order_number_back](https://latex.codecogs.com/png.latex?%5Cdpi%7B200%7D%20%5Ctiny%20%5Cfontsize%7B%20%7D%7Bbaselineskip%7D%20order%5C_number%5C_back%28x%29%3D%20order%5C_number.max%28%29%20-order%5C_number%28x%29%3D10%20-%2010%20%3D%200)

And as we want the last order to be marked as first, rather than zeroth, the previous formula will be:

![](https://latex.codecogs.com/png.latex?%5Cdpi%7B200%7D%20%5Ctiny%20%5Cfontsize%7B%20%7D%7Bbaselineskip%7D%20order%5C_number%5C_back%28x%29%3D%20order%5C_number.max%28%29%20-order%5C_number%28x%29%3D10%20-%2010%20&plus;1%3D%201)

> Note that order_number.max( ) is a single value, where order_number is a 1-D array (column/Series)

By applying the above formula to the orders of user_id == 1 we get the following results:
![](https://i.imgur.com/toda8ay.png)

Now we show how we can perform the procedure on all users. We .groupby( ) prd by the user_id and we select the column order_number. With .transform(max) we request to get the highest number of the column order_number for each group & with minus (-) prd.order_number we substract the order_number of each row. Finally we add 1 for the reason mentioned above.

> .transform( ) perform some group-specific computations and return a like-indexed object. 

In [None]:
prd['order_number_back'] = prd.groupby('user_id')['order_number'].transform(max) - prd.order_number +1 
prd.head(15)

Check that the formula has been applied to all users. Here we check the new column for a random user (user_id== 30):

In [None]:
prd[prd.user_id==30].head(10)

## 1.2 Keep only the last five orders for each customer
With the use of order_number_back we can now select to keep only the last five orders of each customer:

In [None]:
prd5 = prd[prd.order_number_back <= 5]
prd5.head(15)

## 1.3 Perform a .groupby( ) on users and products to get how many times each customer bought every product.
Having kept the last 5 orders for each user, we perform a .groupby( ) on user_id & product_id. With .count( ) we get how many times each customer bought a product.

In [None]:
last_five = prd5.groupby(['user_id','product_id'])[['order_id']].count()
last_five.columns = ['times_last5']
last_five.head(10)

So for user_id==1, the product 196 has been ordered on all of its last five orders, where the product 35951 has been ordered only one time.

## 1.4 Create the final ratio
To create the final ratio, we simply divide the new column by 5:

In [None]:
last_five['times_last5_ratio'] = last_five.times_last5 / 5
last_five.head(10)

# 1.5 Your Turn 📚📝
Have a look on the orders of the customer with user_id == 5:

In [None]:
prd[prd.user_id==5]

As you can see, the customer with user_id==5 has placed 4 orders in total. How can filter out the customers with less than 5 orders?

In [None]:
#Solution 1.1
#prd5_with_five_orders = prd5.groupby('user_id').filter(lambda x: x.order_number_back.max() == 5)
#Solution 1.2
#prd5_with_five_orders = prd5.groupby('user_id').filter(lambda x: x.order_number_back.max() > 4)

In [None]:
#Solution 2.1
#prd5_with_five_orders = prd5.groupby('user_id').filter(lambda x: x.order_number.max() == 5)
#Solution 2.1
#prd5_with_five_orders = prd5.groupby('user_id').filter(lambda x: x.order_number_back.max() > 4)

In [None]:
# Solution 3
prd5_with_five_orders = prd5.groupby('user_id').filter(lambda x: x.order_id.nunique() == 5)

Perform the appropriate sanity check to check your results.

In [None]:
#sanity check
prd5_with_five_orders[prd5_with_five_orders.user_id==5]

## 1.6 Find which products have been bought by the most users on all of their last five orders
To get this insight we will keep only the rows that have ratio == 1. On the rows, we will perform a .groupby( ) using the product_id and we will count the number of observations

In [None]:
last_five_top = last_five[last_five.times_last5_ratio == 1].groupby('product_id')[['times_last5_ratio']].count()
last_five_top.columns = ['total_users']
last_five_top.head()

Now we sort them in descending order by the total_users. We keep the 20 top products, and we reset the index of the DataFrame:

In [None]:
last_five_top = last_five_top.sort_values(by='total_users', ascending=False)
last_five_top = last_five_top.iloc[0:20]
last_five_top = last_five_top.reset_index()
last_five_top

We get the name of the products from the products DataFrame:

In [None]:
last_five_top_names = pd.merge(last_five_top, products, how='left')
last_five_top_names

Finally, we visualize the results:

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(last_five_top_names.total_users, last_five_top_names.product_name)
# add label to x-axis
plt.xlabel('Number of users', size=15)
# keep y-axis free of label
plt.ylabel('  ')
#put a title
plt.title('Top 20 products that have been ordered by most users on their last 5 orders ', size=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

# 2 How frequently a customer bought a product after its first purchase ?
In this business insight we want to calculate the following ratio for each customer and every product that has purchased.

![Order Ratio](https://latex.codecogs.com/gif.latex?%5Cdpi%7B120%7D%20Order%5C%20Ratio%5C%20%28of%5C%20a%5C%20purchased%5C%20product%5C%20from%5C%20a%5C%20user%29%20%3D%20%5Cfrac%7BTimes%5C%20a%5C%20user%5C%20bought%5C%20a%5C%20product%7D%7BNumber%5C%20of%5C%20orders%5C%20placed%5C%20since%5C%20first%5C%20purchase%7D)

In this way we can create a metric that describes how many times a user bought a product out of how many times he or she had the chance to a buy it (starting from its first purchase).

To clarify this, we examine the user_id 1 and the product_id 13032:
- User 1 has made 10 orders in total
- Has bought the product_id 13032 for first time in its 2nd order & has bought the same product 3 times in total.

Then:
User was able to buy the product for 9 times (starting from its 2nd order to his last order).
So this means that has bought it 3 out of 9 times, which equals 3/9= 0,333.

A higher ratio means that the customer bought more frequently a product since its first purchase.

Before we show how we can create the above ratio we declare the following variables:
* How many times a customer bought a product? ('Times_Bought_N')
* For each product get the total orders placed since its first order ('Order_Range_D')

So our desired ratio is defined as:

![Order_Ratio_user_id_X_product_id](https://latex.codecogs.com/gif.latex?Order%5C_Ratio%5C%28user%5C_id%5C%20%2C%5C%20product%5C_id%29%20%3D%20%5Cfrac%7BTimes%5C_Bought%5C_N%7D%7BOrder%5C_Range%5C_D%7D)

Where Order_Range_D is created throught two supportive variables:
* The total number of orders for each customer ('total_orders')
* The order number where the customer bought a product for first time ('first_order_number')

Where
![Order_Range_D](https://latex.codecogs.com/gif.latex?%5Cdpi%7B120%7D%20%5C%20%5C%20%5C%20%5C%20%5C%20Order%5C_Range%5C_D%28user%5C_id%2C%20product%5C_id%29%20%3D%20%5Cnewline%20%3D%5C%20total%5C_orders%28user%5C_id%29%20-%20first%5C_order%5C_number%28user%5C_id%2C%20product%5C_id%29%20&plus;%201)

In the next blocks we show how we create:
1. The numerator 'Times_Bought_N'
2. The supportive variables 'total_orders' & 'first_order_number' 
3. The denumerator 'Order_Range_D' with the use of the supportive variables
4. Our final ratio 'Order_Ratio_user_id_X_product_id'

## 2.1 Calculating the numerator
### 2.1.1 How many times a customer bought a product? ('Times_Bought_N')
To answer this question we simply .groupby( ) user_id & product_id and we count the order_id for each group

In [None]:
times = prd.groupby(['user_id', 'product_id'])[['order_id']].count()
times.columns = ['Times_Bought_N']
times.head()

## 2.2 Calculating the denumerator
To calculate the denumerator, we have first to calculate the total orders of each user & first order number for each user and every product purchase
### 2.2.1 The total number of orders for each customer ('total_orders')
Here we .groupby( ) only by the user_id, we keep the column order_number and we get its highest value with the aggregation function .mean()

In [None]:
total_orders = prd.groupby('user_id')[['order_number']].max()
total_orders.columns = ['total_orders']
total_orders.head()

### 2.2.2 The order number where the customer bought a product for first time ('first_order_number')
Where for first_order_number we .groupby( ) by both user_id & product_id. As we want to get the order when a product has been purchases for first time, we select the order_number column and we retrieve with .min( ) aggregation function, the earliest order.

In [None]:
first_order_number = prd.groupby(['user_id', 'product_id'])[['order_number']].min()
first_order_number.columns = ['first_order_number']
first_order_number.head()

As our goal is to create the Order_Range_D, we need to merge total_orders with first_order_number DataFrame. For this reason we request to turn the multiple index of first_order_number as columns:

In [None]:
first_order_number_reset = first_order_number.reset_index()
first_order_number_reset.head()

So we can succesfully merge the DataFrames. As total_orders refers to all users, where first_order_number refers to unique combinations of user & product, we perform a right join:

In [None]:
span = pd.merge(total_orders, first_order_number_reset, on='user_id', how='right')
span.head(20)

### 2.2.3 For each product get the total orders placed since its first order ('Order_Range_D')
The denominator now can be created with simple operations between the columns of results DataFrame:

In [None]:
span['Order_Range_D'] = span.total_orders - span.first_order_number + 1
span.head(30)

### 2.2.4 Create the final ratio Order_Ratio_user_id_X_product_id
####  2.2.4.1 Merge the DataFrames of numerator & denumerator
In this stage we select to merge times DataFrame which contains the numerator & span which contains the denumerator of our desired ratio. **As both variables derived from the combination of users & products, any type of join will keep all the combinations.**

In [None]:
order_ratio = pd.merge(times, span, on=['user_id', 'product_id'], how='left')
order_ratio.head()

####  2.2.4.2 Perform the final division
Now we divide theTimes_Bought_N by the Order_Range_D for each user and product.

In [None]:
order_ratio['Order_Ratio_user_id_X_product_id'] = order_ratio.Times_Bought_N / order_ratio.Order_Range_D
order_ratio.head()

## 2.3 Visualizing the order_ratio for a product
Here we show the distribution of order_ratio for product 24852; bananas , a product with 73956 unique customers.

In [None]:
plt.figure(figsize=(15,5))
order_ratio[order_ratio.product_id == 24852].Order_Ratio_user_id_X_product_id.hist(bins=50)
plt.xlabel('order_ratio', size=10)
plt.ylabel('Number of customers')
plt.title('The distribution of order_ratio for bananas', size=10)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()

And now we create a Cumulative Distribution Function (CDF) for the same ratio:

In [None]:
plt.figure(figsize=(15,5))
order_ratio[order_ratio.product_id == 24852].Order_Ratio_user_id_X_product_id.hist(cumulative=True, bins=50)
plt.xlabel('order_ratio', size=10)
plt.ylabel('Number of customers')
plt.title('The CDF of order_ratio for bananas', size=10)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()

Where we can conclude that 40.000 customers, bought bananas on less than their half orders (since the first time they ordered bananas), where around to (75.000-40.000=) 35.000 customers bought bananas on more than their half orders . Note that we refer to unique customers (customers who bought bananas even once) and not on all customers (206.209)

## 2.4 Setting index
As the combination of user_id & product_id are identical, we set them as the index of our order_ratio DataFrame:

In [None]:
order_ratio = order_ratio.set_index(['user_id','product_id'])
order_ratio.head()