<img src = 'best-black-friday-deals.jpg' width = '500' height= '350' >

## Problem statement:
A retail company “Wisconn American Dreams Pvt. Ltd.” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories during Black Friday Sales days. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month. They want to gain insights on the customer purchase pattern and other potential mediums of increase in sales.

# Goal:
Perform EDA on the Black Friday and gain insights into *purchasing power* of the customer for the selected listed products during Black Firday Sale days.

# Research Questions
* 		How do customer demographics influence purchase behavior?
    * Investigate if there's a correlation between age, gender, marital status, city type, and stay duration in the current city with the purchase amount.
* 		What is the impact of product categories on purchase amounts?
    * Analyze how different product categories contribute to the overall purchase amount and identify the categories with the highest sales.
* 		How does the combination of product categories affect purchase behavior?
    * Explore if there are common combinations of product categories (Product_Category_1, Product_Category_2, Product_Category_3) that lead to higher purchase amounts.
    
# Hypothesis:
- **Demographic Hypothesis**: <br> 
Certain demographic segments (e.g., age group, gender) are more likely to spend more on specific product categories. For example, younger customers might spend more on electronics, whereas older customers might focus on home appliances.
- **Product Category Hypothesis**:<br> 
Some product categories are universally popular, leading to higher purchases regardless of the demographic profile. These categories might have products that are essential or have a higher appeal.

# Approach:

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import random
%matplotlib inline

pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [2]:
# Read the CSV files
train_df = pd.read_csv('train.csv')  
test_df = pd.read_csv('test.csv')  

merged_df = pd.concat([train_df, test_df], axis=0)

# If you want to reset the index of the merged DataFrame
merged_df.reset_index(drop=True, inplace=True)

In [3]:
merged_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370.0
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422.0
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057.0
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969.0


In [4]:
print(train_df.shape)
print(test_df.shape)
print(merged_df.shape)

(550068, 12)
(233599, 11)
(783667, 12)


In [5]:
df = merged_df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     783667 non-null  int64  
 1   Product_ID                  783667 non-null  object 
 2   Gender                      783667 non-null  object 
 3   Age                         783667 non-null  object 
 4   Occupation                  783667 non-null  int64  
 5   City_Category               783667 non-null  object 
 6   Stay_In_Current_City_Years  783667 non-null  object 
 7   Marital_Status              783667 non-null  int64  
 8   Product_Category_1          783667 non-null  int64  
 9   Product_Category_2          537685 non-null  float64
 10  Product_Category_3          237858 non-null  float64
 11  Purchase                    550068 non-null  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 71.7+ MB


In [6]:
df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,783667.0,783667.0,783667.0,783667.0,537685.0,237858.0,550068.0
mean,1003029.0,8.0793,0.409777,5.366196,9.844506,12.668605,9263.968713
std,1727.267,6.522206,0.491793,3.87816,5.089093,4.12551,5023.065394
min,1000001.0,0.0,0.0,1.0,2.0,3.0,12.0
25%,1001519.0,2.0,0.0,1.0,5.0,9.0,5823.0
50%,1003075.0,7.0,0.0,5.0,9.0,14.0,8047.0
75%,1004478.0,14.0,1.0,8.0,15.0,16.0,12054.0
max,1006040.0,20.0,1.0,20.0,18.0,18.0,23961.0


In [7]:
df.describe(include = 'object')

Unnamed: 0,Product_ID,Gender,Age,City_Category,Stay_In_Current_City_Years
count,783667,783667,783667,783667,783667
unique,3677,2,7,3,5
top,P00265242,M,26-35,B,1
freq,2709,590031,313015,329739,276425


In [8]:
# Drop user_id as it is redundant
df.drop(['User_ID'], inplace= True, axis = 1)

In [23]:
# Changing datatypes
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].str.replace('+', '', regex = False)
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].astype(int)

### Correcting Categorical Features to correct encoding:

- Gender and Age

In [9]:
for col in df.describe(include='object').columns:
    print(col, ':')
    print(df[col].unique())
    print('-'*50)
    print()

Product_ID :
['P00069042' 'P00248942' 'P00087842' ... 'P00030342' 'P00074942'
 'P00253842']
--------------------------------------------------

Gender :
['F' 'M']
--------------------------------------------------

Age :
['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']
--------------------------------------------------

City_Category :
['A' 'C' 'B']
--------------------------------------------------

Stay_In_Current_City_Years :
['2' '4+' '3' '1' '0']
--------------------------------------------------



In [10]:
df['Gender'] = df['Gender'].map({'M': 0, 'F': 1})
df['Age_Encoded'] = df['Age'].map({
    '0-17': 1,
    '18-25': 2,
    '26-35': 3,
    '36-45': 4,
    '46-50': 5,
    '51-55': 6,
    '55+': 7
})

In [11]:
# Dataframe for Nulls and their percentage in column 
nulls = pd.DataFrame(df.isnull().sum()/df.shape[0]*100, columns = ['perecntage_nulls'])
nulls['total_nulls'] = df.isnull().sum()
nulls


Unnamed: 0,perecntage_nulls,total_nulls
Product_ID,0.0,0
Gender,0.0,0
Age,0.0,0
Occupation,0.0,0
City_Category,0.0,0
Stay_In_Current_City_Years,0.0,0
Marital_Status,0.0,0
Product_Category_1,0.0,0
Product_Category_2,31.388587,245982
Product_Category_3,69.648078,545809


In [12]:
df.head()

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Age_Encoded
0,P00069042,1,0-17,10,A,2,0,3,,,8370.0,1
1,P00248942,1,0-17,10,A,2,0,1,6.0,14.0,15200.0,1
2,P00087842,1,0-17,10,A,2,0,12,,,1422.0,1
3,P00085442,1,0-17,10,A,2,0,12,14.0,,1057.0,1
4,P00285442,0,55+,16,C,4+,0,8,,,7969.0,7


In [13]:
# Progress copied
df_copy_till_category_correction = df.copy()

In [14]:
df.isnull().sum()

Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            245982
Product_Category_3            545809
Purchase                      233599
Age_Encoded                        0
dtype: int64

In [18]:
top_modes = df['Product_Category_3'].mode().nlargest(3).to_list()
top_modes_2 = df['Product_Category_2'].mode().nlargest(3).to_list()

df['Product_Category_2'] = df['Product_Category_2'].apply(lambda x: np.random.choice(top_modes_2) if pd.isnull(x) else x)
df['Product_Category_3'] = df['Product_Category_3'].apply(lambda x: np.random.choice(top_modes) if pd.isnull(x) else x)

In [19]:
df.isnull().sum()

Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2                 0
Product_Category_3                 0
Purchase                      233599
Age_Encoded                        0
dtype: int64

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Product_ID                  783667 non-null  object 
 1   Gender                      783667 non-null  int64  
 2   Age                         783667 non-null  object 
 3   Occupation                  783667 non-null  int64  
 4   City_Category               783667 non-null  object 
 5   Stay_In_Current_City_Years  783667 non-null  object 
 6   Marital_Status              783667 non-null  int64  
 7   Product_Category_1          783667 non-null  int64  
 8   Product_Category_2          783667 non-null  float64
 9   Product_Category_3          783667 non-null  float64
 10  Purchase                    550068 non-null  float64
 11  Age_Encoded                 783667 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 71.7+ MB


In [21]:
df.head()

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Age_Encoded
0,P00069042,1,0-17,10,A,2,0,3,8.0,16.0,8370.0,1
1,P00248942,1,0-17,10,A,2,0,1,6.0,14.0,15200.0,1
2,P00087842,1,0-17,10,A,2,0,12,8.0,16.0,1422.0,1
3,P00085442,1,0-17,10,A,2,0,12,14.0,16.0,1057.0,1
4,P00285442,0,55+,16,C,4+,0,8,8.0,16.0,7969.0,7
