<center>
  <h1 align="center"> Retail Sales Prediction </h1>
</center>

- This dataset comprises of purchase transactions captured at a retail store. 
- The dataset has 550,069 rows and 12 columns.

Problem Statement:Build a model to predict the purchase amount of customer against various products 

## Attributes:
| Column ID |         Column Name        | Data type |           Description           | Masked |
|:---------:|:--------------------------:|:---------:|:-------------------------------:|--------|
|     0     |           User_ID          |   int64   |      Unique Id of customer      | False  |
|     1     |         Product_ID         |   object  |       Unique Id of product      | False  |
|     2     |           Gender           |   object  |         Sex of customer         | False  |
|     3     |             Age            |   object  |         Age of customer         | False  |
|     4     |         Occupation         |   int64   |   Occupation code of customer   | True   |
|     5     |        City_Category       |   object  |         City of customer        | True   |
|     6     | Stay_In_Current_City_Years |   object  | Number of years of stay in city | False  |
|     7     |       Marital_Status       |   int64   |    Marital status of customer   | False  |
|     8     |     Product_Category_1     |   int64   |       Category of product       | True   |
|     9     |     Product_Category_2     |  float64  |       Category of product       | True   |
|     10    |     Product_Category_3     |  float64  |       Category of product       | True   |
|     11    |          Purchase          |   int64   |         Purchase amount         | False  |

In [None]:
import pandas as pd
import numpy as np


import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import seaborn as sns
sns.set(style="ticks", color_codes=True)
import plotly.express as px

import matplotlib.pyplot as plt
%matplotlib inline


import warnings 
warnings.filterwarnings("ignore")

### Loading data

In [None]:
df = pd.read_csv('/Purchaseprediction/train.csv')
df.head()

## Exploratory Data Analysis

In [None]:
def check_df(df, head=5):
    print(" SHAPE ".center(70,'#'))
    print('Rows: {}'.format(df.shape[0]))
    print('Columns: {}'.format(df.shape[1]))
    print(" TYPES ".center(70,'#'))
    print(df.dtypes)
    print(" HEAD ".center(70,'#'))
    print(df.head(head))
    print(" TAIL ".center(70,'#'))
    print(df.tail(head))
    print(" Info ".center(70,'#'))
    print(df.info())
    print(" UNIQUE VALUES ".center(70,'#'))
    print(df.apply(lambda x: len(x.unique())))
    print(" MISSING VALUES ".center(70,'#'))
    print(df.isnull().sum())
    missing_percentage = df.isnull().sum() / df.shape[0] * 100
    print(" Missing value percentage ".center(70,'*'))
    print('\n',missing_percentage)
   


In [None]:
check_df(df)

- We have 550068 purchase entries in the dataset with 12 features
- There are 5891 users in the dataset and 3631 uique products
- 31% of Product_Category_2 and 69% of Product_Category_3 has missing values.

In [None]:
## Gender

In [None]:
df.head(5)

In [None]:
fig = px.pie(df,names= df['Gender'],height=400,\
            title='Gender distribution ',\
             labels={'Gender':'Gender'})
fig.show()

- More male users than female in the dataset

In [None]:
## Age

In [None]:
df['Age'].value_counts()

In [None]:
fig = px.pie(df,names= df['Age'],height=400,\
            title='Age distribution ',\
             labels={'Age':'Age'})
fig.show()

In [None]:
## Occupation

In [None]:
fig = px.pie(df,names= df['Occupation'],height=400,\
            title='Occupation distribution ',\
             labels={'Occupation':'Occupation'})
fig.show()

In [None]:
df_occ_age= df.groupby(['Occupation','Age'], as_index=False).agg({'User_ID':'count'})
df_occ_age

In [None]:
top10=df_occ_age.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['Occupation'], y=top10['User_ID'],color=top10['Age'], text_auto=True,height=600,width=800,\
             title='Top 10 Occupation and Age distribution by users ',\
             labels={'Age':'Age','Occupation':'Occupation','User_ID':'Count of Users'})
fig.show()

- Highest number of users in the dataset belong to occupation category 4 and are in the age group of 18.25

In [None]:
df_occ_gender= df.groupby(['Occupation','Gender'], as_index=False).agg({'User_ID':'count'})
df_occ_gender.head(5)

In [None]:
top10=df_occ_gender.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['Occupation'], y=top10['User_ID'],color=top10['Gender'], text_auto=True,height=500,width=700,\
             title='Top 10 Occupation and Gender distribution by users ',\
             labels={'Gender':'Gender','Occupation':'Occupation','User_ID':'Count of Users'})
fig.show()

- Highest number of females work under the cooupation category 0 and male under category 4

### City_Category

In [None]:
fig = px.pie(df,names= df['City_Category'],height=400,\
            title='City Category distribution ',\
             labels={'City_Category':'City Category'})
fig.show()

In [None]:
df_age_city= df.groupby(['City_Category','Age'], as_index=False).agg({'User_ID':'count'})
df_age_city.head(5)

In [None]:
top10=df_age_city.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['City_Category'], y=top10['User_ID'],color=top10['Age'], barmode='group',text_auto=True,height=500,width=500,\
             title='Top 10 City_Category and Age distribution by users ',\
             labels={'City_Category':'City_Category','Age':'Age','User_ID':'Count of Users'})
fig.show()

In [None]:
### City_Category and Occupation

In [None]:
df_city_occ= df.groupby(['City_Category','Occupation'], as_index=False).agg({'User_ID':'count'})
df_city_occ.head(5)

In [None]:
top10=df_city_occ.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['City_Category'], y=top10['User_ID'],color=top10['Occupation'], barmode='group',text_auto=True,height=500,width=700,\
             title='Top 10 City_Category and Occupation distribution by users ',\
             labels={'City_Category':'City_Category','Occupation':'Occupation','User_ID':'Count of Users'})
fig.show()

### Stay_In_Current_City_Years

In [None]:
fig = px.pie(df,names= df['Stay_In_Current_City_Years'],height=400,\
            title='Stay_In_Current_City_Years distribution ',\
             labels={'Stay_In_Current_City_Years':'Stay_In_Current_City_Years'})
fig.show()

### Marital_Status

In [None]:
fig = px.pie(df,names= df['Marital_Status'],height=400,\
            title='Marital_Status distribution ',\
             labels={'Marital_Status':'Marital_Status'})
fig.show()

In [None]:
df_marital_age= df.groupby(['Marital_Status','Age'], as_index=False).agg({'User_ID':'count'})
df_marital_age.head(5)

In [None]:
top10=df_marital_age.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['Marital_Status'], y=top10['User_ID'],color=top10['Age'], barmode='group',text_auto=True,height=500,width=700,\
             title='Top 10 User count by Marital_Status and Age ',\
             labels={'Age':'Age','Marital_Status':'Marital_Status','User_ID':'Count of Users'})
fig.show()

### Product_Category_1

In [None]:
df_pcat= df.groupby(['Product_Category_1'], as_index=False).agg({'User_ID':'count','Purchase':'sum'})
df_pcat.head(5)

In [None]:
df_pcat

In [None]:
top10=df_pcat.sort_values(by='Purchase',ascending=False).head(10)
top10

In [None]:
fig = px.bar(df_pcat, x=df_pcat['Product_Category_1'],y=df_pcat['Purchase'],text_auto=True,height=600,width=1000,\
             title=' Product_Category_1 Distribution over total purchase ',\
             labels={'Product_Category_1':'Product_Category 1'})
fig.show()

In [None]:
fig = px.bar(top10, x=top10['Product_Category_1'],y=top10['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Top 10 Product_Category_1 Distribution over total purchase ',\
             labels={'Product_Category_1':'Product_Category 1'})
fig.show()

In [None]:
top10=df_pcat.sort_values(by='User_ID',ascending=False).head(10)
top10

In [None]:
fig = px.bar(top10, x=top10['Product_Category_1'],y=top10['User_ID'], barmode='group',text_auto=True,height=500,width=700,\
             title='Top 10 Product_Category_1 Distribution over number of purchases ',\
             labels={'Product_Category_1':'Product_Category 1','User_ID':'Count of purchase'})
fig.show()

In [None]:
- Product_category_1 : 1 has highest amount of total revenue in the dataset
- Product_category_1 : 5 has highest amount of number of purchases in the dataset

### Purchase

In [None]:
df.head(5)

In [None]:
##### Purchase vs Age

In [None]:
df_page= df.groupby(['Age'], as_index=False).agg({'User_ID':'count','Purchase':'sum'})
df_page.head(5)

In [None]:
fig = px.bar(df_page, x=df_page['Age'],y=df_page['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Total Purchase across Age groups ',\
             labels={'Purchase':'Purchase','Age':'Age'})
fig.show()

In [None]:
df_page_avg= df.groupby(['Age'], as_index=False).agg({'Purchase':'mean'})
df_page_avg.head(5)

In [None]:
fig = px.bar(df_page_avg, x=df_page_avg['Age'],y=df_page_avg['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Average purchase amount based on Age',\
             labels={'Purchase':'Purchase','Age':'Age'})
fig.show()

In [None]:
##### Purchase vs Gender

In [None]:
df_pgender= df.groupby(['Gender'], as_index=False).agg({'User_ID':'count','Purchase':'sum'})
df_pgender.head(5)

In [None]:
fig = px.bar(df_pgender, x=df_pgender['Gender'],y=df_pgender['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Purchase and Gender ',\
             labels={'Purchase':'Purchase','Gender':'Gender'})
fig.show()

In [None]:
df_pgender_avg= df.groupby(['Gender'], as_index=False).agg({'Purchase':'mean'})
df_pgender_avg.head(5)

In [None]:
fig = px.bar(df_pgender_avg, x=df_pgender_avg['Gender'],y=df_pgender_avg['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Average purchase amount based on Gender',\
             labels={'Purchase':'Purchase','Gender':'Gender'})
fig.show()

In [None]:
##### Purchase vs Occupation

In [None]:
df_pocc= df.groupby(['Occupation'], as_index=False).agg({'User_ID':'count','Purchase':'sum'})
df_pocc.head(5)

In [None]:
fig = px.bar(df_pocc, x=df_pocc['Occupation'],y=df_pocc['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Purchase and Occupation ',\
             labels={'Purchase':'Purchase','Occupation':'Occupation'})
fig.show()

In [None]:
df_pocc_avg= df.groupby(['Occupation'], as_index=False).agg({'User_ID':'count','Purchase':'mean'})
df_pocc_avg.head(5)

In [None]:
fig = px.bar(df_pocc_avg, x=df_pocc_avg['Occupation'],y=df_pocc_avg['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Average purchase amount based on Occupation',\
             labels={'Purchase':'Purchase','Occupation':'Occupation'})
fig.show()

In [None]:
## Stay in the city vs Purchase

In [None]:
df_stay= df.groupby(['Stay_In_Current_City_Years'], as_index=False).agg({'User_ID':'count','Purchase':'sum'})
df_stay.head(5)

In [None]:
fig = px.bar(df_stay, x=df_stay['Stay_In_Current_City_Years'],y=df_stay['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Stay_In_Current_City_Years and Purchase ',\
             labels={'Purchase':'Purchase','Stay_In_Current_City_Years':'Stay_In_Current_City_Years'})
fig.show()

In [None]:
df_stay_avg= df.groupby(['Stay_In_Current_City_Years'], as_index=False).agg({'User_ID':'count','Purchase':'mean'})
df_stay_avg.head(5)

In [None]:
fig = px.bar(df_stay_avg, x=df_stay_avg['Stay_In_Current_City_Years'],y=df_stay_avg['Purchase'], barmode='group',text_auto=True,height=500,width=700,\
             title='Average purchase amount based on Stay_In_Current_City_Years',\
             labels={'Purchase':'Purchase','Stay_In_Current_City_Years':'Stay_In_Current_City_Years'})
fig.show()

In [None]:
df['Purchase'].plot(kind='box');

In [None]:
corr = df.corr()
plt.figure(figsize=(14,7))
sns.heatmap(corr, annot=True, cmap='coolwarm')

- there is not very strong correlation between features. product_category1 and 2 seem to have a strong correlation
- Occupation and Purchase have positive correlation 