### University of San Diego 

### Master of Science, Applied Data Science 

#### Contributors

- Hoori Javadnia
- Salvador Sanchez
- Jacqueline Vo

***

## Data Overview

### Dataset has 537577 rows (transactions) and 12 columns (features) as described below:

- User_ID: Unique ID of the user. 
- Product_ID: Unique ID of the product. 
- Gender: indicates the gender of the person making the transaction.
- Age: indicates the age group of the person making the transaction.
- Occupation: shows the occupation of the user, already labeled with numbers 0 to 20.
- City_Category: User's living city category. Cities are categorized into 3 different categories 'A', 'B' and 'C'.
- Stay_In_Current_City_Years: Indicates how long the users has lived in this city.
- Marital_Status: is 0 if the user is not married and 1 otherwise.
- Product_Category_1 to _3: Category of the product. All 3 are already labaled with numbers.
- Purchase: Purchase amount.

***

# Initial imports

In [1]:
#Import packages
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

# Exploratory Data Analysis (EDA)
Rubric: Discussion is very thorough. All aspects of the data exploration and EDA that are relevant to the main project objectives are carefully addressed.
Figures and tables are highly insightful, and are carefully tailored to the project tasks and report.

In [4]:
#Import csv into train/test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
#Training dataset
display(train_df.head())
train_df.shape

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


(550068, 12)

In [6]:
#Test dataset
display(test_df.head())
test_df.shape

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0


(233599, 11)

In [7]:
#Data quality report

df = train_df


#Initial table
freqDF = pd.DataFrame(columns=['Feature',
                               'Mode',
                               'Mode Freq.',
                               'Mode %',
                               '2nd Mode',
                               '2nd Mode Freq.',
                               '2nd Mode %'])
for col in df.columns:
    freq = df[col].value_counts()
    freqdf = freq.to_frame()
    fRow = freqdf.iloc[0]
    secRow = freqdf.iloc[1]
    fPrct = fRow[0] / len(df[col])
    secPrct = secRow[0] / len(df[col])
    try:
        mode1 = int(fRow.name)
    except:
        mode1 = fRow.name
    try:
        mode2 = int(secRow.name)
    except:
        mode2 = secRow.name
    freqDF = freqDF.append({'Feature':col,
                            'Mode':mode1,
                            'Mode Freq.':fRow[0],
                            'Mode %':fPrct,\
                            '2nd Mode':mode2,
                            '2nd Mode Freq.':secRow[0],
                            '2nd Mode %':secPrct},
                            ignore_index=True)

freqDF = freqDF.set_index('Feature')

#Nulls, Counts, Cardinality
NUllFeatures = round(df.isnull().sum() / df.shape[0],4)\
      .sort_values(ascending=False)
Count = df.count()
uni = df.nunique()

#Formating
NUllFeatures.to_frame(name="% Miss.")
Count.to_frame(name="Count")
uni.to_frame()
result = pd.concat([Count, NUllFeatures,uni], axis=1)
result.columns =["Count","% Miss.","Card."]
result = pd.concat([result, freqDF], axis=1)
result.style.format({'% Miss.': "{:.1%}",
                     'Mode %': "{:.0%}",
                     '2nd Mode %': "{:.0%}",
                     'Count': "{:,}",
                     'Card.': "{:,}",
                     'Mode Freq.': "{:,}",
                    '2nd Mode Freq.': "{:,}"})

Unnamed: 0,Count,% Miss.,Card.,Mode,Mode Freq.,Mode %,2nd Mode,2nd Mode Freq.,2nd Mode %
User_ID,550068,0.0%,5891,1001680,1026,0%,1004277,979,0%
Product_ID,550068,0.0%,3631,P00265242,1880,0%,P00025442,1615,0%
Gender,550068,0.0%,2,M,414259,75%,F,135809,25%
Age,550068,0.0%,7,26-35,219587,40%,36-45,110013,20%
Occupation,550068,0.0%,21,4,72308,13%,0,69638,13%
City_Category,550068,0.0%,3,B,231173,42%,C,171175,31%
Stay_In_Current_City_Years,550068,0.0%,5,1,193821,35%,2,101838,19%
Marital_Status,550068,0.0%,2,0,324731,59%,1,225337,41%
Product_Category_1,550068,0.0%,20,5,150933,27%,1,140378,26%
Product_Category_2,376430,31.6%,17,8,64088,12%,14,55108,10%


<div class="alert alert-block alert-success">
The data quality report shows that each observations in the data represent a product being sold, We have 550,068 observation but only 5,891 users that purchased from population of 3,631 products.
</div>

# Data Pre-processing
Rubric: All preprocessing steps are clearly explained.
## Data Splitting

# Data Modeling
Rubric: Predictive modeling methods are well motivated, correctly implemented, and, to the extent appropriate, span the range of methods discussed in this course.
## Model Performance & Hyperparameter Tuning
Rubric: Cross-validation and/or held-out test sets are used in accordance with best practices to assess model performance. Performance metrics are carefully tailored to the project objectives.

# Final Model

## Results
Rubric: All project objectives are fully met, the findings are presented clearly, and the question(s) are technically addressed in the report and presentation.