### Problem Statement
> A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

|Variable|Definition|
|:---- |:----
|User_ID|User ID|
|Product_ID|Product ID|
|Gender|Sex of User|
|Age|Age in bins|
|Occupation|Occupation (Masked)|
|City_Category|Category of the City (A,B,C)|
|Stay_In_Current_City_Years|Number of years stay in current city|
|Marital_Status|Marital Status|
|Product_Category_1|Product Category (Masked)|
|Product_Category_2|Product may belongs to other category also (Masked)|
|Product_Category_3|Product may belongs to other category also (Masked)|
|Purchase|Purchase Amount (Target Variable)|

### Evaluation
Submissions are scored on the root mean squared error (RMSE). RMSE is very common and is a suitable general-purpose error metric. Compared to the Mean Absolute Error, RMSE punishes large errors:



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
from matplotlib import pyplot as plt

In [3]:
import scipy.stats as stats
from scipy.stats import chi2_contingency

In [4]:
train_file = Path.cwd().joinpath('datasource/train.csv')
test_file =  Path.cwd().joinpath('datasource/test.csv')

In [5]:
train_df = pd.read_csv(train_file)

In [6]:
test_df = pd.read_csv(test_file)

In [7]:
def extended_describe(dataframe):
    extended_describe_df= dataframe.describe(include='all').T 
    extended_describe_df['null_count']= dataframe.isnull().sum()
    extended_describe_df['unique_count'] = dataframe.apply(lambda x: len(x.unique()))
    return extended_describe_df 

In [8]:
extended_describe(train_df)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,null_count,unique_count
User_ID,550068,,,,1003030.0,1727.59,1000000.0,1001520.0,1003080.0,1004480.0,1006040.0,0,5891
Product_ID,550068,3631.0,P00265242,1880.0,,,,,,,,0,3631
Gender,550068,2.0,M,414259.0,,,,,,,,0,2
Age,550068,7.0,26-35,219587.0,,,,,,,,0,7
Occupation,550068,,,,8.07671,6.52266,0.0,2.0,7.0,14.0,20.0,0,21
City_Category,550068,3.0,B,231173.0,,,,,,,,0,3
Stay_In_Current_City_Years,550068,5.0,1,193821.0,,,,,,,,0,5
Marital_Status,550068,,,,0.409653,0.49177,0.0,0.0,0.0,1.0,1.0,0,2
Product_Category_1,550068,,,,5.40427,3.93621,1.0,1.0,5.0,8.0,20.0,0,20
Product_Category_2,376430,,,,9.84233,5.08659,2.0,5.0,9.0,15.0,18.0,173638,18


In [9]:
extended_describe(test_df)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,null_count,unique_count
User_ID,233599,,,,1003030.0,1726.5,1000000.0,1001530.0,1003070.0,1004480.0,1006040.0,0,5891
Product_ID,233599,3491.0,P00265242,829.0,,,,,,,,0,3491
Gender,233599,2.0,M,175772.0,,,,,,,,0,2
Age,233599,7.0,26-35,93428.0,,,,,,,,0,7
Occupation,233599,,,,8.08541,6.52115,0.0,2.0,7.0,14.0,20.0,0,21
City_Category,233599,3.0,B,98566.0,,,,,,,,0,3
Stay_In_Current_City_Years,233599,5.0,1,82604.0,,,,,,,,0,5
Marital_Status,233599,,,,0.41007,0.491847,0.0,0.0,0.0,1.0,1.0,0,2
Product_Category_1,233599,,,,5.27654,3.73638,1.0,1.0,5.0,8.0,18.0,0,18
Product_Category_2,161255,,,,9.84959,5.09494,2.0,5.0,9.0,15.0,18.0,72344,18


In [10]:
train_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [14]:
from collections import Counter
def detect_outliers(dataset , noutliers , columns):
    outlier_indices = []
    for column in columns:
        # 1st quartile (25%),# 3rd quartile (75%)
        q1 , q3 = np.percentile(dataset[column] , [25 , 75])

        # Interquartile range (IQR)
        iqr = q3 - q1

        # outlier step
        outlier_step = 1.5 * iqr

        lower_bound = q1 - outlier_step
        upper_bound = q3 + outlier_step

        # Determine a list of indices of outliers for feature col
        outlier_list_col = dataset[(dataset[column] < lower_bound) | (
        dataset[column] > upper_bound)].index
        outlier_indices.extend(outlier_list_col)

    outlier_indices = Counter(outlier_indices)
 
    multiple_outliers = list(k for k , v in outlier_indices.items()
                             if v > noutliers)

    return outlier_indices

In [15]:
outliers = detect_outliers(train_df,2,["Purchase"])

In [16]:
remove_idx = list(outliers.keys())

In [17]:
train_df.drop(remove_idx,inplace=True,axis =0)

In [18]:
train_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [19]:
def get_mapper(df,feature,target):
    return df.groupby([feature])[target].sum().rank(ascending= False).to_dict()

gender_mapper = get_mapper(train_df,'Gender','Purchase')
age_mapper = get_mapper(train_df,'Age','Purchase')
city_category_mapper = get_mapper(train_df,'City_Category','Purchase')
stay_in_mapper = get_mapper(train_df,'Stay_In_Current_City_Years','Purchase')

train_df['Age_TR']= train_df['Age'].map(lambda x:age_mapper.get(x,-1)).astype('int')
train_df['Gender_TR']= train_df['Gender'].map(lambda x:gender_mapper.get(x,-1)).astype('int')
train_df['City_Category_TR']= train_df['City_Category'].map(lambda x:city_category_mapper.get(x,-1)).astype('int')
train_df['Stay_In_Current_City_Years_TR']= train_df['Stay_In_Current_City_Years'].map(lambda x:stay_in_mapper.get(x,-1)).astype('int')

test_df['Age_TR']= test_df['Age'].map(lambda x:age_mapper.get(x,-1)).astype('int')
test_df['Gender_TR']= test_df['Gender'].map(lambda x:gender_mapper.get(x,-1)).astype('int')
test_df['City_Category_TR']= test_df['City_Category'].map(lambda x:city_category_mapper.get(x,-1)).astype('int')
test_df['Stay_In_Current_City_Years_TR']= test_df['Stay_In_Current_City_Years'].map(lambda x:stay_in_mapper.get(x,-1)).astype('int')

In [20]:
train_count = train_df.groupby(['User_ID'])['User_ID'].count().to_dict()
test_count = test_df.groupby(['User_ID'])['User_ID'].count().to_dict()

In [21]:
train_prd_count = train_df.groupby(['Product_ID'])['Product_ID'].count().to_dict()
test_prd_count =  test_df.groupby(['Product_ID'])['Product_ID'].count().to_dict()

In [22]:
train_df['User_ID_Count']= train_df['User_ID'].map(lambda x:train_count.get(x,-1)).astype('int')
test_df['User_ID_Count']= test_df['User_ID'].map(lambda x:test_count.get(x,-1)).astype('int')

In [23]:
train_df['Product_ID_Count']= train_df['Product_ID'].map(lambda x:train_prd_count.get(x,-1)).astype('int')
test_df['Product_ID_Count']= test_df['Product_ID'].map(lambda x:test_prd_count.get(x,-1)).astype('int')

In [24]:
train_df.fillna(0,inplace=True)
test_df.fillna(0,inplace=True)

In [25]:
test_df_copy = test_df.copy(deep=True)

In [26]:
category_names = list(train_df.select_dtypes(include=['object']).columns)
category_names

['Product_ID', 'Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [27]:
drop_cols =['User_ID','Product_ID','Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [28]:
train_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Age_TR,Gender_TR,City_Category_TR,Stay_In_Current_City_Years_TR,User_ID_Count,Product_ID_Count
0,1000001,P00069042,F,0-17,10,A,2,0,3,0.0,0.0,8370,7,2,3,2,35,227
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200,7,2,3,2,35,581
2,1000001,P00087842,F,0-17,10,A,2,0,12,0.0,0.0,1422,7,2,3,2,35,102
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,0.0,1057,7,2,3,2,35,341
4,1000002,P00285442,M,55+,16,C,4+,0,8,0.0,0.0,7969,6,1,2,4,77,203


In [29]:
X = train_df.drop(drop_cols+['Purchase'], axis=1)
y = train_df['Purchase']


In [30]:
X_test = test_df.drop(drop_cols, axis=1)

In [31]:
X.head()

Unnamed: 0,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Age_TR,Gender_TR,City_Category_TR,Stay_In_Current_City_Years_TR,User_ID_Count,Product_ID_Count
0,10,0,3,0.0,0.0,7,2,3,2,35,227
1,10,0,1,6.0,14.0,7,2,3,2,35,581
2,10,0,12,0.0,0.0,7,2,3,2,35,102
3,10,0,12,14.0,0.0,7,2,3,2,35,341
4,16,0,8,0.0,0.0,6,1,2,4,77,203


In [32]:
X_test.head()

Unnamed: 0,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Age_TR,Gender_TR,City_Category_TR,Stay_In_Current_City_Years_TR,User_ID_Count,Product_ID_Count
0,7,1,1,11.0,0.0,4,1,1,2,1,397
1,17,0,3,5.0,0.0,1,1,2,5,27,117
2,1,1,5,14.0,0.0,2,2,1,4,101,75
3,1,1,4,9.0,0.0,2,2,1,4,101,8
4,1,0,4,5.0,12.0,1,2,2,1,40,214


In [None]:
check

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.8, random_state=42)

In [34]:
import xgboost as xgb

In [35]:
xgb.__version__

'1.1.1'

In [36]:
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import PredictionError
from yellowbrick.regressor import AlphaSelection
from sklearn.model_selection import learning_curve
from sklearn.metrics import mean_squared_error

def residual_plot(model,X_train, y_train,X_validation, y_validation):
    visualizer = ResidualsPlot(model)
    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
    visualizer.score(X_validation, y_validation)  # Evaluate the model on the test data
    visualizer.show()         

def prediction_error(model,X_train, y_train,X_validation, y_validation):
    visualizer = PredictionError(model)
    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
    visualizer.score(X_validation, y_validation)  # Evaluate the model on the test data
    visualizer.show()  

def show_learning_curve(model,X_train, y_train,X_validation, y_validation):
    train_sizes, train_scores, validation_scores = learning_curve(
        model, X_train, y_train.values.ravel(), cv=5)
    plot_learning_curve(train_sizes, train_scores, validation_scores)



In [39]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.8, random_state=42)

In [40]:
from xgboost.sklearn import XGBRegressor 

In [46]:
xg_reg = XGBRegressor(objective ='reg:squarederror',learning_rate = 0.01, max_depth = 8, n_estimators = 600)

In [47]:
xg_reg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.01, max_delta_step=0, max_depth=8,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=600, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [49]:
predictions = xg_reg.predict(X_validation)
print(mean_squared_error(y_validation, predictions))
result = np.sqrt(mean_squared_error(y_validation, predictions))
print(result)

7092067.096665568
2663.0935200750214


In [48]:
from sklearn.metrics import mean_squared_error

def rmse(model,X,Y):
    y_ = model.predict(X)
    return mean_squared_error(y_,Y)**0.5

In [50]:
from xgboost import XGBRegressor
from hyperopt import hp,fmin,tpe
from sklearn.model_selection import cross_val_score, KFold

def objective(params):
    params = {
        'n_estimators' : int(params['n_estimators']),
        'max_depth' : int(params['max_depth']),
        'learning_rate' : float(params['learning_rate'])
    }
    
    clf = XGBRegressor(**params,n_jobs=4)
    score = cross_val_score(clf, X_train,y_train, scoring = rmse, cv=KFold(n_splits=3)).mean()
    print("Parmas {} - {}".format(params,score))
    return score

space = {
    'n_estimators': hp.quniform('n_estimators', 50, 1000, 50),
    'max_depth': hp.quniform('max_depth', 4, 16, 4),
    'learning_rate' : hp.uniform('learning_rate',0.05, 0.15) 
}

best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10)

Parmas {'n_estimators': 500, 'max_depth': 12, 'learning_rate': 0.09693535892492404} - 2564.8486235024398                 
Parmas {'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.051521087170196204} - 2721.629546184593                  
Parmas {'n_estimators': 400, 'max_depth': 4, 'learning_rate': 0.06142966199924067} - 2724.3638887783986                  
Parmas {'n_estimators': 500, 'max_depth': 12, 'learning_rate': 0.11189253876981672} - 2577.1845724973596                 
Parmas {'n_estimators': 950, 'max_depth': 4, 'learning_rate': 0.13339589247464204} - 2601.746089051155                   
Parmas {'n_estimators': 500, 'max_depth': 16, 'learning_rate': 0.062136946993891606} - 2629.3618151551723                
Parmas {'n_estimators': 450, 'max_depth': 12, 'learning_rate': 0.09402446308665188} - 2557.373388079506                  
Parmas {'n_estimators': 250, 'max_depth': 8, 'learning_rate': 0.061348356030688084} - 2587.634041682488                  
Parmas {'n_estimators': 

In [51]:
best

{'learning_rate': 0.08193765605910798,
 'max_depth': 8.0,
 'n_estimators': 1000.0}

In [55]:
xg_reg = XGBRegressor(objective ='reg:squarederror',learning_rate = 0.08193765605910798, max_depth = 8, n_estimators = 1000)

In [56]:
print(xg_reg)

XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.08193765605910798, max_delta_step=None,
             max_depth=8, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=1000, n_jobs=None,
             num_parallel_tree=None, objective='reg:squarederror',
             random_state=None, reg_alpha=None, reg_lambda=None,
             scale_pos_weight=None, subsample=None, tree_method=None,
             validate_parameters=None, verbosity=None)


In [None]:
xg_reg.fit(X_train,y_train)

In [None]:
prediction_error(xg_reg,X_train, y_train,X_validation, y_validation)

In [None]:
y_test = model.predict(X_test)
submission = pd.DataFrame()
submission['Purchase'] = y_test
submission['User_ID'] = test_df['User_ID']
submission['Product_ID'] = test_df['Product_ID']

import datetime
FORMAT = '%Y%m%d%H%M%S'
timestamp=datetime.datetime.now().strftime(FORMAT)
filename ="Submission_xgboos_"+timestamp+"_out.csv"
submission.to_csv(filename,index=False)