# Kaggle Comptetition: Black Friday Sale Prediction

## General Instruction:

__Task__: The objective to predict the primary product category given other features of the product. You may also create your own features.<br>

__Metrics__: The evaluation metric for this competition is Accuracy.<br>

__Other metrics (optional)__While you are working on the problem, you should also check the precision and recall of your models. However, this is
for your learning, and will not be considered in the evaluation.<br>

__Submission Format__<br>
The solution file will be a CSV file consisting of Product_ID and your predicted class. It should contain two columns: Product_ID and Product_Category_1.<br>

## _import neccessary library_
 - Some universally basic libraries are: numpy, pandas, matplotlib
 - standard scaler is to normalize the data
 - Notice, in this project, we use get_dummies method from pandas, which does the job of both lable encoder and onehotencoder
 - Some machine learning library in used are Random Forest, Bagging, AdaBoost, Voting, Logistic Regression, Decision Tree, Neural Networks
 - Some libraries for model selection are Cross Validation and Grid Search

In [1061]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import GridSearchCV

## _Exploring the data_:
 - The data in used consist of the following variables: Gender, Age, Occupation, City Category, Marital Status, Stay_In_Current_City_Years, Product_Category_2, Product_Category_3, Puchase, User_ID, Product_ID, Product_Category_1 <br>

 - Our target for this project is Product_Category_1
 - At first, all of the categories were used as features. However, we discovered that some features hold more predictive power while others do less so. To be more specfic, when changing from all variable to just a selective few: Purchase, Product_Category_2, Product_Category_3, we archieve  a significant improvment in term of accuracy of 6 percent, and when using only Purchase and Product_Category_2, we earned another 1% increment in accuracy.


In [1196]:
'''
This is the data with all features. 
This cell is commented out and the only data in used
will be from Category2 and Purchase
'''
# # read in the data
# df = pd.read_csv('black_friday_data_kaggle.csv')

# # one hot encoding the categorical data
# df = pd.get_dummies(data=df, columns=['Gender', 'Age', 'Occupation',
#                                    'City_Category', 'Marital_Status',
#                                    'Stay_In_Current_City_Years',])
# df = df.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)

# #drop missing data
# df = df.dropna()


# # get the mean value of the data so that there is one one row for each product id
# df = df.groupby(['Product_ID', 'Product_Category_1'])[['Gender_F', 'Gender_M', 
#                                                   'Age_0-17', 'Age_18-25', 'Age_26-35','Age_36-45',
#                                                   'Age_46-50', 'Age_51-55', 'Age_55+',
#                                                   'Occupation_0', 'Occupation_1', 'Occupation_2',
#                                                   'Occupation_3', 'Occupation_4', 'Occupation_5',
#                                                   'Occupation_6', 'Occupation_7', 'Occupation_8',
#                                                   'Occupation_9', 'Occupation_10', 'Occupation_11',
#                                                   'Occupation_12', 'Occupation_13', 'Occupation_14',
#                                                   'Occupation_15', 'Occupation_16', 'Occupation_17',
#                                                   'Occupation_18', 'Occupation_19', 'Occupation_20',
#                                                   'City_Category_A', 'City_Category_B', 'City_Category_C', 
#                                                   'Marital_Status_0', 'Marital_Status_1',
#                                                   'Stay_In_Current_City_Years_0', 'Stay_In_Current_City_Years_1',
#                                                   'Stay_In_Current_City_Years_2', 'Stay_In_Current_City_Years_3',
#                                                   'Stay_In_Current_City_Years_4+', 'Purchase']].median()

# df.reset_index(inplace=True)

# #normalize the data
# sc = StandardScaler()
# df.loc[:, ~df.columns.isin(['User_ID', 'Product_ID', 'Product_Category_1'])] = sc.fit_transform(df.loc[:, ~df.columns.isin(['User_ID', 'Product_ID', 'Product_Category_1'])])
# df.shape # (3623, 43)

'\nThis is the data with all features. \nThis cell is commented out and the only data in used\nwill be from Category2 and Purchase\n'

## _Feature Engineering_

In [1188]:
################## ************* ####################
# read in the data
df = pd.read_csv('black_friday_data_kaggle.csv', index_col=0)

# drop unessary variables with low to none predictive power to Product_Category_1
df = df.drop(['Gender', 'Age', 'Occupation',
              'City_Category', 'Marital_Status',
              'Stay_In_Current_City_Years', 'Product_Category_3'], axis = 1)

# Notice, in this case we don't drop missing data sincce
# it would reduce our test set to be less than 1207 rows

# one hot encoding the categorical data
df = pd.get_dummies(data=df, columns=['Product_Category_2'])


# get the mean value of the data so that there is one one row for each product id
df = df.groupby(['Product_ID', 'Product_Category_1']).median()

# reset the index formed by groupbe,
# the 2 index value Product_ID and Product_Categorical_1
# will become two columns
df.reset_index(inplace=True)

#normalize the data
sc = StandardScaler()
df.loc[:, ~df.columns.isin(['User_ID', 'Product_ID', 'Product_Category_1'])] = sc.fit_transform(df.loc[:, ~df.columns.isin(['User_ID', 'Product_ID', 'Product_Category_1'])])
df.shape 

(3623, 21)

## _Split the Data_:
The data shall be split into training and testing based on whether their Product_Category_1 is -1

In [1189]:
# split the data into train and predicting set based on value of Product_Category_1
train = df.loc[df.Product_Category_1 != -1,:]
x_train = train.loc[:,~train.columns.isin(['Product_ID', 'Product_Category_1'])]
y_train = train.Product_Category_1

test = df.loc[df.Product_Category_1 == -1,:]
x_test = test.loc[:,~test.columns.isin(['Product_ID', 'Product_Category_1'])]
y_test = test.Product_Category_1
y_test.shape #(1207,)

(1207,)

In [1190]:
'''
NOTE TO TEAM: This cell is about the original full dataset.
I use it to test our models. Don't use it in your training.
Otherwise, you will get a 100% accuracy
Don't mention this in the write up either. The purpose of 
using this is only to save time from keep uploading file to Kaggle

It is processed in the exact same way with the "legitimate" dataset above

-----------------------------------------------------------------

this is the original data set, we picked out the 
records that have product id in the test dataset
this new data provide us true target of the test set,
which were changed to -1 in the test set
'''

# original= pd.read_csv('BlackFriday.csv')
# original = pd.get_dummies(data=original, columns=['Gender', 'Age', 'Occupation',
#                                    'City_Category', 'Marital_Status',
#                                    'Stay_In_Current_City_Years',])
# original = original.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)
# original = original.dropna()

# original = original.groupby(['Product_ID', 'Product_Category_1'])[['Gender_F', 'Gender_M', 
#                                                   'Age_0-17', 'Age_18-25', 'Age_26-35','Age_36-45',
#                                                   'Age_46-50', 'Age_51-55', 'Age_55+',
#                                                   'Occupation_0', 'Occupation_1', 'Occupation_2',
#                                                   'Occupation_3', 'Occupation_4', 'Occupation_5',
#                                                   'Occupation_6', 'Occupation_7', 'Occupation_8',
#                                                   'Occupation_9', 'Occupation_10', 'Occupation_11',
#                                                   'Occupation_12', 'Occupation_13', 'Occupation_14',
#                                                   'Occupation_15', 'Occupation_16', 'Occupation_17',
#                                                   'Occupation_18', 'Occupation_19', 'Occupation_20',
#                                                   'City_Category_A', 'City_Category_B', 'City_Category_C', 
#                                                   'Marital_Status_0', 'Marital_Status_1',
#                                                   'Stay_In_Current_City_Years_0', 'Stay_In_Current_City_Years_1',
#                                                   'Stay_In_Current_City_Years_2', 'Stay_In_Current_City_Years_3',
#                                                   'Stay_In_Current_City_Years_4+', 'Purchase']].median()

############ ******** #############


original= pd.read_csv('BlackFriday.csv')


original = original.drop(['Gender', 'Age', 'Occupation',
                          'City_Category', 'Marital_Status',
                          'Stay_In_Current_City_Years', 'Product_Category_3'], axis = 1)

# original = original.dropna()

original = pd.get_dummies(data=original, columns=['Product_Category_2'])



original = original.groupby(['Product_ID', 'Product_Category_1']).median()



original.reset_index(inplace=True)

print(original.shape)



# original.dropna()
original.loc[:, ~original.columns.isin(['Product_ID', 'Product_Category_1'])] = sc.fit_transform(original.loc[:, ~original.columns.isin(['Product_ID', 'Product_Category_1'])])


x_original = original.loc[:,~original.columns.isin(['Product_ID', 'Product_Category_1'])]
y_original = original.Product_Category_1 

# real_test = original.loc[original.Product_ID.isin(test.Product_ID), ['Product_ID', 'Product_Category_1']]
real_y_test = original.loc[original.Product_ID.isin(test.Product_ID), ['Product_Category_1']]
real_x_test = original.loc[original.Product_ID.isin(test.Product_ID), ~original.columns.isin(['Product_ID', 'Product_Category_1'])]



(3623, 21)


## _First Model_:  Random Forest

 - We apply random forest to be our first technic due to its conceptual simplicty.
 - First, we use Grid Search, which bascially provide a systematic way to train our model with multiple different combinations of hyper parameters.
 - Also, Grid Search can deploy Cross Validation and determine mean accuary
 - From this, we will narrow down our choice for hyper parameter and fit it later
 - Random Forest provide good result. But we will soon have better result from other technics

In [1141]:
num_trees = list(range(60,80))
param_grid = dict(n_estimators = num_trees)
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv = 10, return_train_score=False)
grid.fit(x_train, y_train)
rf_result = pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
rf_result



Unnamed: 0,mean_test_score,std_test_score,params
0,0.829341,0.05957,{'n_estimators': 60}
1,0.820359,0.054512,{'n_estimators': 61}
2,0.829341,0.052714,{'n_estimators': 62}
3,0.814371,0.055175,{'n_estimators': 63}
4,0.820359,0.077548,{'n_estimators': 64}
5,0.832335,0.045487,{'n_estimators': 65}
6,0.826347,0.064619,{'n_estimators': 66}
7,0.832335,0.051307,{'n_estimators': 67}
8,0.823353,0.05936,{'n_estimators': 68}
9,0.835329,0.060629,{'n_estimators': 69}


In [1191]:
# first attempt: fit the random forest
clf_rf = RandomForestClassifier(n_estimators=70,random_state=0)
clf_rf.fit(X=x_train, y=y_train)
cv_scores = cross_val_score(clf_rf, X=x_train, y=y_train, cv = 10, scoring='accuracy')
print(np.mean(cv_scores))
clf_rf.score(real_x_test, real_y_test)



0.8251048935956906


0.6876553438276719

In [1192]:
# predict the data and write result to csv
# pred = clf_rf.predict(x_test)
# result = pd.DataFrame({'Product_ID' : test.Product_ID, 'Product_Category_1' : pred})
# result = result[['Product_ID', 'Product_Category_1']]
# result.to_csv('Prediction.csv', index=False)
# result.shape

(1207, 2)

## _Bagging_
To our surprise, a method that is similar to random forest can actually improve our accuracy to to its versatility.
 - We first use gridsearch to determine a optimal choice for parameter for the decision tree
 - Then, once again we use gridsearch to determine the choice of max_features and n_esitimators for the bagging algorithm
 - Finally, we fir the model into the training data and archive a high score based on Kaggle evalution


In [1070]:
clf_dt = DecisionTreeClassifier(random_state=0)

parameters = [{'max_depth': list(np.arange(4,15)),
              'criterion': ['gini', 'entropy']}]

grid_search = GridSearchCV(estimator = clf_dt,
                           param_grid=parameters,
                           scoring='accuracy',
                           cv = 10, n_jobs=-1)

grid_search.fit(x_train, y_train)

rf_result = pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

rf_result



Unnamed: 0,mean_test_score,std_test_score,params
0,0.602649,0.027726,"{'max_depth': 4, 'criterion': 'gini'}"
1,0.746275,0.025476,"{'max_depth': 5, 'criterion': 'gini'}"
2,0.795944,0.027529,"{'max_depth': 6, 'criterion': 'gini'}"
3,0.796772,0.017249,"{'max_depth': 7, 'criterion': 'gini'}"
4,0.801325,0.020834,"{'max_depth': 8, 'criterion': 'gini'}"
5,0.808775,0.020007,"{'max_depth': 9, 'criterion': 'gini'}"
6,0.808361,0.026896,"{'max_depth': 10, 'criterion': 'gini'}"
7,0.817053,0.029192,"{'max_depth': 11, 'criterion': 'gini'}"
8,0.81043,0.031391,"{'max_depth': 12, 'criterion': 'gini'}"
9,0.805877,0.027892,"{'max_depth': 13, 'criterion': 'gini'}"


In [1193]:
bg = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=8))

parameters = [{'max_features': [0.8, 0.9],
              'n_estimators': np.arange(48,53)}]

grid_search = GridSearchCV(estimator = bg,
                           param_grid=parameters,
                           scoring='accuracy',
                           cv = 10, n_jobs=-1)

grid_search.fit(x_train, y_train)

rf_result = pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

rf_result



Unnamed: 0,mean_test_score,std_test_score,params
0,0.86548,0.018904,"{'max_features': 0.8, 'n_estimators': 48}"
1,0.860927,0.023222,"{'max_features': 0.8, 'n_estimators': 49}"
2,0.861755,0.023749,"{'max_features': 0.8, 'n_estimators': 50}"
3,0.856788,0.025901,"{'max_features': 0.8, 'n_estimators': 51}"
4,0.858858,0.020134,"{'max_features': 0.8, 'n_estimators': 52}"
5,0.861755,0.023412,"{'max_features': 0.9, 'n_estimators': 48}"
6,0.854305,0.024702,"{'max_features': 0.9, 'n_estimators': 49}"
7,0.859272,0.022137,"{'max_features': 0.9, 'n_estimators': 50}"
8,0.860099,0.024248,"{'max_features': 0.9, 'n_estimators': 51}"
9,0.860099,0.023777,"{'max_features': 0.9, 'n_estimators': 52}"


In [1194]:
# Bagging
bg = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                       max_features=0.8,
                       n_estimators=50,
                       random_state=2)
bg.fit(x_train, y_train)
bg.score(real_x_test, real_y_test)
# the best so far


0.816072908036454

In [1195]:
# seeing that the result is the best so far, we write it to csv
# predict the data and write result to csv
pred = bg.predict(x_test)
result = pd.DataFrame({'Product_ID' : test.Product_ID, 'Product_Category_1' : pred})
result = result[['Product_ID', 'Product_Category_1']]
result.to_csv('Prediction.csv', index=False)
result.shape

(1207, 2)

## _Other technics_:
We also try on some other models such as:
- bagging of Neural Networks
- AdaBoosting
- Voting
- XGBoost

To our disappoinment, Bagging of Neural Networks did not improve accuracy. This is likely due to the nature of bagging that is meant for simplier model to avoid overfiiting <br>
Boosting with Decision Tree delivered fairly good result but not as strong as Bagging of Tree as shown above
Also, it is worth noticing that XGBoost appears to have highly accuarate prediction. However, the result contains empty array which is not acceptable for Kaggle

In [1086]:
# attempt neural networks
# build nn model
clf_nn = MLPClassifier(solver = 'lbfgs', activation = 'logistic', max_iter=40,
                    hidden_layer_sizes = 10, random_state = 0)
# fit the nn_model
clf_nn.fit(x_train, y_train)

# get real accuracy
print(clf_nn.score(real_x_test, real_y_test))

#apply bagging for neural networks 
bg = BaggingClassifier(base_estimator=clf_nn, max_features=0.9, n_estimators=55)

bg.fit(x_train, y_train)

print(bg.score(real_x_test, real_y_test))

# after submitting to kaggle, we see that even when bagging nn improve from pure nn,
# it still does not beat bagging of Decistion Tree
# the reason is due to the relation between ensemble techniques and the simplicity of the model

0.5642087821043911
0.5932062966031483


In [1090]:
# Boosting
adb = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators=50, learning_rate=1)
adb.fit(x_train, y_train)
print(adb.score(real_x_test, real_y_test))


0.7937033968516984


In [1076]:
# Voting 
lr = LogisticRegression()
dt = DecisionTreeClassifier()
svm = SVC(kernel = 'rbf')
evc = VotingClassifier(estimators=[('dt', dt), ('svm', svm), ('lr', lr)])
evc.fit(x_train, y_train)
evc.score(real_x_test, real_y_test)

  if diff:


0.5741507870753936

In [1198]:
# xgboost
clf_xgb = XGBClassifier(max_depth=10, random_state=0)
clf_xgb.fit(x_train, y_train)
print(clf_xgb.score(real_x_test, real_y_test))
print('#########')

0.8044739022369511
#########


  if diff:
