## Problem Statement
It is important to gain in-depth insights into e-commerce via data-driven analytics and identify the factors affecting product sales, the impact of characteristics of customers on their purchase habits.

It is quite useful to understand the demand, habits, concern, perception, and interest of customers from the clue of genders for e-commerce companies. 

However, the genders of users are in general unavailable in e-commerce platforms. To address this gap the aim here is to predict the gender of e-commerce’s participants from their product viewing records.

About Data Source: PAKDD 2015 Conference

###Data Dictionary 
 CSV containing the product viewing data with gender as label



```
Variable	    Definition
session_id	  Session ID
startTime	   Start time of the session
endTime	     End Time of the session
ProductList 	List of products viewed
gender	      (Target) male/female
```




Product list contains list of products viewed by the user in the given session and it also contains the category, sub category, sub-sub category and the product all encoded and separated with a slash symbol. Each consecutive product is separated with a semicolon.

###Evaluation Metric
Submissions are evaluated on accuracy between the predicted and observed gender for the sessions in the test set.


In [0]:
# importing all the neccessary libraries for the use

import pickle
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

In [0]:
# loading train data
# Loading data from the Google drive
train_df = pd.read_csv('/content/drive/My Drive/Kaggle/Janata Hack/Part 2 - E commerce/train_8wry4cB.csv')

In [0]:
# loading test data
test_df = pd.read_csv('/content/drive/My Drive/Kaggle/Janata Hack/Part 2 - E commerce/test_Yix80N0.csv')

In [0]:
#Custom function for extract infor from product list
def find_product_feature(str):
    '''
    This function help us to return the tuple of value reltated to ProductList
    
    input argument:
    str: String input for each row in ProductList column

    return:
    count_item: How many product viewed, 
    product_category_lv1: Product category name, 
    product_category_lv2: Sub product category name, 
    unique_lv1: Number of unique sub sub category viewed, 
    unique_lv2: Number of unique item viewed, 
    most_freq_lv1: Highest frequency of sub sub category view
    '''
    if ";"  in str:
        prd_lst = str.split(";")
        count_item = len(prd_lst)
        product_category_lv1 = prd_lst[0].split("/")[0]
        product_category_lv2 = prd_lst[0].split("/")[1]
        sub_category_lv1 =[]
        sub_category_lv2 =[]
        for item in prd_lst:
            sub_category_lv1.append(item.split("/")[0])
            sub_category_lv2.append(item.split("/")[1])
        unique_lv1 = len(set(sub_category_lv1))
        unique_lv2 = len(set(sub_category_lv2))
        most_freq_lv1 =  max(sub_category_lv1, key=Counter(sub_category_lv1).get)        
    else:
        lv_lst = str.split("/")
        product_category_lv1 = lv_lst[0]
        product_category_lv2 = lv_lst[1]
        count_item = 1
        unique_lv1 = 1
        unique_lv2 = 1
        most_freq_lv1 = product_category_lv1
    return (count_item, product_category_lv1, product_category_lv2, unique_lv1, unique_lv2, most_freq_lv1) 

In [0]:
#Feature Extraction:
new_col = ('NumProduct','FirstA','FirstB','UniqueA','UniqueB','MostA')

In [0]:
# Applying custom function on every row of train data

new_col_lst = train_df['ProductList'].apply(lambda x: find_product_feature(x))
new_col_lst

0        (4, A00002, B00003, 1, 1, A00002)
1        (7, A00001, B00009, 1, 1, A00001)
2        (1, A00002, B00001, 1, 1, A00002)
3        (3, A00002, B00004, 1, 1, A00002)
4        (2, A00001, B00001, 1, 1, A00001)
                       ...                
10495    (2, A00002, B00002, 1, 1, A00002)
10496    (1, A00006, B00030, 1, 1, A00006)
10497    (1, A00002, B00002, 1, 1, A00002)
10498    (2, A00003, B00012, 1, 1, A00003)
10499    (4, A00002, B00001, 2, 3, A00002)
Name: ProductList, Length: 10500, dtype: object

In [0]:
# Applying custom function on every row of test data as well

new_test_col_lst = test_df['ProductList'].apply(lambda x: find_product_feature(x))
new_test_col_lst

0       (1, A00002, B00003, 1, 1, A00002)
1       (1, A00002, B00005, 1, 1, A00002)
2       (1, A00002, B00002, 1, 1, A00002)
3       (4, A00002, B00003, 1, 1, A00002)
4       (1, A00002, B00001, 1, 1, A00002)
                      ...                
4495    (3, A00001, B00031, 1, 1, A00001)
4496    (2, A00002, B00002, 1, 1, A00002)
4497    (9, A00002, B00007, 1, 1, A00002)
4498    (1, A00001, B00031, 1, 1, A00001)
4499    (2, A00002, B00002, 1, 1, A00002)
Name: ProductList, Length: 4500, dtype: object

In [0]:
# Creating dataframe from the features for train and test data

new_col_df = pd.DataFrame(new_col_lst.tolist(),columns =new_col)
new_test_col_df = pd.DataFrame(new_test_col_lst.tolist(),columns =new_col)

In [0]:
# Concating the data with original dataframe

data = pd.concat([train_df, new_col_df], axis=1)
test_data = pd.concat([test_df, new_test_col_df], axis=1)

In [0]:
# checking
data.head()

Unnamed: 0,session_id,startTime,endTime,ProductList,gender,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA
0,u16159,15/12/14 18:11,15/12/14 18:12,A00002/B00003/C00006/D28435/;A00002/B00003/C00...,female,4,A00002,B00003,1,1,A00002
1,u10253,16/12/14 14:35,16/12/14 14:41,A00001/B00009/C00031/D29404/;A00001/B00009/C00...,male,7,A00001,B00009,1,1,A00001
2,u19037,01/12/14 15:58,01/12/14 15:58,A00002/B00001/C00020/D16944/,female,1,A00002,B00001,1,1,A00002
3,u14556,23/11/14 2:57,23/11/14 3:00,A00002/B00004/C00018/D10284/;A00002/B00004/C00...,female,3,A00002,B00004,1,1,A00002
4,u24295,17/12/14 16:44,17/12/14 16:46,A00001/B00001/C00012/D30805/;A00001/B00001/C00...,male,2,A00001,B00001,1,1,A00001


In [0]:
test_data.head()

Unnamed: 0,session_id,startTime,endTime,ProductList,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA
0,u12112,08/12/14 13:36,08/12/14 13:36,A00002/B00003/C00006/D19956/,1,A00002,B00003,1,1,A00002
1,u19725,19/12/14 13:52,19/12/14 13:52,A00002/B00005/C00067/D02026/,1,A00002,B00005,1,1,A00002
2,u11795,01/12/14 10:44,01/12/14 10:44,A00002/B00002/C00004/D12538/,1,A00002,B00002,1,1,A00002
3,u22639,08/12/14 20:19,08/12/14 20:22,A00002/B00003/C00079/D22781/;A00002/B00003/C00...,4,A00002,B00003,1,1,A00002
4,u18034,15/12/14 19:33,15/12/14 19:33,A00002/B00001/C00010/D23419/,1,A00002,B00001,1,1,A00002


In [0]:
def quater_of_the_day(date_of_element):
    '''
    Return the Quater of the day
    '''
    if((date_of_element % 6) == 0):
        return (date_of_element // 6) - 1
    else:
        return date_of_element // 6

In [0]:
def time_feature_extraction(df_object):
    '''
    Extract time features from the dataframe
    '''
    df_object['startTime'] = pd.to_datetime(df_object['startTime'], dayfirst=True)
    df_object['endTime'] = pd.to_datetime(df_object['endTime'], dayfirst=True)
    df_object['duration'] = df_object['endTime'] - df_object['startTime']
    df_object['duration'] = df_object['duration'].astype('timedelta64[m]')
    df_object['weekday'] = df_object['startTime'].dt.dayofweek
    df_object['hour_24h'] = df_object['startTime'].dt.hour
    df_object['quater_of_the day'] = df_object['hour_24h'].apply(lambda x: quater_of_the_day(x))

In [0]:
time_feature_extraction(data)

In [0]:
time_feature_extraction(test_data)

In [0]:
data.sample(9)

Unnamed: 0,session_id,startTime,endTime,ProductList,gender,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day
3223,u12318,2014-11-14 17:31:00,2014-11-14 17:33:00,A00002/B00016/C00044/D00958/;A00002/B00016/C00...,female,2,A00002,B00016,1,1,A00002,2.0,4,17,2
1458,u19007,2014-11-26 22:24:00,2014-11-26 22:25:00,A00002/B00002/C00007/D00582/,female,1,A00002,B00002,1,1,A00002,1.0,2,22,3
10046,u19112,2014-12-13 14:39:00,2014-12-13 14:42:00,A00003/B00026/C00286/D26081/;A00003/B00079/C00...,female,3,A00003,B00026,1,2,A00003,3.0,5,14,2
9854,u23467,2014-12-13 12:47:00,2014-12-13 12:47:00,A00002/B00002/C00002/D00499/,female,1,A00002,B00002,1,1,A00002,0.0,5,12,1
8741,u24105,2014-12-18 15:18:00,2014-12-18 15:23:00,A00002/B00003/C00005/D03126/;A00002/B00004/C00...,female,4,A00002,B00003,2,2,A00002,5.0,3,15,2
10219,u14232,2014-12-09 09:23:00,2014-12-09 09:24:00,A00002/B00001/C00059/D14203/;A00002/B00001/C00...,female,2,A00002,B00001,1,1,A00002,1.0,1,9,1
576,u21218,2014-12-09 13:31:00,2014-12-09 13:31:00,A00001/B00001/C00019/D01508/,male,1,A00001,B00001,1,1,A00001,0.0,1,13,2
2442,u11801,2014-12-02 09:17:00,2014-12-02 09:18:00,A00002/B00002/C00004/D17591/,female,1,A00002,B00002,1,1,A00002,1.0,1,9,1
1644,u18469,2014-12-17 15:14:00,2014-12-17 15:14:00,A00003/B00012/C00028/D00281/,female,1,A00003,B00012,1,1,A00003,0.0,2,15,2


In [0]:
test_data.sample(9)

Unnamed: 0,session_id,startTime,endTime,ProductList,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day
778,u13504,2014-12-08 11:06:00,2014-12-08 11:06:00,A00002/B00002/C00003/D14167/,1,A00002,B00002,1,1,A00002,0.0,0,11,1
3177,u22675,2014-12-08 22:03:00,2014-12-08 22:03:00,A00002/B00002/C00003/D22944/,1,A00002,B00002,1,1,A00002,0.0,0,22,3
628,u24410,2014-12-18 10:11:00,2014-12-18 10:11:00,A00002/B00001/C00059/D24006/,1,A00002,B00001,1,1,A00002,0.0,3,10,1
757,u23008,2014-12-09 19:31:00,2014-12-09 19:31:00,A00002/B00002/C00003/D11934/,1,A00002,B00002,1,1,A00002,0.0,1,19,3
951,u24387,2014-12-18 09:23:00,2014-12-18 10:38:00,A00003/B00028/C00086/D31294/;A00003/B00028/C00...,5,A00003,B00028,1,2,A00003,75.0,3,9,1
2556,u14980,2014-11-16 22:11:00,2014-11-16 22:11:00,A00001/B00001/C00019/D04658/,1,A00001,B00001,1,1,A00001,0.0,6,22,3
2411,u15484,2014-12-01 09:33:00,2014-12-01 09:38:00,A00002/B00002/C00007/D16394/;A00002/B00002/C00...,4,A00002,B00002,1,1,A00002,5.0,0,9,1
1740,u16168,2014-12-12 12:21:00,2014-12-12 12:21:00,A00002/B00006/C00015/D01259/,1,A00002,B00006,1,1,A00002,0.0,4,12,1
247,u19002,2014-11-26 21:03:00,2014-11-26 21:03:00,A00002/B00001/C00059/D12608/,1,A00002,B00001,1,1,A00002,0.0,2,21,3


In [0]:
# Droping extra columns now

data.drop(['session_id', 'startTime', 'endTime', 'ProductList'], axis=1, inplace=True)
test_data.drop(['session_id', 'startTime', 'endTime', 'ProductList'], axis=1, inplace=True)

In [0]:
data.head()

Unnamed: 0,gender,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day
0,female,4,A00002,B00003,1,1,A00002,1.0,0,18,2
1,male,7,A00001,B00009,1,1,A00001,6.0,1,14,2
2,female,1,A00002,B00001,1,1,A00002,0.0,0,15,2
3,female,3,A00002,B00004,1,1,A00002,3.0,6,2,0
4,male,2,A00001,B00001,1,1,A00001,2.0,2,16,2


In [0]:
test_data.head()

Unnamed: 0,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day
0,1,A00002,B00003,1,1,A00002,0.0,0,13,2
1,1,A00002,B00005,1,1,A00002,0.0,4,13,2
2,1,A00002,B00002,1,1,A00002,0.0,0,10,1
3,4,A00002,B00003,1,1,A00002,3.0,0,20,3
4,1,A00002,B00001,1,1,A00002,0.0,0,19,3


In [0]:
data['gender'] = data['gender'].astype('category')

In [0]:
# Encoding target column
label_encoder_gender = LabelEncoder()
label_encoder_gender.fit(data['gender'].unique())

LabelEncoder()

In [0]:
# checking number of classes to encode

label_encoder_gender.classes_

array(['female', 'male'], dtype=object)

In [0]:
# Transforming target column

data['encode_gender'] = label_encoder_gender.transform(data['gender'])

In [0]:
data.head()

Unnamed: 0,gender,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day,encode_gender
0,female,4,A00002,B00003,1,1,A00002,1.0,0,18,2,0
1,male,7,A00001,B00009,1,1,A00001,6.0,1,14,2,1
2,female,1,A00002,B00001,1,1,A00002,0.0,0,15,2,0
3,female,3,A00002,B00004,1,1,A00002,3.0,6,2,0,0
4,male,2,A00001,B00001,1,1,A00001,2.0,2,16,2,1


In [0]:
# Dropping gender column
data.drop('gender', axis=1, inplace=True)

In [0]:
data.head()

Unnamed: 0,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day,encode_gender
0,4,A00002,B00003,1,1,A00002,1.0,0,18,2,0
1,7,A00001,B00009,1,1,A00001,6.0,1,14,2,1
2,1,A00002,B00001,1,1,A00002,0.0,0,15,2,0
3,3,A00002,B00004,1,1,A00002,3.0,6,2,0,0
4,2,A00001,B00001,1,1,A00001,2.0,2,16,2,1


In [0]:
# Creating the OneHotEncoder object to perform one hot encoding on df
one_hot_encoder =  OneHotEncoder(sparse=False, handle_unknown='ignore')

In [0]:
# Target column
y = data.pop('encode_gender')

In [0]:
data.head()

Unnamed: 0,NumProduct,FirstA,FirstB,UniqueA,UniqueB,MostA,duration,weekday,hour_24h,quater_of_the day
0,4,A00002,B00003,1,1,A00002,1.0,0,18,2
1,7,A00001,B00009,1,1,A00001,6.0,1,14,2
2,1,A00002,B00001,1,1,A00002,0.0,0,15,2
3,3,A00002,B00004,1,1,A00002,3.0,6,2,0
4,2,A00001,B00001,1,1,A00001,2.0,2,16,2


In [0]:
y.head()

0    0
1    1
2    0
3    0
4    1
Name: encode_gender, dtype: int64

In [0]:
# Fit transform
data_one_hot_encode = one_hot_encoder.fit_transform(data)

In [0]:
data_one_hot_encode

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [0]:
data_one_hot_encode.shape

(10500, 351)

In [0]:
# one_hot_encoder_trial = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [0]:
# data_trial = one_hot_encoder_trial.fit_transform(data)

In [0]:
# data_trial.shape

(10500, 351)

In [0]:
test_data_ohe = one_hot_encoder.transform(test_data)

In [0]:
test_data_ohe.shape

(4500, 351)

In [0]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 500, num = 5)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 50, num = 5)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid for Random search cv
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# checking grid
random_grid

{'bootstrap': [True, False],
 'max_depth': [5, 16, 27, 38, 50, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [20, 140, 260, 380, 500]}

In [0]:
# Use the random grid to search for best hyperparameters

# First create the base model to tune
rfc = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rfc, 
                               param_distributions = random_grid,
                               n_iter = 100, 
                               cv = 3, 
                               verbose=1, 
                               random_state=42, 
                               n_jobs = -1)

# Fit the random search model
rf_random.fit(data_one_hot_encode, y)



RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [0]:
# Checking best params of RandomSearch CV

rf_random.best_params_

{'bootstrap': True,
 'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 1600}

In [0]:
# Checking Best score achieved
rf_random.best_score_

0.8723809523809525

In [0]:
# Best estimator
rf_random.best_estimator_

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=1600,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
# Predicting on test data
rf_random.predict(test_data_ohe)

array([0, 0, 0, ..., 0, 1, 0])

In [0]:
# Preparing for submission of results

test_df_for_submission = pd.read_csv('/content/drive/My Drive/Kaggle/Janata Hack/Part 2 - E commerce/test_Yix80N0.csv', usecols=['session_id'])
s1 = pd.Series(test_df_for_submission['session_id'])
s2 = pd.Series(label_encoder_gender.inverse_transform(rf_random.predict(test_data_ohe)), name='gender')
df = pd.concat([s1, s2], axis=1)
df

Unnamed: 0,session_id,gender
0,u12112,female
1,u19725,female
2,u11795,female
3,u22639,female
4,u18034,female
...,...,...
4495,u23966,male
4496,u20527,female
4497,u13253,female
4498,u17094,male


In [0]:
# saving to csv
df.to_csv('submission.csv', index=False)

Above model gave me score of 0.8583 and private score of 0.8711

In [0]:
# Saving model for future purpose.
filename = 'finalized_model.sav'
pickle.dump(rf_random, open(filename, 'wb'))

In [0]:
# Using default model for 
xgb_model = XGBClassifier()

In [0]:
stratifiedKFold_validation = StratifiedKFold(random_state=2, n_splits=5, shuffle=True)

In [0]:
val_score = []
for train_ix, test_ix in stratifiedKFold_validation.split(data_one_hot_encode, y):
    train_X, test_X = data_one_hot_encode[train_ix],data_one_hot_encode[test_ix]
    train_y, test_y = y[train_ix],y[test_ix]
    xgb_model_fold = XGBClassifier()
    xgb_model_fold.fit(train_X, train_y)
    print(xgb_model_fold.score(test_X, test_y))
    val_score.append(xgb_model_fold.score(test_X, test_y))

0.8742857142857143
0.8714285714285714
0.8709523809523809
0.8714285714285714
0.871904761904762


In [0]:
xgb_model.fit(data_one_hot_encode, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [0]:
xgb_model.predict(test_data_ohe)

array([0, 0, 0, ..., 0, 1, 0])

In [0]:
# Preparing for submission of results

s2 = pd.Series(label_encoder_gender.inverse_transform(xgb_model.predict(test_data_ohe)), name='gender')
df = pd.concat([s1, s2], axis=1)
df

Unnamed: 0,session_id,gender
0,u12112,female
1,u19725,female
2,u11795,female
3,u22639,female
4,u18034,female
...,...,...
4495,u23966,male
4496,u20527,female
4497,u13253,female
4498,u17094,male


In [0]:
# saving to csv
df.to_csv('submission.csv', index=False)

In [0]:
random_grid

{'bootstrap': [True, False],
 'max_depth': [1, 2, 5, 10],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 20, 50, 100, 145]}

In [0]:
random_grid

{'bootstrap': [True, False],
 'max_depth': [1, 2, 5, 10],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 20, 50, 100, 145]}

In [0]:
# Using default model for 
xgb_model_random = XGBClassifier()

In [0]:
xgb_random = RandomizedSearchCV(estimator = xgb_model_random, 
                                param_distributions = random_grid, 
                                n_iter = 100, 
                                cv = 2, 
                                verbose=1, 
                                random_state=2, 
                                n_jobs = -1)

In [0]:
xgb_random.fit(data_one_hot_encode, y)

Fitting 2 folds for each of 100 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:  8.5min finished


RandomizedSearchCV(cv=2, error_score=nan,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=None,
                                           objective='binary:logistic',
                                           random_state=0, reg_alpha=0,
                                           reg_lambda=1, sc...
                                           seed=None, silent=None, subsample=1,
                                           verbosity=1),
                   iid='deprecated', n_i

In [0]:
xgb_random.best_score_

0.8722857142857143

In [0]:
xgb_random.best_params_

{'bootstrap': True,
 'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 100}

In [0]:
xgb_random.predict(test_data_ohe)

array([0, 0, 0, ..., 0, 1, 0])

In [0]:
s2 = pd.Series(label_encoder_gender.inverse_transform(xgb_random.predict(test_data_ohe)), name='gender')
df = pd.concat([s1, s2], axis=1)

In [0]:
df

Unnamed: 0,session_id,gender
0,u12112,female
1,u19725,female
2,u11795,female
3,u22639,female
4,u18034,female
...,...,...
4495,u23966,male
4496,u20527,female
4497,u13253,female
4498,u17094,male


In [0]:
df.to_csv('submission.csv', index=False)

In [0]:
filename1 = '/content/drive/My Drive/Kaggle/finalized_model_xgboost_random.sav'
pickle.dump(xgb_random, open(filename1, 'wb'))

In [0]:
# This model fetched me 85.88 

In [0]:
# Trying random Forest with AdaBoost

In [0]:
clf = RandomForestClassifier(n_estimators=145, 
                             n_jobs=-1, 
                             bootstrap=True, 
                             max_depth=None, 
                             max_features='sqrt', 
                             min_samples_leaf=1, 
                             min_samples_split=10, 
                             warm_start=True, 
                             verbose=0, 
                             class_weight = {1:.82, 0:.1});

bclf = AdaBoostClassifier(base_estimator=clf,
                          n_estimators=10, 
                          random_state=2)

In [0]:
bclf.fit(data_one_hot_encode, y)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         ccp_alpha=0.0,
                                                         class_weight={0: 0.1,
                                                                       1: 0.82},
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features='sqrt',
                                                         max_leaf_nodes=None,
                                                         max_samples=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                 

In [0]:
bclf.predict(test_data_ohe)

array([0, 0, 0, ..., 0, 1, 0])

In [0]:
bclf.score(data_one_hot_encode, y)

0.906

In [0]:
s2 = pd.Series(label_encoder_gender.inverse_transform(bclf.predict(test_data_ohe)), name='gender')
df = pd.concat([s1, s2], axis=1)

In [0]:
df

Unnamed: 0,session_id,gender
0,u12112,female
1,u19725,female
2,u11795,female
3,u22639,female
4,u18034,male
...,...,...
4495,u23966,male
4496,u20527,female
4497,u13253,female
4498,u17094,male


In [0]:
df.to_csv('submission13.csv', index=False)

In [0]:
# This model gave me score of 0.8577

In [0]:
# Trying out gradient boosting (taking params intuation from previous models)

In [0]:
clf_2 = GradientBoostingClassifier(learning_rate=1, 
                                   random_state=2, 
                                   n_estimators=145, 
                                   verbose=2, 
                                   min_samples_split=16, 
                                   min_samples_leaf=1, 
                                   warm_start=True, 
                                   max_features='sqrt', 
                                   max_depth=1)

In [0]:
clf_2.fit(data_one_hot_encode, y)

      Iter       Train Loss   Remaining Time 
         1           1.0140            0.83s
         2           1.0117            0.83s
         3           0.7828            0.83s
         4           0.7628            0.81s
         5           0.7618            0.80s
         6           0.7576            0.77s
         7           0.7565            0.75s
         8           0.7555            0.74s
         9           0.7550            0.73s
        10           0.7537            0.73s
        11           0.7533            0.72s
        12           0.7510            0.71s
        13           0.7506            0.72s
        14           0.7481            0.71s
        15           0.7465            0.70s
        16           0.7461            0.70s
        17           0.7455            0.69s
        18           0.7443            0.68s
        19           0.7441            0.67s
        20           0.7435            0.65s
        21           0.7431            0.64s
        2

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=1, loss='deviance', max_depth=1,
                           max_features='sqrt', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=16,
                           min_weight_fraction_leaf=0.0, n_estimators=145,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=2, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=2, warm_start=True)

In [0]:
s2 = pd.Series(label_encoder_gender.inverse_transform(clf_2.predict(test_data_ohe)), name='gender')
df = pd.concat([s1, s2], axis=1)

In [0]:
df.to_csv('submission12.csv', index=False)

In [0]:
# this function provided me highest score of 0.86. Highest
