Major definition

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.

The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 

The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and very date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the

In [46]:
import numpy as np
import pandas as pd

import os
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# model evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

# classifiers
from sklearn.neighbors import KNeighborsClassifier # KNN
from sklearn.linear_model import LogisticRegression # logistic regression
from sklearn.tree import DecisionTreeClassifier # decision tree
from sklearn.ensemble import RandomForestClassifier # random forest
from sklearn.ensemble import GradientBoostingClassifier # gradient boosting

from tqdm import tqdm_notebook
from sklearn.preprocessing import StandardScaler
from sklearn.svm import NuSVC, SVC
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
pd.options.display.precision = 15

import lightgbm as lgb
import time
import datetime

import json
import ast
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, train_test_split

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

pd.pandas.set_option('display.max_columns', None)

### 1. EDA: columnar missing vals, distribution and some plots

In [47]:
def missing_data(data):
    '''
    display of missing information per column
    '''
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

def plot_feature_distribution(df1, df2, label1, label2, features):
    '''
    numerical feature ditribution comparator for binary labelled segments
    '''
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(3,3,figsize=(10,10))

    for feature in features:
        try:
            i += 1
            plt.subplot(3,3,i)
            sns.distplot(df1[feature], hist=False,label=label1)
            sns.distplot(df2[feature], hist=False,label=label2)
            plt.xlabel(feature, fontsize=9)
            locs, labels = plt.xticks()
            plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
            plt.tick_params(axis='y', which='major', labelsize=6)
        except:
            continue
    plt.show();
    
def count_ctgy_spread(df,ctgy_cols):
    '''
    count in each categorical column,
    how many variety they have
    returns the suggested methods to process columns, using config
    '''
    res = {}
    advice = {}
    for col in ctgy_cols:
        val = df[col].value_counts().reset_index().shape[0]
        if val <= 6:
            advice[col] = 'one_hot'
        elif val > 6 and val <= 10:
            advice[col] = 'mid_level'
        else:
            advice[col] = 'encoding'
        
        res[col] = val
    print('column:number of unique records')
    return res,advice

In [48]:
# starting point, read in raw data
df = pd.read_csv('online_shoppers_intention.csv')

# categorical feature coersive conversion
df['SpecialDay'] = df['SpecialDay'].astype('O')
df['OperatingSystems'] = df['OperatingSystems'].astype('O')
df['Browser'] = df['Browser'].astype('O')
df['Region'] = df['Region'].astype('O')
df['TrafficType'] = df['TrafficType'].astype('O')
df['Revenue'] = df['Revenue'].astype(int)

# convenient vars for columns
feature_cols = df.columns.tolist()
feature_cols.remove('Revenue')
feature_obj_cols = df.select_dtypes('O').columns.tolist()
feature_num_cols = [x for x in feature_cols if x not in feature_obj_cols]

#show columnar missing values conditions
missing_data(df)

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
Total,14,14,14,14,14,14,14,14,0,0,0,0,0,0,0,0,0,0
Percent,0.113544201135442,0.113544201135442,0.113544201135442,0.113544201135442,0.113544201135442,0.113544201135442,0.113544201135442,0.113544201135442,0,0,0,0,0,0,0,0,0,0
Types,float64,float64,float64,float64,float64,float64,float64,float64,float64,object,object,object,object,object,object,object,bool,int64


In [42]:
#fillna on the numerical columns, using mean
for col in df.select_dtypes(exclude='O').columns.tolist():
    df[col].fillna(df[col].mean(),inplace=True)

In [44]:
train, test = train_test_split(df, test_size=0.2,random_state = 42)

### Build benchmark model, off the shelf LR

In [100]:
def benchmark_model(df,target_col):
    '''
    benchmark model, using simple off shelf models
    we use LR off shelf as benchmark model
    '''
    
    #fillna on the numerical columns, using mean
    for col in df.select_dtypes(exclude='O').columns.tolist():
        df[col].fillna(df[col].mean(),inplace=True)
        
    all_cols = df.columns.tolist()
    all_cols.remove(target_col)
    
    X = pd.get_dummies(df[all_cols])
    y = df[target_col]
    
    #incurred a bit data leakage here due to get dummies, but it's ok due to this is a benchmark
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model_names = ['LR'] 
    acc = []
    f1 = []

    #using default hyperparameter settings
    lr_clf = LogisticRegression()

    clf_list = [lr_clf]
    clf_dict = dict(zip(model_names,clf_list))

    for model in model_names:
        clf = clf_dict[model]
        #fit df
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        f1.append(round(f1_score(y_test, y_pred, average='weighted')*100,2))
        acc.append(round(accuracy_score(y_test, y_pred)*100,2))

    accuracy_record = pd.DataFrame({'Model': model_names, 'acc': acc})
    accuracy_record.set_index('Model', inplace=True)
    accuracy_record.loc['avg'] = accuracy_record.mean()

    F1_record = pd.DataFrame({'Model': model_names, 'f1': f1})
    F1_record.set_index('Model', inplace=True)
    F1_record.loc['avg'] = F1_record.mean()

    print(accuracy_record)
    print('\n')
    print(F1_record);

In [51]:
ac = benchmark_model(df,target_col='Revenue')

                      acc
Model                    
LR     88.159999999999997
avg    88.159999999999997


                       f1
Model                    
LR     86.739999999999995
avg    86.739999999999995


### Feature engineering

The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and very date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

In [62]:
cols_variety, advice = count_ctgy_spread(df,df.select_dtypes(include=('O')).columns)

column:number of unique records


In [None]:
from math import exp
def decayed_tte(x):
    '''
    time to event
    x input, rescaled output
    '''
    #if x is small, means not impactful, if x=1, then it's very close to special day
    return exp(x-1)

def get_likelihood_feature(df,feature,target):
    '''
    get a factual information for feature
    conversion probability using singel column
    '''
    non_rev = train[train.Revenue==0].Month.value_counts().reset_index().rename(columns={'index':'month','Month':'count_non_rev'})
    rev = train[train.Revenue==1].Month.value_counts().reset_index().rename(columns={'index':'month','Month':'count_rev'})
    counts_month = pd.merge(non_rev,rev,on='month',how='inner')
    counts_month['conversion_prob'] = counts_month['count_rev']/counts_month['count_non_rev']
    mth = counts_month['month'].tolist()
    prob = counts_month['conversion_prob'].tolist()
    mth_prob = dict(zip(mth,prob))

In [76]:
train['mth_conv_prob'] = train['Month'].apply(lambda x:mth_prob[x])

In [119]:
train['special_day_scaled'] = train['SpecialDay'].apply(lambda x:decayed_tte(x))

In [115]:
train = train.drop(columns='special_day_scaled')

In [61]:
advice

{'SpecialDay': 'one_hot',
 'Month': 'mid_level',
 'OperatingSystems': 'mid_level',
 'Browser': 'encoding',
 'Region': 'mid_level',
 'TrafficType': 'encoding',
 'VisitorType': 'one_hot'}

### Dealing with categorical features

In [81]:
import category_encoders as ce
'''
rf, xgb can't deal with cat features, need to transform to numerical
lightgbm and CatBoost, can input directly categorical feature。
For lgbm: need to label encoding to find optimal split, better than 1-hot encoding.

* For ordinal features (have latent orders), can use label encoding
* No order information, a few categories (<4), can use 1-hot
* Target encoding (mean/liklihood/impact encoding), can use non-ordinal, >4 categories
'''

"\nrf, xgb can't deal with cat features, need to transform to numerical\nlightgbm and CatBoost, can input directly categorical feature。\nFor lgbm: need to label encoding to find optimal split, better than 1-hot encoding.\n\n* For ordinal features (have latent orders), can use label encoding\n* No order information, a few categories (<4), can use 1-hot\n* Target encoding (mean/liklihood/impact encoding), can use non-ordinal, >4 categories\n"

In [14]:
# Dealing with numerical features
from sklearn.preprocessing import StandardScaler
normed_df = df.copy()

scaler = StandardScaler()

for col in feature_num_cols:
    scaled_col = scaler.fit_transform(normed_df[col].values.reshape((-1,1)))
    normed_df[col] = scaled_col