**About the competition**

In this competition, we抮e challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

**Objectives of the Notebook**

In this notebook we will go through the features of the dataset. We will try to clean the data by preprocessing with Pandas and try to make it model ready for Baseline LightGBM model. Then Finally we will apply LightGBM model to predict the outcome.

> This is my first notebook where I have tried to present what I have done in a documented manner. Please point out any noticeable mistake that I have done. Thank you. :)

**Inspirations of the Notebook**

Some parts of the notebook is inspired by the following notebooks.
* [Simple Exploration+Baseline - GA Customer Revenue](https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue) by [SRK](https://www.kaggle.com/sudalairajkumar)
*  [2 - Quick study: LGBM, XGB and Catboost [LB: 1.66]](https://www.kaggle.com/julian3833/2-quick-study-lgbm-xgb-and-catboost-lb-1-66) by [Julian](https://www.kaggle.com/julian3833)


In [None]:
import numpy as np
import time
import gc
import json
import os
from datetime import datetime
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer, LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error
sns.set_style("dark")

In [None]:
def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     parse_dates=['date'],
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

train_df = load_df()
test_df = load_df("../input/test.csv")

**About the Features**

* fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
* channelGrouping - The channel via which the user came to the Store.
* date - The date on which the user visited the Store.
* device - The specifications for the device used to access the Store.
* geoNetwork - This section contains information about the geography of the user.
* sessionId - A unique identifier for this visit to the store.
* socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
* totals - This section contains aggregate values across the session.
* trafficSource - This section contains information about the Traffic Source from which the session originated.
* visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
* visitNumber - The session number for this user. If this is the first session, then this is set to 1.
* visitStartTime - The timestamp (expressed as POSIX time).

In [None]:
train_df.columns

**Preprocessing**

First we will merge the train and test dataset to make the basic operations on the whole dataset easier.

In [None]:
train_df['train_or_test'] = 'train'
test_df['train_or_test'] = 'test'
test_df['totals.transactionRevenue'] = np.nan
df = pd.concat([train_df, test_df], sort=False, ignore_index=True)
del train_df
del test_df
gc.collect()

In [None]:
df.shape

In [None]:
df.head(10)

Based on the time-series nature of the data, I have created some new features which might be helpful to get new insights in the future.

In [None]:
df['year'] = df.date.dt.year
df['month'] = df.date.dt.month
df['dayofmonth'] = df.date.dt.day
df['dayofweek'] = df.date.dt.dayofweek
df['dayofyear'] = df.date.dt.dayofyear
df['weekofyear'] = df.date.dt.weekofyear
df['is_month_start'] = (df.date.dt.is_month_start).astype(int)
df['is_month_end'] = (df.date.dt.is_month_end).astype(int)
df['quarter'] = df.date.dt.quarter
df['week_block_num'] = [int(x) for x in np.floor((df.date - pd.to_datetime('2012-12-31')).dt.days/7) + 1]
df['quarter_block_num'] = (df['year'] - 2013) * 4 + df['quarter']

In [None]:
df.describe(include="all")

Many of the columns contain constant values for all the training and test examples. Removing them is wise to make predictions faster.

In [None]:
dropcols = [c for c in df.columns if df[c].nunique(dropna=True)==1]
dropcols.remove('totals.bounces')
dropcols.remove('totals.newVisits')
print(dropcols)
df.drop(dropcols,axis=1,inplace=True,errors='ignore')

These following 4 features seem to have integer format and there missing values cell should most probably be 0.

In [None]:
df['totals.bounces'].fillna(0,inplace=True)
df['totals.hits'].fillna(0,inplace=True)
df['totals.pageviews'].fillna(0,inplace=True)
df['totals.newVisits'].fillna(0,inplace=True)
df[['totals.bounces','totals.hits','totals.pageviews','totals.newVisits']] = df[['totals.bounces','totals.hits','totals.pageviews','totals.newVisits']].astype(np.int)

Trying to find out the reason behind all the other missing values in the dataset.

In [None]:
null_df = df.isnull().sum().reset_index()
null_df[0] = null_df[0] / df.shape[0]
null_df[null_df[0] > 0]

These features has similar reason for missing values. I am trying to put appropriate fill for the missing cells for these columns.

In [None]:
cols = ['trafficSource.adwordsClickInfo.adNetworkType','trafficSource.adwordsClickInfo.gclId','trafficSource.adwordsClickInfo.slot','trafficSource.adContent']
df[cols] = df[cols].fillna("No_Ad")
df['trafficSource.adwordsClickInfo.page'].fillna(0,inplace=True)
df['trafficSource.referralPath'].fillna("No_Path",inplace=True)
df['trafficSource.adContent'].fillna("No_Ad",inplace=True)

In [None]:
df.describe(include=["O"]).T

I am categorizing the object features in two different groups. The group having many categorical values will be used for Label Encoding and the other having only a few categorical values will be used for One Hot Encoding.

In [None]:
cat_many_label_cols = ["channelGrouping", "device.browser", "device.operatingSystem", 
            "geoNetwork.city", "geoNetwork.continent", 
            "geoNetwork.country", "geoNetwork.metro",
            "geoNetwork.networkDomain", "geoNetwork.region", 
            "geoNetwork.subContinent", "trafficSource.adContent", 
            "trafficSource.adwordsClickInfo.gclId", 
            "trafficSource.adwordsClickInfo.page", 
            "trafficSource.campaign",
            "trafficSource.keyword", "trafficSource.medium", 
            "trafficSource.referralPath", "trafficSource.source"]

cat_few_label_cols = ["device.deviceCategory","trafficSource.adwordsClickInfo.adNetworkType",
                     "trafficSource.adwordsClickInfo.slot"]

Performing Label Encoding and One Hot Encoding of the vabove grouped features.

In [None]:
for col in cat_many_label_cols:
    print(col)
    lbl = LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    
df = pd.get_dummies(df,columns=cat_few_label_cols)

In [None]:
df.shape

Now the target variable also has many missing value in the training dataset. The reason is those instances did not generate any revenues. Thus setting all the missing values to zero is the only option.

In [None]:
df["totals.transactionRevenue"].fillna(0,inplace=True)
df["totals.transactionRevenue"] = df["totals.transactionRevenue"].astype(np.float)

Now I think most of the data cleaning and feature engineering is over. I will get back the original train and test DataFrame from the merged DataFrame and then extract a validation dataset from the last part of the time sequence.

In [None]:
train_df = df[df.train_or_test=='train']
test_df = df[df.train_or_test=='test'].drop('totals.transactionRevenue',axis=1)
val_df = train_df[train_df['date']>datetime(2017,5,31)]
print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

In [None]:
dropcols = ['fullVisitorId','sessionId','visitId']
train_x = train_df.drop(dropcols,axis=1)
test_x = test_df.drop(dropcols,axis=1)

In [None]:
dev_x = train_x[train_x['date']<=datetime(2017,5,31)]
val_x = train_x[train_x['date']>datetime(2017,5,31)]
dev_y = np.log1p(dev_x["totals.transactionRevenue"].values)
val_y = np.log1p(val_x["totals.transactionRevenue"].values)
dev_x.drop(["totals.transactionRevenue","date","train_or_test"],axis=1,inplace=True)
val_x.drop(["totals.transactionRevenue","date","train_or_test"],axis=1,inplace=True)
test_x.drop(["date","train_or_test"],axis=1,inplace=True)

**Modeling**

Now we are ready to apply machine learning models on the dataset. I will use gradient boosting framework LightGBM as it is fast and very accurate in modeling big datasets.

In [None]:
lgb_params = {
        "objective" : "regression",
        "metric" : "rmse", 
        "num_leaves" : 1024,
        'max_depth': 16,  
        'max_bin': 255,
        "min_child_samples" : 100,
        "learning_rate" : 0.005,
        'verbose': 0,
        "bagging_fraction" : 0.7,
        "feature_fraction" : 0.7,
        "bagging_frequency" : 5,
        "bagging_seed" : 2018
    }

In [None]:
dtrain = lgb.Dataset(dev_x, label=dev_y)
dvalid = lgb.Dataset(val_x, label=val_y)

In [None]:
evals_results = {}
print("Training the model...")

start = datetime.now()
lgb_model = lgb.train(lgb_params, 
                 dtrain, 
                 valid_sets=[dtrain, dvalid], 
                 valid_names=['train','valid'], 
                 evals_result=evals_results, 
                 num_boost_round=1000,
                 early_stopping_rounds=70,
                 verbose_eval=50, 
                 feval=None)
print("Total time taken : ", datetime.now()-start)

In [None]:
pred_test_lgb = lgb_model.predict(test_x, num_iteration=lgb_model.best_iteration)
pred_val_lgb = lgb_model.predict(val_x, num_iteration=lgb_model.best_iteration)

Will now test the accuracy on the validation dataset. 

In [None]:
from sklearn import metrics
pred_val_lgb[pred_val_lgb<0] = 0
val_pred_df = pd.DataFrame({"fullVisitorId":val_df["fullVisitorId"].values})
val_pred_df["transactionRevenue"] = val_df["totals.transactionRevenue"].values
val_pred_df["PredictedRevenue"] = np.expm1(pred_val_lgb)
val_pred_df = val_pred_df.groupby("fullVisitorId")["transactionRevenue", "PredictedRevenue"].sum().reset_index()
print(np.sqrt(metrics.mean_squared_error(np.log1p(val_pred_df["transactionRevenue"].values), np.log1p(val_pred_df["PredictedRevenue"].values))))

Now plotting the feature importance graph as calculated by the above LightGBM model.

In [None]:
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = val_x.columns
fold_importance_df["importance"] = lgb_model.feature_importance()
plt.figure(figsize=(18,20))
sns.barplot(x='importance',y='feature',data=fold_importance_df.sort_values(by="importance", ascending=False))

Now applying the model on the test dataset and trying to generate submission files.

In [None]:
train_id = train_df["fullVisitorId"].values
test_id = test_df["fullVisitorId"].values
sub_df = pd.DataFrame({"fullVisitorId":test_id})
pred_test_lgb[pred_test_lgb<0] = 0
sub_df["PredictedLogRevenue"] = np.expm1(pred_test_lgb)
sub_df = sub_df.groupby("fullVisitorId")["PredictedLogRevenue"].sum().reset_index()
sub_df.columns = ["fullVisitorId", "PredictedLogRevenue"]
sub_df["PredictedLogRevenue"] = np.log1p(sub_df["PredictedLogRevenue"])
sub_df.to_csv("baseline_lgb.csv", index=False)