- <a href='#0'>0. Introduction</a>  
- <a href='#1'>1. Get the Data</a>
- <a href='#2'>2. Check the Data</a>
    - <a href='#2-1'>2.1 Remove useless columns</a>
    - <a href='#2-2'>2.2 Check missing features</a>
- <a href='#3'> 3. Explore the data</a>
    - <a href='#3-1'>3.1 Y label</a>
    - <a href='#3-2'>3.2 Device group</a>
    - <a href='#3-3'>3.3 GeoNetwork group</a>
    - <a href='#3-4'>3.4 Totals group</a>
    - <a href='#3-5'>3.5 TrafficSource group</a>
    - <a href='#3-6'>3.6 Others</a>
- <a href='#4'> 4. Baseline model</a>
    - <a href='#4-1'>4.1 Pre-processing</a>
    - <a href='#4-2'>4.2 Modeling</a>
    - <a href='#4-3'>4.3 Feature importance</a>

## <a id='0'>0. Introduction</a>
The 80/20 rule has proven true for many businesses杘nly a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.

In this competition, you抮e challenged to analyze a Google Merchandise Store customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

Submissions are scored on the root mean squared error(RMSE). For each *fullVisitorId* in the test set, you must predict the natural log of their total revenue in PredictedLogRevenue. We are predicting the natural log of the sum of all transactions per user. For every user in the test set, the target is:

$$y_{user} = \sum_{i=1}^ntransaction_{useri}$$

$$target_{user} = ln(y_{user}+1)$$

## <a id='1'>1. Get the Data</a>

In [None]:
import os
import time
import gc
import warnings
warnings.filterwarnings("ignore")
# data manipulation
import json
from pandas.io.json import json_normalize
import numpy as np
import pandas as pd
# plot
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
# model
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

Thanks [this kernel](https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields/notebook) by [Julian](https://www.kaggle.com/julian3833) for handling the columns with JSON data. 

In [None]:
%%time
def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df
train = load_df('../input/train.csv')
test = load_df('../input/test.csv')
sub = pd.read_csv('../input/sample_submission.csv')
gc.collect()

In [None]:
train.head()

## <a id='2'>2. Check the Data</a>

### <a id='2-1'>2.1 Remove useless columns</a>
**The fields in train set but not test set**

In [None]:
set(train.columns).difference(set(test.columns))

**The fields with constant value**

In [None]:
cons_col = [i for i in train.columns if train[i].nunique(dropna=False)==1]
cons_col

In [None]:
train = train.drop(cons_col + ['trafficSource.campaignCode'], axis=1)
test = test.drop(cons_col, axis=1)
gc.collect()

In [None]:
print(train.shape)
print(test.shape)

### <a id='2-2'>2.2 Check missing values</a>

In [None]:
def find_missing(data):
    # number of missing values
    count_missing = data.isnull().sum().values
    # total records
    total = data.shape[0]
    # percentage of missing
    ratio_missing = count_missing/total
    # return a dataframe to show: feature name, # of missing and % of missing
    return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing}, index=data.columns.values)
train_missing = find_missing(train)
test_missing = find_missing(test)

In [None]:
train_missing.reset_index()[['index', 'missing_ratio']]\
    .merge(test_missing.reset_index()[['index', 'missing_ratio']], on='index', how='left')\
    .rename(columns={'index':'columns', 'missing_ratio_x':'train_missing_ratio', 'missing_ratio_y':'test_missing_ratio'})\
    .sort_values(['train_missing_ratio', 'test_missing_ratio'], ascending=False)\
    .query('train_missing_ratio>0')

## <a id='3'>3. Explore the Data</a>

In [None]:
if test.fullVisitorId.nunique() == len(sub):
    print('Till now, the number of fullVisitorId is equal to the rows in submission. Everything goes well!')
else:
    print('Check it again')

- **fullVisitorId**: A unique identifier for each user of the Google Merchandise Store.
- **visitId**: An identifier for this session. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
- **sessionId**:  *fullVisitorId_visitId*; A unique identifier for this visit to the store.
- **date** : The date on which the user visited the Store.
- **visitNumber**: The session number for this user. If this is the first session, then this is set to 1.
- **visitStartTime**: The timestamp (expressed as POSIX time).
- **socialEngagementType**: Engagement type, either "Socially Engaged" or "Not Socially Engaged".
- **channelGrouping** : The channel via which the user came to the Store. 'Organic Search', 'Referral', 'Paid Search', 'Affiliates', 'Direct', 'Display', 'Social' or 'Other'.
- **device** : The specifications for the device used to access the Store. **It includes 16 variables, 4 of which are useful. **
- **geoNetwork**: This section contains information about the geography of the user. **It includes 11 variables, 7 of which are useful. **
- **totals**: This section contains aggregate values across the session. **It includes 6 variables, 5 of which are useful**. **Specially, 'totals.transactionRevenue' is the target for modeling**
- **trafficSource**: This section contains information about the Traffic Source from which the session originated. **It includes 14 variables, 12 of which are useful. **

###  <a id='3-1'>3.1 Y label</a>
#### The distribution of 'transactionRevenue'

In [None]:
y = np.nan_to_num(np.array([float(i) for i in train['totals.transactionRevenue']]))
print('The ratio of customers with transaction revenue is', str((y != 0).mean()))   

In [None]:
plt.figure(figsize=[12, 6])
sns.distplot(y[y!=0])
plt.xlabel('transactionRevenue')
plt.show()

#### The distribution of 'Target'

In [None]:
train["totals.transactionRevenue"] = train["totals.transactionRevenue"].astype('float')
target = np.log1p(train.groupby("fullVisitorId")["totals.transactionRevenue"].sum())
print('The ratio of customers with transaction revenue is', str((target != 0).mean()))

In [None]:
plt.figure(figsize=[12, 6])
sns.distplot(target[target!=0])
plt.xlabel('Target')
plt.show()

###  <a id='3-2'>3.2 Device group</a>

In [None]:
def plot_categorical(data, col, size=[8 ,4], xlabel_angle=0, title='', max_cat = None):
    '''use this for ploting the count of categorical features'''
    plotdata = data[col].value_counts() / len(data)
    if max_cat != None:
        plotdata = plotdata[max_cat[0]:max_cat[1]]
    plt.figure(figsize = size)
    sns.barplot(x = plotdata.index, y=plotdata.values)
    plt.title(title)
    if xlabel_angle!=0: 
        plt.xticks(rotation=xlabel_angle)
    plt.show()
plot_categorical(data=train, col='device.browser', size=[8 ,4], xlabel_angle=20, title='Device - Browser', max_cat=[0, 6])

In [None]:
plot_categorical(data=train, col='device.deviceCategory', size=[8 ,4], xlabel_angle=0, title='Device - Category')

In [None]:
plot_categorical(data=train, col='device.operatingSystem', size=[8 ,4], xlabel_angle=30, 
                 title='Device - Operating System', max_cat = [0, 7])

###  <a id='3-3'>3.3 GeoNetwork group</a>

In [None]:
plot_categorical(data=train, col='geoNetwork.city', size=[12 ,4], xlabel_angle=30, 
                 title='GeoNetwork - City', max_cat = [1, 20])

In [None]:
plot_categorical(data=train, col='geoNetwork.country', size=[12 ,4], xlabel_angle=30, 
                 title='GeoNetwork - Country', max_cat = [0, 20])

In [None]:
plot_categorical(data=train, col='geoNetwork.region', size=[12 ,4], xlabel_angle=30, 
                 title='GeoNetwork - Region', max_cat = [1, 20])

In [None]:
plot_categorical(data=train, col='geoNetwork.metro', size=[12 ,4], xlabel_angle=90, 
                 title='GeoNetwork - metro', max_cat = [2, 20])

In [None]:
plot_categorical(data=train, col='geoNetwork.subContinent', size=[8 ,4], xlabel_angle=30, 
                 title='GeoNetwork - SubContinent', max_cat = [0, 10])

In [None]:
plot_categorical(data=train, col='geoNetwork.continent', size=[8 ,4], xlabel_angle=30, 
                 title='GeoNetwork - Continent')

###  <a id='3-4'>3.4 Totals group</a>

In [None]:
train['totals.bounces'] = train['totals.bounces'].fillna('0')
plot_categorical(data=train, col='totals.bounces', size=[8 ,4], xlabel_angle=0, title='Totals - Bounces')

In [None]:
train['totals.newVisits'] = train['totals.newVisits'].fillna('0')
plot_categorical(data=train, col='totals.newVisits', size=[8 ,4], xlabel_angle=0, title='Totals - NewVisits')

In [None]:
plt.figure(figsize=[12, 6])
sns.distplot(train['totals.hits'].astype('float'), kde=True,bins=30)
plt.xlabel('totals.hits')
plt.title('Total - Hits')
plt.show()

In [None]:
plt.figure(figsize=[12, 6])
sns.distplot(train['totals.pageviews'].astype('float').fillna(0))
plt.xlabel('totals.pageviews')
plt.title('Total - Pageviews')
plt.show()

###  <a id='3-5'>3.5 TrafficSource group</a>

In [None]:
plot_categorical(data=train, col='trafficSource.adContent', size=[10 ,4], xlabel_angle=30, 
                 title='TrafficSource - AdContent', max_cat = [0, 10])

In [None]:
plot_categorical(data=train, col='trafficSource.medium', size=[10 ,4], xlabel_angle=30, 
                 title='TrafficSource - medium')

###  <a id='3-6'>3.6  Others</a>
**Channel Grouping**

In [None]:
plot_categorical(data=train, col='channelGrouping', size=[10 ,4], xlabel_angle=30, 
                 title='Channel Grouping')

**Visit Number **

In [None]:
a = train.groupby("fullVisitorId")["visitNumber"].max()
plt.figure(figsize=[12, 6])
sns.distplot(a)
plt.xlabel('VisitNumber')
plt.title('Visit Number')
plt.show()

**Date**

In [None]:
plt.figure(figsize=[12, 6])
sns.distplot(train['date'])
plt.xlabel('Date')
plt.title('Date')
plt.show()

## <a id='4'>4. Baseline Model</a>
###  <a id='4-1'>4.1 Pre-processing</a>
#### Index and Target

In [None]:
train_idx = train.fullVisitorId
test_idx = test.fullVisitorId
train["totals.transactionRevenue"] = train["totals.transactionRevenue"].astype('float').fillna(0)
train_y = train["totals.transactionRevenue"]
train_target = np.log1p(train.groupby("fullVisitorId")["totals.transactionRevenue"].sum())

#### Pre-processing: label encoder

In [None]:
train.drop(['fullVisitorId', 'sessionId', 'visitId'], axis = 1, inplace = True)
test.drop(['fullVisitorId', 'sessionId', 'visitId'], axis = 1, inplace = True)
num_col = ["totals.hits", "totals.pageviews", "visitNumber", "visitStartTime", 'totals.bounces',  'totals.newVisits']
for i in num_col:
    train[i] = train[i].astype('float').fillna(0)
    test[i] = test[i].astype('float').fillna(0)
cat_col = [e for e in train.columns.tolist() if e not in num_col]
cat_col.remove('date')
cat_col.remove('totals.transactionRevenue')
for i in cat_col:
    lab_en = LabelEncoder()
    train[i] = train[i].fillna('not known')
    test[i] = test[i].fillna('not known')
    lab_en.fit(list(train[i].astype('str')) + list(test[i].astype('str')))
    train[i] = lab_en.transform(list(train[i].astype('str')))
    test[i] = lab_en.transform(test[i].astype('str'))
    print('finish', i)

###  <a id='4-2'>4.2 Modeling</a>
#### train valid split

In [None]:
y_train = np.log1p(train["totals.transactionRevenue"])
x_train = train.drop(["totals.transactionRevenue"], axis=1)
x_test = test.copy()
print(x_train.shape)
print(x_test.shape)

**Be careful! In a typical time series application, we can't use latter records to predict previous one.**

However till now, I am not sure we should treat this problem as a time-series one or not. Feel free to comment and discuss. 

BTW, if you want to conduct time-series split, see [sklearn document](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) for more information.

In [None]:
folds = KFold(n_splits=5,random_state=6)
oof_preds = np.zeros(x_train.shape[0])
sub_preds = np.zeros(x_test.shape[0])

start = time.time()
valid_score = 0
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(x_train, y_train)):
    trn_x, trn_y = x_train.iloc[trn_idx], y_train[trn_idx]
    val_x, val_y = x_train.iloc[val_idx], y_train[val_idx]    
    
    train_data = lgb.Dataset(data=trn_x, label=trn_y)
    valid_data = lgb.Dataset(data=val_x, label=val_y)
    
    params = {"objective" : "regression", "metric" : "rmse", 'n_estimators':10000, 'early_stopping_rounds':100,
              "num_leaves" : 30, "learning_rate" : 0.01, "bagging_fraction" : 0.9,
              "feature_fraction" : 0.3, "bagging_seed" : 0}
    
    lgb_model = lgb.train(params, train_data, valid_sets=[train_data, valid_data], verbose_eval=1000) 
    
    oof_preds[val_idx] = lgb_model.predict(val_x, num_iteration=lgb_model.best_iteration)
    oof_preds[oof_preds<0] = 0
    sub_pred = lgb_model.predict(x_test, num_iteration=lgb_model.best_iteration) / folds.n_splits
    sub_pred[sub_pred<0] = 0 # should be greater or equal to 0
    sub_preds += sub_pred
    print('Fold %2d RMSE : %.6f' % (n_fold + 1, np.sqrt(mean_squared_error(val_y, oof_preds[val_idx]))))
    valid_score += np.sqrt(mean_squared_error(val_y, oof_preds[val_idx]))

In [None]:
print('Session-level CV-score:', str(round(valid_score/folds.n_splits,4)))
print(' ')
train_pred = pd.DataFrame({"fullVisitorId":train_idx})
train_pred["PredictedLogRevenue"] = np.expm1(oof_preds)
train_pred = train_pred.groupby("fullVisitorId")["PredictedLogRevenue"].sum().reset_index()
train_pred.columns = ["fullVisitorId", "PredictedLogRevenue"]
train_pred["PredictedLogRevenue"] = np.log1p(train_pred["PredictedLogRevenue"])
train_rmse = np.sqrt(mean_squared_error(train_target, train_pred['PredictedLogRevenue']))
print('User-level score:', str(round(train_rmse, 4)))
print(' ')
end = time.time()
print('training time:', str(round((end - start)/60)), 'mins')

In [None]:
test_pred = pd.DataFrame({"fullVisitorId":test_idx})
test_pred["PredictedLogRevenue"] = np.expm1(sub_preds)
test_pred = test_pred.groupby("fullVisitorId")["PredictedLogRevenue"].sum().reset_index()
test_pred.columns = ["fullVisitorId", "PredictedLogRevenue"]
test_pred["PredictedLogRevenue"] = np.log1p(test_pred["PredictedLogRevenue"])
test_pred.to_csv("lgb_base_model.csv", index=False) # submission

###  <a id='4-3'>4.3 Feature importance</a>

In [None]:
lgb.plot_importance(lgb_model, height=0.5, max_num_features=20, ignore_zero = False, figsize = (12,6), importance_type ='gain')
plt.show()