Please see [this discussion](https://www.kaggle.com/c/google-analytics-customer-revenue-prediction/discussion/65989) for more detail.

The code below imputes zero predicted revenue to all the records in test that have positive 'totals_bounces', and doing this reduced my RMSE by 0.0001 on the public leaderboard. 

The idea is as follows: If I understand the feature 'totals_bounces' correctly, a positive value means that the user stumbled on the site (probably clicked on some ad by accident) and left right away. In this sense, this user should not have any purchase at all. This is very well reflected by the training data and at first I thought the model should be able to pick this up easily. Howerver, after some experiments I failed to use this feature in any meaningful way. That's why I tried to use this to do some postprocessing. 

I acutally expected a boost more than 0.0001,  as half of the records in test have positive bounces. But it seems my model has already done a decent job dealing with these cases. 

In [None]:
import numpy as np 
import pandas as pd 
import json
import bq_helper
from pandas.io.json import json_normalize

def load_df(filename):
    path = "../input/" + filename
    df = pd.read_csv(path, converters={column: json.loads for column in json_cols}, 
                     dtype={'fullVisitorId': 'str'})
    
    for column in json_cols:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}_{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    return df

json_cols = ['device', 'geoNetwork', 'totals', 'trafficSource']

test = load_df("test.csv")

submission = test[['fullVisitorId']].copy()
submission['fullVisitorId'] = submission['fullVisitorId'].astype(str)

test_is_bounce_index = test[test['totals_bounces'].notnull()].index.copy()

submission.loc[:, 'PredictedLogRevenue'] = np.random.rand(submission.shape[0])
submission.loc[test_is_bounce_index, 'PredictedLogRevenue'] = 0

grouped_test = submission[['fullVisitorId', 'PredictedLogRevenue']].groupby('fullVisitorId').sum().reset_index()
grouped_test['PredictedLogRevenue'] = grouped_test['PredictedLogRevenue'].apply(lambda x: x if x>=0 else 0)
grouped_test.to_csv('postprocessing.csv', index=False)
