# Create the Submission Files Using Our Machine Learning Models

## Random Forest Regressor Models - v3
### Version 3: Limit Input Variables, Label Encoding, No Scaling/Transforming of Data


##### !!!We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the Training Dataset!!! <br> <br>

The requirements for this Kaggle competition are defined at the project page:
https://www.kaggle.com/c/ga-customer-revenue-prediction

#### We need to used our models to make predictions on the test data and then create the final Submission Files in the format required.

Root Mean Squared Error (RMSE)
Submissions are scored on the root mean squared error. RMSE is defined as:

RMSE=1n∑i=1n(yi−y^i)2−−−−−−−−−−−−√,
where y hat is the natural log of the predicted revenue for a customer and y is the natural log of the actual summed revenue value plus one.

Submission File
For each fullVisitorId in the test set, you must predict the natural log of their total revenue in PredictedLogRevenue. The submission file should contain a header and have the following format:

fullVisitorId,PredictedLogRevenue <br>
0000000259678714014,0 <br>
0000049363351866189,0 <br>
0000053049821714864,0 <br>
etc.


In [1]:
import pandas as pd
import numpy as np

import pickle

from sklearn import preprocessing
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

  _nan_object_mask = _nan_object_array != _nan_object_array


## Importing and Pre-processing of the Test Dataset
We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the training dataset.

In [2]:
#import the data engineered and feature engineered test dataset
df = pd.read_pickle('/home/michael_suomi/Final-Project-Google-Merch-Store/data/test_v1_full_data_split.pkl')
print(df.shape)
# print(df.columns)

(804684, 42)


In [3]:
# in test data don't have transaction revenue column
# ### CHANGE TRANSACTION REVENUE FROM NANs to 0 AND CHANGE to FLOAT TYPE (some are strings)###
# df.totals_transactionRevenue.fillna(0, inplace=True)
# df.totals_transactionRevenue = df.totals_transactionRevenue.astype(dtype=float)

### CHANGE OTHER STRINGS TO INTS/FLOATS WHERE NEEDED ###
#stick to floats rather than ints since a np.nan is a float object
df.totals_bounces = df.totals_bounces.astype(dtype=float)
df.totals_hits = df.totals_hits.astype(dtype=float)
df.totals_newVisits = df.totals_newVisits.astype(dtype=float)
df.totals_pageviews = df.totals_pageviews.astype(dtype=float)
df.totals_visits = df.totals_visits.astype(dtype=float)

### CONVERT NANs in bounces, newVisits to 0 values ###
#the blank NAN values for these columns imply a 0 value meaning 0 newVisits or 0 bounces
df.totals_bounces.fillna(0, inplace=True)
df.totals_newVisits.fillna(0, inplace=True)
# df.totals_visits.fillna(0, inplace=True) #there shouldn't be anyone with 0 visits (they've at least visited once or woulnd't be recorded)

In [4]:
### VIEW THE DATA BEFORE LABEL ENCODING ###
print(df.shape)
print(df.columns)
df.head(3)

(804684, 42)
Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device_browser', 'device_operatingSystem', 'device_deviceCategory',
       'device_isMobile', 'geoNetwork_city', 'geoNetwork_region',
       'geoNetwork_networkDomain', 'geoNetwork_continent', 'geoNetwork_metro',
       'geoNetwork_subContinent', 'geoNetwork_country', 'totals_pageviews',
       'totals_newVisits', 'totals_hits', 'totals_visits', 'totals_bounces',
       'trafficSource_keyword', 'trafficSource_medium',
       'trafficSource_adContent', 'trafficSource_referralPath',
       'trafficSource_isTrueDirect', 'trafficSource_campaign',
       'trafficSource_source', 'city_country', 'lat_lng', 'timezone',
       'datetime_iso_utc', 'datetime_iso_local', 'year_local', 'month_local',
       'day_local', 'yearday_local', 'weekday_local', 'hour_local'],
      dtype='object')


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_browser,device_operatingSystem,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20171016,6167871330617112363,6167871330617112363_1508151024,Not Socially Engaged,1508151024,2,1508151024,Chrome,Macintosh,...,"(1.352083, 103.819836)","(+08, 8.0)",2017-10-16 10:50:24+00:00,2017-10-16 18:50:24+08:00,2017.0,10.0,16.0,289.0,1.0,18.0
1,Organic Search,20171016,643697640977915618,0643697640977915618_1508175522,Not Socially Engaged,1508175522,1,1508175522,Chrome,Windows,...,"(41.6488226, -0.8890853)","(CEST, 2.0)",2017-10-16 17:38:42+00:00,2017-10-16 19:38:42+02:00,2017.0,10.0,16.0,289.0,1.0,19.0
2,Organic Search,20171016,6059383810968229466,6059383810968229466_1508143220,Not Socially Engaged,1508143220,1,1508143220,Chrome,Macintosh,...,"(46.227638, 2.213749)","(CEST, 2.0)",2017-10-16 08:40:20+00:00,2017-10-16 10:40:20+02:00,2017.0,10.0,16.0,289.0,1.0,10.0


In [5]:
#view the numerical data columns for counts, mean, and min/max
#if the standard deviation (std) is zero, that means every value is the same - may want to check that data
#and see if need to edit it (since describe ignores NANs for instance, you may need to go back and convert the NANs to a 
#value that makes sense)
df.describe()

Unnamed: 0,date,visitId,visitNumber,visitStartTime,totals_pageviews,totals_newVisits,totals_hits,totals_visits,totals_bounces,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
count,804684.0,804684.0,804684.0,804684.0,804545.0,804684.0,804684.0,804684.0,804684.0,803652.0,803652.0,803652.0,803652.0,803652.0,803652.0
mean,20174960.0,1513339000.0,2.414087,1513339000.0,3.523742,0.751065,4.242126,1.0,0.523122,2017.426762,6.852701,15.670803,193.092638,3.738419,13.932457
std,4573.101,6676000.0,9.431737,6676000.0,5.786013,0.432396,8.196982,0.0,0.499465,0.494607,3.953311,8.587357,120.377968,1.918821,5.833705
min,20170800.0,1501656000.0,1.0,1501657000.0,1.0,0.0,1.0,1.0,0.0,2017.0,1.0,1.0,1.0,1.0,0.0
25%,20171010.0,1507548000.0,1.0,1507548000.0,1.0,1.0,1.0,1.0,0.0,2017.0,3.0,8.0,71.0,2.0,10.0
50%,20171210.0,1513125000.0,1.0,1513125000.0,1.0,1.0,1.0,1.0,1.0,2017.0,8.0,16.0,235.0,4.0,14.0
75%,20180220.0,1519227000.0,1.0,1519227000.0,4.0,1.0,4.0,1.0,1.0,2018.0,10.0,23.0,301.0,5.0,19.0
max,20180430.0,1525158000.0,457.0,1525158000.0,500.0,1.0,500.0,1.0,1.0,2018.0,12.0,31.0,365.0,7.0,23.0


In [6]:
### LABEL ENCODING ALL THE CATEGORICAL VARIABLES ###
# label encode the categorical variables
categorical_cols = ['channelGrouping', 'socialEngagementType', 
                   'device_deviceCategory', 'device_browser', 'device_isMobile',
                   'device_operatingSystem', 'geoNetwork_subContinent',
                   'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                   'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                   'trafficSource_isTrueDirect', 'trafficSource_keyword',
                   'trafficSource_source', 'trafficSource_adContent',
                   'trafficSource_medium', 'trafficSource_referralPath',
                   'trafficSource_campaign']

print('Original Dataframe Shape: ', df.shape)

for col in categorical_cols:
    print('\n Converting Column: ', col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    print(df.shape)


Original Dataframe Shape:  (804684, 42)

 Converting Column:  channelGrouping
(804684, 42)

 Converting Column:  socialEngagementType
(804684, 42)

 Converting Column:  device_deviceCategory
(804684, 42)

 Converting Column:  device_browser
(804684, 42)

 Converting Column:  device_isMobile
(804684, 42)

 Converting Column:  device_operatingSystem
(804684, 42)

 Converting Column:  geoNetwork_subContinent
(804684, 42)

 Converting Column:  geoNetwork_region
(804684, 42)

 Converting Column:  geoNetwork_continent
(804684, 42)

 Converting Column:  geoNetwork_country
(804684, 42)

 Converting Column:  geoNetwork_city
(804684, 42)

 Converting Column:  geoNetwork_metro
(804684, 42)

 Converting Column:  geoNetwork_networkDomain
(804684, 42)

 Converting Column:  trafficSource_isTrueDirect
(804684, 42)

 Converting Column:  trafficSource_keyword
(804684, 42)

 Converting Column:  trafficSource_source
(804684, 42)

 Converting Column:  trafficSource_adContent
(804684, 42)

 Converting Colum

## Decide what Input Data to Use for X
We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the training dataset.

In [7]:
### ASSIGN X and y DATA for VARIALBES WE WANT TO USE###

# for X data use the initial correlation values and variables that we think are most important to narrow things down
# (remember, that the correlation values are just linear correlation values, so this doesn't capture variables
# that do have a large influence but might be nonlinear, however, for linear regression models at the least, that
# seems like a good metric to start with as the linear models won't be able to capture nonlinear affects well anyways)

#INITIAL RUN DECISIONS: as we can see from the initial Pearson correlations, no variables even fall within the range of what
#we would consider even low correlations traditionally, so it is doubtful that linear regression models will work well,
#but we'll try it out - take roughly top 10 variables, use all one hot encoded columns and also include
#weekday_local, month_local, yearday_local, and hour_local since those are features we specifically added to
#make our features unique

### NARROW DOWN THE CATEGORICAL COLUMNS WANT TO ADD AS X VARIABLE INPUTS ###
categorical_columns_x_test = ['trafficSource_isTrueDirect', 'trafficSource_source',
                               #'trafficSource_keyword', #tons of dimensions and not good predictor
                                'geoNetwork_continent', 'geoNetwork_country'
                               #'geoNetwork_city'
                               ]


### NARROW DOWN THE NUMERICAL COLUMNS WANT TO ADD TO X VARIABLE INPUTS ###
numerical_columns_x_test = ['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
                             'weekday_local', 'month_local', 'yearday_local', 'hour_local']

### DON'T HAVE Y-VALUES FOR TEST DATA AS THAT IS WHAT WE ARE PREDICTING ###
# #create y outputs column name (but do in list form for easy list adding later)
# column_y_model = ['totals_transactionRevenue']

#for test data, we now need to keep track of fullVisitorId as well (but do in list form for easy list adding later)
column_fullVisitorId_test = ['fullVisitorId']

#create the model dataframe that includes chosen x input variables (from numerical and categorical) and y output variable
#do this so that can clean the dataframe by dropping all rows that have any nans
df_test = df[numerical_columns_x_test + categorical_columns_x_test + column_fullVisitorId_test]

### TRY RUNNING MODEL WITHOUT DROPPING NANs FIRST - IF DOESN'T WORK, MAY NEED TO LATER REVISE THIS PROCEDURE ###
# print('\nShape of all of our variables being used for the model (before dropping nans): ', df_test.shape)
# #for linear regression drop NANs as they can't be interpreted in the regression model - check to make
# #sure it isn't reducing size of data too much before proceeding
# df_test = df_test.dropna(axis='index', how='any')
# print('\nShape of all of our variables being used for the model (after dropping nans): ', df_test.shape)

### REVENUE_LABEL NOT APPLICABLE FOR TEST DATA AS WE DON'T KNOW ACTUAL REVENUE ###
# #add a column to the df_model data of a simple classifier of "revenue" or "no_revenue" - will use this data point for:
# #     in the train_test_split model we will use the stratify command to get equal train-test percentages for both revenue
# #     and no revenue outcomes - I think this will be important since only about 1.3% of all rows actually resulted in 
# #     revenue and not completely sure how randomly selecting will have equal test-train distributions without defining it
# #     (this may be unnecessary, but better safe than sorry)
# df_model['revenue_label'] = df_model.totals_transactionRevenue.map(lambda revenue_amount: 
#                                                         'revenue' if revenue_amount > 0 else 'no_revenue')


#split out the data we are using for testing to X and y and ID values
columns_X_test = [col for col in list(df_test.columns) if col not in ['fullVisitorId', 'totals_transactionRevenue', 'revenue_label']]
X_test = df_test[columns_X_test]

#split out the fullVisitor info (don't want that column running through model as it doesn't have any coefficients)
fullVisitorId_test = df_test['fullVisitorId']

### DON'T HAVE Y-VALUES FOR TEST DATA AS THAT IS WHAT WE ARE PREDICTING ###
# #don't actually need to reshape the y_model data for decision trees apparently, but narrow it down to only y_values
# y_model = df_model['totals_transactionRevenue'] #.values.reshape(-1, 1)

### WON'T BE DOING TRAIN-TEST-SPLIT ON THE TEST DATA & DON'T KNOW 'REVENUE_LABEL' SO EXCLUED THIS STEP ###
# #put stratify criteria of revenue/no_revenue into its own array, make sure to reshape this as well
# stratify_criteria_model = df_model['revenue_label'] #.values.reshape(-1, 1)

print('\nShape of X input variables is: ', X_test.shape, '\nShape of fullVisitorId is: ', fullVisitorId_test.shape) #, '\nShape of y output variable is: ', y_model.shape)



Shape of X input variables is:  (804684, 13) 
Shape of fullVisitorId is:  (804684,)


In [8]:
### see how many unique VisitorIds there are to check the groupby later ###
print('# of Unique fullVisitorIds is: ', len(set(fullVisitorId_test)))

# of Unique fullVisitorIds is:  617242


In [9]:
### see how many rows have any nans ###
print('# of rows (transactions) with NaNs is: ', (X_test.shape[0] - X_test.dropna(axis='index', how='any').shape[0]), 
     "\nThis is {}% of the total transaction rows".format((X_test.shape[0] - X_test.dropna(axis='index', how='any').shape[0])/X_test.shape[0]*100))

# of rows (transactions) with NaNs is:  1170 
This is 0.1453986906661497% of the total transaction rows


# Make Predictions on Test Data with the Archived Trained Models

### <u> Model for Submission: </u>
## Random Forest Regressor v3: n_estimators=5 <br>Kaggle Score: 2.4075
model_pickles/random_forest_v3_n_estimators_5.pkl

In [15]:
### LOAD MACHINE LEARNING MODEL FROM PICKLE ###
#load the model of your choosing from pickle
loaded_model_regr = pickle.load(open('model_pickles/random_forest_v3_n_estimators_5.pkl', 'rb'))
print(loaded_model_regr)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


In [57]:
### GET TRANSACTION REVENUE PREDICTIONS BASED ON YOUR MODEL ### 
#get transaction revenue predictions based on your model (these are the y-values)
#because there are rows with NaNs in X_test data, will need to treat those differently
#this code is not effecient.... try something better later, but it is function for now (but very slow)

revenue_test_predicted = []

for row_X_test in X_test.itertuples():
    #if any of the row items are a nan, will then do this if loop to set revenue to 0
    if np.isnan(row_X_test).any():
        #assume if the X data has any Nans that revenue is zero
        #(can do more data cleaning later, but only about 1100 out of 804,600 rows that have NaNs so seems reasonable)
        row_revenue_prediction = 0        
        
    #else if none of the row items are nan, will use the row items in the model algorithm to give the revenue prediction
    #the weird syntax of [list(row_X_test)[1:]] is converting the itertuples back to a list, and it is excluding the 0th
    #index which is the index label of the row (which is not a variable we want going through the model) and then
    #the outer [] are to convert it to a 2d array, which is what the model is expecting to see
    else:
        row_revenue_prediction = loaded_model_regr.predict([list(row_X_test)[1:]])
    
    #store the revenue prediction in the overall list
    revenue_test_predicted.append(row_revenue_prediction)

print(len(revenue_test_predicted))
revenue_test_predicted

804684


[array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 20844000.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 ar

In [75]:
#the model spit out results in arrays, so need to clean it up to get to a traditional list of ints/floats
revenue_test_predicted_clean = [x[0] if type(x)!=int else x for x in revenue_test_predicted ]

In [76]:
### CREATE DATAFRAME WITH IDs LINKED TO EACH PREDICTED TRANSACTION REVENUE & GROUPBY IDs AND TAKE REVENUE SUM ###
#create a new dataframe with fullVisitorId and PredictedTransactionRevenue
df_visitor_revenue_predicted = pd.DataFrame({"fullVisitorId":list(fullVisitorId_test),
                                             "PredictedTransactionRevenue": revenue_test_predicted_clean})

#competition wants the total revenue by customer, so need to groupby the fullVisitorId and sum all transaction revenue
#this will start our final submission df
df_submission = df_visitor_revenue_predicted.groupby("fullVisitorId")["PredictedTransactionRevenue"].sum().reset_index()
print(df_submission.shape)
df_submission.head()

(617242, 2)


Unnamed: 0,fullVisitorId,PredictedTransactionRevenue
0,259678714014,0.0
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


In [78]:
### CONVERT REVENUE TO LOGSCALE AND CLEAN UP DF ###
#competition also requires us to take the log of the total revenue by customer for its final metrics, so map that
#believe they intend for us to take the log of the total revenue plus 1 (which np.log1p does for us) because 
#otherwise all the 0 revenues would go to -infinity and mess everything up
df_submission["PredictedLogRevenue"] = df_submission["PredictedTransactionRevenue"].map(lambda x: np.log1p(x))

#drop the intermediary column of sum of transacation revenues before the log revenue
df_submission.drop("PredictedTransactionRevenue", axis='columns', inplace=True)
print(df_submission.shape)
df_submission.head()


(617242, 2)


Unnamed: 0,fullVisitorId,PredictedLogRevenue
0,259678714014,0.0
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


In [79]:
### SAVE AS CSV ###
#send the df_submission to csv - use folder kaggle_submissions and use same name as the model (just change .pkl to .csv)
df_submission.to_csv("kaggle_submissions/random_forest_v3_n_estimators_5.csv", index=False)
print('Results saved to kaggle_submissions folder.')

Results saved to kaggle_submissions folder.


### <u> Model for Submission: </u>
## Random Forest Regressor v3: n_estimators=200, min_leaf=44 <br>Kaggle Score: ???
model_pickles/random_forest_v3_n_estimators_200_min_leaf_44.pkl

In [10]:
### LOAD MACHINE LEARNING MODEL FROM PICKLE ###
#load the model of your choosing from pickle
loaded_model_regr = pickle.load(open('model_pickles/random_forest_v3_n_estimators_200_min_leaf_44.pkl', 'rb'))
print(loaded_model_regr)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=44, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


In [11]:
### GET TRANSACTION REVENUE PREDICTIONS BASED ON YOUR MODEL ### 
#get transaction revenue predictions based on your model (these are the y-values)
#because there are rows with NaNs in X_test data, will need to treat those differently
#this code is not effecient.... try something better later, but it is function for now (but very slow)

revenue_test_predicted = []

for row_X_test in X_test.itertuples():
    #if any of the row items are a nan, will then do this if loop to set revenue to 0
    if np.isnan(row_X_test).any():
        #assume if the X data has any Nans that revenue is zero
        #(can do more data cleaning later, but only about 1100 out of 804,600 rows that have NaNs so seems reasonable)
        row_revenue_prediction = 0        
        
    #else if none of the row items are nan, will use the row items in the model algorithm to give the revenue prediction
    #the weird syntax of [list(row_X_test)[1:]] is converting the itertuples back to a list, and it is excluding the 0th
    #index which is the index label of the row (which is not a variable we want going through the model) and then
    #the outer [] are to convert it to a 2d array, which is what the model is expecting to see
    else:
        row_revenue_prediction = loaded_model_regr.predict([list(row_X_test)[1:]])
    
    #store the revenue prediction in the overall list
    revenue_test_predicted.append(row_revenue_prediction)

print(len(revenue_test_predicted))
revenue_test_predicted

804684


[array([ 7356.21715241]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 407963.86422141]),
 array([ 1484163.07433054]),
 array([ 0.]),
 array([ 0.]),
 array([ 258771.27166586]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 25316.32984923]),
 array([ 0.]),
 array([ 0.]),
 array([ 40329.76240924]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 32973.54525683]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 543041.71032153]),
 array([ 35138.54525683]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 array([ 0.]),
 ar

In [12]:
#the model spit out results in arrays, so need to clean it up to get to a traditional list of ints/floats
revenue_test_predicted_clean = [x[0] if type(x)!=int else x for x in revenue_test_predicted ]

In [13]:
### CREATE DATAFRAME WITH IDs LINKED TO EACH PREDICTED TRANSACTION REVENUE & GROUPBY IDs AND TAKE REVENUE SUM ###
#create a new dataframe with fullVisitorId and PredictedTransactionRevenue
df_visitor_revenue_predicted = pd.DataFrame({"fullVisitorId":list(fullVisitorId_test),
                                             "PredictedTransactionRevenue": revenue_test_predicted_clean})

#competition wants the total revenue by customer, so need to groupby the fullVisitorId and sum all transaction revenue
#this will start our final submission df
df_submission = df_visitor_revenue_predicted.groupby("fullVisitorId")["PredictedTransactionRevenue"].sum().reset_index()
print(df_submission.shape)
df_submission.head()

(617242, 2)


Unnamed: 0,fullVisitorId,PredictedTransactionRevenue
0,259678714014,595202.880334
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


In [14]:
### CONVERT REVENUE TO LOGSCALE AND CLEAN UP DF ###
#competition also requires us to take the log of the total revenue by customer for its final metrics, so map that
#believe they intend for us to take the log of the total revenue plus 1 (which np.log1p does for us) because 
#otherwise all the 0 revenues would go to -infinity and mess everything up
df_submission["PredictedLogRevenue"] = df_submission["PredictedTransactionRevenue"].map(lambda x: np.log1p(x))

#drop the intermediary column of sum of transacation revenues before the log revenue
df_submission.drop("PredictedTransactionRevenue", axis='columns', inplace=True)
print(df_submission.shape)
df_submission.head()


(617242, 2)


Unnamed: 0,fullVisitorId,PredictedLogRevenue
0,259678714014,13.296659
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


In [16]:
### SAVE AS CSV ###
#send the df_submission to csv - use folder kaggle_submissions and use same name as the model (just change .pkl to .csv)
df_submission.to_csv("kaggle_submissions/random_forest_v3_n_estimators_200_min_leaf_44.csv", index=False)
print('Results saved to kaggle_submissions folder.')

Results saved to kaggle_submissions folder.


### <u> Model for Submission: </u>
## Prediction of Zero for Everything - Playing the Odds <br>Kaggle Score: 1.7804
No need for model, just create y_prediction of all 0s.

In [80]:
#need a list of zeros with length equal to length of X_model_test (which can find that by calling the 0th index of the shape tuple)
revenue_test_predicted_clean = [0 for row in range(0, X_test.shape[0])]

#verify that first and last number in list are zero
print('first list item is: ', revenue_test_predicted_clean[0], '\nlast list item is: ', revenue_test_predicted_clean[-1])

#verify lengths
print('\nX_model_test length is: ', X_test.shape[0],
      '\ny_model_test_predict length is: ', len(revenue_test_predicted_clean))

first list item is:  0 
last list item is:  0

X_model_test length is:  804684 
y_model_test_predict length is:  804684


In [81]:
### CREATE DATAFRAME WITH IDs LINKED TO EACH PREDICTED TRANSACTION REVENUE & GROUPBY IDs AND TAKE REVENUE SUM ###
#create a new dataframe with fullVisitorId and PredictedTransactionRevenue
df_visitor_revenue_predicted = pd.DataFrame({"fullVisitorId":list(fullVisitorId_test),
                                             "PredictedTransactionRevenue": revenue_test_predicted_clean})

#competition wants the total revenue by customer, so need to groupby the fullVisitorId and sum all transaction revenue
#this will start our final submission df
df_submission = df_visitor_revenue_predicted.groupby("fullVisitorId")["PredictedTransactionRevenue"].sum().reset_index()
print(df_submission.shape)
df_submission.head()

(617242, 2)


Unnamed: 0,fullVisitorId,PredictedTransactionRevenue
0,259678714014,0
1,49363351866189,0
2,53049821714864,0
3,59488412965267,0
4,85840370633780,0


In [82]:
### CONVERT REVENUE TO LOGSCALE AND CLEAN UP DF ###
#competition also requires us to take the log of the total revenue by customer for its final metrics, so map that
#believe they intend for us to take the log of the total revenue plus 1 (which np.log1p does for us) because 
#otherwise all the 0 revenues would go to -infinity and mess everything up
df_submission["PredictedLogRevenue"] = df_submission["PredictedTransactionRevenue"].map(lambda x: np.log1p(x))

#drop the intermediary column of sum of transacation revenues before the log revenue
df_submission.drop("PredictedTransactionRevenue", axis='columns', inplace=True)
print(df_submission.shape)
df_submission.head()


(617242, 2)


Unnamed: 0,fullVisitorId,PredictedLogRevenue
0,259678714014,0.0
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


In [83]:
### SAVE AS CSV ###
#send the df_submission to csv - use folder kaggle_submissions and use same name as the model (just change .pkl to .csv)
df_submission.to_csv("kaggle_submissions/playing_the_odds_of_zero_revenue_everyone.csv", index=False)
print('Results saved to kaggle_submissions folder.')

Results saved to kaggle_submissions folder.


In [None]:
### THE BELOW FUNCTIONS/SCRIPT DIDN'T WORK BECAUSE OF THE NANs THAT EXIST IN THE DATA ###

In [13]:
# ### LOAD PICKLE MODEL AND RUN PREDICTION USING THAT MODEL ###

# #function that loads the machine learning model from a pickle and then uses that to create
# #the predicted y_values - returns y_predictions
# #the input of file_name_path, make sure it includes extension of .pkl
# def prediction_from_pickle_model(X_test_data, model_file_name_path):
#     loaded_model_regr = pickle.load(open(model_file_name_path, 'rb'))
    
#     y_test_predicted = loaded_model_regr.predict(X_test_data)
    
#     return y_test_predicted

In [56]:
# #run predictions using the pickled model of your choosing to get y value revenue predictions per transaction
# revenue_test_predicted = prediction_from_pickle_model(X_test, 'model_pickles/random_forest_v3_n_estimators_5.pkl')

# #may need to fill NaNs with zeros if there are NaNs in output (not sure how it will handle things)

# #create a new dataframe with fullVisitorId and PredictedTransactionRevenue
# df_visitor_revenue_predicted = pd.DataFrame({"fullVisitorId":fullVisitorId_test, "PredictedTransactionRevenue": revenue_test_predicted})

# #competition wants the total revenue by customer, so need to groupby the fullVisitorId and sum all transaction revenue
# #this will start our final submission df
# df_submission = df_visitor_revenue_predicted.groupby("fullVisitorId")["PredictedTransactionRevenue"].sum().reset_index()

# #competition also requires us to take the log of the total revenue by customer for its final metrics, so map that
# #believe they intend for us to take the log of the total revenue plus 1 (which np.log1p does for us) because 
# #otherwise all the 0 revenues would go to -infinity and mess everything up
# df_submission["PredictedLogRevenue"] = df_submission["PredictedTransactionRevenue"].map(lambda x: np.log1p(x))

# #drop the intermediary column of sum of transacation revenues before the log revenue
# df_submission.drop("PredictedTransactionRevenue", inplace=True)

# #send the df_submission to csv - use folder kaggle_submissions and use same name as the model (just change .pkl to .csv)
# df_submission.to_csv("kaggle_submissions/random_forest_v3_n_estimators_5.csv", index=False)
# print('Results saved to kaggle_submissions folder.')