# Create the Submission Files Using Our Machine Learning Models

## Random Forest Regressor Models - v3
### Version 3: Limit Input Variables, Label Encoding, No Scaling/Transforming of Data


##### !!!We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the Training Dataset!!! <br> <br>

The requirements for this Kaggle competition are defined at the project page:
https://www.kaggle.com/c/ga-customer-revenue-prediction

#### We need to used our models to make predictions on the test data and then create the final Submission Files in the format required.

Root Mean Squared Error (RMSE)
Submissions are scored on the root mean squared error. RMSE is defined as:

RMSE=1n∑i=1n(yi−y^i)2−−−−−−−−−−−−√,
where y hat is the natural log of the predicted revenue for a customer and y is the natural log of the actual summed revenue value plus one.

Submission File
For each fullVisitorId in the test set, you must predict the natural log of their total revenue in PredictedLogRevenue. The submission file should contain a header and have the following format:

fullVisitorId,PredictedLogRevenue <br>
0000000259678714014,0 <br>
0000049363351866189,0 <br>
0000053049821714864,0 <br>
etc.


In [1]:
import pandas as pd
import numpy as np

import pickle

# from sklearn import preprocessing
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

## Importing and Pre-processing of the Test Dataset
We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the training dataset.

In [None]:
#import the data engineered and feature engineered test dataset
df = pd.read_pickle('/home/michael_suomi/Final-Project-Google-Merch-Store/data/test_v1_full_data_split.pkl')
print(df.shape)
# print(df.columns)

In [None]:
### CHANGE TRANSACTION REVENUE FROM NANs to 0 AND CHANGE to FLOAT TYPE (some are strings)###
df.totals_transactionRevenue.fillna(0, inplace=True)
df.totals_transactionRevenue = df.totals_transactionRevenue.astype(dtype=float)

### CHANGE OTHER STRINGS TO INTS/FLOATS WHERE NEEDED ###
#stick to floats rather than ints since a np.nan is a float object
df.totals_bounces = df.totals_bounces.astype(dtype=float)
df.totals_hits = df.totals_hits.astype(dtype=float)
df.totals_newVisits = df.totals_newVisits.astype(dtype=float)
df.totals_pageviews = df.totals_pageviews.astype(dtype=float)
df.totals_visits = df.totals_visits.astype(dtype=float)

### CONVERT NANs in bounces, newVisits to 0 values ###
#the blank NAN values for these columns imply a 0 value meaning 0 newVisits or 0 bounces
df.totals_bounces.fillna(0, inplace=True)
df.totals_newVisits.fillna(0, inplace=True)
# df.totals_visits.fillna(0, inplace=True) #there shouldn't be anyone with 0 visits (they've at least visited once or woulnd't be recorded)

In [None]:
### VIEW THE DATA BEFORE LABEL ENCODING ###
print(df.shape)
print(df.columns)
df.head(3)

In [None]:
#view the numerical data columns for counts, mean, and min/max
#if the standard deviation (std) is zero, that means every value is the same - may want to check that data
#and see if need to edit it (since describe ignores NANs for instance, you may need to go back and convert the NANs to a 
#value that makes sense)
df.describe()

In [None]:
### LABEL ENCODING ALL THE CATEGORICAL VARIABLES ###
# label encode the categorical variables
categorical_cols = ['channelGrouping', 'socialEngagementType', 
                   'device_deviceCategory', 'device_browser', 'device_isMobile',
                   'device_operatingSystem', 'geoNetwork_subContinent',
                   'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                   'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                   'trafficSource_isTrueDirect', 'trafficSource_keyword',
                   'trafficSource_source', 'trafficSource_adContent',
                   'trafficSource_medium', 'trafficSource_referralPath',
                   'trafficSource_campaign']

print('Original Dataframe Shape: ', df.shape)

for col in categorical_cols:
    print('\n Converting Column: ', col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    print(df.shape)


## Decide what Input Data to Use for X and Split Data via train_test_split
We need to perform the same pre-processing procedure and same variable selection on the Test Dataset as we did the training dataset.

In [None]:
### ASSIGN X and y DATA for VARIALBES WE WANT TO USE###

# for X data use the initial correlation values and variables that we think are most important to narrow things down
# (remember, that the correlation values are just linear correlation values, so this doesn't capture variables
# that do have a large influence but might be nonlinear, however, for linear regression models at the least, that
# seems like a good metric to start with as the linear models won't be able to capture nonlinear affects well anyways)

#INITIAL RUN DECISIONS: as we can see from the initial Pearson correlations, no variables even fall within the range of what
#we would consider even low correlations traditionally, so it is doubtful that linear regression models will work well,
#but we'll try it out - take roughly top 10 variables, use all one hot encoded columns and also include
#weekday_local, month_local, yearday_local, and hour_local since those are features we specifically added to
#make our features unique

### NARROW DOWN THE CATEGORICAL COLUMNS WANT TO ADD AS X VARIABLE INPUTS ###
categorical_columns_x_test = ['trafficSource_isTrueDirect', 'trafficSource_source',
                               #'trafficSource_keyword', #tons of dimensions and not good predictor
                                'geoNetwork_continent', 'geoNetwork_country'
                               #'geoNetwork_city'
                               ]


### NARROW DOWN THE NUMERICAL COLUMNS WANT TO ADD TO X VARIABLE INPUTS ###
numerical_columns_x_test = ['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
                             'weekday_local', 'month_local', 'yearday_local', 'hour_local']

### DON'T HAVE Y-VALUES FOR TEST DATA AS THAT IS WHAT WE ARE PREDICTING ###
# #create y outputs column name (but do in list form for easy list adding later)
# column_y_model = ['totals_transactionRevenue']

#for test data, we now need to keep track of fullVisitorId as well (but do in list form for easy list adding later)
column_fullVisitorId_test = ['fullVisitorId']

#create the model dataframe that includes chosen x input variables (from numerical and categorical) and y output variable
#do this so that can clean the dataframe by dropping all rows that have any nans
df_test = df[numerical_columns_x_test + categorical_columns_x_test + column_fullVisitorId_test]

### TRY RUNNING MODEL WITHOUT DROPPING NANs FIRST - IF DOESN'T WORK, MAY NEED TO LATER REVISE THIS PROCEDURE ###
# print('\nShape of all of our variables being used for the model (before dropping nans): ', df_test.shape)
# #for linear regression drop NANs as they can't be interpreted in the regression model - check to make
# #sure it isn't reducing size of data too much before proceeding
# df_test = df_test.dropna(axis='index', how='any')
# print('\nShape of all of our variables being used for the model (after dropping nans): ', df_test.shape)

### REVENUE_LABEL NOT APPLICABLE FOR TEST DATA AS WE DON'T KNOW ACTUAL REVENUE ###
# #add a column to the df_model data of a simple classifier of "revenue" or "no_revenue" - will use this data point for:
# #     in the train_test_split model we will use the stratify command to get equal train-test percentages for both revenue
# #     and no revenue outcomes - I think this will be important since only about 1.3% of all rows actually resulted in 
# #     revenue and not completely sure how randomly selecting will have equal test-train distributions without defining it
# #     (this may be unnecessary, but better safe than sorry)
# df_model['revenue_label'] = df_model.totals_transactionRevenue.map(lambda revenue_amount: 
#                                                         'revenue' if revenue_amount > 0 else 'no_revenue')


#split out the data we are using for testing to X and y and ID values
columns_X_test = [col for col in list(df_model.columns) if col not in ['fullVisitorId', 'totals_transactionRevenue', 'revenue_label']]
X_test = df_test[columns_X_test]

#split out the fullVisitor info (don't want that column running through model as it doesn't have any coefficients)
fullVisitorId_test = df_test['fullVisitorId']

### DON'T HAVE Y-VALUES FOR TEST DATA AS THAT IS WHAT WE ARE PREDICTING ###
# #don't actually need to reshape the y_model data for decision trees apparently, but narrow it down to only y_values
# y_model = df_model['totals_transactionRevenue'] #.values.reshape(-1, 1)

### WON'T BE DOING TRAIN-TEST-SPLIT ON THE TEST DATA & DON'T KNOW 'REVENUE_LABEL' SO EXCLUED THIS STEP ###
# #put stratify criteria of revenue/no_revenue into its own array, make sure to reshape this as well
# stratify_criteria_model = df_model['revenue_label'] #.values.reshape(-1, 1)

print('\nShape of X input variables is: ', X_test.shape, '\nShape of fullVisitorId is: ', fullVisitorId_test.shape) #, '\nShape of y output variable is: ', y_model.shape)


# Make Predictions on Test Data with the Archived Trained Models

##### Initial Functions Used for  Importing Trained Models and Processing Test Predictions to Kaggle Output Required

In [None]:
### LOAD PICKLE MODEL AND RUN PREDICTION USING THAT MODEL ###

#function that loads the machine learning model from a pickle and then uses that to create
#the predicted y_values - returns y_predictions
#the input of file_name_path, make sure it includes extension of .pkl
def prediction_from_pickle_model(X_test_data, model_file_name_path):
    loaded_model_regr = pickle.load(open(model_file_name_path, 'rb'))
    
    y_test_predicted = loaded_model_regr.predict(X_test_data)
    
    return y_test_predicted

### <u> Model for Submission:
### Random Forest Regressor v3: n_estimators=5
model_pickles/random_forest_v3_n_estimators_5.pkl

In [None]:
#run predictions using the pickled model of your choosing to get y value revenue predictions per transaction
revenue_test_predicted = prediction_from_pickle_model(X_test, 'model_pickles/random_forest_v3_n_estimators_5.pkl')

#may need to fill NaNs with zeros if there are NaNs in output (not sure how it will handle things)

#create a new dataframe with fullVisitorId and PredictedTransactionRevenue
df_visitor_revenue_predicted = pd.DataFrame({"fullVisitorId":fullVisitorId_test, "PredictedTransactionRevenue": revenue_test_predicted})

#competition wants the total revenue by customer, so need to groupby the fullVisitorId and sum all transaction revenue
#this will start our final submission df
df_submission = df_visitor_revenue_predicted.groupby("fullVisitorId")["PredictedTransactionRevenue"].sum().reset_index()

#competition also requires us to take the log of the total revenue by customer for its final metrics, so map that
#believe they intend for us to take the log of the total revenue plus 1 (which np.log1p does for us) because 
#otherwise all the 0 revenues would go to -infinity and mess everything up
df_submission["PredictedLogRevenue"] = df_submission["PredictedTransactionRevenue"].map(lambda x: np.log1p(x))

#drop the intermediary column of sum of transacation revenues before the log revenue
df_submission.drop("PredictedTransactionRevenue", inplace=True)

#send the df_submission to csv - use folder kaggle_submissions and use same name as the model (just change .pkl to .csv)
df_submission.to_csv("kaggle_submissions/random_forest_v3_n_estimators_5.csv", index=False)
print('Results saved to kaggle_submissions folder.')

In [6]:
np.log1p(1000000)

13.815511557963774

In [None]:
pred_test[pred_test<0] = 0