# Machine Learning of the Google Store Analytics Dataset

## Random Forest Regressor Models - v3
### Version 3: Limit Input Variables, Label Encoding, No Scaling/Transforming of Data

This dataset is provided by the Kaggle competition.  
https://www.kaggle.com/c/ga-customer-revenue-prediction

We performed some data engineering and datetime feature engineering to get the dataset to the state we wanted.

Now we will try a variety of different models and look at their accuracy.  The models we will try:
1. Generalized Linear Regression Models
    1. Linear Regression (Ordinary Least Squares) Model
    2. Linear Lasso Regression Model
    3. Linear Ridge Regression Model
    4. Linear Elastic Net Regression Model
2. Decision Tree Regression - a combination of decision trees and getting continuous data output http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html  http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
3. Random Forest Regression??
4. Neural Networks

In [27]:
import pandas as pd
import numpy as np

import pickle

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

from scipy.stats.stats import pearsonr

## Importing and Pre-processing of the Training Dataset

In [4]:
# import pickle
# with open('data/train_v1_full_data_split.pkl', 'rb') as fp:
#     df = pickle.load(fp)

In [5]:
#import the data engineered and feature engineered training dataset
df = pd.read_pickle('/home/michael_suomi/Final-Project-Google-Merch-Store/data/train_v1_full_data_split.pkl')
print(df.shape)
# print(df.columns)

(903652, 44)


In [6]:
### DROP COLUMNS NOT IN FINAL TEST DATA ###
#the test dataset does not have the 'trafficSource_campaignCode' column, so drop that from our training set too
df.drop('trafficSource_campaignCode', axis=1, inplace=True)
print(df.shape)
# print(df.columns)
# df.head(3)

(903652, 43)


In [7]:
### CHANGE TRANSACTION REVENUE FROM NANs to 0 AND CHANGE to FLOAT TYPE (some are strings)###
df.totals_transactionRevenue.fillna(0, inplace=True)
df.totals_transactionRevenue = df.totals_transactionRevenue.astype(dtype=float)

### CHANGE OTHER STRINGS TO INTS/FLOATS WHERE NEEDED ###
#stick to floats rather than ints since a np.nan is a float object
df.totals_bounces = df.totals_bounces.astype(dtype=float)
df.totals_hits = df.totals_hits.astype(dtype=float)
df.totals_newVisits = df.totals_newVisits.astype(dtype=float)
df.totals_pageviews = df.totals_pageviews.astype(dtype=float)
df.totals_visits = df.totals_visits.astype(dtype=float)

### CONVERT NANs in bounces, newVisits to 0 values ###
#the blank NAN values for these columns imply a 0 value meaning 0 newVisits or 0 bounces
df.totals_bounces.fillna(0, inplace=True)
df.totals_newVisits.fillna(0, inplace=True)
# df.totals_visits.fillna(0, inplace=True) #there shouldn't be anyone with 0 visits (they've at least visited once or woulnd't be recorded)

In [8]:
#### REVENUE IS DOLLARS * 10^6, NOT EXPONENTIAL LIKE WE THOUGHT ####
#### SINCE THE REVENUE IS SCALED UP BY A CONSTANT, NO NEED TO ADJUST FOR LIN REGRESS MODEL ####
# ### CONVERT TRANSACTION REVENUE TO DOLLARS (instead of the e^dollars_revenue) ###
# df['totals_transactionRevenue_dollars'] = df.totals_transactionRevenue.map(lambda x:
#                                                                             np.log1p(x))

In [9]:
### VIEW THE DATA BEFORE LABEL ENCODING ###
print(df.shape)
print(df.columns)
df.head(3)

(903652, 43)
Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device_deviceCategory', 'device_browser', 'device_isMobile',
       'device_operatingSystem', 'geoNetwork_subContinent',
       'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
       'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
       'totals_bounces', 'totals_hits', 'totals_newVisits', 'totals_pageviews',
       'totals_visits', 'totals_transactionRevenue',
       'trafficSource_isTrueDirect', 'trafficSource_keyword',
       'trafficSource_source', 'trafficSource_adContent',
       'trafficSource_medium', 'trafficSource_referralPath',
       'trafficSource_campaign', 'city_country', 'lat_lng', 'timezone',
       'datetime_iso_utc', 'datetime_iso_local', 'year_local', 'month_local',
       'day_local', 'yearday_local', 'weekday_local', 'hour_local'],
      dtype='object')


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_deviceCategory,device_browser,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20160902,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,desktop,Chrome,...,"(38.423734, 27.142826)","(+03, 3.0)",2016-09-02 15:33:05+00:00,2016-09-02 18:33:05+03:00,2016.0,9.0,2.0,246.0,5.0,18.0
1,Organic Search,20160902,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,desktop,Firefox,...,"(-25.274398, 133.775136)","(ACST, 9.5)",2016-09-03 05:22:27+00:00,2016-09-03 14:52:27+09:30,2016.0,9.0,3.0,247.0,6.0,14.0
2,Organic Search,20160902,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,desktop,Chrome,...,"(40.4167754, -3.7037902)","(CEST, 2.0)",2016-09-03 01:16:26+00:00,2016-09-03 03:16:26+02:00,2016.0,9.0,3.0,247.0,6.0,3.0


In [10]:
#view the numerical data columns for counts, mean, and min/max
#if the standard deviation (std) is zero, that means every value is the same - may want to check that data
#and see if need to edit it (since describe ignores NANs for instance, you may need to go back and convert the NANs to a 
#value that makes sense)
df.describe()

Unnamed: 0,date,visitId,visitNumber,visitStartTime,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,totals_visits,totals_transactionRevenue,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
count,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903552.0,903652.0,903652.0,902175.0,902175.0,902175.0,902175.0,902175.0,902175.0
mean,20165890.0,1485007000.0,2.264898,1485007000.0,0.498675,4.596542,0.77802,3.849767,1.0,1704275.0,2016.517473,6.990086,15.698499,197.611083,3.739715,13.898355
std,4697.698,9022128.0,9.28374,9022128.0,0.499999,9.641442,0.415578,7.025277,0.0,52778690.0,0.499695,3.486402,8.824394,106.757146,1.919636,5.806083
min,20160800.0,1470035000.0,1.0,1470035000.0,0.0,1.0,0.0,1.0,1.0,0.0,2016.0,1.0,1.0,1.0,1.0,0.0
25%,20161030.0,1477561000.0,1.0,1477561000.0,0.0,1.0,1.0,1.0,1.0,0.0,2016.0,4.0,8.0,103.0,2.0,10.0
50%,20170110.0,1483949000.0,1.0,1483949000.0,0.0,2.0,1.0,1.0,1.0,0.0,2017.0,7.0,16.0,207.0,4.0,14.0
75%,20170420.0,1492759000.0,1.0,1492759000.0,1.0,4.0,1.0,4.0,1.0,0.0,2017.0,10.0,23.0,297.0,5.0,18.0
max,20170800.0,1501657000.0,395.0,1501657000.0,1.0,500.0,1.0,469.0,1.0,23129500000.0,2017.0,12.0,31.0,366.0,7.0,23.0


In [11]:
### LABEL ENCODING ALL THE CATEGORICAL VARIABLES ###
# label encode the categorical variables
categorical_cols = ['channelGrouping', 'socialEngagementType', 
                   'device_deviceCategory', 'device_browser', 'device_isMobile',
                   'device_operatingSystem', 'geoNetwork_subContinent',
                   'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                   'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                   'trafficSource_isTrueDirect', 'trafficSource_keyword',
                   'trafficSource_source', 'trafficSource_adContent',
                   'trafficSource_medium', 'trafficSource_referralPath',
                   'trafficSource_campaign']

print('Original Dataframe Shape: ', df.shape)

for col in categorical_cols:
    print('\n Converting Column: ', col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    print(df.shape)


Original Dataframe Shape:  (903652, 43)

 Converting Column:  channelGrouping
(903652, 43)

 Converting Column:  socialEngagementType
(903652, 43)

 Converting Column:  device_deviceCategory
(903652, 43)

 Converting Column:  device_browser
(903652, 43)

 Converting Column:  device_isMobile
(903652, 43)

 Converting Column:  device_operatingSystem
(903652, 43)

 Converting Column:  geoNetwork_subContinent
(903652, 43)

 Converting Column:  geoNetwork_region
(903652, 43)

 Converting Column:  geoNetwork_continent
(903652, 43)

 Converting Column:  geoNetwork_country
(903652, 43)

 Converting Column:  geoNetwork_city
(903652, 43)

 Converting Column:  geoNetwork_metro
(903652, 43)

 Converting Column:  geoNetwork_networkDomain
(903652, 43)

 Converting Column:  trafficSource_isTrueDirect
(903652, 43)

 Converting Column:  trafficSource_keyword
(903652, 43)

 Converting Column:  trafficSource_source
(903652, 43)

 Converting Column:  trafficSource_adContent
(903652, 43)

 Converting Colum

## Decide what Input Data to Use for X and Split Data via train_test_split
For initial runs of the models, try using less input data (by using the ones we think are most predictive).

In [12]:
# ### CALCULATE CORRELATION OF EACH POSSIBLE INPUT vs. REVENUE VALUE ---ORIGINAL DF--- TO HELP DECISION MAKING OF WHICH INPUTS TO INCLUDE###
# correlation_summary_list = []

# for col in df.columns:
#     #can only run correlations on columns that have numerical values (either dtype of float or int)
#     #in particular some columns have dtype of 'O', which stands for python object, which in this case means the dtypes are mixed
#     if df[col].dtype in ['float64', 'int64']:
        
#         #having NANs in the dataset for correlations breaks the correlation calculation
#         #so only keep the good_rows that don't have nans for either series being used in the correlation calculation
#         #and then np.compress(good_rows, series) just reduces the series to the array with only the good_rows to run the correlation
#         good_rows = ~np.logical_or(np.isnan(df[col]), np.isnan(df.totals_transactionRevenue))

#         #pearsonr function calculates the Pearson correlation coefficient and the p-value for as a tuple
#         correl_pvalue = pearsonr(np.compress(good_rows, df[col]), np.compress(good_rows, df.totals_transactionRevenue))

#         #create new tuple that also has the column name (which is the input variable) and the correl coef and p-value
#         variable_correl_pvalue = (col,) + correl_pvalue

#         #add the correl and pvalue tuple to the list of all correlation summaries
#         correlation_summary_list.append(variable_correl_pvalue)

        
# #create a dataframe of the correlation summary (for ease of readibility/manipulation)    
# correlation_summary_df = pd.DataFrame(correlation_summary_list, columns=['Input Variable', 'Correlation', 'p-value'])
# #create a new column of absolute value of correlations so can easily sort both positive and negative correlations together
# correlation_summary_df['Absolute Value Correlation'] = correlation_summary_df.Correlation.map(lambda correl: abs(correl))
# #sort the dataframe by absolute value correlation high to low
# correlation_summary_df.sort_values('Absolute Value Correlation', ascending=False, inplace=True)

# #print output
# correlation_summary_df

In [13]:
# ### CALCULATE CORRELATION OF EACH POSSIBLE INPUT vs. REVENUE VALUE ---ONEHOT CATEGORICAL DF---  TO HELP DECISION MAKING OF WHICH INPUTS TO INCLUDE###
# categ_correlation_summary_list = []

# for col in df_categorical_onehot.columns:
#     #onehotencoded categorical columns are all 0/1 ints, so can calculate all correlations
        
#     #having NANs in the dataset for correlations breaks the correlation calculation
#     #so only keep the good_rows that don't have nans for either series being used in the correlation calculation
#     #and then np.compress(good_rows, series) just reduces the series to the array with only the good_rows to run the correlation
#     good_rows = ~np.logical_or(np.isnan(df_categorical_onehot[col]), np.isnan(df.totals_transactionRevenue))

#     #pearsonr function calculates the Pearson correlation coefficient and the p-value for as a tuple
#     correl_pvalue = pearsonr(np.compress(good_rows, df_categorical_onehot[col]), np.compress(good_rows, df.totals_transactionRevenue))

#     #create new tuple that also has the column name (which is the input variable) and the correl coef and p-value
#     variable_correl_pvalue = (col,) + correl_pvalue

#     #add the correl and pvalue tuple to the list of all correlation summaries
#     categ_correlation_summary_list.append(variable_correl_pvalue)

        
# #create a dataframe of the correlation summary (for ease of readibility/manipulation)    
# categ_correlation_summary_df = pd.DataFrame(categ_correlation_summary_list, columns=['Input Variable', 'Correlation', 'p-value'])
# #create a new column of absolute value of correlations so can easily sort both positive and negative correlations together
# categ_correlation_summary_df['Absolute Value Correlation'] = categ_correlation_summary_df.Correlation.map(lambda correl: abs(correl))
# #sort the dataframe by absolute value correlation high to low
# categ_correlation_summary_df.sort_values('Absolute Value Correlation', ascending=False, inplace=True)

# #print output
# categ_correlation_summary_df

In [14]:
### ASSIGN X and y DATA for VARIALBES WE WANT TO USE###

# for X data use the initial correlation values and variables that we think are most important to narrow things down
# (remember, that the correlation values are just linear correlation values, so this doesn't capture variables
# that do have a large influence but might be nonlinear, however, for linear regression models at the least, that
# seems like a good metric to start with as the linear models won't be able to capture nonlinear affects well anyways)

#INITIAL RUN DECISIONS: as we can see from the initial Pearson correlations, no variables even fall within the range of what
#we would consider even low correlations traditionally, so it is doubtful that linear regression models will work well,
#but we'll try it out - take roughly top 10 variables, use all one hot encoded columns and also include
#weekday_local, month_local, yearday_local, and hour_local since those are features we specifically added to
#make our features unique

### NARROW DOWN THE CATEGORICAL COLUMNS WANT TO ADD AS X VARIABLE INPUTS ###
categorical_columns_x_model = ['trafficSource_isTrueDirect', 'trafficSource_source',
                               #'trafficSource_keyword', #tons of dimensions and not good predictor
                                'geoNetwork_continent', 'geoNetwork_country'
                               #'geoNetwork_city'
                               ]


### NARROW DOWN THE NUMERICAL COLUMNS WANT TO ADD TO X VARIABLE INPUTS ###
numerical_columns_x_model = ['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
                             'weekday_local', 'month_local', 'yearday_local', 'hour_local']

#create y outputs column name (but do in list form for easy list adding later)
column_y_model = ['totals_transactionRevenue']

#create the model dataframe that includes chosen x input variables (from numerical and categorical) and y output variable
#do this so that can clean the dataframe by dropping all rows that have any nans
df_model = df[numerical_columns_x_model + categorical_columns_x_model + column_y_model]

print('\nShape of all of our variables being used for the model (before dropping nans): ', df_model.shape)

#for linear regression drop NANs as they can't be interpreted in the regression model - check to make
#sure it isn't reducing size of data too much before proceeding
df_model = df_model.dropna(axis='index', how='any')
print('\nShape of all of our variables being used for the model (after dropping nans): ', df_model.shape)

#add a column to the df_model data of a simple classifier of "revenue" or "no_revenue" - will use this data point for:
#     in the train_test_split model we will use the stratify command to get equal train-test percentages for both revenue
#     and no revenue outcomes - I think this will be important since only about 1.3% of all rows actually resulted in 
#     revenue and not completely sure how randomly selecting will have equal test-train distributions without defining it
#     (this may be unnecessary, but better safe than sorry)
df_model['revenue_label'] = df_model.totals_transactionRevenue.map(lambda revenue_amount: 
                                                        'revenue' if revenue_amount > 0 else 'no_revenue')


#split out the data we are using for modeling to X and y values
columns_X_model = [col for col in list(df_model.columns) if col not in ['totals_transactionRevenue', 'revenue_label']]
X_model = df_model[columns_X_model]

#don't actually need to reshape the y_model data for decision trees apparently, but narrow it down to only y_values
y_model = df_model['totals_transactionRevenue'] #.values.reshape(-1, 1)

#put stratify criteria of revenue/no_revenue into its own array, make sure to reshape this as well
stratify_criteria_model = df_model['revenue_label'] #.values.reshape(-1, 1)

print('\nShape of X input variables is: ', X_model.shape, '\nShape of y output variable is: ', y_model.shape)



Shape of all of our variables being used for the model (before dropping nans):  (903652, 14)

Shape of all of our variables being used for the model (after dropping nans):  (902077, 14)

Shape of X input variables is:  (902077, 13) 
Shape of y output variable is:  (902077,)


In [15]:
##### TRAIN-TEST-SPLIT #####

### SPLIT THE MODEL DATA ###
#split the model data (which is all of the Kaggle Training data) into the model's train/test subsets
#(have to do this since Kaggle competition has its own test data, but those actual values are not provided, so can't
#actually use that to test our models, just end up comparing our predictions on that test data with their actuals)
#use a 75-25 split to start with train-test
#also make sure to add stratify_criteria to make sure it is doing a 75-25 split on both website visits that led to 
#actual sales/revenue and those that did not
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_model, y_model,
                                                                            test_size=0.25,
                                                                            stratify=stratify_criteria_model)
#print sizes of the train/test data splits
print('Check Shapes of the train-test data splits.\n')
print('X_model_train: ', X_model_train.shape)
print('X_model_test: ', X_model_test.shape)
print('y_model_train: ', y_model_train.shape)
print('y_model_test: ', y_model_test.shape)


### VERIFTY THE STRATEFIY COMMAND WORKED ###
print('\n--------------------------------------------------------------------')
print('Check that the train-test data split worked along stratify criteria.\n')

# FIRST PRINT OUT TOTAL DF_MODEL PERCENTAGE OF DATA THAT HAS REVENUE #
print('The df_model data percentages of revenue and no_revenue are:')
#since this data from the df_model is in a series, we can just use pandas value counts
print(df_model['revenue_label'].value_counts(normalize=True))

# THEN CHECK THE TRAIN DATA #
#the train data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_train_revenue_count = len(y_model_train.values.reshape(-1, 1)[np.where(y_model_train.values.reshape(-1, 1) > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_train_revenue_percent = y_model_train_revenue_count/y_model_train.shape[0]
print('\nThe percentage of model_train data that has revenue is: ', y_model_train_revenue_percent)

# THEN CHECK THE TEST DATA #
#the test data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_test_revenue_count = len(y_model_test.values.reshape(-1, 1)[np.where(y_model_test.values.reshape(-1, 1) > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_test_revenue_percent = y_model_test_revenue_count/y_model_test.shape[0]
print('\nThe percentage of model_test data that has revenue is: ', y_model_test_revenue_percent)

Check Shapes of the train-test data splits.

X_model_train:  (676557, 13)
X_model_test:  (225520, 13)
y_model_train:  (676557,)
y_model_test:  (225520,)

--------------------------------------------------------------------
Check that the train-test data split worked along stratify criteria.

The df_model data percentages of revenue and no_revenue are:
no_revenue    0.987242
revenue       0.012758
Name: revenue_label, dtype: float64

The percentage of model_train data that has revenue is:  0.01275871803853925

The percentage of model_test data that has revenue is:  0.012757183398368215


# Random Forest v3
### Version 3: Limited Input Variables, Label Encoding, No Scaling/Transforming of Data

##### Initial Functions Used for Evaluation of Models
These functions are being created to evaluate whether models are even identifying transactions correctly as revenue/no revenue (or even worse if it assigned negative revenue) because beyond the final revenue amount we want predicted, we also don't want predictions of revenue (or negative revenue) for someone that didn't buy anything and had 0 revenue.

In [16]:
#function to define whether the outcome is revenue (when revenue_amt is greater than 0), 
#no_revenue (when revenue_amt is 0), and neg_revenue (when revenue_amt is less than 0)
def revenue_norevenue_negrevenue(revenue_amt):
    if revenue_amt > 0:
        return 'revenue'
    elif revenue_amt == 0:
        return 'no_revenue'
    elif revenue_amt < 0:
        return 'neg_revenue'

In [17]:
### CREATE A FUNCTION THAT EVALUATES THE REVENUE OR NO_REVENUE ACCURACY ###
#an important part of this model is to make sure that only people that actually performed a final transaction are 
#getting a revenue prediction
#so calculate the percentages of each outcome in a confusion matrix using sklearn.metrics.confusion_matrix:
# --True Positive: Revenue_actual & Revenue_predicted
# --True Negative: No_Revenue_actual & No_Revenue_predicted
# --False Negative: Revenue_actual & No_Revenue_predicted
# --False Positive: No_Revenue_actual & Revenue_predicted

#inputs of y_revenue_true, y_revenue_predicted are the actual revenue amount arrays,
#we will convert to labels
def evaluate_revenue_confusion_matrix(y_revenue_true, y_revenue_predicted):
    df_revenue_eval = pd.DataFrame(data={'revenue_actual': y_revenue_true, #y_revenue_true.reshape(-1),
                       'revenue_prediction': y_revenue_predicted}) #y_revenue_predicted.reshape(-1)})
    
    df_revenue_eval['revenue_label_actual'] = df_revenue_eval.revenue_actual.map(revenue_norevenue_negrevenue)
    
    df_revenue_eval['revenue_label_prediction'] = df_revenue_eval.revenue_prediction.map(revenue_norevenue_negrevenue)
    
    print('Confusion Matrix of revenue/no_revenue/neg_revenue: \n')
    #use scikit learns confusion matrix (the diagonal indicates true hits, outside of that is the false hits)
    print(confusion_matrix(df_revenue_eval.revenue_label_actual,
                 df_revenue_eval.revenue_label_prediction,
                 labels=['revenue', 'no_revenue', 'neg_revenue']))

In [18]:
#function that saves the machine learning model to a pickle, so we can extract it later
#input of file_name_path, make sure it includes extension of .pkl
def save_model_pickle(scikit_model, file_name_path):
    filename = file_name_path
    pickle.dump(scikit_model, open(filename, 'wb'))
    print("Scikit Model saved to {}".format(filename))

## Random Forest Regressor: n_estimators=5

In [19]:
from sklearn.ensemble import RandomForestRegressor

In [20]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                              n_estimators=5)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[ 0.11126265  0.17725263  0.33581702  0.00115248  0.          0.05170576
  0.02572765  0.10376375  0.12643741  0.03596573  0.0210055   0.00422666
  0.00568275]


In [21]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.11126264862179994),
 ('totals_hits', 0.17725263162549143),
 ('visitNumber', 0.3358170209373692),
 ('totals_newVisits', 0.0011524846169580572),
 ('totals_bounces', 0.0),
 ('weekday_local', 0.051705760429027414),
 ('month_local', 0.025727653059584715),
 ('yearday_local', 0.10376375174869798),
 ('hour_local', 0.12643741155470672),
 ('trafficSource_isTrueDirect', 0.035965725664915599),
 ('trafficSource_source', 0.021005496979401708),
 ('geoNetwork_continent', 0.00422666255150108),
 ('geoNetwork_country', 0.0056827522105460607)]

In [22]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[ 0.  0.  0. ...,  0.  0.  0.]


In [23]:
evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2212    665      0]
 [  6620 216023      0]
 [     0      0      0]]


In [36]:
percent_accuracy = (2212 + 216023)/(2212 + 216023 + 665 + 6620)
print(percent_accuracy)

0.9676968783256474


In [25]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 2920533384289471.0 
R-squared: -0.6260142518112775 
Correlation: nan




In [28]:
save_model_pickle(regr, 'model_pickles/random_forest_v3_n_estimators_5.pkl')

Scikit Model saved to model_pickles/random_forest_v3_n_estimators_5.pkl


## Random Forest Regressor: n_estimators=50

In [29]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                              n_estimators=50)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  1.19650739e-01   1.58621411e-01   3.48609827e-01   3.33071746e-03
   7.43124379e-05   4.50196060e-02   3.63449531e-02   1.17564016e-01
   1.07174648e-01   2.38906109e-02   2.86787169e-02   3.54385603e-03
   7.49658557e-03]


In [30]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.11965073917833095),
 ('totals_hits', 0.15862141131174681),
 ('visitNumber', 0.34860982727104151),
 ('totals_newVisits', 0.0033307174551724067),
 ('totals_bounces', 7.4312437928260579e-05),
 ('weekday_local', 0.045019606012241517),
 ('month_local', 0.036344953094142464),
 ('yearday_local', 0.11756401614587261),
 ('hour_local', 0.10717464772263961),
 ('trafficSource_isTrueDirect', 0.023890610904593199),
 ('trafficSource_source', 0.028678716857540181),
 ('geoNetwork_continent', 0.0035438560340496756),
 ('geoNetwork_country', 0.0074965855747007515)]

In [31]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[ 0.  0.  0. ...,  0.  0.  0.]


In [32]:
evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2747    130      0]
 [ 13098 209545      0]
 [     0      0      0]]


In [37]:
percent_accuracy = (2747 + 209545)/(2747 + 209545 + 130 + 13098)
print(percent_accuracy)

0.9413444483859524


In [34]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 2360893613788940.0 
R-squared: -0.314433412636709 
Correlation: nan




In [35]:
save_model_pickle(regr, 'model_pickles/random_forest_v3_n_estimators_50.pkl')

Scikit Model saved to model_pickles/random_forest_v3_n_estimators_50.pkl
