# Machine Learning of the Google Store Analytics Dataset

## Random Forest Regressor Models - v4
### Version 4: More Input Variables, Label Encoding, No Scaling/Transforming of Data

This dataset is provided by the Kaggle competition.  
https://www.kaggle.com/c/ga-customer-revenue-prediction

We performed some data engineering and datetime feature engineering to get the dataset to the state we wanted.

Now we will try a variety of different models and look at their accuracy.  The models we will try:
1. Generalized Linear Regression Models
    1. Linear Regression (Ordinary Least Squares) Model
    2. Linear Lasso Regression Model
    3. Linear Ridge Regression Model
    4. Linear Elastic Net Regression Model
2. Decision Tree Regression - a combination of decision trees and getting continuous data output http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html  http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
3. Random Forest Regression??
4. Neural Networks

In [72]:
import pandas as pd
import numpy as np

import pickle

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

from scipy.stats.stats import pearsonr

from sklearn.tree import export_graphviz
from graphviz import Source
from subprocess import call
from IPython.display import Image

## Importing and Pre-processing of the Training Dataset

In [2]:
# import pickle
# with open('data/train_v1_full_data_split.pkl', 'rb') as fp:
#     df = pickle.load(fp)

In [3]:
#import the data engineered and feature engineered training dataset
df = pd.read_pickle('/home/michael_suomi/Final-Project-Google-Merch-Store/data/train_v1_full_data_split.pkl')
print(df.shape)
# print(df.columns)

(903652, 44)


In [4]:
### DROP COLUMNS NOT IN FINAL TEST DATA ###
#the test dataset does not have the 'trafficSource_campaignCode' column, so drop that from our training set too
df.drop('trafficSource_campaignCode', axis=1, inplace=True)
print(df.shape)
# print(df.columns)
# df.head(3)

(903652, 43)


In [5]:
### CHANGE TRANSACTION REVENUE FROM NANs to 0 AND CHANGE to FLOAT TYPE (some are strings)###
df.totals_transactionRevenue.fillna(0, inplace=True)
df.totals_transactionRevenue = df.totals_transactionRevenue.astype(dtype=float)

### CHANGE OTHER STRINGS TO INTS/FLOATS WHERE NEEDED ###
#stick to floats rather than ints since a np.nan is a float object
df.totals_bounces = df.totals_bounces.astype(dtype=float)
df.totals_hits = df.totals_hits.astype(dtype=float)
df.totals_newVisits = df.totals_newVisits.astype(dtype=float)
df.totals_pageviews = df.totals_pageviews.astype(dtype=float)
df.totals_visits = df.totals_visits.astype(dtype=float)

### CONVERT NANs in bounces, newVisits to 0 values ###
#the blank NAN values for these columns imply a 0 value meaning 0 newVisits or 0 bounces
df.totals_bounces.fillna(0, inplace=True)
df.totals_newVisits.fillna(0, inplace=True)
# df.totals_visits.fillna(0, inplace=True) #there shouldn't be anyone with 0 visits (they've at least visited once or woulnd't be recorded)

In [6]:
#### REVENUE IS DOLLARS * 10^6, NOT EXPONENTIAL LIKE WE THOUGHT ####
#### SINCE THE REVENUE IS SCALED UP BY A CONSTANT, NO NEED TO ADJUST FOR LIN REGRESS MODEL ####
# ### CONVERT TRANSACTION REVENUE TO DOLLARS (instead of the e^dollars_revenue) ###
# df['totals_transactionRevenue_dollars'] = df.totals_transactionRevenue.map(lambda x:
#                                                                             np.log1p(x))

In [7]:
### VIEW THE DATA BEFORE LABEL ENCODING ###
print(df.shape)
print(df.columns)
df.head(3)

(903652, 43)
Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device_deviceCategory', 'device_browser', 'device_isMobile',
       'device_operatingSystem', 'geoNetwork_subContinent',
       'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
       'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
       'totals_bounces', 'totals_hits', 'totals_newVisits', 'totals_pageviews',
       'totals_visits', 'totals_transactionRevenue',
       'trafficSource_isTrueDirect', 'trafficSource_keyword',
       'trafficSource_source', 'trafficSource_adContent',
       'trafficSource_medium', 'trafficSource_referralPath',
       'trafficSource_campaign', 'city_country', 'lat_lng', 'timezone',
       'datetime_iso_utc', 'datetime_iso_local', 'year_local', 'month_local',
       'day_local', 'yearday_local', 'weekday_local', 'hour_local'],
      dtype='object')


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_deviceCategory,device_browser,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20160902,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,desktop,Chrome,...,"(38.423734, 27.142826)","(+03, 3.0)",2016-09-02 15:33:05+00:00,2016-09-02 18:33:05+03:00,2016.0,9.0,2.0,246.0,5.0,18.0
1,Organic Search,20160902,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,desktop,Firefox,...,"(-25.274398, 133.775136)","(ACST, 9.5)",2016-09-03 05:22:27+00:00,2016-09-03 14:52:27+09:30,2016.0,9.0,3.0,247.0,6.0,14.0
2,Organic Search,20160902,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,desktop,Chrome,...,"(40.4167754, -3.7037902)","(CEST, 2.0)",2016-09-03 01:16:26+00:00,2016-09-03 03:16:26+02:00,2016.0,9.0,3.0,247.0,6.0,3.0


In [8]:
#view the numerical data columns for counts, mean, and min/max
#if the standard deviation (std) is zero, that means every value is the same - may want to check that data
#and see if need to edit it (since describe ignores NANs for instance, you may need to go back and convert the NANs to a 
#value that makes sense)
df.describe()

Unnamed: 0,date,visitId,visitNumber,visitStartTime,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,totals_visits,totals_transactionRevenue,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
count,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903552.0,903652.0,903652.0,902175.0,902175.0,902175.0,902175.0,902175.0,902175.0
mean,20165890.0,1485007000.0,2.264898,1485007000.0,0.498675,4.596542,0.77802,3.849767,1.0,1704275.0,2016.517473,6.990086,15.698499,197.611083,3.739715,13.898355
std,4697.698,9022128.0,9.28374,9022128.0,0.499999,9.641442,0.415578,7.025277,0.0,52778690.0,0.499695,3.486402,8.824394,106.757146,1.919636,5.806083
min,20160800.0,1470035000.0,1.0,1470035000.0,0.0,1.0,0.0,1.0,1.0,0.0,2016.0,1.0,1.0,1.0,1.0,0.0
25%,20161030.0,1477561000.0,1.0,1477561000.0,0.0,1.0,1.0,1.0,1.0,0.0,2016.0,4.0,8.0,103.0,2.0,10.0
50%,20170110.0,1483949000.0,1.0,1483949000.0,0.0,2.0,1.0,1.0,1.0,0.0,2017.0,7.0,16.0,207.0,4.0,14.0
75%,20170420.0,1492759000.0,1.0,1492759000.0,1.0,4.0,1.0,4.0,1.0,0.0,2017.0,10.0,23.0,297.0,5.0,18.0
max,20170800.0,1501657000.0,395.0,1501657000.0,1.0,500.0,1.0,469.0,1.0,23129500000.0,2017.0,12.0,31.0,366.0,7.0,23.0


In [9]:
### LABEL ENCODING ALL THE CATEGORICAL VARIABLES ###
# label encode the categorical variables
categorical_cols = ['channelGrouping', 'socialEngagementType', 
                   'device_deviceCategory', 'device_browser', 'device_isMobile',
                   'device_operatingSystem', 'geoNetwork_subContinent',
                   'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                   'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                   'trafficSource_isTrueDirect', 'trafficSource_keyword',
                   'trafficSource_source', 'trafficSource_adContent',
                   'trafficSource_medium', 'trafficSource_referralPath',
                   'trafficSource_campaign']

print('Original Dataframe Shape: ', df.shape)

for col in categorical_cols:
    print('\n Converting Column: ', col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    print(df.shape)


Original Dataframe Shape:  (903652, 43)

 Converting Column:  channelGrouping
(903652, 43)

 Converting Column:  socialEngagementType
(903652, 43)

 Converting Column:  device_deviceCategory
(903652, 43)

 Converting Column:  device_browser
(903652, 43)

 Converting Column:  device_isMobile
(903652, 43)

 Converting Column:  device_operatingSystem
(903652, 43)

 Converting Column:  geoNetwork_subContinent
(903652, 43)

 Converting Column:  geoNetwork_region
(903652, 43)

 Converting Column:  geoNetwork_continent
(903652, 43)

 Converting Column:  geoNetwork_country
(903652, 43)

 Converting Column:  geoNetwork_city
(903652, 43)

 Converting Column:  geoNetwork_metro
(903652, 43)

 Converting Column:  geoNetwork_networkDomain
(903652, 43)

 Converting Column:  trafficSource_isTrueDirect
(903652, 43)

 Converting Column:  trafficSource_keyword
(903652, 43)

 Converting Column:  trafficSource_source
(903652, 43)

 Converting Column:  trafficSource_adContent
(903652, 43)

 Converting Colum

## Decide what Input Data to Use for X and Split Data via train_test_split
For initial runs of the models, try using less input data (by using the ones we think are most predictive).

In [10]:
df.columns

Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device_deviceCategory', 'device_browser', 'device_isMobile',
       'device_operatingSystem', 'geoNetwork_subContinent',
       'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
       'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
       'totals_bounces', 'totals_hits', 'totals_newVisits', 'totals_pageviews',
       'totals_visits', 'totals_transactionRevenue',
       'trafficSource_isTrueDirect', 'trafficSource_keyword',
       'trafficSource_source', 'trafficSource_adContent',
       'trafficSource_medium', 'trafficSource_referralPath',
       'trafficSource_campaign', 'city_country', 'lat_lng', 'timezone',
       'datetime_iso_utc', 'datetime_iso_local', 'year_local', 'month_local',
       'day_local', 'yearday_local', 'weekday_local', 'hour_local'],
      dtype='object')

In [11]:
### ASSIGN X and y DATA for VARIALBES WE WANT TO USE###

# for X data use the initial correlation values and variables that we think are most important to narrow things down
# (remember, that the correlation values are just linear correlation values, so this doesn't capture variables
# that do have a large influence but might be nonlinear, however, for linear regression models at the least, that
# seems like a good metric to start with as the linear models won't be able to capture nonlinear affects well anyways)

#INITIAL RUN DECISIONS: as we can see from the initial Pearson correlations, no variables even fall within the range of what
#we would consider even low correlations traditionally, so it is doubtful that linear regression models will work well,
#but we'll try it out - take roughly top 10 variables, use all one hot encoded columns and also include
#weekday_local, month_local, yearday_local, and hour_local since those are features we specifically added to
#make our features unique

### NARROW DOWN THE CATEGORICAL COLUMNS WANT TO ADD AS X VARIABLE INPUTS ###
categorical_columns_x_model = ['device_deviceCategory', 'device_browser', 'device_isMobile',
                               'device_operatingSystem', 'geoNetwork_subContinent',
                               'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                               'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                               
                               'trafficSource_isTrueDirect', 
                               #'trafficSource_keyword', #too large of feature counts
                               'trafficSource_source', 
                               'trafficSource_adContent',
                               'trafficSource_medium', 
                               #'trafficSource_referralPath',
                               'trafficSource_campaign'    
                              ]


### NARROW DOWN THE NUMERICAL COLUMNS WANT TO ADD TO X VARIABLE INPUTS ###
numerical_columns_x_model = ['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
                             #'totals_visits', all 1.0 ?need to label encode to capture nans?
                             'weekday_local', 'month_local', 'yearday_local', 'hour_local']

#create y outputs column name (but do in list form for easy list adding later)
column_y_model = ['totals_transactionRevenue']

#create the model dataframe that includes chosen x input variables (from numerical and categorical) and y output variable
#do this so that can clean the dataframe by dropping all rows that have any nans
df_model = df[numerical_columns_x_model + categorical_columns_x_model + column_y_model]

print('\nShape of all of our variables being used for the model (before dropping nans): ', df_model.shape)

#for linear regression drop NANs as they can't be interpreted in the regression model - check to make
#sure it isn't reducing size of data too much before proceeding
df_model = df_model.dropna(axis='index', how='any')
print('\nShape of all of our variables being used for the model (after dropping nans): ', df_model.shape)

#add a column to the df_model data of a simple classifier of "revenue" or "no_revenue" - will use this data point for:
#     in the train_test_split model we will use the stratify command to get equal train-test percentages for both revenue
#     and no revenue outcomes - I think this will be important since only about 1.3% of all rows actually resulted in 
#     revenue and not completely sure how randomly selecting will have equal test-train distributions without defining it
#     (this may be unnecessary, but better safe than sorry)
df_model['revenue_label'] = df_model.totals_transactionRevenue.map(lambda revenue_amount: 
                                                        'revenue' if revenue_amount > 0 else 'no_revenue')


#split out the data we are using for modeling to X and y values
columns_X_model = [col for col in list(df_model.columns) if col not in ['totals_transactionRevenue', 'revenue_label']]
X_model = df_model[columns_X_model]

#don't actually need to reshape the y_model data for decision trees apparently, but narrow it down to only y_values
y_model = df_model['totals_transactionRevenue'] #.values.reshape(-1, 1)

#put stratify criteria of revenue/no_revenue into its own array, make sure to reshape this as well
stratify_criteria_model = df_model['revenue_label'] #.values.reshape(-1, 1)

print('\nShape of X input variables is: ', X_model.shape, '\nShape of y output variable is: ', y_model.shape)



Shape of all of our variables being used for the model (before dropping nans):  (903652, 26)

Shape of all of our variables being used for the model (after dropping nans):  (902077, 26)

Shape of X input variables is:  (902077, 25) 
Shape of y output variable is:  (902077,)


In [12]:
##### TRAIN-TEST-SPLIT #####

### SPLIT THE MODEL DATA ###
#split the model data (which is all of the Kaggle Training data) into the model's train/test subsets
#(have to do this since Kaggle competition has its own test data, but those actual values are not provided, so can't
#actually use that to test our models, just end up comparing our predictions on that test data with their actuals)
#use a 75-25 split to start with train-test
#also make sure to add stratify_criteria to make sure it is doing a 75-25 split on both website visits that led to 
#actual sales/revenue and those that did not
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_model, y_model,
                                                                            test_size=0.25,
                                                                            stratify=stratify_criteria_model)
#print sizes of the train/test data splits
print('Check Shapes of the train-test data splits.\n')
print('X_model_train: ', X_model_train.shape)
print('X_model_test: ', X_model_test.shape)
print('y_model_train: ', y_model_train.shape)
print('y_model_test: ', y_model_test.shape)


### VERIFTY THE STRATEFIY COMMAND WORKED ###
print('\n--------------------------------------------------------------------')
print('Check that the train-test data split worked along stratify criteria.\n')

# FIRST PRINT OUT TOTAL DF_MODEL PERCENTAGE OF DATA THAT HAS REVENUE #
print('The df_model data percentages of revenue and no_revenue are:')
#since this data from the df_model is in a series, we can just use pandas value counts
print(df_model['revenue_label'].value_counts(normalize=True))

# THEN CHECK THE TRAIN DATA #
#the train data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_train_revenue_count = len(y_model_train.values.reshape(-1, 1)[np.where(y_model_train.values.reshape(-1, 1) > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_train_revenue_percent = y_model_train_revenue_count/y_model_train.shape[0]
print('\nThe percentage of model_train data that has revenue is: ', y_model_train_revenue_percent)

# THEN CHECK THE TEST DATA #
#the test data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_test_revenue_count = len(y_model_test.values.reshape(-1, 1)[np.where(y_model_test.values.reshape(-1, 1) > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_test_revenue_percent = y_model_test_revenue_count/y_model_test.shape[0]
print('\nThe percentage of model_test data that has revenue is: ', y_model_test_revenue_percent)

Check Shapes of the train-test data splits.

X_model_train:  (676557, 25)
X_model_test:  (225520, 25)
y_model_train:  (676557,)
y_model_test:  (225520,)

--------------------------------------------------------------------
Check that the train-test data split worked along stratify criteria.

The df_model data percentages of revenue and no_revenue are:
no_revenue    0.987242
revenue       0.012758
Name: revenue_label, dtype: float64

The percentage of model_train data that has revenue is:  0.01275871803853925

The percentage of model_test data that has revenue is:  0.012757183398368215


# Random Forest v4
### Version 4: More Input Variables, Label Encoding, No Scaling/Transforming of Data

##### Initial Functions Used for Evaluation of Models
These functions are being created to evaluate whether models are even identifying transactions correctly as revenue/no revenue (or even worse if it assigned negative revenue) because beyond the final revenue amount we want predicted, we also don't want predictions of revenue (or negative revenue) for someone that didn't buy anything and had 0 revenue.

In [47]:
#function to define whether the outcome is revenue (when revenue_amt is greater than 0), 
#no_revenue (when revenue_amt is 0), and neg_revenue (when revenue_amt is less than 0)
def revenue_norevenue_negrevenue(revenue_amt):
    if revenue_amt > 0:
        return 'revenue'
    elif revenue_amt == 0:
        return 'no_revenue'
    elif revenue_amt < 0:
        return 'neg_revenue'

In [57]:
### CREATE A FUNCTION THAT EVALUATES THE REVENUE OR NO_REVENUE ACCURACY ###
#an important part of this model is to make sure that only people that actually performed a final transaction are 
#getting a revenue prediction
#so calculate the percentages of each outcome in a confusion matrix using sklearn.metrics.confusion_matrix:
# --True Positive: Revenue_actual & Revenue_predicted
# --True Negative: No_Revenue_actual & No_Revenue_predicted
# --False Negative: Revenue_actual & No_Revenue_predicted
# --False Positive: No_Revenue_actual & Revenue_predicted

#inputs of y_revenue_true, y_revenue_predicted are the actual revenue amount arrays,
#we will convert to labels
def evaluate_revenue_confusion_matrix(y_revenue_true, y_revenue_predicted):
    df_revenue_eval = pd.DataFrame(data={'revenue_actual': y_revenue_true, #y_revenue_true.reshape(-1),
                       'revenue_prediction': y_revenue_predicted}) #y_revenue_predicted.reshape(-1)})
    
    df_revenue_eval['revenue_label_actual'] = df_revenue_eval.revenue_actual.map(revenue_norevenue_negrevenue)
    
    df_revenue_eval['revenue_label_prediction'] = df_revenue_eval.revenue_prediction.map(revenue_norevenue_negrevenue)
    
    confusion_matrix_revenue = confusion_matrix(df_revenue_eval.revenue_label_actual,
                                                 df_revenue_eval.revenue_label_prediction,
                                                 labels=['revenue', 'no_revenue', 'neg_revenue'])
    
    print('Confusion Matrix of revenue/no_revenue/neg_revenue: \n')
    #use scikit learns confusion matrix (the diagonal indicates true hits, outside of that is the false hits)
    print(confusion_matrix_revenue)
    return confusion_matrix_revenue

In [71]:
confusion_matrix?

In [49]:
#function that saves the machine learning model to a pickle, so we can extract it later
#input of file_name_path, make sure it includes extension of .pkl
def save_model_pickle(scikit_model, file_name_path):
    filename = file_name_path
    pickle.dump(scikit_model, open(filename, 'wb'))
    print("Scikit Model saved to {}".format(filename))

## Random Forest Regressor: n_estimators=5

In [20]:
from sklearn.ensemble import RandomForestRegressor

In [17]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                              n_estimators=5)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  9.81992528e-02   8.46820723e-02   3.75061041e-01   1.10684684e-03
   1.82770240e-06   4.20242271e-02   2.60649095e-02   1.51921022e-01
   5.33903949e-02   7.41951426e-04   1.18857772e-03   2.39049887e-03
   1.47302610e-02   1.21302179e-03   9.30028138e-03   4.32009803e-03
   2.94581091e-03   3.89920427e-02   1.26632420e-02   4.63694253e-02
   3.74277193e-03   1.59213106e-02   5.46020042e-04   1.15691043e-02
   9.13987313e-04]


In [18]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.098199252805627857),
 ('totals_hits', 0.084682072341030118),
 ('visitNumber', 0.37506104114211769),
 ('totals_newVisits', 0.0011068468353414881),
 ('totals_bounces', 1.8277024044120025e-06),
 ('weekday_local', 0.042024227118661057),
 ('month_local', 0.026064909451127403),
 ('yearday_local', 0.15192102230583568),
 ('hour_local', 0.053390394917195952),
 ('device_deviceCategory', 0.00074195142609456244),
 ('device_browser', 0.0011885777248412325),
 ('device_isMobile', 0.0023904988658957193),
 ('device_operatingSystem', 0.014730261034482378),
 ('geoNetwork_subContinent', 0.001213021794480184),
 ('geoNetwork_region', 0.009300281376086383),
 ('geoNetwork_continent', 0.0043200980300574102),
 ('geoNetwork_country', 0.0029458109125432589),
 ('geoNetwork_city', 0.038992042692795291),
 ('geoNetwork_metro', 0.012663242021704196),
 ('geoNetwork_networkDomain', 0.046369425303767989),
 ('trafficSource_isTrueDirect', 0.0037427719302544375),
 ('trafficSource_source', 0.015921310

In [19]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[ 0.  0.  0. ...,  0.  0.  0.]


In [20]:
evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2273    604      0]
 [  7132 215511      0]
 [     0      0      0]]


In [33]:
percent_accuracy = (2273 + 215511)/(2273 + 215511 + 604 + 7132)
print(percent_accuracy)

0.9656970556935084


In [22]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 2539945397067719.0 
R-squared: -0.5065226377445373 
Correlation: nan




In [25]:
save_model_pickle(regr, 'model_pickles/random_forest_v4_n_estimators_5.pkl')

Scikit Model saved to model_pickles/random_forest_v4_n_estimators_5.pkl


## Random Forest Regressor: n_estimators=50

In [26]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                              n_estimators=50)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  1.16117952e-01   1.34858340e-01   2.82527793e-01   1.73286065e-03
   5.77248012e-07   2.85022533e-02   3.64625642e-02   1.35582719e-01
   8.76960466e-02   9.51006878e-04   2.87269979e-03   7.86133616e-04
   2.06731379e-02   1.29495959e-03   1.32098477e-02   3.02374728e-03
   4.81323811e-03   2.12507547e-02   1.10107876e-02   4.02959893e-02
   9.23191279e-03   2.59194118e-02   5.98554498e-04   1.98458111e-02
   7.40901862e-04]


In [27]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.11611795188890861),
 ('totals_hits', 0.1348583396208797),
 ('visitNumber', 0.28252779291959107),
 ('totals_newVisits', 0.0017328606479875433),
 ('totals_bounces', 5.7724801183710757e-07),
 ('weekday_local', 0.028502253252008149),
 ('month_local', 0.036462564185786077),
 ('yearday_local', 0.13558271910396993),
 ('hour_local', 0.087696046639030023),
 ('device_deviceCategory', 0.00095100687837768223),
 ('device_browser', 0.0028726997902136157),
 ('device_isMobile', 0.00078613361565670308),
 ('device_operatingSystem', 0.020673137885513174),
 ('geoNetwork_subContinent', 0.0012949595876828224),
 ('geoNetwork_region', 0.013209847661270062),
 ('geoNetwork_continent', 0.0030237472756852813),
 ('geoNetwork_country', 0.0048132381131814591),
 ('geoNetwork_city', 0.021250754707982118),
 ('geoNetwork_metro', 0.011010787581566397),
 ('geoNetwork_networkDomain', 0.040295989301168796),
 ('trafficSource_isTrueDirect', 0.0092319127907511053),
 ('trafficSource_source', 0.0259194118

In [28]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[ 0.  0.  0. ...,  0.  0.  0.]


In [29]:
evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2810     67      0]
 [ 14239 208404      0]
 [     0      0      0]]


In [34]:
percent_accuracy = (2810 + 208404)/(2810 + 208404 + 67 + 14239)
print(percent_accuracy)

0.9365643845335225


In [31]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 1582993607768091.5 
R-squared: 0.061075994680548806 
Correlation: 0.24713557955209284


In [32]:
save_model_pickle(regr, 'model_pickles/random_forest_v4_n_estimators_50.pkl')

Scikit Model saved to model_pickles/random_forest_v4_n_estimators_50.pkl


## Random Forest Regressor: n_estimators=200

In [53]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                             n_estimators=200)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  1.34868947e-01   1.46237574e-01   2.32262551e-01   1.76938149e-03
   6.95492529e-06   3.65888134e-02   2.23218051e-02   1.17347258e-01
   1.12177988e-01   2.40316398e-03   2.47972178e-03   1.02205592e-03
   2.81124307e-02   1.07431786e-03   1.68456782e-02   3.59291112e-03
   5.73471742e-03   2.47302135e-02   1.80483343e-02   3.88717669e-02
   2.22396200e-02   1.71325548e-02   5.72667479e-04   1.27611183e-02
   7.97454300e-04]


In [54]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.13486894706792482),
 ('totals_hits', 0.14623757432406004),
 ('visitNumber', 0.23226255140983024),
 ('totals_newVisits', 0.0017693814878450666),
 ('totals_bounces', 6.9549252907876324e-06),
 ('weekday_local', 0.036588813382452325),
 ('month_local', 0.022321805097777276),
 ('yearday_local', 0.11734725810514775),
 ('hour_local', 0.11217798756765433),
 ('device_deviceCategory', 0.0024031639834465688),
 ('device_browser', 0.0024797217819181551),
 ('device_isMobile', 0.001022055917851756),
 ('device_operatingSystem', 0.028112430690201005),
 ('geoNetwork_subContinent', 0.0010743178641116778),
 ('geoNetwork_region', 0.016845678241209618),
 ('geoNetwork_continent', 0.0035929111212731988),
 ('geoNetwork_country', 0.0057347174241345463),
 ('geoNetwork_city', 0.024730213547377949),
 ('geoNetwork_metro', 0.018048334335286475),
 ('geoNetwork_networkDomain', 0.038871766874547027),
 ('trafficSource_isTrueDirect', 0.022239619989687504),
 ('trafficSource_source', 0.01713255476216

In [58]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[  337350.        0.        0. ...,        0.        0.  1009650.]


In [74]:
confusion_matrix_model = evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2831     46      0]
 [ 17813 204830      0]
 [     0      0      0]]


In [76]:
### ACCURACY OF REVENUE/NO_REVENUE CATEGORY LABELS ###

#use np.sum(array) to sum all the values in the array
percent_accuracy = (confusion_matrix_model[0][0] + confusion_matrix_model[1][1])/np.sum(confusion_matrix_model)
print(percent_accuracy)

0.920809684285


In [77]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 3192820637130802.0 
R-squared: -0.24058010421929543 
Correlation: nan




In [28]:
regr

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [27]:
save_model_pickle(regr, 'model_pickles/random_forest_v4_n_estimators_200.pkl')

Scikit Model saved to model_pickles/random_forest_v4_n_estimators_200.pkl


In [29]:
### LOOK AT THE DEPTH OF ALL OF OUR DECISION TREES USED IN THE RANDOM FOREST TREES ###
#and evaluate the average, min, max and some distributions
decision_tree_depths= [estimator.tree_.max_depth for estimator in regr.estimators_] 

print('Decision Tree Depths of All of Our Random Forests: ')
print('Min Depth: ', min(decision_tree_depths))
print('Mean Depth: ', np.mean(decision_tree_depths))
print('Max Depth: ', max(decision_tree_depths))

print('\nDecision Tree Depths Distribution: ')
print('25th/50th/75th Percentiles: ', np.percentile(decision_tree_depths, q=[25, 50, 75]))


Decision Tree Depths of All of Our Random Forests: 
Min Depth:  29
Mean Depth:  33.295
Max Depth:  41

Decision Tree Depths Distribution: 
25th/50th/75th Percentiles:  [ 32.  33.  34.]


In [37]:
### VISUALIZE ONE OF THE DECISION TREES IN OUR RANDOM FOREST USING GRAPHVIZ - TRY TO PICK ONE WITH AVERAGE DEPTH OF OUR RF ###
#find an individual decision tree (estimator) that has the median depth and try to visualize it to see the
#end leaf outputs, etc. to better understand the model

#make sure have already imported the following packages (and make sure to install graphivz)
# from sklearn.tree import export_graphviz
# from subprocess import call
# from IPython.display import Image

#manually try different estimators_ in list until verify that max_depth does match closely to the mean/median depth
decision_tree_avg_depth_example = regr.estimators_[3]
print('The example decision tree we are looking at has a maximum depth of: ', decision_tree_avg_depth_example.tree_.max_depth)

#export our full tree as a dot file (which is the filetype that graphviz uses)
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_viz_random_forest_v4_n_estimators_200.dot', 
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

#export top 5 levels of our tree as a dot file - the full tree representation is massive, so this might help the viz actually run
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_depth5_viz_random_forest_v4_n_estimators_200.dot',
                max_depth=5,
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

# VISUALIZE AT www.webgraphviz.com: can visualize the .dot outputs, by coping the .dot file txt into http://www.webgraphviz.com/

# # Convert to png using system command (requires Graphviz -unsure how to install this on cloud server)
# # ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH
# #option1- try to display top 5 levels with graphviz - 
#Source.from_file('model_pickles/avg_tree_depth5_viz_random_forest_v3_n_estimators_200.dot')

# #option2- Command line syntax call is trying to replicate: !dot -Tpng tree.dot -o tree.png -Gdpi=600
# call(['dot', '-Tpng', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.dot', 
#       '-o', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png', '-Gdpi=600'])

# #Display in jupyter notebook
# Image(filename = 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png')

The example decision tree we are looking at has a maximum depth of:  33


## Random Forest Regressor: n_estimators=200, min_samples_leaf=44
Chose min_samples_leaf of 44, which is approximately 0.005 (a half a percent) of the count of revenue transactions in the training dataset.  (approx 900,000 rows x .75 for training set x .013 conversion rate x .005 half a percent chosen arbitrarily)

In [82]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                             min_samples_leaf=44,
                             n_estimators=200)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  3.13533089e-01   4.81370140e-02   2.66350288e-01   7.50750580e-03
   1.52368813e-09   1.67559431e-02   2.49147684e-03   8.01818389e-02
   3.80684798e-02   2.09651551e-03   6.46481109e-04   2.98465265e-03
   1.75240987e-02   2.11748067e-04   1.79542188e-02   5.16490408e-03
   2.23126474e-02   1.16106516e-02   1.37834241e-02   3.10507315e-02
   7.19437222e-03   6.66348200e-02   3.12493795e-06   2.77935509e-02
   8.42156800e-06]


In [83]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.31353308941044206),
 ('totals_hits', 0.048137013998194193),
 ('visitNumber', 0.26635028756244433),
 ('totals_newVisits', 0.0075075057974991581),
 ('totals_bounces', 1.5236881250606844e-09),
 ('weekday_local', 0.016755943108781979),
 ('month_local', 0.0024914768370072646),
 ('yearday_local', 0.080181838898796903),
 ('hour_local', 0.038068479781046577),
 ('device_deviceCategory', 0.0020965155090408665),
 ('device_browser', 0.00064648110929981276),
 ('device_isMobile', 0.0029846526481201903),
 ('device_operatingSystem', 0.01752409873923531),
 ('geoNetwork_subContinent', 0.00021174806714474931),
 ('geoNetwork_region', 0.017954218792685182),
 ('geoNetwork_continent', 0.0051649040843063654),
 ('geoNetwork_country', 0.022312647379450667),
 ('geoNetwork_city', 0.011610651587235463),
 ('geoNetwork_metro', 0.013783424055321547),
 ('geoNetwork_networkDomain', 0.031050731481780308),
 ('trafficSource_isTrueDirect', 0.0071943722170881887),
 ('trafficSource_source', 0.06663482

In [84]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[ 1653886.86043196        0.                0.         ...,        0.
        0.          2196853.46967448]


In [85]:
confusion_matrix_model = evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2869      8      0]
 [ 35471 187172      0]
 [     0      0      0]]


In [86]:
### ACCURACY OF REVENUE/NO_REVENUE CATEGORY LABELS ###

#use np.sum(array) to sum all the values in the array
percent_accuracy = (confusion_matrix_model[0][0] + confusion_matrix_model[1][1])/np.sum(confusion_matrix_model)
print(percent_accuracy)

0.84267914154


In [87]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 2367230724757142.0 
R-squared: 0.08020535037965482 
Correlation: 0.2832054914362623


In [88]:
regr

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=44, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [89]:
save_model_pickle(regr, 'model_pickles/random_forest_v4_n_estimators_200_min_leaf_44.pkl')

Scikit Model saved to model_pickles/random_forest_v4_n_estimators_200_min_leaf_44.pkl


In [90]:
### LOOK AT THE DEPTH OF ALL OF OUR DECISION TREES USED IN THE RANDOM FOREST TREES ###
#and evaluate the average, min, max and some distributions
decision_tree_depths= [estimator.tree_.max_depth for estimator in regr.estimators_] 

print('Decision Tree Depths of All of Our Random Forests: ')
print('Min Depth: ', min(decision_tree_depths))
print('Mean Depth: ', np.mean(decision_tree_depths))
print('Max Depth: ', max(decision_tree_depths))

print('\nDecision Tree Depths Distribution: ')
print('25th/50th/75th Percentiles: ', np.percentile(decision_tree_depths, q=[25, 50, 75]))


Decision Tree Depths of All of Our Random Forests: 
Min Depth:  17
Mean Depth:  19.98
Max Depth:  25

Decision Tree Depths Distribution: 
25th/50th/75th Percentiles:  [ 19.  20.  21.]


In [93]:
### VISUALIZE ONE OF THE DECISION TREES IN OUR RANDOM FOREST USING GRAPHVIZ - TRY TO PICK ONE WITH AVERAGE DEPTH OF OUR RF ###
#find an individual decision tree (estimator) that has the median depth and try to visualize it to see the
#end leaf outputs, etc. to better understand the model

#make sure have already imported the following packages (and make sure to install graphivz)
# from sklearn.tree import export_graphviz
# from subprocess import call
# from IPython.display import Image

#manually try different estimators_ in list until verify that max_depth does match closely to the mean/median depth
decision_tree_avg_depth_example = regr.estimators_[1]
print('The example decision tree we are looking at has a maximum depth of: ', decision_tree_avg_depth_example.tree_.max_depth)

#export our full tree as a dot file (which is the filetype that graphviz uses)
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_viz_random_forest_v4_n_estimators_200_min_leaf_44.dot', 
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

#export top 5 levels of our tree as a dot file - the full tree representation is massive, so this might help the viz actually run
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_depth5_viz_random_forest_v4_n_estimators_200_min_leaf_44.dot',
                max_depth=5,
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

# VISUALIZE AT www.webgraphviz.com: can visualize the .dot outputs, by coping the .dot file txt into http://www.webgraphviz.com/

# # Convert to png using system command (requires Graphviz -unsure how to install this on cloud server)
# # ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH
# #option1- try to display top 5 levels with graphviz - 
#Source.from_file('model_pickles/avg_tree_depth5_viz_random_forest_v3_n_estimators_200.dot')

# #option2- Command line syntax call is trying to replicate: !dot -Tpng tree.dot -o tree.png -Gdpi=600
# call(['dot', '-Tpng', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.dot', 
#       '-o', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png', '-Gdpi=600'])

# #Display in jupyter notebook
# Image(filename = 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png')

The example decision tree we are looking at has a maximum depth of:  20


## Random Forest Regressor: n_estimators=200, min_samples_leaf=8
Chose min_samples_leaf of 8, which is approximately 0.001 (a tenth of a percent) of the count of revenue transactions in the training dataset.  (approx 900,000 rows x .75 for training set x .013 conversion rate x .001 one tenth a percent chosen arbitrarily)

In [94]:
regr = RandomForestRegressor(#max_depth=10, 
                             #random_state=0,
                             min_samples_leaf=8,
                             n_estimators=200)
regr.fit(X_model_train, y_model_train)
print(regr.feature_importances_)

[  2.05440430e-01   1.03539409e-01   3.05195721e-01   4.27328960e-03
   1.29392469e-08   2.53491831e-02   1.03327882e-02   1.18827981e-01
   5.87351552e-02   1.18271011e-03   8.71776273e-04   1.39760742e-03
   1.91387861e-02   1.24958351e-03   1.66940728e-02   2.85968575e-03
   1.03086889e-02   2.05936096e-02   1.52711328e-02   2.85956875e-02
   6.10813262e-03   2.47282261e-02   2.58337367e-04   1.86370203e-02
   4.10973067e-04]


In [95]:
#show the X columns input and their associated feature importance
list(zip(X_model_train.columns, regr.feature_importances_))

[('totals_pageviews', 0.2054404295753593),
 ('totals_hits', 0.10353940882359319),
 ('visitNumber', 0.30519572099617792),
 ('totals_newVisits', 0.004273289603871102),
 ('totals_bounces', 1.2939246927955069e-08),
 ('weekday_local', 0.025349183050909087),
 ('month_local', 0.010332788181458868),
 ('yearday_local', 0.11882798136308331),
 ('hour_local', 0.058735155166533784),
 ('device_deviceCategory', 0.0011827101077456562),
 ('device_browser', 0.00087177627308422309),
 ('device_isMobile', 0.0013976074181166578),
 ('device_operatingSystem', 0.019138786091507837),
 ('geoNetwork_subContinent', 0.0012495835109251304),
 ('geoNetwork_region', 0.016694072818117862),
 ('geoNetwork_continent', 0.0028596857539773747),
 ('geoNetwork_country', 0.010308688884460942),
 ('geoNetwork_city', 0.020593609626563689),
 ('geoNetwork_metro', 0.015271132811773412),
 ('geoNetwork_networkDomain', 0.02859568750991319),
 ('trafficSource_isTrueDirect', 0.0061081326220094323),
 ('trafficSource_source', 0.02472822614127

In [96]:
y_model_test_predict = regr.predict(X_model_test)
print(y_model_test_predict)

[  463090.37896155        0.                0.         ...,        0.
        0.          2572856.02830904]


In [97]:
confusion_matrix_model = evaluate_revenue_confusion_matrix(y_model_test, y_model_test_predict)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2860     17      0]
 [ 24945 197698      0]
 [     0      0      0]]


In [98]:
### ACCURACY OF REVENUE/NO_REVENUE CATEGORY LABELS ###

#use np.sum(array) to sum all the values in the array
percent_accuracy = (confusion_matrix_model[0][0] + confusion_matrix_model[1][1])/np.sum(confusion_matrix_model)
print(percent_accuracy)

0.889313586378


In [99]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE = mean_squared_error(y_model_test, y_model_test_predict)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2 = r2_score(y_model_test, y_model_test_predict)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE, r2, np.sqrt(r2)))

Mean Squared Error: 2579128676776270.5 
R-squared: -0.0021282390311103683 
Correlation: nan




In [100]:
regr

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=8, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [101]:
save_model_pickle(regr, 'model_pickles/random_forest_v4_n_estimators_200_min_leaf_8.pkl')

Scikit Model saved to model_pickles/random_forest_v4_n_estimators_200_min_leaf_8.pkl


In [102]:
### LOOK AT THE DEPTH OF ALL OF OUR DECISION TREES USED IN THE RANDOM FOREST TREES ###
#and evaluate the average, min, max and some distributions
decision_tree_depths= [estimator.tree_.max_depth for estimator in regr.estimators_] 

print('Decision Tree Depths of All of Our Random Forests: ')
print('Min Depth: ', min(decision_tree_depths))
print('Mean Depth: ', np.mean(decision_tree_depths))
print('Max Depth: ', max(decision_tree_depths))

print('\nDecision Tree Depths Distribution: ')
print('25th/50th/75th Percentiles: ', np.percentile(decision_tree_depths, q=[25, 50, 75]))


Decision Tree Depths of All of Our Random Forests: 
Min Depth:  23
Mean Depth:  25.515
Max Depth:  33

Decision Tree Depths Distribution: 
25th/50th/75th Percentiles:  [ 25.  25.  26.]


In [105]:
### VISUALIZE ONE OF THE DECISION TREES IN OUR RANDOM FOREST USING GRAPHVIZ - TRY TO PICK ONE WITH AVERAGE DEPTH OF OUR RF ###
#find an individual decision tree (estimator) that has the median depth and try to visualize it to see the
#end leaf outputs, etc. to better understand the model

#make sure have already imported the following packages (and make sure to install graphivz)
# from sklearn.tree import export_graphviz
# from subprocess import call
# from IPython.display import Image

#manually try different estimators_ in list until verify that max_depth does match closely to the mean/median depth
decision_tree_avg_depth_example = regr.estimators_[2]
print('The example decision tree we are looking at has a maximum depth of: ', decision_tree_avg_depth_example.tree_.max_depth)

#export our full tree as a dot file (which is the filetype that graphviz uses)
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_viz_random_forest_v4_n_estimators_200_min_leaf_8.dot', 
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

#export top 5 levels of our tree as a dot file - the full tree representation is massive, so this might help the viz actually run
export_graphviz(decision_tree_avg_depth_example, #the tree regressor
                out_file='model_pickles/avg_tree_depth5_viz_random_forest_v4_n_estimators_200_min_leaf_8.dot',
                max_depth=5,
                feature_names = list(X_model_train.columns),
#                 class_names = iris.target_names,
                rounded = True, #rounded When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
                proportion = True, #proportion True changes 'values' and/or 'samples' to be proportions and percentages respectively instead of counts 
                precision = 2, filled = True,
                leaves_parallel = True) #leaves_parallel when set to True, draw all leaf nodes at the bottom of the tree.

# VISUALIZE AT www.webgraphviz.com: can visualize the .dot outputs, by coping the .dot file txt into http://www.webgraphviz.com/

# # Convert to png using system command (requires Graphviz -unsure how to install this on cloud server)
# # ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH
# #option1- try to display top 5 levels with graphviz - 
#Source.from_file('model_pickles/avg_tree_depth5_viz_random_forest_v3_n_estimators_200.dot')

# #option2- Command line syntax call is trying to replicate: !dot -Tpng tree.dot -o tree.png -Gdpi=600
# call(['dot', '-Tpng', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.dot', 
#       '-o', 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png', '-Gdpi=600'])

# #Display in jupyter notebook
# Image(filename = 'model_pickles/avg_tree_viz_random_forest_v3_n_estimators_200.png')

The example decision tree we are looking at has a maximum depth of:  25
