# Machine Learning of the Google Store Analytics Dataset

## Linear Regression Models - v1
### Version 1: Limit Input Variables, Label Encoding (instead of One Hot Encoding), No Scaling/Transforming of Data

This dataset is provided by the Kaggle competition.  
https://www.kaggle.com/c/ga-customer-revenue-prediction

We performed some data engineering and datetime feature engineering to get the dataset to the state we wanted.

Now we will try a variety of different models and look at their accuracy.  The models we will try:
1. Generalized Linear Regression Models
    1. Linear Regression (Ordinary Least Squares) Model
    2. Linear Lasso Regression Model
    3. Linear Ridge Regression Model
    4. Linear Elastic Net Regression Model
2. Decision Tree Regression - a combination of decision trees and getting continuous data output http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html  http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
3. Random Forest Regression??
4. Neural Networks

In [154]:
import pandas as pd
import numpy as np


from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

from scipy.stats.stats import pearsonr

## Importing and Pre-processing of the Training Dataset

In [65]:
#import the data engineered and feature engineered training dataset
df = pd.read_pickle('data/train_v1_full_data_split.pkl')
print(df.shape)
# print(df.columns)

(903652, 44)


In [66]:
### DROP COLUMNS NOT IN FINAL TEST DATA ###
#the test dataset does not have the 'trafficSource_campaignCode' column, so drop that from our training set too
df.drop('trafficSource_campaignCode', axis=1, inplace=True)
print(df.shape)
# print(df.columns)
# df.head(3)

(903652, 43)


In [67]:
### CHANGE TRANSACTION REVENUE FROM NANs to 0 AND CHANGE to FLOAT TYPE (some are strings)###
df.totals_transactionRevenue.fillna(0, inplace=True)
df.totals_transactionRevenue = df.totals_transactionRevenue.astype(dtype=float)

### CHANGE OTHER STRINGS TO INTS/FLOATS WHERE NEEDED ###
#stick to floats rather than ints since a np.nan is a float object
df.totals_bounces = df.totals_bounces.astype(dtype=float)
df.totals_hits = df.totals_hits.astype(dtype=float)
df.totals_newVisits = df.totals_newVisits.astype(dtype=float)
df.totals_pageviews = df.totals_pageviews.astype(dtype=float)
df.totals_visits = df.totals_visits.astype(dtype=float)

### CONVERT NANs in bounces, newVisits to 0 values ###
#the blank NAN values for these columns imply a 0 value meaning 0 newVisits or 0 bounces
df.totals_bounces.fillna(0, inplace=True)
df.totals_newVisits.fillna(0, inplace=True)
# df.totals_visits.fillna(0, inplace=True) #there shouldn't be anyone with 0 visits (they've at least visited once or woulnd't be recorded)

In [68]:
#### REVENUE IS DOLLARS * 10^6, NOT EXPONENTIAL LIKE WE THOUGHT ####
#### SINCE THE REVENUE IS SCALED UP BY A CONSTANT, NO NEED TO ADJUST FOR LIN REGRESS MODEL ####
# ### CONVERT TRANSACTION REVENUE TO DOLLARS (instead of the e^dollars_revenue) ###
# df['totals_transactionRevenue_dollars'] = df.totals_transactionRevenue.map(lambda x:
#                                                                             np.log1p(x))

In [69]:
### VIEW THE DATA BEFORE LABEL/ONE HOT ENCODING ###
print(df.shape)
print(df.columns)
df.head(3)

(903652, 43)
Index(['channelGrouping', 'date', 'fullVisitorId', 'sessionId',
       'socialEngagementType', 'visitId', 'visitNumber', 'visitStartTime',
       'device_deviceCategory', 'device_browser', 'device_isMobile',
       'device_operatingSystem', 'geoNetwork_subContinent',
       'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
       'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
       'totals_bounces', 'totals_hits', 'totals_newVisits', 'totals_pageviews',
       'totals_visits', 'totals_transactionRevenue',
       'trafficSource_isTrueDirect', 'trafficSource_keyword',
       'trafficSource_source', 'trafficSource_adContent',
       'trafficSource_medium', 'trafficSource_referralPath',
       'trafficSource_campaign', 'city_country', 'lat_lng', 'timezone',
       'datetime_iso_utc', 'datetime_iso_local', 'year_local', 'month_local',
       'day_local', 'yearday_local', 'weekday_local', 'hour_local'],
      dtype='object')


Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_deviceCategory,device_browser,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20160902,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,desktop,Chrome,...,"(38.423734, 27.142826)","(+03, 3.0)",2016-09-02 15:33:05+00:00,2016-09-02 18:33:05+03:00,2016.0,9.0,2.0,246.0,5.0,18.0
1,Organic Search,20160902,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,desktop,Firefox,...,"(-25.274398, 133.775136)","(ACST, 9.5)",2016-09-03 05:22:27+00:00,2016-09-03 14:52:27+09:30,2016.0,9.0,3.0,247.0,6.0,14.0
2,Organic Search,20160902,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,desktop,Chrome,...,"(40.4167754, -3.7037902)","(CEST, 2.0)",2016-09-03 01:16:26+00:00,2016-09-03 03:16:26+02:00,2016.0,9.0,3.0,247.0,6.0,3.0


In [70]:
#view the numerical data columns for counts, mean, and min/max
#if the standard deviation (std) is zero, that means every value is the same - may want to check that data
#and see if need to edit it (since describe ignores NANs for instance, you may need to go back and convert the NANs to a 
#value that makes sense)
df.describe()

Unnamed: 0,date,visitId,visitNumber,visitStartTime,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,totals_visits,totals_transactionRevenue,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
count,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903652.0,903552.0,903652.0,903652.0,902175.0,902175.0,902175.0,902175.0,902175.0,902175.0
mean,20165890.0,1485007000.0,2.264898,1485007000.0,0.498675,4.596542,0.77802,3.849767,1.0,1704275.0,2016.517473,6.990086,15.698499,197.611083,3.739715,13.898355
std,4697.698,9022128.0,9.28374,9022128.0,0.499999,9.641442,0.415578,7.025277,0.0,52778690.0,0.499695,3.486402,8.824394,106.757146,1.919636,5.806083
min,20160800.0,1470035000.0,1.0,1470035000.0,0.0,1.0,0.0,1.0,1.0,0.0,2016.0,1.0,1.0,1.0,1.0,0.0
25%,20161030.0,1477561000.0,1.0,1477561000.0,0.0,1.0,1.0,1.0,1.0,0.0,2016.0,4.0,8.0,103.0,2.0,10.0
50%,20170110.0,1483949000.0,1.0,1483949000.0,0.0,2.0,1.0,1.0,1.0,0.0,2017.0,7.0,16.0,207.0,4.0,14.0
75%,20170420.0,1492759000.0,1.0,1492759000.0,1.0,4.0,1.0,4.0,1.0,0.0,2017.0,10.0,23.0,297.0,5.0,18.0
max,20170800.0,1501657000.0,395.0,1501657000.0,1.0,500.0,1.0,469.0,1.0,23129500000.0,2017.0,12.0,31.0,366.0,7.0,23.0


In [71]:
### LABEL ENCODING THE CATEGORICAL VARIABLES (???SHOULD WE BE DOING ONE-HOT ENCODING INSTEAD???)###
# label encode the categorical variables
categorical_cols = ['channelGrouping', 'socialEngagementType', 
                   'device_deviceCategory', 'device_browser', 'device_isMobile',
                   'device_operatingSystem', 'geoNetwork_subContinent',
                   'geoNetwork_region', 'geoNetwork_continent', 'geoNetwork_country',
                   'geoNetwork_city', 'geoNetwork_metro', 'geoNetwork_networkDomain',
                   'trafficSource_isTrueDirect', 'trafficSource_keyword',
                   'trafficSource_source', 'trafficSource_adContent',
                   'trafficSource_medium', 'trafficSource_referralPath',
                   'trafficSource_campaign']

print('Original Dataframe Shape: ', df.shape)

for col in categorical_cols:
    print('\n Converting Column: ', col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[col].values.astype('str')))
    df[col] = lbl.transform(list(df[col].values.astype('str')))
    print(df.shape)

# #Kaggle competition original code
# for col in categorical_cols:
#     print(col)
#     lbl = preprocessing.LabelEncoder()
#     lbl.fit(list(train_df[col].values.astype('str')) + list(test_df[col].values.astype('str')))
#     train_df[col] = lbl.transform(list(train_df[col].values.astype('str')))
#     test_df[col] = lbl.transform(list(test_df[col].values.astype('str')))

Original Dataframe Shape:  (903652, 43)

 Converting Column:  channelGrouping
(903652, 43)

 Converting Column:  socialEngagementType
(903652, 43)

 Converting Column:  device_deviceCategory
(903652, 43)

 Converting Column:  device_browser
(903652, 43)

 Converting Column:  device_isMobile
(903652, 43)

 Converting Column:  device_operatingSystem
(903652, 43)

 Converting Column:  geoNetwork_subContinent
(903652, 43)

 Converting Column:  geoNetwork_region
(903652, 43)

 Converting Column:  geoNetwork_continent
(903652, 43)

 Converting Column:  geoNetwork_country
(903652, 43)

 Converting Column:  geoNetwork_city
(903652, 43)

 Converting Column:  geoNetwork_metro
(903652, 43)

 Converting Column:  geoNetwork_networkDomain
(903652, 43)

 Converting Column:  trafficSource_isTrueDirect
(903652, 43)

 Converting Column:  trafficSource_keyword
(903652, 43)

 Converting Column:  trafficSource_source
(903652, 43)

 Converting Column:  trafficSource_adContent
(903652, 43)

 Converting Colum

## Decide what Input Data to Use for X and Split Data via train_test_split
For initial runs of the models, try using less input data (by using the ones we think are most predictive).

In [72]:
### CALCULATE CORRELATION OF EACH POSSIBLE INPUT vs. REVENUE VALUE - TO HELP DECISION MAKING OF WHICH INPUTS TO INCLUDE###
correlation_summary_list = []

for col in df.columns:
    #can only run correlations on columns that have numerical values (either dtype of float or int)
    #in particular some columns have dtype of 'O', which stands for python object, which in this case means the dtypes are mixed
    if df[col].dtype in ['float64', 'int64']:
        
        #having NANs in the dataset for correlations breaks the correlation calculation
        #so only keep the good_rows that don't have nans for either series being used in the correlation calculation
        #and then np.compress(good_rows, series) just reduces the series to the array with only the good_rows to run the correlation
        good_rows = ~np.logical_or(np.isnan(df[col]), np.isnan(df.totals_transactionRevenue))

        #pearsonr function calculates the Pearson correlation coefficient and the p-value for as a tuple
        correl_pvalue = pearsonr(np.compress(good_rows, df[col]), np.compress(good_rows, df.totals_transactionRevenue))

        #create new tuple that also has the column name (which is the input variable) and the correl coef and p-value
        variable_correl_pvalue = (col,) + correl_pvalue

        #add the correl and pvalue tuple to the list of all correlation summaries
        correlation_summary_list.append(variable_correl_pvalue)

        
#create a dataframe of the correlation summary (for ease of readibility/manipulation)    
correlation_summary_df = pd.DataFrame(correlation_summary_list, columns=['Input Variable', 'Correlation', 'p-value'])
#create a new column of absolute value of correlations so can easily sort both positive and negative correlations together
correlation_summary_df['Absolute Value Correlation'] = correlation_summary_df.Correlation.map(lambda correl: abs(correl))
#sort the dataframe by absolute value correlation high to low
correlation_summary_df.sort_values('Absolute Value Correlation', ascending=False, inplace=True)

#print output
correlation_summary_df

  r = r_num / r_den
  x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0


Unnamed: 0,Input Variable,Correlation,p-value,Absolute Value Correlation
22,totals_transactionRevenue,1.0,0.0,1.0
20,totals_pageviews,0.15559,0.0,0.15559
18,totals_hits,0.154333,0.0,0.154333
4,visitNumber,0.051366,0.0,0.051366
19,totals_newVisits,-0.041164,0.0,0.041164
17,totals_bounces,-0.032206,6.106351000000001e-206,0.032206
23,trafficSource_isTrueDirect,0.030819,9.368688e-189,0.030819
28,trafficSource_referralPath,-0.030432,4.3109749999999995e-184,0.030432
12,geoNetwork_continent,-0.025523,4.440926e-130,0.025523
13,geoNetwork_country,0.022395,1.3618839999999998e-100,0.022395


In [77]:
### ASSIGN X and y DATA for VARIALBES WE WANT TO USE###

# for X data use the initial correlation values and variables that we think are most important to narrow things down
# (remember, that the correlation values are just linear correlation values, so this doesn't capture variables
# that do have a large influence but might be nonlinear, however, for linear regression models at the least, that
# seems like a good metric to start with as the linear models won't be able to capture nonlinear affects well anyways)

#INITIAL RUN DECISIONS: as we can see from the initial Pearson correlations, no variables even fall within the range of what
#we would consider even low correlations traditionally, so it is doubtful that linear regression models will work well,
#but we'll try it out - take the top 10 variables and also include weekday_local, month_local, yearday_local, and hour_local
#since those are features we specifically added to make our features unique



#start with a reduced dataframe that includes all x input variables and the y output variable
#do this so that can clean the dataframe by dropping all rows tha have any nans
df_model = df[['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
              'trafficSource_isTrueDirect', 'trafficSource_referralPath',
              'geoNetwork_continent', 'geoNetwork_country', 'geoNetwork_networkDomain',
              'weekday_local', 'month_local', 'yearday_local', 'hour_local',
              'totals_transactionRevenue']]
print('\nShape of all of our variables being used for the model (before dropping nans): ', df_model.shape)

#for linear regression drop NANs as they can't be interpreted in the regression model - check to make
#sure it isn't reducing size of data too much before proceeding
df_model.dropna(axis='index', how="any", inplace=True)
print('\nShape of all of our variables being used for the model (after dropping nans): ', df_model.shape)

#add a column to the df_model data of a simple classifier of "revenue" or "no_revenue" - will use this data point for:
#  1) evaluating the accuracy of the model later (for example, if our errors are low and R2 high, but it is incorrectly
#     assigning any revenue value to customers with no revenue, that'd be a problem)
#  2) in the train_test_split model we will use the stratify command to get equal train-test percentages for both revenue
#     and no revenue outcomes - I think this will be important since only about 1.3% of all rows actually resulted in 
#     revenue and not completely sure how randomly selecting will have equal test-train distributions without defining it
#     (this may be unnecessary, but better safe than sorry)
df_model['revenue_label'] = df_model.totals_transactionRevenue.map(lambda revenue_amount: 
                                                        'revenue' if revenue_amount > 0 else 'no_revenue')


#split out the data we are using for modeling to X and y values
X_model = df_model[['totals_pageviews', 'totals_hits', 'visitNumber', 'totals_newVisits', 'totals_bounces',
              'trafficSource_isTrueDirect', 'trafficSource_referralPath',
              'geoNetwork_continent', 'geoNetwork_country', 'geoNetwork_networkDomain',
              'weekday_local', 'month_local', 'yearday_local', 'hour_local']]

#need to reshape the y_model data so that the array is of shape (902077, 1) which is (902077 rows, 1 column)
#rather than (90277,) which is (902077 rows, 0 column)
y_model = df_model['totals_transactionRevenue'].values.reshape(-1, 1)

#put stratify criteria of revenue_yes_no into its own array, make sure to reshape this as well
stratify_criteria_model = df_model['revenue_label'].values.reshape(-1, 1)

print('\nShape of X input variables is: ', X_model.shape, '\nShape of y output variable is: ', y_model.shape)



Shape of all of our variables being used for the model (before dropping nans):  (903652, 15)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Shape of all of our variables being used for the model (after dropping nans):  (902077, 15)

Shape of X input variables is:  (902077, 14) 
Shape of y output variable is:  (902077, 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [None]:
### ???NEED TO SCALE/TRANSFORM THE DATA???? ###

In [106]:
##### TRAIN-TEST-SPLIT #####

### SPLIT THE MODEL DATA ###
#split the model data (which is all of the Kaggle Training data) into the model's train/test subsets
#(have to do this since Kaggle competition has its own test data, but those actual values are not provided, so can't
#actually use that to test our models, just end up comparing our predictions on that test data with their actuals)
#use a 75-25 split to start with train-test
#also make sure to add stratify_criteria to make sure it is doing a 75-25 split on both website visits that led to 
#actual sales/revenue and those that did not
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_model, y_model,
                                                                            test_size=0.25,
                                                                            stratify=stratify_criteria_model)
#print sizes of the train/test data splits
print('Check Shapes of the train-test data splits.\n')
print('X_model_train: ', X_model_train.shape)
print('X_model_test: ', X_model_test.shape)
print('y_model_train: ', y_model_train.shape)
print('y_model_test: ', y_model_test.shape)


### VERIFTY THE STRATEFIY COMMAND WORKED ###
print('\n--------------------------------------------------------------------')
print('Check that the train-test data split worked along stratify criteria.\n')

# FIRST PRINT OUT TOTAL DF_MODEL PERCENTAGE OF DATA THAT HAS REVENUE #
print('The df_model data percentages of revenue and no_revenue are:')
#since this data from the df_model is in a series, we can just use pandas value counts
print(df_model['revenue_label'].value_counts(normalize=True))

# THEN CHECK THE TRAIN DATA #
#the train data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_train_revenue_count = len(y_model_train[np.where(y_model_train > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_train_revenue_percent = y_model_train_revenue_count/y_model_train.shape[0]
print('\nThe percentage of model_train data that has revenue is: ', y_model_train_revenue_percent)

# THEN VERIFY THE STRATFIY COMMAND WORKED - SECOND CHECK THE TEST DATA #
#the test data is now an array, so can't use value_counts, have to use np commands
#filter the data to be only y_values that had revenue (rev>0) by using the np.where command
#it creates a mask of booleans that you can use to filter your array based on whatever criteria you give it
#take the length of this filtered array to figure out how many y_values actually had revenue
y_model_test_revenue_count = len(y_model_test[np.where(y_model_test > 0)])
#calculate the percentage of the y_values that have revenue by taking the revenue count/total count in that dataset 
#note: this percentage should equal the overall percentage of your data that has revenue if the stratify command is working properly
y_model_test_revenue_percent = y_model_test_revenue_count/y_model_test.shape[0]
print('\nThe percentage of model_test data that has revenue is: ', y_model_test_revenue_percent)

Check Shapes of the train-test data splits.

X_model_train:  (676557, 14)
X_model_test:  (225520, 14)
y_model_train:  (676557, 1)
y_model_test:  (225520, 1)

--------------------------------------------------------------------
Check that the train-test data split worked along stratify criteria.

The df_model data percentages of revenue and no_revenue are:
no_revenue    0.987242
revenue       0.012758
Name: revenue_label, dtype: float64

The percentage of model_train data that has revenue is:  0.01275871803853925

The percentage of model_test data that has revenue is:  0.012757183398368215


# Linear Regression Models v1
##### Without Scaling/Transforming, Limited Input Variables, Use of Label Encoding (instead of one-hot-encoding), 

In [157]:
#function to define whether the outcome is revenue (when revenue_amt is greater than 0), 
#no_revenue (when revenue_amt is 0), and neg_revenue (when revenue_amt is less than 0)
def revenue_norevenue_negrevenue(revenue_amt):
    if revenue_amt > 0:
        return 'revenue'
    elif revenue_amt == 0:
        return 'no_revenue'
    elif revenue_amt < 0:
        return 'neg_revenue'

In [167]:
### CREATE A FUNCTION THAT EVALUATES THE REVENUE OR NO_REVENUE ACCURACY ###
#an important part of this model is to make sure that only people that actually performed a final transaction are 
#getting a revenue prediction
#so calculate the percentages of each outcome in a confusion matrix using sklearn.metrics.confusion_matrix:
# --True Positive: Revenue_actual & Revenue_predicted
# --True Negative: No_Revenue_actual & No_Revenue_predicted
# --False Negative: Revenue_actual & No_Revenue_predicted
# --False Positive: No_Revenue_actual & Revenue_predicted

#inputs of y_revenue_true, y_revenue_predicted are the actual revenue amount arrays,
#we will convert to labels
def evaluate_revenue_confusion_matrix(y_revenue_true, y_revenue_predicted):
    df_revenue_eval = pd.DataFrame(data={'revenue_actual': y_revenue_true.reshape(-1),
                       'revenue_prediction': y_revenue_predicted.reshape(-1)})
    
    df_revenue_eval['revenue_label_actual'] = df_revenue_eval.revenue_actual.map(revenue_norevenue_negrevenue)
    
    df_revenue_eval['revenue_label_prediction'] = df_revenue_eval.revenue_prediction.map(revenue_norevenue_negrevenue)
    
    print('Confusion Matrix of revenue/no_revenue/neg_revenue: \n')
    #use scikit learns confusion matrix (the diagonal indicates true hits, outside of that is the false hits)
    print(confusion_matrix(df_revenue_eval.revenue_label_actual,
                 df_revenue_eval.revenue_label_prediction,
                 labels=['revenue', 'no_revenue', 'neg_revenue']))

## A. Linear Regression (Ordinary Least Squares) Model

In [124]:
from sklearn.linear_model import LinearRegression

In [125]:
### INSTANTIATE, FIT, and PREDICT THE MODEL ###

#instantiate the LinearRegression model
lin_reg_model = LinearRegression()

#fit the training X, y data using the model
lin_reg_model.fit(X_model_train, y_model_train)

#make predictions using the model
y_lin_reg_test_predictions = lin_reg_model.predict(X_model_test)

In [126]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE_lin_reg = mean_squared_error(y_model_test, y_lin_reg_test_predictions)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2_lin_reg = r2_score(y_model_test, y_lin_reg_test_predictions)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE_lin_reg, r2_lin_reg, np.sqrt(r2_lin_reg)))

Mean Squared Error: 1722713059096698.2 
R-squared: 0.04348373509403003 
Correlation: 0.2085275403730405


In [168]:
evaluate_revenue_confusion_matrix(y_model_test, y_lin_reg_test_predictions)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2876      0      1]
 [ 83598      0 139045]
 [     0      0      0]]


## B. Linear Lasso Regression Model

In [127]:
from sklearn.linear_model import Lasso

In [129]:
### INSTANTIATE, FIT, and PREDICT THE MODEL ###

#instantiate the LinearRegression model
lasso_model = Lasso(alpha=.01)

#fit the training X, y data using the model
lasso_model.fit(X_model_train, y_model_train)

#make predictions using the model
y_lasso_test_predictions = lasso_model.predict(X_model_test)



In [130]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE_lasso = mean_squared_error(y_model_test, y_lasso_test_predictions)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2_lasso = r2_score(y_model_test, y_lasso_test_predictions)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE_lasso, r2_lasso, np.sqrt(r2_lasso)))

Mean Squared Error: 1722713062244528.8 
R-squared: 0.04348373334623401 
Correlation: 0.20852753618223663


In [169]:
evaluate_revenue_confusion_matrix(y_model_test, y_lasso_test_predictions)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2876      0      1]
 [ 83596      0 139047]
 [     0      0      0]]


## C. Linear Ridge Regression Model

In [131]:
from sklearn.linear_model import Ridge

In [132]:
### INSTANTIATE, FIT, and PREDICT THE MODEL ###

#instantiate the LinearRegression model
ridge_model = Ridge(alpha=.01)

#fit the training X, y data using the model
ridge_model.fit(X_model_train, y_model_train)

#make predictions using the model
y_ridge_test_predictions = ridge_model.predict(X_model_test)

In [133]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE_ridge = mean_squared_error(y_model_test, y_ridge_test_predictions)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2_ridge = r2_score(y_model_test, y_ridge_test_predictions)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE_ridge, r2_ridge, np.sqrt(r2_ridge)))

Mean Squared Error: 1722713059109972.0 
R-squared: 0.043483735086659925 
Correlation: 0.20852754035536872


In [170]:
evaluate_revenue_confusion_matrix(y_model_test, y_ridge_test_predictions)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2876      0      1]
 [ 83598      0 139045]
 [     0      0      0]]


## D. Linear Elastic Net Regression Model

In [134]:
from sklearn.linear_model import ElasticNet

In [135]:
### INSTANTIATE, FIT, and PREDICT THE MODEL ###

#instantiate the LinearRegression model
elastic_net_model = ElasticNet(alpha=.01)

#fit the training X, y data using the model
elastic_net_model.fit(X_model_train, y_model_train)

#make predictions using the model
y_elastic_net_test_predictions = elastic_net_model.predict(X_model_test)



In [136]:
### EVALUATE THE MODEL USING MSE, R2, CORREL ###

#MSE function syntax: mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
MSE_elastic_net = mean_squared_error(y_model_test, y_elastic_net_test_predictions)

#R2 function syntax: r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
r2_elastic_net = r2_score(y_model_test, y_elastic_net_test_predictions)

print("Mean Squared Error: {} \nR-squared: {} \nCorrelation: {}".format(MSE_elastic_net, r2_elastic_net, np.sqrt(r2_elastic_net)))

Mean Squared Error: 1722719863189738.5 
R-squared: 0.04347995720099862 
Correlation: 0.20851848167728113


In [171]:
evaluate_revenue_confusion_matrix(y_model_test, y_elastic_net_test_predictions)

Confusion Matrix of revenue/no_revenue/neg_revenue: 

[[  2876      0      1]
 [ 83089      0 139554]
 [     0      0      0]]
