# Executive summary

This the purpose of this project is to determine what factors contribute most to sales for a national quick service restaurant and how we, as a digital marketing firm, can use those factors to more efficiently determine where we should focus a campaign. Specifically, we will examine what factors increase online sales as measured by conversions. Conversions are when an online sale is made through an advertizement. 

Many factors have been linked to sales, but the interaction of these factors has not been determined. The goal of this study is to determine the relationship between several of these factors in order to maximize return on ad spend. Weather is an important factor in the online sales of this company as there are much greater sales during winter months. There are also sales spikes around holidays, the end of the week, and big sporting events(mainly football). If we can show which combination of these factors predict high sales, we can use different thresholds of these factors to determine where we want to be spending our advertising dollar.  

Through looking at past data, we can determine what the most important predictors of higher than normal online sales might be. We will use an ensemble decision tree method called random forest in order to both predict values for online sales as well as classifying whether a day will be a low, medium low, medium high, or high conversion day. If we know that a day might be a high conversion day anyways, we could spend less money on those days, and focus resources on days where there is more potential to increase conversions.

We used 2014 and 2015 data to see if we could produce a good model of one year based on the last years data. If the 2014 is a good predictor of the conversions in 2015, we can apply this model to determine our targeted ad spend in 2016. The random forest models we have used are very good at classifying days as mentioned above and will provide a good basis for what type day we think we will see. 

We saw some inconsistencies in the 2015 data, which a had much lower conversion average (50% lower) and much lower maximum values. We therefore focused on using the 2014 data to build the strongest model we could. To do this we hid the actual conversion values for 2014 from the model and asked it to predict the number of conversions.  We were able to produce a model that predicted values that were very correlated to the known actual values for those days. However, there was a lot of variation in the predictions, meaning that there were sometimes predicted spikes when there were none in the actual data and vice versa. We decided to take another approach that would give us actionable data without worrying about incorrect spikes in conversions.

We broke the conversions up into high, high medium, low medium and low conversion days base on quartiles of the data. we then asked the model to again place the different days into these categories without knowing the class the actual conversions belonged in. This model was far more successful and produced actionable predictions. This model correctly predicts high sales days more than 90% of the time. It also exceeded at predicting the other categories of conversions. We can continue to use this model to more efficiently use our ad spend to maximize return. We can avoid spending too much on days we would have high conversions regardless and focus on boosting conversions on days where we would predict lower conversions. 



# Technical Review and Code

### EDA and data cleaning
The football games that are included in the analysis are all NFL games (including playoffs) and college football games that involve a top 25 ranked team. This data was queried and downloaded as a csv from sportsreference.com. Weather data was gathered through a query on the NOAA website, which included data from 3 weather stations covering the metro Atlanta area, giving a good overview of the weather trends. All data was gathered by date, so it will be easy to join dataframes and be confident in the data being concurrent.

In [24]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.cross_validation import KFold,cross_val_score,train_test_split,cross_val_predict
from sklearn.preprocessing import MinMaxScaler, label_binarize, LabelEncoder
from sklearn.metrics import r2_score,classification_report,roc_curve,auc,accuracy_score,precision_score,recall_score
from sklearn.grid_search import GridSearchCV
from sklearn.tree import export_graphviz
from sklearn.metrics import precision_recall_fscore_support as score
from scipy import interp
import scipy
import psycopg2
import seaborn as sns
import matplotlib.pyplot as plt
import sqlalchemy

%matplotlib inline

In [25]:
#importing nfl 2014 schedule and results data set, downloaded from sportsreference.com
nfl14 = pd.read_csv('../Assets/nfl2014schedule.csv') 
nfl14['Date'] = '2014 ' + nfl14['Date']

In [26]:
#importing nfl 2015 schedule and results data set, downloaded from sportsreference.com

nfl15 = pd.read_csv('../Assets/nfl2015schedule.csv') 
nfl15['Date'] = '2015 ' + nfl15['Date']

In [27]:
#joining the dataframes
nflSched = pd.concat([nfl14,nfl15])

In [28]:
#Dropping columns that won't help predict sales
nflSched.drop(['Week','Day','Unnamed: 4','Unnamed: 6','PtsW','PtsL','YdsW','TOW','YdsL','TOL'], axis=1, inplace=True)

In [29]:
#eliminating strings that were found in empty rows in the dataframe
nflSched = nflSched[nflSched['Date']!= '2014 Date']
nflSched = nflSched[nflSched['Date']!= '2014 Playoffs']
nflSched = nflSched[nflSched['Date']!= '2015 Date']
nflSched = nflSched[nflSched['Date']!= '2015 Playoffs']

In [30]:
#converting to datetime and setting as index
nflSched['Date'] = pd.to_datetime(nflSched['Date'], format='%Y %B %d')
nflSched.set_index('Date', inplace=True)

In [31]:
#adding a column so that we will have a column that indicates dates that have an NFL game
nflSched['NflGame'] = 1

In [32]:
#importing college football 2014 and 2015 schedule and results data set, downloaded from sportsreference.com
cfb14 = pd.read_csv('../Assets/cfb2014schedule.csv') 
cfb15 = pd.read_csv('../Assets/cfb2015schedule.csv')     

In [33]:
#combining the dataframes
cfbSched = pd.concat([cfb14,cfb15])

In [34]:
#dropping columns that I don't think will be good predictors
cfbSched.drop(['Rk','Wk','Time','Day','Pts','Unnamed: 7','Pts.1','Notes'], axis=1, inplace=True)

In [35]:
#removing columns that just contain the string 'Date'
cfbSched = cfbSched[cfbSched['Date']!= 'Date']

In [36]:
#converting to datetime and setting as index
cfbSched['Date'] = pd.to_datetime(cfbSched['Date'], format='%b %d %Y')
cfbSched.set_index('Date', inplace=True)

In [37]:
#adding a column so that we will have a column that indicates dates that have an cfb game
cfbSched['cfbGame'] = 1

In [38]:
#making better column names to see if network, or teams involved influence sales
cfbSched=cfbSched.rename(columns = {'Winner/Tie':'CfbWinner', 'Loser/Tie':'CfbLoser', 'TV':'CfbTV'})

In [39]:
#importing weather data from the noaa
Weather = pd.read_csv('../Assets/AtlWeather.csv') 

In [40]:
#converting to datetime
Weather['DATE'] = pd.to_datetime(Weather['DATE'], format='%Y%m%d')

In [41]:
#dropping columns we don't need, and eliminating null value place holders
Weather.drop([u'STATION',u'STATION_NAME',u'MDPR',u'DAPR',u'SNWD',u'TOBS',u'WESD'
             , u'WESF', u'WT01', u'WT06', u'WT02', u'WT04', u'WT08', u'WT03','PSUN','TAVG','SNOW','TSUN'], axis=1, inplace=True)

Weather.replace('-9999',np.nan, inplace=True)
Weather.replace('-9999.0',np.nan, inplace=True)

In [42]:
Weather.rename(columns = {'DATE':'Date'}, inplace=True)
Weather.set_index('Date', inplace=True)

In [43]:
#There are many weather stations across the Atlanta area, so we are taking the mean of these stations to estimate weather for
#the metro area
Weather = Weather.groupby(Weather.index).mean()

In [44]:
Final = pd.merge(Weather, cfbSched, left_index=True, right_index=True, how='outer')
Final = pd.merge(Final, nflSched, left_index=True, right_index=True, how='outer')

In [45]:
#slicing the dataframe to include the pro and college football seasons from 2014 to 2016 for a master table
Final = Final.ix['2014-8':'2016-03']

In [46]:
#importing client data for Atlanta area 
columns=['Date','PSA Sales','Orders','Online Sls %', 'Carryout %', 'Rewards Enrollment']
conv = pd.read_csv('../Assets/AtlantaSales2014_2016.csv', names=columns, header=0) 

IOError: File ../Assets/AtlantaSales2014_2016.csv does not exist

In [None]:
conv.info()

In [None]:
conv.head()

In [None]:
conv['Date'] = pd.to_datetime(conv['Date'], format='%Y-%m-%d')

In [None]:
conv.set_index('Date', inplace=True)

In [None]:
conv.head()

In [None]:
for i in conv.columns:
    if conv[i].dtype == object:
        conv[i] = conv[i].convert_objects(convert_numeric=True)

conv.info()

In [None]:
conv.rename(columns = {'PSA Sales':'psa_sales','Online Sls %':'Online%','Carryout %':'carryout%',
                         'Papa Rewards Customer Enrollments':'rewards'}, inplace=True)

In [None]:
conv['online_sales'] = conv['psa_sales'] * (conv['Online%']/100)

In [None]:
conv.info()

In [None]:
conv.head()

In [None]:
#merging the dataframe with the final dataframe
Final = pd.merge(Final, conv, left_index=True, right_index=True, how='outer')


In [None]:
#slicing the dataframe to include the pro and college football seasons from 2014 to 2015 so that we can look at the 
#seasons separately 
Final2 = Final.ix['2014-8':'2015-3']
Test = pd.DataFrame()
Test = Final.ix['2015-8':'2016-3']

In [None]:
#replacin NaN's with zeros so that we can identify whether there was or was not a football game 
Final2['cfbGame'].replace('NaN', 0.0,inplace=True)
Final2['NflGame'].replace('NaN', 0.0, inplace=True)

Test['cfbGame'].replace('NaN', 0.0,inplace=True)
Test['NflGame'].replace('NaN', 0.0, inplace=True)
#1=game on that day, 0=no game

In [None]:
Final2['Day']=Final2.index.weekday
Test['Day']=Test.index.weekday

In [None]:
# #establishing connection to SQL database, dumping final table into database
# user = "postgres:Lumberjack1"
# engine = sqlalchemy.create_engine('postgresql://{}{}'.format(user,'@localhost:5432/Rforest'))
# Final.to_sql("final", con = engine, if_exists="replace")


In [None]:
# EDA
# Final2.corr()


In [None]:
#Final2.drop('CfbWinner','CfbLoser','CfbTV','Winner/tie','Loser/tie')
Final2.head(50)

## Data Dictionary

|Column Name|Data|
|-----------|----|
|PRCP| The amount of precipitation that day in inches|
|TMAX| The maximum measured temperature for the day in degrees Farenheit|
|TMIN| The minimum measured temperature for the day in degrees Farenheit|
|CfbWinner| The team that won the college football game on that day|
|CfbLoser| The team that lost the college football game on that day|
|CfbTV| Television network a college football game was broadcast on|
|cfbGame| Whether there was a college football game on that day|
|Winner/tie| The team that won the NFL game on that day|
|Loser/tie| The team that lost the NFL game on that day|
|NflGame| Whether there was an NFL game on that day|
|psa_sales| per store average sales|
|Orders| transaction volume for Atlanta market|
|online%| percentage of per store average sales that were made online|
|carryout%| percentage of orders that were carryout|
|rewards| Number of rewards program members|
|online_sales| psa_sales multiplied by percentage of online sales|
|Day| day of the week represented by integers 0-6 with 0 being Monday|

In [None]:
plt.figure(1)
plt.subplot(1, 1, 1)
sns.regplot('online_sales','Day', Final2)

plt.figure(2)
sns.lmplot('online_sales','NflGame',Final2, logistic=True)

plt.figure(3)
sns.lmplot('online_sales','cfbGame', Final2, logistic=True)

plt.figure(4)
sns.lmplot('online_sales','TMAX', Final2)

plt.figure(5)
sns.lmplot(x='online_sales',y='PRCP',data=Final2)

In [None]:
# EDA

plt.figure(1)
plt.subplot(1, 1, 1)
sns.regplot(Final2['Orders'],Final2['PRCP'])

plt.figure(2)
plt.subplot(1, 1, 1)
sns.regplot(Final2['Orders'],Final2['TMAX'])

plt.figure(3)
plt.subplot(1, 1, 1)
sns.regplot(Final2['Orders'],Final2['cfbGame'])

plt.figure(4)
plt.subplot(1, 1, 1)
sns.regplot(Final2['Orders'],Final2['NflGame'])

plt.figure(5)
plt.subplot(1, 1, 1)
sns.regplot(Final2['Orders'],Final2['Day'])


In [None]:
Final2.info()

In [None]:
#Using category codes for columns with many categorical variables (winners/losers of games, tv network, etc)
for i in Final2.columns:
    if Final2[i].dtype == object:
        Final2[i] = Final2[i].astype('category')
        Final2[i] = Final2[i].cat.codes
    

In [None]:
for i in Test.columns:
    if Test[i].dtype == object:
        Test[i] = Test[i].astype('category')
        Test[i] = Test[i].cat.codes

In [None]:
Final2.dropna(inplace=True)
print Final2.shape

Test.dropna(inplace=True)
print Test.shape

In [None]:
#creating target and predictors for 2014 data to see if these are good predictors for conversions before I use 2014 data 
#to try to predict 2015 data
y=Final2['online_sales']
X=Final2.drop(Final2.ix[:,10:], axis=1)
X['Day']=Final2['Day']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=20)

In [None]:
#fitting a decision tree model using cross validation
cv = KFold(len(y_train), shuffle=False) 
print cv
dt = DecisionTreeRegressor(random_state=5)
dtScore = cross_val_score(dt, X_train, y_train, cv=cv,n_jobs=1)
print "Regular Decision Tree scores are:", dtScore
print "Regular Decision Tree average score is:", dtScore.mean()

In [None]:
#fitting a random forest model using cross validation and comparing it to previous model

rf = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=1, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=-1, oob_score=False, random_state=5,
           verbose=0, warm_start=False)
rfScore = cross_val_score(rf, X_train,y_train, cv=cv, n_jobs=1)
print "Random Forest scores are:", rfScore
print "Regular Decision Tree scores are:", dtScore
print "Random Forest average score is:", rfScore.mean()
print "Regular Decision Tree average score is:", dtScore.mean()

In [None]:
# rfc = RandomForestRegressor(n_jobs=-1, max_features= 'sqrt' ,n_estimators=100) 

# param_grid = { 
#     'n_estimators': [100,200,300,400,500],
#     'max_features': [None, 'sqrt', 'log2'],
#     'min_samples_split':[1,2,3,4,5,6]
# }

# CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
# CV_rfc.fit(X_train, y_train)
# print CV_rfc.best_params_
# print CV_rfc.best_estimator_

In [None]:
#fitting a adaboost model using cross validation and comparing it to previous model

ab = AdaBoostRegressor(base_estimator=None, learning_rate=2.0, loss='linear',
         n_estimators=300, random_state=5)
abScore = cross_val_score(ab, X_train,y_train, cv=cv, n_jobs=1)
print "Adaptive Boost scores are :",abScore
print "Random Forest scores are:", rfScore
print "Regular Decision Tree scores are:", dtScore
print "Adaptive Boost average score is:",abScore.mean()
print "Random Forest average score is:", rfScore.mean()
print "Regular Decision Tree average score is:", dtScore.mean()

In [None]:
#Grid search for best parameters
# abc = AdaBoostRegressor() 

# param_grid2 = { 
#     'n_estimators': [50,100,150,200],
#     'learning_rate': [1.0,2.0,3.0,4.0]
# }

# CV_abc= GridSearchCV(estimator=abc, param_grid=param_grid2, cv= 5)
# CV_abc.fit(X_train, y_train)
# print CV_abc.best_params_
# print CV_abc.best_estimator_

In [None]:
#plotting cross-validated models
def do_plot(model, m=None):
    for fold, color in zip(cv, ['r','g','b']):#colors are from different folds from Kfold, so 3 diff models
        
        X_train = X.iloc[fold[0]]
        X_test =  X.iloc[fold[1]]
        y_train = y.iloc[fold[0]]
        y_test = y.iloc[fold[1]]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        plt.scatter(y_test, y_pred, color=color)
        plt.plot([0,3500],[0,3500])
        plt.text(2500,0, "R2:"+str(m), fontsize=20, )
        plt.xlabel('Actual Conversions')
        plt.ylabel('Predicted Conversions')

In [None]:
do_plot(dt, dtScore.mean().round(2))
plt.title("Regular Decision Trees")

In [None]:
do_plot(rf, rfScore.mean().round(2))
plt.title("Random Forest")

In [None]:
do_plot(ab, abScore.mean().round(2))
plt.title("Adaptive Boost")

In [None]:
rf.fit(X_train,y_train)

In [None]:
ab.fit(X_train, y_train)

In [None]:
#calculating feature importance for random forest
all(rf.feature_importances_ == np.mean([tree.feature_importances_ for tree in rf.estimators_], axis=0))

importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
feature_names = X_train.columns

In [None]:
plt.figure(figsize=(20,10))
plt.title("Random Forest Feature importances", fontsize = 30)
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names[indices], rotation=90, fontsize = 20)
plt.xlim([-1, X_train.shape[1]])
plt.yticks(fontsize=15)

In [None]:
#calculating feature importance for Adaboost
all(ab.feature_importances_ == np.mean([tree.feature_importances_ for tree in ab.estimators_], axis=0))

importances2 = ab.feature_importances_
std2 = np.std([tree.feature_importances_ for tree in ab.estimators_], axis=0)
indices2 = np.argsort(importances)[::-1]
feature_names2 = X_train.columns

In [None]:
plt.figure(figsize=(20,10))
plt.title("Adaboost feature importances", fontsize = 30)
plt.bar(range(X_train.shape[1]), importances2[indices],
       color="r", yerr=std2[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names2[indices], rotation=90, fontsize = 20)
plt.xlim([-1, X_train.shape[1]])
plt.yticks(fontsize=15)

We can see in both the adaptive boost model and the randome forest model, weather and day of the week seem to be more important features for splitting the data and sporting events are less important. This could lead to better targeting of ads, where the conversions will happen regardless of whether there is a sports game on, but we can increase ads on days where the weather is not indicative of great sales in order to increase sales on those days. 

In [None]:
#random forest predictions
rfy_pred = rf.predict(X_test)
rfpredictions = pd.DataFrame()
rfpredictions['actual'] = y_test
rfpredictions['predict'] = rfy_pred

In [None]:
#test train split of model on 2014 data
sns.regplot(rfpredictions['actual'],rfpredictions['predict'])

In [None]:
r2_score(rfpredictions['actual'],rfpredictions['predict'])

In [None]:
rfpredictions.plot()
plt.ylabel('Conversions')

In [None]:
rfpredictions.describe()

We have used several different models to see which one performs the best on this split data. The random forest performed best and was applied to the 2014 data to produce predictions. The model has a good r2_score at 0.86, but we can see from the plot of the predicted vs. actual values that there is a lot of variation away from the known values, with some big predicted spikes where there are none in the real data. For this reason we will later perform a random forest classifier to try to bin the conversions and make for less variation in the predicted classes as a result.

In [None]:
# now fit model to all of 2014 data and predict 2015 data

In [None]:
Conv2015 = pd.DataFrame()
Conv2015['actual'] = Test['online_sales']
Features15 = Test.drop(Test.ix[:,10:], axis=1)
X2=Final2.drop(Final2.ix[:,10:], axis=1)
y2=Final2['online_sales']
X2['Day']=Final2['Day']
Features15['Day']=Test['Day']

In [None]:
#Fitting model to all of the 2014 data
rf.fit(X2,y2)

In [None]:
pred15 = rf.predict(Features15)
Conv2015['predicted'] = pred15

In [None]:
sns.regplot(Conv2015['actual'],Conv2015['predicted'])

In [None]:
r2_score(Conv2015['actual'],Conv2015['predicted'])

This regression model for 2015 data is not very good. The r2_score is around 0.5, which means that we have a positive correlation, but not a very strong one.  This is perhaps a situation where a classification model may be a better fit.

In [None]:
Conv2015.describe()

In [None]:
Conv2015.plot()
plt.ylabel('Conversions')
plt.ylim(0, 3500)


The Adapdtive boost and the regular decision tree regression models did not do as well as a random forest at predicting online sales in a given day after being trained on the split data for 2014 (70% train, 30% test). The r^2 for the Adaboost and the regular decision tree were greater then 0.10 less than the random forest, which did much better. 

The random forest regression model was good at predicting within year, but not very good at using results from one year to predict another. The r^2 for the 2014 only random forest regression model was 0.86, which is a good score, not great, but good enough to give some valuable insights. We therefore decided to proceed with just the random forest model. When the model was fit to the 2014 data and used to predict 2015 values the r^2 was only slightly above 0.5, making it not a great model to predict actual sales values.

If we decide to use a random forest classifier with 4 different categories of online sales, we may be able to make predictions for where to spend ad money based on a more general classification rather than a specific online sales value.


In [None]:
Final2['online_sales'].describe()

In [None]:
#breaking up conversions into quartiles for classification
def classConversions(cl):
    if cl > 2271: 
        return 3
    elif 1990 < cl <= 2271:
        return 2
    elif 1620 < cl <= 1990:
        return 1
    else:
        return 0

In [None]:
#adding classifications to the dataframe
Final2['Class'] = Final2['online_sales'].map(classConversions)


In [None]:
Test['Class'] = Test['online_sales'].map(classConversions)

### EDA for classification

In [None]:
sns.violinplot('Class','cfbGame', data=Final2)

In [None]:
sns.violinplot('Class','TMIN', data=Final2)

In [None]:
sns.violinplot('Class','PRCP', data=Final2)

In [None]:
sns.violinplot('Class','NflGame', data=Final2)

In [None]:
sns.violinplot('Class','TMAX', data=Final2)

In [None]:
sns.violinplot('Class','Day', data=Final2)

These violin plots show the data in a different manner and allow us to see what the features are like when we see most of our high online_sales days. These plots indicate that sometimes there is an increase in high sales days when it rains more than 0.5 inches in a day. Also we can see that almost all high sales days occur on Saturdays. High sales also occur more often when the minimum temperature is between 80-100 and 40-60 degrees. These plots help, by showing us some possible thresholds for predicting high sales. 

### Using classifier to predict 2014

In [None]:
y1=Final2['Class']
X1=Final2.drop(Final2.ix[:,10:], axis=1)
X1['Day']=Final2['Day']

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.30, random_state=20)

In [None]:
# #Grid search cv for parameters
# rfc1 = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=100, oob_score = True) 

# param_grid1 = { 
#     'n_estimators': [100,200,300,400,500],
#     'criterion': ["gini"],
#     'max_features': [None, 'sqrt', 'log2'],
#     'min_samples_split':[1,2,3,4,5,6]
# }

# CV_rfc1 = GridSearchCV(estimator=rfc1, param_grid=param_grid1, cv= 5)
# CV_rfc1.fit(X_train1, y_train1)
# print CV_rfc1.best_params_
# print CV_rfc1.best_estimator_

In [None]:
cv_class = KFold(len(y_train1), shuffle=False) 
print cv_class
rfclass = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=4,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=5, verbose=0, warm_start=False)
rfclassScore = cross_val_score(rfclass, X_train1, y_train1, cv=cv_class,n_jobs=1)
print "Random forest classifier scores are:", rfclassScore
print "Regular forest classifier average score is:", rfclassScore.mean()

In [None]:
rfclass.fit(X_train1, y_train1)

In [None]:
all(rfclass.feature_importances_ == np.mean([tree.feature_importances_ for tree in rfclass.estimators_], axis=0))

importancesClass = rfclass.feature_importances_
stdClass = np.std([tree.feature_importances_ for tree in rfclass.estimators_], axis=0)
indicesClass = np.argsort(importancesClass)[::-1]
feature_namesClass = X_train1.columns

In [None]:
plt.figure(figsize=(20,10))
plt.title("Random Forest Classifier Feature importances", fontsize = 30)
plt.bar(range(X_train1.shape[1]), importancesClass[indicesClass],
       color="r", yerr=stdClass[indicesClass], align="center")
plt.xticks(range(X_train1.shape[1]), feature_names[indicesClass], rotation=90, fontsize = 20)
plt.xlim([-1, X_train1.shape[1]])
plt.yticks(fontsize=15)

We can see that precipitation is a more important feature in the classifier model vs. the regression model. It is also interesting that who the winner or loser of a football game is matters more than only whether there is a game or not. Weather and day of the week seem to be the mest predictors of whether a particular day will be a high sales day. These could be useful metrics to use to predict conversions of future data. 

In [None]:
Class_pred = rfclass.predict(X_test1)
Class_predict = pd.DataFrame()
Class_predict['actual'] = y_test1
Class_predict['predict'] = Class_pred
Class_probs = rfclass.predict_proba(X_test)
Class_predict['ProbBottom'],Class_predict['ProbMidLow'],Class_predict['ProbMidHi'],Class_predict['ProbHi'] = zip(*Class_probs)

In [None]:
Class_predict.head()

In [None]:
conf_mat = pd.crosstab(Class_predict['actual'], Class_predict['predict'], rownames=['actual'])
conf_mat

In [None]:
precision, recall, fscore, support = score(y_test1, Class_pred)

Scores=pd.DataFrame()
Scores['Class'] = ['Low','LowMed','HiMed','Hi']
Scores['precision'] = precision
Scores['recall'] = recall
Scores['fscore'] = fscore
Scores['support'] = support
Scores.head()

In [None]:
classes=[0,1,2,3]
y_testBi = label_binarize(y_test1, classes)
y_predBi = label_binarize(Class_pred, classes)

In [None]:
print Class_pred

In [None]:
n_classes=len(classes)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_testBi[:,i], y_predBi[:,i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [None]:
# Plot all ROC curves
plt.figure()


for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for each class')
plt.legend(loc="lower right")
plt.show()


In [None]:

Class15 = Test.drop(Test.ix[:,10:], axis=1)
X3=Final2.drop(Final2.ix[:,10:], axis=1)
y3=Final2['Class']
X3['Day']=Final2['Day']
Class15['Day']=Test['Day']

In [None]:
rfclass.fit(X3,y3)

In [None]:
Class15_pred = rfclass.predict(Class15)
Class15_predict = pd.DataFrame()
Class15_predict['actual'] = Test['Class']
Class15_predict['predict'] = Class15_pred
Class15_probs = rfclass.predict_proba(Class15)
Class15_predict['ProbBottom'],Class15_predict['ProbMidLow'],Class15_predict['ProbMidHi'],Class15_predict['ProbHi'] = zip(*Class15_probs)

In [None]:
Class15_predict.head()

In [None]:
conf_mat = pd.crosstab(Class15_predict['actual'], Class15_predict['predict'], rownames=['actual'])
conf_mat

In [None]:
precision15, recall15, fscore15, support15 = score(Class15_predict['actual'], Class15_pred)

Scores15=pd.DataFrame()
Scores15['Class'] = ['Low','LowMed','HiMed','Hi']
Scores15['precision'] = precision15
Scores15['recall'] = recall15
Scores15['fscore'] = fscore15
Scores15['support'] = support15
Scores15.head()

In [None]:
#use threshold to try to improve model, try regular decision tree

In [None]:
#write summary of this crappy model

In [None]:
y_testBi1 = label_binarize(Test['Class'], classes)
y_predBi1 = label_binarize(Class15_pred, classes)

In [None]:
n_classes=y_testBi1.shape[1]
fpr1 = dict()
tpr1 = dict()
roc_auc1 = dict()
for i in range(n_classes):
    fpr1[i], tpr1[i], _ = roc_curve(y_testBi1[:,i], y_predBi1[:,i])
    roc_auc1[i] = auc(fpr1[i], tpr1[i])

In [None]:
plt.figure()


for i in range(n_classes):
    plt.plot(fpr1[i], tpr1[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(i, roc_auc1[i]))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for each class')
plt.legend(loc="lower right")
plt.show()


This model is not good for predicting classes in the upper two quartiles of sales. However, almost all of the mis-classified points for the class 2 and 3 predictions were in the upper two quartiles. Perhaps we could use a more simple classification method and just classify days as high or low selling days.  

In [None]:
def binaryConversions(cl):
    if cl > 1990: 
        return 1
    else:
        return 0

In [None]:
Test['BiClass'] = Test['online_sales'].map(binaryConversions)

In [None]:
Final2['BiClass'] = Final2['online_sales'].map(binaryConversions)

In [None]:
Class15Bi = Test.drop(Test.ix[:,10:], axis=1)
X4=Final2.drop(Final2.ix[:,10:], axis=1)
y4=Final2['BiClass']
X4['Day']=Final2['Day']
Class15Bi['Day']=Test['Day']

In [None]:
cv_classBi = KFold(len(y4), shuffle=False) 
print cv_classBi
rfclassBi = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=4,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=5, verbose=0, warm_start=False)
rfclassBiScore = cross_val_score(rfclassBi, X4, y4, cv=cv_class,n_jobs=1)
print "Random forest classifier scores are:", rfclassBiScore
print "Regular forest classifier average score is:", rfclassBiScore.mean()

In [None]:
rfclassBi.fit(X4,y4)

In [None]:
Class15Bi_pred = rfclassBi.predict(Class15Bi)
Class15Bi_predict = pd.DataFrame()
Class15Bi_predict['actual'] = Test['BiClass']
Class15Bi_predict['predict'] = Class15Bi_pred
Class15Bi_probs = rfclassBi.predict_proba(Class15Bi)

In [None]:
conf_mat = pd.crosstab(Class15Bi_predict['actual'], Class15Bi_predict['predict'], rownames=['actual'])
conf_mat

In [None]:
Final2.to_csv('../Assets/2014data.csv')

In [None]:
Test.to_csv('../Assets/2015data.csv')

In [None]:
frames = [Final2, Test]
Tableau = pd.concat(frames)

In [None]:
Tableau.to_csv('../Assets/TableauData.csv')

### Decision Tree Binary Classifier

In [None]:
#Grid search cv for parameters
DTBinary = DecisionTreeClassifier(max_features= 'sqrt' ) 

param_grid1 = { 
    'criterion': ["gini"],
    'max_features': [None, 'sqrt', 'log2'],
    'min_samples_split':[1,2,3,4,5,6]
}

CV_DT = GridSearchCV(estimator=DTBinary, param_grid=param_grid1, cv= 5)
CV_DT.fit(X4, y4)
print CV_DT.best_params_
print CV_DT.best_estimator_

In [None]:
cv_classDT = KFold(len(y4), shuffle=False) 
print cv_classDT
DTBinary = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features='log2', max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=1, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
DTBinaryScore = cross_val_score(DTBinary, X4, y4, cv=cv_class,)
print "Decision Tree classifier scores are:", DTBinaryScore
print "Decision Tree classifier average score is:", DTBinaryScore.mean()

In [None]:
DTBinary.fit(X4,y4)

In [None]:
DTBinary_pred = rfclassBi.predict(Class15Bi)
DTBinary_predict = pd.DataFrame()
DTBinary_predict['actual'] = Test['BiClass']
DTBinary_predict['predict'] = DTBinary_pred
DTBinary_probs = rfclassBi.predict_proba(Class15Bi)

In [None]:
conf_mat = pd.crosstab(DTBinary_predict['actual'], DTBinary_predict['predict'], rownames=['actual'])
conf_mat

The Decision tree model is just as good as the random forest at predicting the 

In [None]:
Bifpr = dict()
Bitpr = dict()
Biroc_auc = dict()
Bifpr, Bitpr, _ = roc_curve(DTBinary_predict.actual, DTBinary_predict.predict)
Biroc_auc = auc(Bifpr, Bitpr)

# Plot of a ROC curve 
plt.figure(figsize=(20,20))
plt.plot(Bifpr, Bitpr, label='Rf AUC = %0.2f' % Biroc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate',fontsize=40)
plt.ylabel('True Positive Rate', fontsize=40)
plt.title('Receiver operating characteristic', fontsize=40)
plt.legend(loc="lower right", fontsize=40)
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)

plt.show()

In [None]:
export_graphviz(DTBinary.fit(X4,y4), out_file='tree.dot', feature_names=X4.columns)                

In [None]:
#converting exported dot file to png file using bash
! dot -Tpng tree.dot -o tree.png

In [None]:
from IPython.display import Image
Image(filename='tree.png')

### Recommendations and limitations

We can use this model to predict whether a day will be a high or low sales day. We can then use this to decide where to spend our advertising budget. We can even look through the actual decision tree to make decisions based on the categorization of the features that we know.

One major limitation is that we did not treat this as time series data. We can also play with thresholds for deciding whether or not a day will be a high or low sales day.

C5 Score: | 26/27
------------|-----------
Identify: Review executive summary, audience, goals, criteria	|	3		
Acquire: Review data selection & acquisition process			|	2
Parse: Review data descriptions, outliers, risks, assumptions	|	3		
Mine: Review statistical analysis				|2
Refine: Review visual analysis				|2
Model: Review model and performance			|2	
Present: Tell/sell the story to a non-tech audience |  3
Present: Discuss findings and limitations	|			3
Present: Create targeted recommendations and next steps		|	3	
Bonus: Deploy: Address how to (re)train model over time  | 2
Bonus: Create an interactive demo of your data|  1