# LASSO Regression to predict census variables for greatest sales
## By Nicole Haberer
## Created for APRD6342

Modeling sales data with the LASSO Regression Model to help account for collinearity in a large dataset. 

We are trying to determine which census variables are the greatest predictors of sales for Bobo bars.

In [14]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [15]:
#import csv file with Purchases and all predictors
alldata = pd.read_csv("finalmaster-ratios.csv")

In [16]:
#create a list of all variable names (ie. column titles)
allvariablenames = list(alldata.columns.values)

#limit my variables list to just predictors we are concerned with
allvariablenames2 = allvariablenames[8:]

#load predictors into dataframe
predictors = alldata.loc[:,'B01001008':'B19001017']  

#load target into dataframe
target = alldata['# Purchases'] 

In [17]:
# split data into train and test sets, with 30% retained for test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)                 
                 
# specify the lasso regression model
model = LassoLarsCV(cv=10, precompute=False)
model.fit(pred_train, tar_train)

# build coefficent chart
predictors_model=pd.DataFrame(allvariablenames2)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_

for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)  

['B01001036' 2.7861365955132507]
['B01001037' 0.9200572652790069]
['B01001038' 0.9459340522644333]
['B02001005' 0.39156809216155525]
['B13014026' 0.22056164158451835]
['B13014027' 0.05049787197081092]
['B19001017' 1.6062678580473928]


This code is running a LASSO regression model against all of our identified predictors using 10-fold cross-validation.

We fit the model using our sets of data for pred_train and tar_train  as training data. Then the coefficient chart is looking at all of these variables (predictors) to identify the coefficients, which tells us how much increasing a variable by one will increase our target (Purchases).

The list is being filtered by positive, significant coefficients, which is great because we are only concerned with positive correlations or how to increase our target

#### Breakdown of Significant Coefficients

Age Identifiers        
B01001036 = 30 to 34 Years
B01001037 = 35 to 39 Years
B01001038 = 40 to 44 Years
        
Race Identifiers        
B02001005 = Asian Alone
        
Women 15 to 50 Years Who Had a Birth in the Past 12 Months
B13014026 =  Unmarried (Never Married, Widowed and Divorced) and have a Bachelor's Degree
B13014027 = Unmarried (Never Married, Widowed and Divorced) and have a Graduate or Professional Degree

Household Income
B19001017 = Households: $200,000 or More 

What does this mean?
We have the best sales for age range 30-44 where household income is over $200,000


In [18]:
#mean squared error for the training and training set      
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print ('training data MSE')
print(train_error)

training data MSE
22528.486826258624


In [19]:
#mean squared error for the predict test and target test set
train_test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('test data MSE')
print(train_test_error)

test data MSE
41578.280293705764


Are the training and test set mean squared errors similar? What does that mean practically? 

No, the training and test set mean squared errors are different. The training data has a lower MSE, meaning 
it is a better fit for the data than the test data. This is possibly due to oversampling within the training set.

In [20]:
#r squared for training set
rsquared_train=model.score(pred_train,tar_train)
print ('training data R-square')
print(rsquared_train)

training data R-square
0.22266652028942102


In [21]:
#r squared for training set
rsquared_test=model.score(pred_test,tar_test)
print ('test data R-square')
print(rsquared_test)

test data R-square
0.17529294561525344


Compare the two R Squared values
The R squared value for the training set is slightly higher, indicating that this model is slight more accurate at explaining the variability in the data

How well does Census data, overall, predict sales?
Based on the r squared value of our test data, our model isn't very good at predicting sales. Our goal should be an r squared value of .5 or higher to indicate a reliable model and our current results are a long ways off from this goal.
 

In [22]:
#let's see what our y-intercept is, so we can interpret what our baseline sales number looks like
print("y interecept:")
print(model.intercept_)

y interecept:
2.758738710322305


What is our baseline sales number? What does that mean, practically?
Our baseline sales number (y-intercept) is 2.75. This means that if all other variables are zero, we sell an average of 2.75 bars per customer.