# Using LASSO Regression on Sales Data to Predict Purchases
## Kenneth R. Miller
#### With sales data for Bobo Bars (a healthfood oat bar), I used LASSO regression to predict the demographics most likely to purchase Bobo Bars. The data (posted in the github repository) has number of purchases and various demographic information in census variables.

#### Importing the data and cleaning it.

In [42]:
import pandas as pd

# The depreciation and convergence warnings are annoying, so this ignores them
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error

# Data can be found in the github repository
lassodf = pd.read_csv("finalmaster-ratios.csv")
# Creating a list of predictors
varnames = list(lassodf.columns.values)

# The first variable in the list is the outcome variable, 
# and the next 8 are repetitive of other variables, so we need to exclude them.
del varnames[0:8]

# Assigning predictors and target
predictors = lassodf[varnames]
target = lassodf['# Purchases']

#### Creating the LASSO model.

In [43]:
# Predict train, predict test, target train, target test, split 70/30 train/test
pred_train, pred_test, tar_train, tar_test = train_test_split(
        predictors, target, test_size = .3, random_state = 123
        )

# Defining the model as a LASSO model with 10-fold cross validation.
# This makes sure we aren't getting non-random ordering in our test/train split
model = LassoLarsCV(fit_intercept = True, cv = 10, precompute = False)
model.fit(predictors, target)

# Creating a data frame of the predictors 
predictors_model = pd.DataFrame(varnames)
# Creating a column label for the predictors called 'label'
predictors_model.columns = ['label']
# Assigning the coefficients from the model to each predictor
predictors_model['coeff'] = model.coef_

#### Using a for loop to print the coefficients of the predictors that actually matter.

In [44]:
# Creating an empty list to hold the coefficients and their labels
coeflist = []
# Iterating over row in the predictor model dataframe
    # iterrows iterates over each row and returns both the index and the row
for index, row in predictors_model.iterrows():
    # If the regression coefficient in each row is greater than zero, print it out
    if row['coeff'] > 0:
        print(row.values)
        coeflist.append(row.values[0]) 
        coeflist.append(row.values[1])
# This prints out the census code and the coefficient. I've made this more readable below.
# Coefficients with actual names
coeffs = ['Male 40 to 44 Years', 0.7739123853299034,
'Female 30 to 34 Years', 1.9784044386558475,
'Female 35 to 39 Years', 2.0572434376705573,
'Female 40 to 44 Years', 1.8768425232001016,
'Female 80 to 84 Years', 0.13689258036664992,
'Asian', 1.0964526416628348,
'Other Race', 0.0033429478657402553,
'Unmarried women with Bachelor Degree', 0.6980574814546774,
'Unmarried women with Graduate/Professional Degree', 1.4981329346451273,
'Women 40 to 44 Years birth past year', 1.0946731295462726,
'Female 25+ with less than high school', 4.1881957997050305,
'Households making $200k+', 1.8635089505684308]   
print('\n')
for item in range(len(coeffs)):
    if item % 2 == 0:
        if len(coeffs[item]) < 15:
            print(coeffs[item] + ":\t\t\t " + str(coeffs[item+1]))
        else:
            print(coeffs[item] + ": " + str(coeffs[item+1]))

    
    

['B01001014' 0.7739123853299034]
['B01001036' 1.9784044386558475]
['B01001037' 2.0572434376705573]
['B01001038' 1.8768425232001016]
['B01001048' 0.13689258036664992]
['B02001005' 1.0964526416628348]
['B02001007' 0.0033429478657402553]
['B13014026' 0.6980574814546774]
['B13014027' 1.4981329346451273]
['B13016008' 1.0946731295462726]
['B15002027' 4.1881957997050305]
['B19001017' 1.8635089505684308]


Male 40 to 44 Years: 0.7739123853299034
Female 30 to 34 Years: 1.9784044386558475
Female 35 to 39 Years: 2.0572434376705573
Female 40 to 44 Years: 1.8768425232001016
Female 80 to 84 Years: 0.13689258036664992
Asian:			 1.0964526416628348
Other Race:			 0.0033429478657402553
Unmarried women with Bachelor Degree: 0.6980574814546774
Unmarried women with Graduate/Professional Degree: 1.4981329346451273
Women 40 to 44 Years birth past year: 1.0946731295462726
Female 25+ with less than high school: 4.1881957997050305
Households making $200k+: 1.8635089505684308


#### Printing the output in full sentences for fun.

In [45]:
# Fun little for loop that eases the interpretation of each coefficient.
print('Coefficient Interpretations:\n')
a = 1
for i in range(len(coeffs)):
    if i%2 == 0:
        num = round(coeffs[i+1], 2)
        print(str(a) + ') For one more ' + coeffs[i] +', we sell ' + str(num) + 
              ' more Bobo Bars all else equal.')
        a = a + 1

Coefficient Interpretations:

1) For one more Male 40 to 44 Years, we sell 0.77 more Bobo Bars all else equal.
2) For one more Female 30 to 34 Years, we sell 1.98 more Bobo Bars all else equal.
3) For one more Female 35 to 39 Years, we sell 2.06 more Bobo Bars all else equal.
4) For one more Female 40 to 44 Years, we sell 1.88 more Bobo Bars all else equal.
5) For one more Female 80 to 84 Years, we sell 0.14 more Bobo Bars all else equal.
6) For one more Asian, we sell 1.1 more Bobo Bars all else equal.
7) For one more Other Race, we sell 0.0 more Bobo Bars all else equal.
8) For one more Unmarried women with Bachelor Degree, we sell 0.7 more Bobo Bars all else equal.
9) For one more Unmarried women with Graduate/Professional Degree, we sell 1.5 more Bobo Bars all else equal.
10) For one more Women 40 to 44 Years birth past year, we sell 1.09 more Bobo Bars all else equal.
11) For one more Female 25+ with less than high school, we sell 4.19 more Bobo Bars all else equal.
12) For one mo

### Model performance: 
#### The test MSE was much higher than the training MSE indicating possible overfitting and too many predictors. The R-squared of both the training and the test however were nearly identical, though very low. This model did not predict purchases very well in the end, though it was good to try.

In [46]:
# Training and test set mean squared errors 
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print('Training Data MSE')
print(train_error)

test_error = mean_squared_error(tar_test, model.predict(pred_test))
print('Test Data MSE')
print(str(test_error) + '\n')
# R-Squared
rsquared_train = model.score(pred_train, tar_train)
print('Training Data R-Square')
print(rsquared_train)

rsquared_test = model.score(pred_test, tar_test)
print('Test Data R-Square')
print(rsquared_test)

Training Data MSE
20853.698377528126
Test Data MSE
35941.189558306716

Training Data R-Square
0.2804542067270799
Test Data R-Square
0.28710489317178245
