# The Data

We have been given a small sample data from our customers. The data has been aggregated from individual purchases / transactions across the time period (e.g., last year). The goal is to see if we can predict the spending amounts for the next time period (e.g., next year).

In [None]:
# import "standard" packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import packages to run regression
from statsmodels.formula.api import ols

# Another one from sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# Read in the data
cust = pd.read_csv('./data/cust_small_clean.csv')
cust.info()

In [None]:
# Take a peek at a few rows of data
cust.sample(5)

## Descriptive Stuff

In [None]:
# statistical summaries
cust.describe()

In [None]:
# Try again and include all columns
cust.describe(include='all')

## End Result for Input
We want to have all numerical variables. This means we should create *dummy* variables for `gender`, `marital_status`, and `home_ownership`. We also do not need `cust_id` since it is just a unique id. The two date columns could be used to create numerical values, but we will simply ignore them for now.

In [None]:
# What happens when we call get_dummies?
# Try to create dummy variables for gender, marital_status, and home_ownership
dummies = pd.get_dummies(cust[['gender','marital_status','home_ownership']])
dummies

In [None]:
# Let's drop the following columns:
# cust_id, join_date, last_purchase_date
# gender, marital_status, home_ownership
cust = cust.drop(columns=['cust_id','join_date','last_purchase_date',
                          'gender','marital_status','home_ownership'])
cust.info()

In [None]:
# We now need to add the dummy variables
# However, remember we only need k-1 for k classes
# For gender that means just 1, ditto for marital_status
# For home_ownership we need 2
cust = pd.concat([cust,
                  dummies[['gender_F',
                           'marital_status_married',
                           'home_ownership_own',
                           'home_ownership_rent']]], axis=1)
cust.info()

In [None]:
cust.describe()

## Using `statsmodels`

In [None]:
results = ols('spend ~ age + household_income + num_children + num_vehicles + gender_F + marital_status_married + home_ownership_own + home_ownership_rent',
              data=cust).fit()

In [None]:
results.summary()

## Using `sklearn`

In [None]:
# Create the X and y
y = cust.spend

X = cust.drop('spend', axis=1)

In [None]:
y.shape

In [None]:
X.shape

In [None]:
reg = LinearRegression()
reg.fit(X, y)

In [None]:
reg.intercept_

In [None]:
reg.coef_