
#### Importing Libraries

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

![kneeding](https://media.giphy.com/media/RpckSiHL6ZaXS/giphy.gif)

# Enhancing Regression Models

Objectives, be able to use:

Pre-processing:
- handling non-numeric data
 - ordinal: label encoder
 - categorical: one-hot-encoder (which do you drop?)
 - binary encoder
- Scaling

Creating New:
- Interaction terms
- Polynomials
- combinations of other variables

Evaluating:
- R^2 vs adjusted R^2
- AIC
- BIC
- comparing model performance metrics - metrics going up or down?


## Scenario: car seat sales

Description: simulated data set on sales of car seats<br>
Format: 400 observations on the following 11 variables
- Sales: unit sales at each location
- CompPrice: price charged by nearest competitor at each location
- Income: community income level
- Advertising: local advertising budget for company at each location
- Population: population size in region (in thousands)
- Price: price charged for car seat at each site
- ShelveLoc: quality of shelving location at site (Good | Bad | Medium)
- Age: average age of the local population
- Education: education level at each location
- Urban: whether the store is in an urban or rural location
- USA: whether the store is in the US or not

 We will attempt to predict ${\tt Sales}$ (child car seat sales) in 400 locations based on a number of predictors.

#### Task
Before looking at the data, brainstorm with your neighbor which four variables you think *might* be related to sales.

In [2]:
df2 = pd.read_csv('Carseats.csv')
df2.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


The ${\tt Carseats}$ data includes qualitative predictors such as ${\tt Shelveloc}$, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor ${\tt Shelveloc}$ takes on three possible values, ${\tt Bad}$, ${\tt Medium}$, and ${\tt Good}$.

Given a qualitative variable such as ${\tt Shelveloc}$, Python generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

In [3]:
x_vars=list(df2.columns[df2.columns!='Sales'])

In [4]:
model = smf.ols('Sales ~ Income:Advertising+Price:Age + ' + "+".join(x_vars),data= df2)

In [5]:
results = model.fit()

In [6]:
results.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.876
Model:,OLS,Adj. R-squared:,0.872
Method:,Least Squares,F-statistic:,210.0
Date:,"Thu, 29 Aug 2019",Prob (F-statistic):,6.140000000000001e-166
Time:,12:19:06,Log-Likelihood:,-564.67
No. Observations:,400,AIC:,1157.0
Df Residuals:,386,BIC:,1213.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5756,1.009,6.519,0.000,4.592,8.559
ShelveLoc[T.Good],4.8487,0.153,31.724,0.000,4.548,5.149
ShelveLoc[T.Medium],1.9533,0.126,15.531,0.000,1.706,2.201
Urban[T.Yes],0.1402,0.112,1.247,0.213,-0.081,0.361
US[T.Yes],-0.1576,0.149,-1.058,0.291,-0.450,0.135
Income:Advertising,0.0008,0.000,2.698,0.007,0.000,0.001
Price:Age,0.0001,0.000,0.801,0.424,-0.000,0.000
CompPrice,0.0929,0.004,22.567,0.000,0.085,0.101
Income,0.0109,0.003,4.183,0.000,0.006,0.016

0,1,2,3
Omnibus:,1.281,Durbin-Watson:,2.047
Prob(Omnibus):,0.527,Jarque-Bera (JB):,1.147
Skew:,0.129,Prob(JB):,0.564
Kurtosis:,3.05,Cond. No.,131000.0


#### Task 
Again, with your neighbor:
- What issues do you see with this model?
- What would you change?

To learn how to set other coding schemes (or _contrasts_), see: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html

### Polynomials

![polynomials](https://sc.cnbcfm.com/applications/cnbc.com/resources/files/2015/12/11/emotionandincome-01_0.png)

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import BinaryEncoder

In [None]:
from sklearn.preprocessing import PolynomialFeature

`medv ~ lstat + np.square(lstat)`

In [None]:
from sklearn.preprocessing import StandardScaler

### Evaluating
#### Using `statsmodels`

![albon2](./img/aic-albon.png)

**AIC**: The Akaike Information Criterion. Adjusts the log-likelihood based on the number of observations and the complexity of the model.


**BIC**:	The Bayesian Information Criterion. Similar to the AIC, but has a higher penalty for models with more parameters.

Want to be lower. Lower is better.

`results.aic`<br>
`results.bic`

![r-sqared](https://qph.fs.quoracdn.net/main-qimg-b932057f732059158062cf0ad9c1719f.webp)

![adj-r-sqr](https://i.stack.imgur.com/BTGK6.png)

`results.rsquared()`<br>
`results.rsquared_adj()`
