### Regression Part II

**OBJECTIVES**

- Use `sklearn` to build multiple regression models
- Use `statsmodels` to build multiple regression models
- Evaluate models using `mean_squared_error`
- Interpret categorical coefficients

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

ModuleNotFoundError: No module named 'matplotlib'

### Advertising Data

The goal here is to predict sales. We have spending on three different media types to help make such predictions.  Here, we want to be selective about what features are used as inputs to the model.


In [None]:
ads = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/ads.csv', index_col=0)

In [None]:
ads.head()

In [None]:
#scatterplot
plt.scatter(ads['TV'], ads['sales']);

In [None]:
# heatmap
sns.heatmap(ads.corr(), annot = True, cmap = 'Reds');

In [None]:
# pairplot
sns.pairplot(ads);

1. Choose a single column as `X` to predict sales.  Justify your choice -- remember to make `X` 2D.

In [None]:
#declare X and y


2. Build a regression model to predict `sales` using your `X` above.

In [None]:
# instantiate and fit the model


3. Interpret the slope of the model.

In [None]:
# examine slope, what does it mean?


4. Interpret the intercept of the model.

In [None]:
#intercept


5. Evaluate the `mean_squared_error` of the model.

In [None]:
# MSE


6. Create baseline predictions using the mean of `y`. 

In [None]:
# ones

# multiply by mean


7. Compute the `mean_squared_error` of your baseline predictions.

In [None]:
# MSE Baseline


8. Did your model perform better than the baseline?  

#### the `.score` method

In addition to the `mean_squared_error` function, you are able to evaluate regression models using the objects `.score` method.  This method evaluates in terms of $r^2$.  One way to understand this metric is as the ratio between the *residual sum of squares* and the *total sum of squares*.  These are given by:

$$RSS =  \sum _{i}(y_{i}-f_{i})^{2}$$

$$TSS = \sum _{i}(y_{i}-{\bar {y}})^{2}$$

and 

$$r^2 = 1 - \frac{RSS}{TSS}$$

In [None]:
#model score


You can interpret this as the percent of variation in the data explained by the features according to your model.

#### Adding Features

Now, we want to include a second feature as input to the model.  Reexamine the plots and correlations above, what is a good second choice?

9. Choose two columns from the `ads` data, assign these as `X`. 

In [None]:
sns.pairplot(ads, y_vars = 'sales')

In [None]:
# X2


10. Build a regression model with two features to predict `sales`.

In [None]:
# lr2 model


11. Evaluate the model using `mean_squared_error`.

In [None]:
# yhat2

# MSE


12. Interpret the coefficients of the model

In [None]:
# make a dataframe here


#### Using `statsmodels`

A different library for models is the `statsmodels` library.  This contains more classic statistical modeling approaches including a statistical summary of the fit.  The interface is slightly different than that of `sklearn`.

13. Fit a regression model using `statsmodels`.

In [None]:
import statsmodels.api as sm

In [None]:
# instantiate and fit


14. Examine the summary of the model using `.summary()`.  

In [None]:
# model summary


15. Including an intercept term.

In [None]:
# add the constant 


In [None]:
# fit model


In [None]:
# summary


In [None]:
LinearRegression()

### Example II: Credit Data

In [None]:
credit = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/Credit.csv', index_col = 0)

In [None]:
credit.head(2)

Build a regression model using `Ethnicity` feature to predict balance.  Interpret the coefficients.

In [None]:
#unique values?


In [None]:
# using get_dummies


1. Define `X` and `y`.

2. Instantiate and fit.

3. Examine the coefficients.

4. Interpret the intercept.

5. Mean Squared Error

6. Baseline MSE

#### Problem

Only using Ethnicity to predict the balance is perhaps too simplistic of a model.  Select other features you believe to be important to predicting the Balance and build a regression model using these inputs.  Interpret your coefficients and discuss the overall performance of the model.

In [None]:
credit.head(2)