# THIS FILE IS IN THE HANDOUTS FOLDER. COPY IT INTO YOUR CLASS NOTES TO WORK ON TODAY

- [**Read the chapter on the website!**](https://ledatascifi.github.io/ledatascifi-2022/content/05/02_reg.html) It contains a lot of extra information we won't cover in class extensively.
- After reading that, I recommend [this webpage as a complimentary place to get additional intuition.](https://aeturrell.github.io/coding-for-economists/econmt-regression.html)

## ASAP

[Declare your team and project interests in the project sheet](https://docs.google.com/spreadsheets/d/1A6oQfhTBHHEb_EWSgfQv2KgsBoZm4V_BQWuvTjxnrdo/edit#gid=1508330834)


# Today: Regression

We start our machine learning applications with regression for a few simple reasons:
- Regression is fundamental method for estimating the relationship between a variable ("y") that condition on many ("X") variables. 
- But the coefficients obtained can also be used to generate predictions. 
- _Note: The focus in this section is on RELATIONSHIP paradigm_
- Many issues that confront researchers have well understood solutions when regression is the model being used. 
- Regression coefficients are easy to interpret.


  
## Objectives

1. You can fit a regression with `statsmodels` or `sklearn`
    - statsmodels: Nicer result tables, usually easier to specifying the regression model
    - sklearn: Easier to use within a prediction/ML exercise
2. You can view the results visually or numerically of your model with either method
3. The focus today is on the _mechanics_ of running regressions, viewing the output, and using the estimation's output objects.

![](https://media.giphy.com/media/yoJC2K6rCzwNY2EngA/giphy.gif)


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols as sm_ols
import matplotlib.pyplot as plt


## Data

First, we load the data. 

**This is a new dataset, so we should do some data exploration!** Things students should try:
- describe() - any impossible values
- value_count() any categorical variables
- didn't we have a community function to start the EDA?
- correlation heat map
- look for outliers for all variables, and within pairplots
- print out and explore many sections of the data manually (in Excel or Spyder) to get familiar and check for data consistency issues


In [2]:
url = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip') 

## Task 1

Spend 5 minutes exploring the data and jot down what you learn about the data. 


In [3]:
from pandas_profiling import ProfileReport

# create the report:
profile = ProfileReport(fannie_mae, title="Pandas Profiling Report")
profile.to_file("fannie_mae_summary.html")

# NOW GO AND SPEND TIME WITH THIS!

# # OBS, # VARS.
# OBSERVATION UNIT? (A SINGLE LOAN)
# DUPLICATE OBSERVATIONS (YES, JUDGING BY LOAN_IDENTIFIER!)
# WHICH VARS ARE #S? WHICH ARE CATEGORICAL AND WHAT VALUES DO THEY TAKE?

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

- loans data, mortgages fannie mae backed
- observation unit is: A LOAN
- 135k obs, 36 vars
- a lot of vars have nans,
- 2 credit score vars? Borrower_Credit_Score_at_Origination Co-borrower_credit_score_at_origination

In [4]:
fannie_mae.head().T

Unnamed: 0,0,1,2,3,4
Loan_Identifier,973373000000.0,927620000000.0,717667000000.0,988951000000.0,190885000000.0
Origination_Channel,B,B,B,C,R
Seller_Name,OTHER,"PNC BANK, N.A.",OTHER,AMTRUST BANK,OTHER
Original_Interest_Rate,6.875,5.875,6.25,6.0,5.875
Original_UPB,32000.0,200000.0,122000.0,67000.0,50000.0
Original_Loan_Term,360.0,360.0,180.0,180.0,180.0
Original_LTV_(OLTV),90.0,80.0,80.0,77.0,41.0
Original_Combined_LTV_(CLTV),90.0,80.0,80.0,77.0,41.0
Number_of_Borrowers,1.0,2.0,2.0,2.0,2.0
Original_Debt_to_Income_Ratio,22.0,26.0,31.0,17.0,10.0


In [5]:
fannie_mae.columns

Index(['Loan_Identifier', 'Origination_Channel', 'Seller_Name',
       'Original_Interest_Rate', 'Original_UPB', 'Original_Loan_Term',
       'Original_LTV_(OLTV)', 'Original_Combined_LTV_(CLTV)',
       'Number_of_Borrowers', 'Original_Debt_to_Income_Ratio',
       'Borrower_Credit_Score_at_Origination', 'Loan_purpose', 'Property_type',
       'Number_of_units', 'Occupancy_type', 'Property_state', 'Zip_code_short',
       'Primary_mortgage_insurance_percent', 'Product_type',
       'Co-borrower_credit_score_at_origination', 'Mortgage_Insurance_type',
       'Origination_Date', 'First_payment_date',
       'First_time_home_buyer_indicator', 'UNRATE', 'CPIAUCSL', 'Qdate',
       'rGDP', 'TCMR', 'POILWTIUSDM', 'TTLCONS', 'DEXUSEU', 'BOPGSTB',
       'GOLDAMGBD228NLBM', 'CSUSHPISA', 'MSPUS'],
      dtype='object')

In [6]:
fannie_mae['Occupancy_type'].value_counts()

P    120399
I      9238
S      5401
Name: Occupancy_type, dtype: int64

In [7]:
fannie_mae[['Original_LTV_(OLTV)', 'Original_Combined_LTV_(CLTV)']].corr()

Unnamed: 0,Original_LTV_(OLTV),Original_Combined_LTV_(CLTV)
Original_LTV_(OLTV),1.0,0.975036
Original_Combined_LTV_(CLTV),0.975036,1.0


In [8]:
fannie_mae[['Borrower_Credit_Score_at_Origination','Co-borrower_credit_score_at_origination']].describe()

Unnamed: 0,Borrower_Credit_Score_at_Origination,Co-borrower_credit_score_at_origination
count,134481.0,67366.0
mean,742.428797,751.145904
std,53.428076,50.576226
min,361.0,400.0
25%,707.0,721.0
50%,755.0,765.0
75%,786.0,791.0
max,850.0,850.0


## Clean the data and create variables we will use

These variables are pretty straightforward:

In [9]:
fannie_mae = (fannie_mae
                  # create variables
                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
                          Origination_Date = lambda x: pd.to_datetime(x['Origination_Date']),
                          Origination_Year = lambda x: x['Origination_Date'].dt.year,
                          const = 1,
                          great = fannie_mae['Borrower_Credit_Score_at_Origination'] >= 800
                         )
              
             )

Credit rating is a number between 0 and 850. But in some analysis, it might make sense to have categories of credit ratings (e.g. bad to good). I borrowed [these cutoffs from experian.](https://www.experian.com/blogs/ask-experian/infographic-what-are-the-different-scoring-ranges/)

In [10]:
# create a categorical bin var with "pd.cut()"

fannie_mae['creditbins']= pd.cut(fannie_mae['Borrower_Credit_Score_at_Origination'],
                                 [0,579,669,739,799,850],
                                 labels=['Very Poor','Fair','Good','Very Good','Exceptional'])

Here is the variable that created. I notice that 669 (right on the threshold of a bin) goes into the "Fair" bin instead of "Good".

In [11]:
fannie_mae.loc[:5,['Borrower_Credit_Score_at_Origination','creditbins']]

Unnamed: 0,Borrower_Credit_Score_at_Origination,creditbins
0,669.0,Fair
1,693.0,Good
2,741.0,Very Good
3,804.0,Exceptional
4,658.0,Fair
5,665.0,Fair


In [12]:
# pd.cut took credit , var number between 0 and 850,
# and changed it to bins. I labeled the bins explicitly

fannie_mae['creditbins'].value_counts(dropna=False)

Very Good      63855
Good           39539
Exceptional    15889
Fair           14560
Very Poor        638
NaN              557
Name: creditbins, dtype: int64

## Exercises with statsmodels

- **For all problems: y is the interest rate of the loan**
- I recommend the _statsmodels formula_ method on the website

Psuedocode for using statsmodels to run a regression:
```python
model = sm_ols(<formula>, data=<dataframe>)
result=model.fit()

# to print regression output: result.summary()
# get predicted values (yhat): result.predict
# get regression residuals (uhat): result.resid
```

### Q1: Starter regressions

A. Regress y on the credit score (student demo): $y=\beta_0 + \beta_1*\text{Credit Score}$
- _I'll show 2 ways: the psuedo code and the one-liner_

B. Regress y on the **natural log** of the credit score: $y=\beta_0 + \beta_1*log(\text{Credit Score})$
- _I'll show two ways to do this_

C. Regress y on the **natural log** of the loan-to-value

D. Regress y on the natural log of the loan-to-value and the natural log of the credit score: $y=\beta_0 + \beta_1*log(\text{LTV}) + \beta_2*log(\text{Credit Score})$

In [13]:
# PROB 1.A
# opt A - declare model, save results, use results
model = sm_ols(
    "Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination", data=fannie_mae
)
results = model.fit()
results.summary()

# opt B - don't save model or results,... just print
sm_ols("Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination", 
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.126
Model:,OLS,Adj. R-squared:,0.126
Method:,Least Squares,F-statistic:,19380.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:15,Log-Likelihood:,-215750.0
No. Observations:,134481,AIC:,431500.0
Df Residuals:,134479,BIC:,431500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,11.5819,0.046,253.270,0.000,11.492,11.671
Borrower_Credit_Score_at_Origination,-0.0086,6.14e-05,-139.198,0.000,-0.009,-0.008

0,1,2,3
Omnibus:,2660.479,Durbin-Watson:,0.397
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2660.737
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.75,Cond. No.,10400.0


In [14]:
# PROB 1.B
# reg on log cred score
# opt 1: create the variable ahead of time and use it
sm_ols("Original_Interest_Rate ~ l_credscore", 
       data=fannie_mae).fit().summary()

# opt 2: sm_ols formula lets you use math functions inside it
sm_ols("Original_Interest_Rate ~ np.log(Borrower_Credit_Score_at_Origination)", 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.124
Model:,OLS,Adj. R-squared:,0.124
Method:,Least Squares,F-statistic:,19060.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:15,Log-Likelihood:,-215890.0
No. Observations:,134481,AIC:,431800.0
Df Residuals:,134479,BIC:,431800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,45.3715,0.291,156.057,0.000,44.802,45.941
np.log(Borrower_Credit_Score_at_Origination),-6.0750,0.044,-138.067,0.000,-6.161,-5.989

0,1,2,3
Omnibus:,2741.277,Durbin-Watson:,0.394
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2737.156
Skew:,0.325,Prob(JB):,0.0
Kurtosis:,2.744,Cond. No.,598.0


In [15]:
# PROB 1.C
# reg on log loan to value
# opt 1: create the variable ahead of time and use it
sm_ols("Original_Interest_Rate ~ l_LTV", 
       data=fannie_mae).fit().summary()

# # opt 2: Q("bad var name")  ... just remember the whole formula should be inside '' marks 
sm_ols('Original_Interest_Rate ~ np.log(Q("Original_LTV_(OLTV)"))', 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,1010.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,8.409999999999999e-221
Time:,18:09:15,Log-Likelihood:,-225480.0
No. Observations:,135038,AIC:,451000.0
Df Residuals:,135036,BIC:,451000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.7603,0.047,80.622,0.000,3.669,3.852
"np.log(Q(""Original_LTV_(OLTV)""))",0.3513,0.011,31.779,0.000,0.330,0.373

0,1,2,3
Omnibus:,4889.29,Durbin-Watson:,0.214
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3115.913
Skew:,0.245,Prob(JB):,0.0
Kurtosis:,2.439,Cond. No.,59.4


In [16]:
# PROB 1.D
# adding more variables is easy! +X2
sm_ols('Original_Interest_Rate ~ l_LTV + l_credscore',
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.126
Model:,OLS,Adj. R-squared:,0.126
Method:,Least Squares,F-statistic:,9656.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:15,Log-Likelihood:,-215780.0
No. Observations:,134481,AIC:,431600.0
Df Residuals:,134478,BIC:,431600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,44.1324,0.302,145.949,0.000,43.540,44.725
l_LTV,0.1546,0.010,14.765,0.000,0.134,0.175
l_credscore,-5.9859,0.044,-134.888,0.000,-6.073,-5.899

0,1,2,3
Omnibus:,2793.369,Durbin-Watson:,0.386
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2743.99
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.72,Cond. No.,735.0


### Q2: Best practices: Look at the outputs every time

Let's talk about the outputs you see and should look at EVERY time you run a regression:
- Number of obs
- R2 
- AR2
- Coef 
- Std error, t value, p value ("P>|t|")
- Std error options:
    - `.fit(cov_type="HC2")`
    - `.fit(cov_type="cluster", cov_kwds={"groups": df["industry"]})`

### Q3: Regressions with transformations

We are talking about "linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can't be some kind of non-linear transform of the original features $x_i$." The most common transformations are logging variables, interaction terms, and polynomial terms."

We already did log transformations above. 

An interaction term simply means one regressor is two variables multiplied:
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2$
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 $

Polynomial terms might look like:
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_1^2$

A. Regress y on the credit score and the credit score squared. 

B. Regress y on the natural log of the loan-to-value, the natural log of the credit score, and the interaction of LTV and credit score. 



In [17]:
model2 = sm_ols('Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination',data=fannie_mae)
results2 = model2.fit()            
print(results2.summary())  

                              OLS Regression Results                              
Dep. Variable:     Original_Interest_Rate   R-squared:                       0.126
Model:                                OLS   Adj. R-squared:                  0.126
Method:                     Least Squares   F-statistic:                 1.938e+04
Date:                    Wed, 23 Mar 2022   Prob (F-statistic):               0.00
Time:                            09:50:45   Log-Likelihood:            -2.1575e+05
No. Observations:                  134481   AIC:                         4.315e+05
Df Residuals:                      134479   BIC:                         4.315e+05
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
                                           coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------

In [18]:
# PROB 3.B
# * automatically includes BOTH vars alone, and the multiplication of them
sm_ols('Original_Interest_Rate ~ l_LTV * l_credscore',
       data=fannie_mae).fit().summary()

# you can just manually list them, too, and : multiplies two vars without the expansion
sm_ols('Original_Interest_Rate ~ l_LTV + l_LTV : l_credscore',
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.127
Model:,OLS,Adj. R-squared:,0.127
Method:,Least Squares,F-statistic:,9767.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:16,Log-Likelihood:,-215680.0
No. Observations:,134481,AIC:,431400.0
Df Residuals:,134478,BIC:,431400.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.4874,0.044,101.849,0.000,4.401,4.574
l_LTV,9.5850,0.069,139.190,0.000,9.450,9.720
l_LTV:l_credscore,-1.4240,0.010,-135.703,0.000,-1.445,-1.403

0,1,2,3
Omnibus:,2768.987,Durbin-Watson:,0.388
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2729.806
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.726,Cond. No.,599.0


### Q4: Dummy and categorical variables

A. Regress y on the dummy variable for a great credit score.

B. Regress y on the categorical variable we created for credit bins.

C. (Advanced, optional, after class exercise): High dimensional fixed effects. This basically means "a categorical variable with LOTS of values". [See this discussion.](https://aeturrell.github.io/coding-for-economists/econmt-regression.html#high-dimensional-fixed-effects-aka-absorbing-regression)

In [19]:
# binary/dummy vars with two values (0 and 1) can just be added into the reg

sm_ols('Original_Interest_Rate ~ great',
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.05
Model:,OLS,Adj. R-squared:,0.05
Method:,Least Squares,F-statistic:,7048.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:16,Log-Likelihood:,-222550.0
No. Observations:,135038,AIC:,445100.0
Df Residuals:,135036,BIC:,445100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.3433,0.004,1466.712,0.000,5.336,5.350
great[T.True],-0.8916,0.011,-83.951,0.000,-0.912,-0.871

0,1,2,3
Omnibus:,2948.608,Durbin-Watson:,0.305
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2309.413
Skew:,0.239,Prob(JB):,0.0
Kurtosis:,2.572,Cond. No.,3.15


In [20]:
# you can add categorical variables into a reg
# you will get one coef for each level of the var, except 1 of the levels

# ALWAYS ALWAYS ALWAYS PUT CATEGORICAL VARS INSIDE C()
# C() prevents cat vars that are numbers from being treated like a number
sm_ols('Original_Interest_Rate ~ C(creditbins)',
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.116
Model:,OLS,Adj. R-squared:,0.116
Method:,Least Squares,F-statistic:,4411.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:16,Log-Likelihood:,-216510.0
No. Observations:,134481,AIC:,433000.0
Df Residuals:,134476,BIC:,433100.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7172,0.048,140.160,0.000,6.623,6.811
C(creditbins)[T.Fair],-0.6749,0.049,-13.784,0.000,-0.771,-0.579
C(creditbins)[T.Good],-1.2020,0.048,-24.881,0.000,-1.297,-1.107
C(creditbins)[T.Very Good],-1.6642,0.048,-34.552,0.000,-1.759,-1.570
C(creditbins)[T.Exceptional],-2.2655,0.049,-46.351,0.000,-2.361,-2.170

0,1,2,3
Omnibus:,2410.734,Durbin-Watson:,0.385
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2345.942
Skew:,0.294,Prob(JB):,0.0
Kurtosis:,2.729,Cond. No.,37.9


### Q5: Summarize what you've learned so far


In [17]:
# opt A: crd score to the power of 2 inside the function. 
# use pow(), np.squared() or np.power() to create the square
# note: **2 doesn't work :(
sm_ols('Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination + pow(Borrower_Credit_Score_at_Origination,2)',
       data=fannie_mae).fit().summary()

# opt B: create the square term before the regression and give that to the function
fannie_mae['credit_score2'] = fannie_mae['Borrower_Credit_Score_at_Origination']**2

sm_ols('Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination + credit_score2',
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.128
Model:,OLS,Adj. R-squared:,0.128
Method:,Least Squares,F-statistic:,9880.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:09:15,Log-Likelihood:,-215580.0
No. Observations:,134481,AIC:,431200.0
Df Residuals:,134478,BIC:,431200.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.8016,0.481,5.822,0.000,1.858,3.745
Borrower_Credit_Score_at_Origination,0.0160,0.001,11.930,0.000,0.013,0.019
credit_score2,-1.704e-05,9.3e-07,-18.329,0.000,-1.89e-05,-1.52e-05

0,1,2,3
Omnibus:,2433.505,Durbin-Watson:,0.399
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2428.353
Skew:,0.306,Prob(JB):,0.0
Kurtosis:,2.759,Cond. No.,82100000.0


### Q6: Plot the regression

_If time is tight: I'll do it._

Plot 1:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q1a's reg and plot the yhat values. 
    - Let's talk about this.
    - Rerun Q1b's reg and plot the yhat values.
    - Compare to the prior line.
    
Plot 2:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q4b's reg and plot the yhat values, hued by credit bin
  
Plot 3:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q4b's reg BUT WITH credit score as a variable and plot the yhat values, hued by credit bin  
    
_Note: statsmodels has some useful plotting functions. My favs are influence_plot (can be slow) and plot_partregress_grid._

In [22]:
# plot 1

# a normal scatterplot
sns.scatterplot(x='Borrower_Credit_Score_at_Origination',
                y='Original_Interest_Rate',
                alpha = .05,
                data=fannie_mae.sample(5000,random_state=20)) # sampled just to avoid overplotting

# get the predicted y values (yhat) 
# .fit().predict() predicts yhats after fitting the regression
yhat_reg1 = sm_ols('Original_Interest_Rate ~  Borrower_Credit_Score_at_Origination ',  
               data=fannie_mae).fit().predict()

yhat_reg2 = sm_ols('Original_Interest_Rate ~  l_credscore ',  
               data=fannie_mae).fit().predict()


# there is ONE issue with plotting this: the predictions can only be made for 
# obs with a non-missing value for all the Xs in the reg!
# print('len yhat:',len(yhat_reg1), '   len fannie:',len(fannie_mae))

# so if you plot x=cred score, y=yhat, the vectors won't line up
# the trick is to drop missing values in the X vector:

sns.lineplot(x=fannie_mae['Borrower_Credit_Score_at_Origination'].dropna(),
             y=yhat_reg1,color='red')

sns.lineplot(x=fannie_mae['Borrower_Credit_Score_at_Origination'].dropna(),
             y=yhat_reg2,color='black')

plt.show()

  plt.show()


In [23]:
# plot 2

# a df with no missing values for our variables
subset = fannie_mae.dropna(subset=["Original_Interest_Rate", "Borrower_Credit_Score_at_Origination", "creditbins"])


# a normal scatterplot
sns.scatterplot(x='Borrower_Credit_Score_at_Origination',
                y='Original_Interest_Rate',
                alpha = .05,
                data=subset.sample(5000,random_state=20)) # sampled just to avoid overplotting

# get the predicted y values (yhat) 
subset['yhat_reg3'] = sm_ols('Original_Interest_Rate ~  C(creditbins) ', 
                             data=subset).fit().predict()

# no missing values to worry about in subset
sns.lineplot(x='Borrower_Credit_Score_at_Origination',y='yhat_reg3',
             data=subset,hue='creditbins')


plt.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['yhat_reg3'] = sm_ols('Original_Interest_Rate ~  C(creditbins) ',
  plt.show()


In [24]:
# plot 3

# a df with no missing values for our variables
subset = fannie_mae.dropna(subset=["Original_Interest_Rate", "Borrower_Credit_Score_at_Origination", "creditbins"])


# a normal scatterplot
sns.scatterplot(x='Borrower_Credit_Score_at_Origination',
                y='Original_Interest_Rate',
                alpha = .05,
                data=subset.sample(5000,random_state=20)) # sampled just to avoid overplotting

# get the predicted y values (yhat) 
subset['yhat_reg3'] = sm_ols('Original_Interest_Rate ~  C(creditbins) + Borrower_Credit_Score_at_Origination', 
                             data=subset).fit().predict()

# no missing values to worry about in subset
sns.lineplot(x='Borrower_Credit_Score_at_Origination',y='yhat_reg3',
             data=subset,hue='creditbins')


plt.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['yhat_reg3'] = sm_ols('Original_Interest_Rate ~  C(creditbins) + Borrower_Credit_Score_at_Origination',
  plt.show()


## Regression with SKLEARN

I don't like running regressions in `sklearn` usually. The main reason to do so is if you're doing a typical ML task that sklearn excels in (meaning: "pipelines", which is a term you'll understand later in the course) or if you know you're going to be using other sklearn models anyways (in which case, you'll already be doing the set up for sklearn).

But I want to run at least one regression in SKLEARN for you so you can see how the mechanics are similar, and how they differ. We will cover sklearn more in future classes.

Psuedocode for a reg in sklearn is similar. The differences:
1. A little more work setting up the data
1. `.fit()` gets the data passed to it 
1. The `results` object is different than statsmodels'

```python

# 1. import the "class" of model form sklearn

from sklearn.linear_model import LinearRegression

# 2. arrange the data - more work than statsmodels

# Issue: sklearn doesn't work with missing values, so drop any obs with missing values
# replace vars_in_your_reg with a list of variables you want to use, including y
subset = df[vars_in_your_reg].dropna()

# explicitly set up the y variable and the X variables you want
y = subset['y'] # whatever the y variable is
X = subset[['X1','X2']] # list the X vars

# 3. set up the model ("instantiate the model")
# every class of models has "hyperparamaters" that control how you want the model to work
# below, fit_intercept=True is a "hyperparameter" for OLS models 
# hyperparameters are the things inside the parenthesis of the model class when you declare it

model = LinearRegression(fit_intercept=True)
result=model.fit(X,y) # in sklearn, you put X and Y inside fit!!!

# the result object is different in sklearn
# results.intercept_ (the constant in the model)
# results.coef_ (the other X vars)

```


## Q7: STUDENT DEMO - regressions **using sklearn**

A. Regress the interest rate on the natural log of the loan-to-value using the sklearn method.

B. Regress the interest rate on the credit score using the sklearn method.

In [25]:
# A
y = fannie_mae['Original_Interest_Rate']   # pick y
X = fannie_mae[['l_LTV']]

model3 = LinearRegression()               # set up the model object (but don't tell sklearn what data it gets!)   
results3 = model3.fit(X,y)                # fit it, and tell it what data to fit on
print('INTERCEPT:', results3.intercept_)  # to get the coefficients, you print out the intercept
print('COEFS:', results3.coef_)           # and the other coefficients separately (yuck)

INTERCEPT: 3.760340891975928
COEFS: [0.3512853]


In [26]:
# B

# # run with this first: won't work! NaN
# y = fannie_mae['Original_Interest_Rate']   # pick y
# X = fannie_mae[['Borrower_Credit_Score_at_Origination']]  

# # then use these X and y (without the restriction to our vars)
# #     this will yield the wrong coefs! (dropna drops a row if ANY var is missing)
# #     but statsmodel only drops rows where vars in the reg are missing
# # so then use these X and y (with restriction)
# #     notice - same coefs!

reg_data = (fannie_mae.copy() # copy to ensure we don't change the orig data
            [['const','Borrower_Credit_Score_at_Origination','Original_Interest_Rate']] 
            .dropna())
y = reg_data['Original_Interest_Rate']   # pick y
X = reg_data[['const','Borrower_Credit_Score_at_Origination']]  

# # copied from webpage direct:
# # NOTICE I EXPLICITLY GIVE X A CONSTANT SO IT FITS AN INTERCEPT

model3 = LinearRegression()               # set up the model object (but don't tell sklearn what data it gets!)   
results3 = model3.fit(X,y)                # fit it, and tell it what data to fit on
print('INTERCEPT:', results3.intercept_)  # to get the coefficients, you print out the intercept
print('COEFS:', results3.coef_)           # and the other coefficients separately (yuck)

INTERCEPT: 11.581857988176292
COEFS: [ 0.         -0.00855164]
