# Lab: Yes, Even More Linear Regression
## CMSE 381 - Fall 2023
## Lecture 8, Sep 16, 2023

In the last few lectures, we have focused on linear regression, that is, fitting models of the form 
$$
Y =  \beta_0 +  \beta_1 X_1 +  \beta_2 X_2 + \cdots +  \beta_pX_p + \varepsilon
$$
In this lab, we will continue to use two different tools for linear regression. 
- [Scikit learn](https://scikit-learn.org/stable/index.html) is arguably the most used tool for machine learning in python 
- [Statsmodels](https://www.statsmodels.org) provides many of the statisitcial tests we've been learning in class

This lab will be a recap of what we covered in the last few lectures.

In [None]:
# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

## Let's look back on our college data set:

In [None]:
college_df = pd.read_csv('College.csv')
college_df = college_df.rename({'Unnamed: 0': 'College'}, axis=1)
college_df = college_df.set_index('College')
#Renaming for later, smf was getting angry by the periods in the coulmn names
college_df = college_df.rename(columns={'S.F.Ratio': 'SFRatio', 'Grad.Rate': 'GradRate'})
college_df.head()


## Out of state tuition and Private schools

Remember in the homework how we made boxplots of the out of state tuition for private and public schools? We're going to now use linear regression to see if there is statistical evidence of a difference in the out of state tuition between private and public schools.

Let's first ask ourselves some questions:

&#9989; **<font color=red>Q:</font>**
- What is the response?
- What is the predictor?
- What is the model?
- What is the null hypothesis?

In [None]:
#for memory purposes, here's the boxplot
college_df.boxplot('Outstate', by='Private')

&#9989; **<font color=red>Do now:</font>** Use `smf.ols` to fit the model and print out the summary. Remember we learned a shortcut for this in the last lecture notebook.

In [None]:
#your code here

Now that we have some results, let's interpret them.

&#9989; **<font color=red>Q:</font>**
- What is the equation of the line?
- Do we have evidence to reject the null hypothesis? Why or why not?
- Based on your boxplot from the homework, are you surprised by the results?
- What are the confidence intervals for $\beta_0$ and $\beta_1$?


Let's now predict values for out of state tuition for private and non-private schools. We can still use the `smf` package to do this for us. Here's some code to run it below. As you can see, it likes a dictionary format.

In [None]:
est.predict({'Private': ['Yes', 'No']})

&#9989; **<font color=red>Q:</font>** Do these values make sense? Why or why not?

## Out of state tuition and more variables

&#9989; **<font color=red>Q:</font>** What are the coefficients for this model:
$y = \beta_0 + \beta_1 X_{private} + \beta_2 X_{Top10perc} + \beta_3 X_{SFRatio} + \beta_4 X_{GradRate}$?


In [None]:
#write your code here

&#9989; **<font color=red>Q:</font>** Does the coefficient for $X_{private}$ change when we add more predictors variables, compared to when it was the only predictor? Why or why not?



&#9989; **<font color=red>Q:</font>** Are we confident that at least one of the predictors is associated with the response? Why or why not?


&#9989; **<font color=red>Q:</font>** What is our prediction for out of state tuition at a private school with 20% of the students from the top 10% of their high school class, a student faculty ratio of 15, and a graduation rate of 75%?

In [None]:
#your code here

&#9989; **<font color=red>Q:</font>** Find the RSS and Rsquared values for estimating out of state tuition using just the private variable and then using private, top10perc, SFRatio, and GradRate. Use `.ssr` and `.rsquared` respectively after fitting the model. What do you notice?

In [None]:
#your code here

## Interaction Terms

What happens if we add an interaction term between Top10perc and GradRate? Fit the model and look at the RSS and Rsquared results.

&#9989; **<font color=red>Do now:</font>** Add a new column to the college dataframe that is the product of SFRatio and GradRate. Then fit two models: one with predictor variables SFRatio and GradRate, and one with predictor variables SFRatio, GradRate AND your new column.

&#9989; **<font color=red>Q:</font>** What are the RSS and Rsquared values for the two models? What do you notice?

In [None]:
#your code here

In [None]:
print(est.ssr)
print(est.rsquared)

Looks like the interaction term made a difference! But what's going on with the SFRatio p value? 

&#9989; **<font color=red>Do now:</font>** Check out the summary table if you haven't already.

&#9989; **<font color=red>Q:</font>** Does this mean we should throw out the SFRatio variable? What principle does this point to?

## Keep exploring!
Use any remaining time to build more models with some of the other columns we haven't used yet. Keep track of what you try and write a brief summary of what you found.


-----
### Congratulations, we're done!

Written by Rachel Roca, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.