# STA 130 Lab 10: Multiple Regression
*Your Name Here*

# Instructions

Work in teams of 2 or 3 students.  Each student should  submit  their own final report.   

Your  task  for  this  lab  is  to  develop a model that will estimate peak flood discharges in urban areas across the state of North Carolina.  This model will take the form of a multiple linear regression equation. For $i = 1,...,n$:

$$
Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_p X_{pi} + \epsilon_i
$$

It will be up to you to determine which predictors should be included in your model.  

# Data
Your data have been sampled at 32 of the 45 sites on the map on page 8 of the attached report from the United States Geological Survey. Some sites were omitted due to poor data-quality.  The data file USGSdata.csv contains the following variables for each site:

### Independent Variables

- **DA**: contributing drainage area ($mi^2$), measured from topographic maps.
- **L**: channel length ($mi$), measured from gaging station upstream to basin divide.
- **S**: channel slope ($ft/mi$), calculated from the difference in elevation from the 10-and 85-percent points along the stream channel.
- **IA**: impervious area ($\%$), found as portion of map covered by road, parking lots, roofs, etc.
- **BDF**: basin development factor, which is “score” of urban drainage improvements (drains, sewers, gutters, etc.).
- **RI22**: 2-year, 2-hour rainfall amount (inches of precipitation for 2-hour long storm with 2-year recurrence).
- **RQ100**: rural equivalent peak discharge for 100-year flood ($ft^3/s$) basically gives a measure of the baseline/background flood occurrence for the area.  This number is the rate of flow for a one-in-100 year flood for natural/rural areas.

### Dependent (target) variable

- **Q100**: urban discharge for 100-year flood ($ft^3/s$).  These data are the flow rates for observed one-in-100 year floods for our urban areas of interest.

## Methods

Before fitting the model, you may need to transform your variables so that the independent variables have linear relationships with the dependent one.  A good way to judge this is to simply plot the target variable ($Y$) versus one of the predictors ($X_j$, $j = 1,...,p$) of interest at a time.  (Hint:  see the OpenIntro Stats text here at
https://www.openintro.org/stat/textbook.php and particularly the section s 8.2.2 on model selection strategies and 8.2.3 on checking model assumptions using graphs.)

We will use the Python library `statsmodels` to build our regression models. Note that model-building can also be accomplished using the package `scikitlearn`, which includes many more models and helpful features but takes a machine learning approach to model building. We will be focusing on the basic elements of the model, for which `statsmodels` is much more transparent.

Ultimately, the goal when building a model should be parimony parsimony. You want to have the most powerful, yet simplest model you can build. Thus, you would like to find the model that maximizes $Adj-R^2$.

In [3]:
import statsmodels.api as stats
import pandas as pd

In [5]:
flood_df = pd.read_csv("./USGSdata.csv")
flood_df.shape

(32, 8)

In [7]:
flood_df.describe()

Unnamed: 0,Q100,BDF,DA,IA,L,RI22,RQ100,S
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,2091.46875,6.34375,6.32625,20.123438,3.175,2.05,1640.125,81.2375
std,2492.547938,2.63487,11.533003,13.29883,3.060744,0.170389,2214.90389,90.096943
min,48.0,2.0,0.04,2.0,0.11,1.9,10.8,9.0
25%,693.75,4.0,0.64,10.85,1.0925,1.9,431.5,17.025
50%,1225.0,6.0,1.315,16.25,1.955,2.1,799.0,50.5
75%,2170.0,9.0,3.8075,28.5,3.9125,2.125,1490.0,122.5
max,10300.0,11.0,41.0,54.6,11.2,2.6,7830.0,375.0


In [16]:
# instantiate model
flood_lm = stats.OLS(
    endog = flood_df["Q100"],
    exog = flood_df[["BDF", "DA", "L"]],
    missing = "drop"
)

# fit model
lm_results = flood_lm.fit()
print(lm_results.summary())

                                 OLS Regression Results                                
Dep. Variable:                   Q100   R-squared (uncentered):                   0.923
Model:                            OLS   Adj. R-squared (uncentered):              0.915
Method:                 Least Squares   F-statistic:                              116.3
Date:                Mon, 26 Aug 2019   Prob (F-statistic):                    2.87e-16
Time:                        02:15:37   Log-Likelihood:                         -262.83
No. Observations:                  32   AIC:                                      531.7
Df Residuals:                      29   BIC:                                      536.1
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

## Assignment
1. Perform exploratory data analysis on the variables using charts and graphs, include some of this output and commentary in your final report.
2. Fit a linear model that predicts the target variable **Q100** to the data. Include the summary output from this model, and also write out the final equation symbolically using Markdown.
3. Justify your choice of final model using values from the regression output. Why was it better than other models which you may have explored?
4. Present your plot of the residuals for this final model. Provide justification for how this plot shows that the residuals satisfy the normality assumption.
5. Discuss whether or not you think the various assumptions that go into the multiple linear regression model hold up for this dataset.  How can the physical layout of the stations affect assertions about constant variance over our data?  What about independence?

**Note:** You will be graded in part on the quality of your final model. It should have a sufficiently large $Adj-R^2$ (high 80s) and be sufficiently parsimonious!