In [2]:
# Primary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Custom functions
from src import helpers

# Linear Regression Practice

A dataset called `diabetes.csv` is stored in the `data` folder for this repository. 

In the cell below, read in the dataset using pandas, and output the head of the dataframe. Assign the dataframe to the variable `df`.

In [None]:
# Your code here

In [None]:
#==SOLUTION== 
df = pd.read_csv('data/diabetes.csv')
df.head()

**Each row in this dataset represents a patient with diabetes.** 

For this assignment, the variables of focus will be:
* age
* sex
* bmi
* bp
* target

<details>
    <summary>
        <i>Click here to view the documentation for the dataset
        </i>
    </summary>
    <h1>Diabetes dataset</h1>
    <p>


Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
    
  </p>
</details>

Let's go ahead and select the desired columns.

In [None]:
df = df[['target', 'age', 'sex', 'bmi', 'bp']]
df.head(2)

In [6]:
# Run this cell unchanged
helpers.independent.display()

VBox(children=(Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<IPython.core.display.M…

**For our first model, let's figure out which column is most correlated with the target.**

In the cell below,

* Identify the most correlated feature with the target column.
* Assign the name of the column to the variable `most_correlated`.

In [None]:
# Your code here

In [None]:
#==SOLUTION== 
most_correlated = df.corr().sort_values('target').iloc[-2].name

In [4]:
# Run this cell unchanged!
helpers.correlation.display()

VBox(children=(Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<IPython.core.display.M…

In [5]:
# Run this cell unchanged!
helpers.correlation_strong.display()

VBox(children=(Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<IPython.core.display.M…

Let's create a model using the most correlated feature as the only predictor.

## There are two main ways of creating a linear regression model when using `statsmodels`. 
### 1. The pythonic way
   - For this approach you will typcally see statsmodels imported with the line `import statsmodels.api as sm`
   - Using this approach, you create the model by passing the actual data objects into the model like so..
        
        --------------
        ```python
        model = sm.OLS(df.target, df.bmi)
        model_results = model.fit()
        ```
        --------------
       
       
   - This approach can be handy when you have a lot a features and do not wish to type the name of each column because you can pass in an entire dataframe of predictors.
   - One small annoyance with this approach is that the model does not use an intercept by default so you typically have to add the intercept manually
    
    
    --------------
    ```python
    model = sm.OLS(df.target, df[['bmi']].assign(intercept=1))
    model_results = model.fit()
    ```
    
    ----------------
    
### 2. The `R` formula way
   - For this approach you will typically see statsmodels imported with the line `import statsmodels.formula.api as smf`
   - Using this approach, you write your linear equation as a string with the following format:
    
```python
'{dependent_variable} ~ {independent_variable_1} + ... {indepdendent_variable_n}'
```
        
   - In this case, with a dependent variable of `target` and a single independent variable of `bmi`, our formula looks like this:
   ```python
    'target ~ bmi'
    ```
   - And the full modeling code looks like this:
   
   --------
   
```python
formula = 'target ~ bmi'
model = smf.ols(formula, data=df)
model_results = model.fit()
```
    
   --------
    
   - Using this approach, the intercept is added by default
   - One downside of this approach is that writing the formula can be a little cumbersome when you have a lot of features
   
   
## tl;dr There are multiple ways of creating a model, but either option works perfectly fine. 

In this notebook, we will focus on using the `R` formula method for the following reasons:
1. This bootcamp focuses primarily on  *Ordinary Least Squares Linear Regression*, but there are some more advanced versions of linear regression in statsmodels that are only supported by the formula approach. Because of this, familiarity with the formula technique is a highly encouraged.
2. The formula approach adds an intercept term by default which is extremely convenient!

**In the cell below, write the formula for our first linear regression model and assign the string to a variable called `formula`.**

In [None]:
# Your code here

In [None]:
#==SOLUTION== 
formula = 'target ~ bmi'

Let's fit the model and interpret the results!

In [None]:
model = smf.ols(formula, df)
model_results = model.fit()
model_results.summary().tables[1]

**Using the table above, we have all the information we need to write a linear equation.**

Intercept = 152.1335

Slope = 949.4353

Linear Equation:  $target = 152.1335 + 949.4353bmi$

## Interpret the numbers

#### Interpreting the intercept

In [None]:
# Run this cell unchanged
helpers.intercept.display()

#### Interpret the slope

In [None]:
# Run this cell unchanged
helpers.slope.display()