# Week 9 - Assessed exercises

In this weeks exercises we will fit a regression model and create a stepwise AIC function.

You **must** submit your answers on Moodle.

Unfortunately, statsmodels is not installed on CodeRunner, so the questions this week only check your final answer. If you would like to receive partial credit for an incorrect answer, you should submit your code on Brightspace (Assessment $\rightarrow$ Assignments $\rightarrow$ Week 9 Exercises). This is not a requirement. I will only look at your submitted code if one of your answers on Moodle is incorrect.

In [4]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import numpy.random as npr

Load in the prostate data and create a DataFrame X which contains the following columns:
* A column of 1s
* The variables lcavol, lweight, age, lbph, svi, lcp, gleason, and pgg45 standardised to have mean 0 and standard deviation 1

Create a Series y which contains the lpsa also standardised.

In [5]:
# Load dataset.
prostate = pd.read_csv('http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data',sep='\t')
prostate.head()

Unnamed: 0.1,Unnamed: 0,lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa,train
0,1,-0.579818,2.769459,50,-1.386294,0,-1.386294,6,0,-0.430783,T
1,2,-0.994252,3.319626,58,-1.386294,0,-1.386294,6,0,-0.162519,T
2,3,-0.510826,2.691243,74,-1.386294,0,-1.386294,7,20,-0.162519,T
3,4,-1.203973,3.282789,58,-1.386294,0,-1.386294,6,0,-0.162519,T
4,5,0.751416,3.432373,62,-1.386294,0,-1.386294,6,0,0.371564,T


In [6]:
# Create arrays with the desired columns.
col1 = np.ones(prostate.shape[0])
col2 = prostate["lcavol"]
col3 = prostate["lweight"]
col4 = prostate["age"]
col5 = prostate["lbph"]
col6 = prostate["svi"]
col7 = prostate["lcp"]
col8 = prostate["gleason"]
col9 = prostate["pgg45"]
col_list = [col2, col3, col4, col5, col6, col7, col8, col9]

Which column of X has the smallest correlation with y?

In [7]:
# Create a list to store the normalized columns and a list with the new names of the columns.
col_list_norm = [col1]
col_names = ["1s", "lcavol_norm", "lweight_norm", "age_norm", "lbph_norm", "svi_norm", "lcp_norm", "gleason_norm", "pgg45_norm"]

# Loop through the columns.
for col_i in col_list:
    # Get the mean and std.
    mean = np.mean(col_i)
    std = np.std(col_i)
    # Store the normalize the values on a new list and append it to the normalized list.
    col_i_norm = [((coli_i - mean) / std) for coli_i in col_i]
    col_list_norm.append(col_i_norm)

# Create a dataframe.
df = DataFrame(col_list_norm)
X = df.transpose()
X.columns = col_names
X.head()

Unnamed: 0,1s,lcavol_norm,lweight_norm,age_norm,lbph_norm,svi_norm,lcp_norm,gleason_norm,pgg45_norm
0,1.0,-1.645861,-2.016634,-1.872101,-1.030029,-0.525657,-0.867655,-1.047571,-0.868957
1,1.0,-1.999313,-0.725759,-0.791989,-1.030029,-0.525657,-0.867655,-1.047571,-0.868957
2,1.0,-1.587021,-2.200154,1.368234,-1.030029,-0.525657,-0.867655,0.344407,-0.156155
3,1.0,-2.178174,-0.812191,-0.791989,-1.030029,-0.525657,-0.867655,-1.047571,-0.868957
4,1.0,-0.510513,-0.461218,-0.251933,-1.030029,-0.525657,-0.867655,-1.047571,-0.868957


In [8]:
# Create a pandas series with normalized lpsa.
lpsa = prostate["lpsa"]
mean = np.mean(lpsa)
st = np.std(lpsa)
y = pd.Series([((lpsa_i - mean) / st) for lpsa_i in lpsa])
y

0    -2.533318
1    -2.299712
2    -2.299712
3    -2.299712
4    -1.834631
        ...   
92    1.660415
93    1.921044
94    2.320465
95    2.611649
96    2.703452
Length: 97, dtype: float64

In [9]:
# Check the correlation of X columns with y.
print(X.corrwith(y))

1s                   NaN
lcavol_norm     0.734460
lweight_norm    0.433319
age_norm        0.169593
lbph_norm       0.179809
svi_norm        0.566218
lcp_norm        0.548813
gleason_norm    0.368987
pgg45_norm      0.422316
dtype: float64


Use the `OLS` function from statsmodels.api to fit a linear regression with y as your dependent/response variable and the first two columns of X as the explanatory variables, i.e. the intercept column and the lcavol column. 

What is the adjusted R-square to 3 decimal places?

In [10]:
new_X = X[["1s", "lcavol_norm"]]

# Import statsmodel.api and run the fit.
import statsmodels.api as sm
mod = sm.OLS(y, new_X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.539
Model:                            OLS   Adj. R-squared:                  0.535
Method:                 Least Squares   F-statistic:                     111.3
Date:                Fri, 27 Nov 2020   Prob (F-statistic):           1.12e-17
Time:                        10:11:43   Log-Likelihood:                -100.04
No. Observations:                  97   AIC:                             204.1
Df Residuals:                      95   BIC:                             209.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
1s           1.561e-16      0.070   2.24e-15      

I now want to run a *forward selection AIC regression*. AIC is the Akaike information criterion. It's designed to penalise models with lots of explanatory variables so that we pick models which fit the data well but aren't too complicated. In general, if you have two models fitted to the same data, the model with the lowest AIC is preferable. The AIC is given as part of the model summary with OLS.

The steps to run a forward selection AIC regression are: 
1. Run a linear regression with just the intercept column. Get the AIC.
2. Repeat:<br>
a. Try adding in all the currently unused explanatory variables individually and look at the decrease in AIC for each<br>
b. Find the variable with the biggest decrease in AIC and include it in the model<br>
c. If none of the variables lowers the AIC then stop, otherwise go to 2a.
3. Report your final chosen variables

Write a function called `forwardAIC` which performs this algorithm given the DataFrame X and Series y.
The function should return the column numbers of the X matrix for the model that gives the lowest AIC.

In [12]:
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.663
Model:                            OLS   Adj. R-squared:                  0.633
Method:                 Least Squares   F-statistic:                     21.68
Date:                Fri, 27 Nov 2020   Prob (F-statistic):           7.65e-18
Time:                        10:12:13   Log-Likelihood:                -84.829
No. Observations:                  97   AIC:                             187.7
Df Residuals:                      88   BIC:                             210.8
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
1s            3.469e-17      0.062   5.61e-16   

What's the AIC of this chosen model?

**Bonus question (ungraded)**

Run the same analysis on the Diamonds data using price as the dependent/response variable. Load the data in and create dummy variables for the categorical variables cut, colour and clarity (using `pd.get_dummies`). You will need to drop one category for each categorical variable (i.e. drop 'Fair' for cut, drop 'D' for color, and drop 'I1' for clarity). Otherwise the model cannot be fully determined.

Standardise everything, add in the intercept column, and then run your `forwardAIC` function. 

How many variables (not including the intercept) get chosen?

What's the AIC of this chosen model?

These two bonus questions are included in the *Week 9 - Non-assessed exercises* quiz on Moodle.