# Going Back-to-Basics
## Linear methods for regression

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from linear_regression import linear_regression  as lr 
from sklearn import datasets
import pandas as pd

This series of back-to-basic posts implements some of the simpler inferential statistical methods. Here we're focusing on the good ol'fashion linear regression. In the next post I'll dive into some shrinkage techniques but for now we're keeping it simple. The last cell contains all the code in the linear regression class.

I'm using the [prostate data](https://web.stanford.edu/~hastie/ElemStatLearn/data.html) set following my favorite machine learning text, [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/). 

In [6]:
##testing with the prostate dataset
df = pd.read_csv('.//prostate.data',sep='\s+')
df.head()

Unnamed: 0,lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa,train
1,-0.579818,2.769459,50,-1.386294,0,-1.386294,6,0,-0.430783,T
2,-0.994252,3.319626,58,-1.386294,0,-1.386294,6,0,-0.162519,T
3,-0.510826,2.691243,74,-1.386294,0,-1.386294,7,20,-0.162519,T
4,-1.203973,3.282789,58,-1.386294,0,-1.386294,6,0,-0.162519,T
5,0.751416,3.432373,62,-1.386294,0,-1.386294,6,0,0.371564,T


By peaking at the dataset(above), we see a mixture of catagorical and continious data. Luckly, this dataset is a toy-model, catagorical variables are encoded by an index and the data has been cleaned. Training and testing data has already been labeled which is split in the code-cell below.

In [None]:
#split train and test
df_train = df.loc[df['train'] == 'T']
df_test = df.loc[df['train'] == 'F']
#drop train column
df_train = df_train.drop(['train'],axis=1)
df_test = df_test.drop(['train'],axis=1)
x_train = df_train[['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45']].to_numpy()
y_train = df_train[['lpsa']].to_numpy()
x_test = df_test[['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45']].to_numpy()
y_test = df_test[['lpsa']].to_numpy()
predictors = ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45']

In [3]:
#plot correlations between all predictors
#grr = pd.plotting.scatter_matrix(df, figsize=(15, 15), marker='o',
#                                 hist_kwds={'bins': 20}, s=60, alpha=.8)

Linear regressions is the minimization of a cost function $E[(Y-\hat{Y})^2]$ where $\hat{Y} = X\beta$ where $\beta$ are the estimated regression coefficients and $X$~is a $(p+1)\times~N$ matrix containing p predictors and N data points. The expectation is taken empirically.



In [4]:
fh = lr(predictors, x_train, y_train, x_test, y_test, standardize = True, intercept=True)
fh.solve()

In [5]:
fh.backwards_stepwise_selection()

c (30, 9) (9, 1)
c (30, 8) (8, 1)
c (30, 7) (7, 1)
c (30, 6) (6, 1)
c (30, 5) (5, 1)
c (30, 4) (4, 1)
c (30, 3) (3, 1)
c (30, 2) (2, 1)


(['gleason', 'age', 'lcp', 'pgg45', 'lbph', 'svi', 'lweight', 'lcavol'],
 [16.564820514008186,
  16.467503922798873,
  17.27080608851507,
  15.884637816445682,
  14.756207448326178,
  13.367761375874872,
  16.54138258922808,
  15.036170914877058])