### Linear Regression 

This notebook will walk through: assumptions, pre-work, algorithms, performance analysis, and when to use.

#### Assumptions

1. Indpendent variables(X): no multi-collinearity, variability within, uncorrelated w/ residuals, appx normal, linear relationship with dependent variable
2. Dependent variable(y): normal, if n < 3000, continuous
3. Residuals: iid, mean of 0

#### Pre-model work

1. Encode categorical features
2. Remove outliers, maybe
3. Transform input variables to better expose linear relationship
4. Remove correlated inputs
5. Normalizing data can help SGD converge quicker, but you may lose some feature information

#### Read in dataset

In [1]:
import pandas as pd
df = pd.read_csv('heart.csv')

In [3]:
df.sample(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
13,64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
201,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3,1


#### Implementation

- y = b_0 + b_1 * x
- Aiming to minimize the residual sum of squares (error terms)
- Standard error measures how each coeff varies under repeated sampling (the more spread out the x, the less wiggle room the slope has)

In [23]:
from statistics import mean, variance

In [19]:
# yi - y_hat
def residual_sum_squares():
    RSS = sum((yi - (B_hat0 + B_hat1 * xi))**2)
    pass

def coefficients():
    b_1 = sum( (xi-x_mean) * (yi-y_mean) ) / sum( (xi-x_mean)**2) )
    b_0 = y_mean - b_1 * x_mean
    pass

# Used to complete confidence intervals for the parameters
# 95% = b_1 +- 2 * SE(b_1)
def std_error():
    b_1_err = var(errors) / sum((xi - x_mean)**2)  
    b_0_err = var(errors) * ((1 / n) + (x_mean**2 / sum((xi - x_mean)**2)))
    pass

def residual_std_error():
    sqrt( (1 / n - 2) * residual_sum_squares() )
    pass

def r_squared():
    TSS = sum( (yi - y_mean**2) )
    r2 = 1 - RSS/TSS

In [13]:
x = df.drop(columns=['chol']).values
y = df['chol'].values

In [17]:
b0, b1 = coefficients(x,y)
print(b0,'\n',b1)

[179.96747066 261.30208333 249.99877906 198.35051794 245.97674419
 254.1251976  249.62992561 243.84803922 243.76021249 246.73926557
 243.65736232 226.91382929 251.08695652] 
 [  1.21944129 -22.01222826  -3.86221862   0.36401868   1.93436693
 -14.88709295  -0.02249228   7.39438503   2.40843053  -0.33961667
   3.57384262   8.36392257  -8.85665349]


In [3]:
#from sklearn.linear_model import LinearRegression
#statsmodels.api.OLS()

#### Hypothesis Testing

- H0: no relationship between X and Y, B1 == 0
- H1: there is some relationship between X and Y, B1 != 0
- Use t-test to create a t-distribution (for small number of samples)
- Get p-value: probability of getting a value of t at least as large as you got
    - p-value < 0.05 is deemed significant
    - confidence interval will not contain 0, will give effect magnitude
    - merely answers how the evidence is of a non-zero association (weak effect can be very significant)

In [None]:
def t_test():
    t = (b_1 - 0) / SE(b_1)

#### Performance

1. R^2: how well the model fits the data, fraction of variance explained by the model
     - TSS: is the error without a model, RSS is error with a model
     - How well do we reduce the TSS, relative to itself (TSS-RSS/TSS)
     - Squared correlation between X & y
     - Domain decides threshold for acceptance
3. MSE/RMSE:  closer to 0 the better
4. Residuals: should be random and iid

#### Improvements

1. Many models -> ensembling
2. Check initially for linear trend
3. Try handling outliers
4. Standardize/Transform variables (Log of positive vars)
5. Interaction or polynomials

#### Interpretation

- Coefficients: for every 1 unit increase in X, y_pred increases by x_coeff
- 95% confidence interval:  we are 95% confident that the true slope lies in confidence interval over repeated sampling

#### When to use

1. To determine existence/strength of predictions (sales -> spend)
2. When you have low dimensional data and want a quick baseline
3. Forecasting an effect or trend (how accurately can we predict future costs)