**Linear Regression : is a model that predicts relationship of direct proportionality between dependent and predictor variables.**

<img src="img/1.png">

#### Dataset

##### Source : FRED (https://fred.stlouisfed.org/)

* Target Variable : HPI or Housing Price Index 
    * It measures the prices changes of residential housing

* Predictor Variables
    * GDP
    * Unemployment
    * Interest Rates

!pip install statsmodels

#### Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.sandbox.regression.predstd import wls_prediction_std

sns.set_style('darkgrid')
%matplotlib inline



#### Import Data

In [3]:
hpi = pd.read_csv('dataset/monthly-hpi.csv')
uemp = pd.read_csv('dataset/unemployment-macro.csv')
ffr = pd.read_csv('dataset/fed_funds.csv')
shiller = pd.read_csv('dataset/shiller.csv')
gdp = pd.read_csv('dataset/gdp.csv')

In [4]:
df = (shiller.merge(hpi,on='date')
      .merge(uemp,on='date')
      .merge(ffr,on='date')
      .merge(gdp,on='date'))

In [10]:
gdp.head(10)

Unnamed: 0,date,total_expenditures,labor_force_pr,producer_price_index,gross_domestic_product
0,2011-01-01,5766.7,64.2,192.7,14881.3
1,2011-04-01,5870.8,64.2,203.1,14989.6
2,2011-07-01,5802.6,64.0,204.6,15021.1
3,2011-10-01,5812.9,64.1,201.1,15190.3
4,2012-01-01,5765.7,63.7,200.7,15291.0
5,2012-04-01,5771.2,63.7,203.7,15362.4
6,2012-07-01,5745.4,63.7,200.1,15380.8
7,2012-10-01,5841.4,63.8,203.5,15384.3
8,2013-01-01,5748.0,63.6,202.5,15491.9
9,2013-04-01,5756.8,63.4,203.5,15521.6


#### Exploratory Data Analysis : Figure out the best predictors for our dependent variables

    * Plotting
    * Descriptive Stats
    
    
##### OLS : Ordinary Least Squares : statistical method that helps estimate the relationship between independent and dependent variables by minimising the sum of squares in difference between observed and predicted values

##### OLS or Linear Regression Assumptions

1. Linearity :
    * dependent and independent variables have linear relationship
    
    
2. No Multicollinearity
    * independent variables are not correlated with each other.
    * if predictors are highly correlated, then removing them shouldn't drastically reduce adjusted R-squared
    
    
3. Zero Conditional Mean
    * Average of distances i.e. residuals between observations and trend line is zero
    
    
4. Homoskedasticity
    * Variance is constant
    * No pattern in residuals
    
    
5. No Autocorrelation
    * Autocorrelation is a variable correlated to itself across observations.