# Multivariate Regression Analysis

It is reasonable to expect that by considering more variables in the regression equation will improve its explanatory power and provide a better idea of the full picture of circumstances that determine the development of the variable we are trying to predict. 

Think back to our house example. The buyer of a house is not only interested in the size, but the location, neighbourhood, transportation, etc. All these play a role and affect pricing. Therefore, a regression that considers multiple cariables should provide predictions closer to true results. The main catch is that we need the right variables. 

The form of the multivariate regression equation is almost the same as the simple regression equation, with the slight difference that we have multiple beta coefficients and explanatory variables. It's up to the analyst's discretion to choose how many explanatory variables will be included. If their number is higher than 1, we can talk about a multiple regression. 

The principle accoding to which regression is calculated is the same as in the simple regression setting. We're estimating a best fitting line minimizing the sum of squared residuals. It is more difficult to represent the best fitting line on a plot, given we are calculating its fit through multiple dimensions. If the independent variables are more than two, things become complicated. If they're three, you can use a 3D graph. 

To understand how the regression coefficients are obtained, try to imagine the software we're using knows how to calculate the equation of a best fitting line in 4, 5, 6, or more dimensions. Math formulas make this a possible task. Similar to what we saw before, the $R^2$ measurement will help us determine how powerful the regression we've obtained is. 

- R-squared can be from 0-1
- The more variables we include in the equation, the higher we would expect it to be

There are two main ways to check if an explanatory variable is helpful:
1. Run a regression with the variable, then run a new regression without that variable/with a different variable

We should observe how $R^2$ changes. If it's higher the first time, then this independent variable has good explanatory power and its value contributes to the explanation of the value of the dependent variable. 

2. Compare the p-values of the beta coefficients

A p-value is the probability to obtain a result that is more extreme than the one we obtained. In this case, this is the probability that the beta coefficients we've estimated should have been different. So a low p-value is a good thing, it provides assurance that the real beta coefficient differs from 0 and helps us explain the dependent variable. A p-value that is lower than 5% allows us to state we can be 95% confident that the beta coefficient we've estimated is different than 0.

This is how we can run a multivariate regression.

$$Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_i X_i + \varepsilon_i$$

Note that the beta coefficients estimated in a multivariate regression can still be interpreted as the marginal impact of the explanatory variable has on the dependent variable. But this is only true of all other explanatory variables remain constant. 

So, if we've obtained a regression that shows how a stock's expected return is a function of the country's GDP growth and the industry growth rate, and we've obtained a 0.8 beta coefficient for the GDP growth, and a 1.2 beta coefficient for industry growth, we can say that for every percentage increase of GDP growth, the company will be expected to earn 0.8% more if the expected industry growth remains the same. 

The expected return and the GDP growth are interralated in this regression and we cannot interpret one without the other.

In [2]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [5]:
data = pd.read_excel('./Housing.xlsx')
data

Unnamed: 0,House Price,House Size (sq.ft.),State,Number of Rooms,Year of Construction
0,1116000,1940,IN,8,2002
1,860000,1300,IN,5,1992
2,818400,1420,IN,6,1987
3,1000000,1680,IN,7,2000
4,640000,1270,IN,5,1995
5,1010000,1850,IN,7,1998
6,600000,1000,IN,4,2015
7,700000,1100,LA,4,2014
8,1100000,1600,LA,7,2017
9,570000,1000,NY,5,1997


### Multivariate Regression
Independent variables: House size, rooms, year of construction

In [6]:
x = data[['House Size (sq.ft.)', 'Number of Rooms', 'Year of Construction']]
# we double the brackets to shwo x is multi-dimensional
y = data['House Price']

In [8]:
x1 = sm.add_constant(x)
reg = sm.OLS(y, x1).fit()

reg.summary()

0,1,2,3
Dep. Variable:,House Price,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.687
Method:,Least Squares,F-statistic:,14.9
Date:,"Fri, 17 Jul 2020",Prob (F-statistic):,6.82e-05
Time:,08:29:43,Log-Likelihood:,-258.43
No. Observations:,20,AIC:,524.9
Df Residuals:,16,BIC:,528.9
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-9.452e+06,5.4e+06,-1.752,0.099,-2.09e+07,1.99e+06
House Size (sq.ft.),341.8271,179.666,1.903,0.075,-39.049,722.703
Number of Rooms,1.16e+04,3.74e+04,0.310,0.760,-6.77e+04,9.08e+04
Year of Construction,4863.5761,2697.969,1.803,0.090,-855.862,1.06e+04

0,1,2,3
Omnibus:,2.14,Durbin-Watson:,1.938
Prob(Omnibus):,0.343,Jarque-Bera (JB):,1.747
Skew:,-0.676,Prob(JB):,0.418
Kurtosis:,2.484,Cond. No.,540000.0


Let's examine the values of the statistics obtained. The coefficient of the constant came out in scientific notation. Transform it to standard notation and we will get a result near 452,000. So, the coefficient values of the constant and of the house size drastically changed. The constant jumped from 260,800 while the house size beta dropped from 402 to 341.

Does this mean that by adding the other two independent variables, we destroyed the nice results of our univariate regression and that house size cannot explain the house size anymore? Absolutely not. 

The r-squared value of this regression changed from 0.678 to 0.736. According to this statistic, the second model is slightly better in terms of explanatory power. So, at least some of the independent variables influence the price of a house. 

The coefficient p-values are an important indicator we should consider. They're not small enough, being > 5%. This means the 3 coefficients are not statistically significant. 

So what should we infer? An experienced researcher runs tons of regressions before making a sound inference. Let's run 3 other regressions with 2 independent variables.

Independent variables: house size, number of rooms

In [9]:
x = data[['House Size (sq.ft.)', 'Number of Rooms']]

In [10]:
x1 = sm.add_constant(x)
reg = sm.OLS(y, x1).fit()

reg.summary()

0,1,2,3
Dep. Variable:,House Price,R-squared:,0.683
Model:,OLS,Adj. R-squared:,0.645
Method:,Least Squares,F-statistic:,18.3
Date:,"Fri, 17 Jul 2020",Prob (F-statistic):,5.77e-05
Time:,08:45:42,Log-Likelihood:,-260.28
No. Observations:,20,AIC:,526.6
Df Residuals:,17,BIC:,529.6
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.737e+05,1.03e+05,2.655,0.017,5.62e+04,4.91e+05
House Size (sq.ft.),314.1363,190.485,1.649,0.117,-87.752,716.025
Number of Rooms,1.944e+04,3.95e+04,0.492,0.629,-6.39e+04,1.03e+05

0,1,2,3
Omnibus:,1.326,Durbin-Watson:,1.852
Prob(Omnibus):,0.515,Jarque-Bera (JB):,0.81
Skew:,-0.487,Prob(JB):,0.667
Kurtosis:,2.853,Cond. No.,5890.0


Independent vars: house size, year of construction

In [11]:
x = data[['House Size (sq.ft.)', 'Year of Construction']]
x1 = sm.add_constant(x)
reg = sm.OLS(y, x1).fit()

reg.summary()

0,1,2,3
Dep. Variable:,House Price,R-squared:,0.735
Model:,OLS,Adj. R-squared:,0.704
Method:,Least Squares,F-statistic:,23.55
Date:,"Fri, 17 Jul 2020",Prob (F-statistic):,1.26e-05
Time:,08:46:30,Log-Likelihood:,-258.49
No. Observations:,20,AIC:,523.0
Df Residuals:,17,BIC:,526.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-9.654e+06,5.21e+06,-1.852,0.081,-2.07e+07,1.34e+06
House Size (sq.ft.),394.0417,61.098,6.449,0.000,265.137,522.947
Year of Construction,4960.9407,2607.443,1.903,0.074,-540.283,1.05e+04

0,1,2,3
Omnibus:,2.064,Durbin-Watson:,1.926
Prob(Omnibus):,0.356,Jarque-Bera (JB):,1.689
Skew:,-0.663,Prob(JB):,0.43
Kurtosis:,2.48,Cond. No.,536000.0


In [12]:
x = data[['Number of Rooms', 'Year of Construction']]
x1 = sm.add_constant(x)
reg = sm.OLS(y, x1).fit()

reg.summary()

0,1,2,3
Dep. Variable:,House Price,R-squared:,0.677
Model:,OLS,Adj. R-squared:,0.639
Method:,Least Squares,F-statistic:,17.79
Date:,"Fri, 17 Jul 2020",Prob (F-statistic):,6.79e-05
Time:,08:47:01,Log-Likelihood:,-260.47
No. Observations:,20,AIC:,526.9
Df Residuals:,17,BIC:,529.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.471e+06,5.77e+06,-1.468,0.160,-2.06e+07,3.7e+06
Number of Rooms,7.824e+04,1.4e+04,5.574,0.000,4.86e+04,1.08e+05
Year of Construction,4424.7160,2887.793,1.532,0.144,-1667.996,1.05e+04

0,1,2,3
Omnibus:,2.115,Durbin-Watson:,1.959
Prob(Omnibus):,0.347,Jarque-Bera (JB):,1.4
Skew:,-0.407,Prob(JB):,0.497
Kurtosis:,1.991,Cond. No.,434000.0


Skimming through the values in the new output, we can see that year of construction doesn't get a low p-value in any of the regressions it's involved in. This means it's not related to house prices. 

When we run a regression with only the two variables size and number of rooms, their p-values increase significantly. So, we can't confirm they can influence the price of a house. At this stage, even if we cannot make a firm conclusion, these results can give us a good guidance for future research:

- Output suggests that if we gather more data about more observations, house size or number of rooms might prove to be good indicators of house prices and their p-values will probably decrease
    - In our calculations, r-squared was high - this gives us confidence that we have a good set of explanatory variables
- House size and number of rooms are related and act as a single explanatory variable
- We should gather data about more explanatory variables