# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [2]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [2]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we use pandas.Categorical to convert textual category data (model name) into an ordinal number that we can work with.

This is actually a questionable thing to do in the real world - doing a regression on categorical data only works well if there is some inherent order to the categories!

In [8]:
import statsmodels.api as sm

df['Model_ord'] = pd.Categorical(df.Model).codes
df['Sound_ord'] = pd.Categorical(df.Sound).codes
X = df[['Mileage', 'Model_ord', 'Doors','Sound_ord']]
y = df[['Price']]

X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()

est.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,12.81
Date:,"Wed, 11 Jan 2017",Prob (F-statistic):,4.15e-10
Time:,09:22:27,Log-Likelihood:,-8511.1
No. Observations:,804,AIC:,17030.0
Df Residuals:,799,BIC:,17060.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,3.361e+04,1888.255,17.801,0.000,2.99e+04 3.73e+04
Mileage,-0.1809,0.041,-4.371,0.000,-0.262 -0.100
Model_ord,-35.8761,38.974,-0.921,0.358,-112.380 40.628
Doors,-1752.3564,399.737,-4.384,0.000,-2537.016 -967.697
Sound_ord,-2898.6423,727.588,-3.984,0.000,-4326.852 -1470.432

0,1,2,3
Omnibus:,217.434,Durbin-Watson:,0.119
Prob(Omnibus):,0.0,Jarque-Bera (JB):,540.02
Skew:,1.408,Prob(JB):,5.4500000000000005e-118
Kurtosis:,5.861,Cond. No.,122000.0


In [9]:
X

Unnamed: 0,Mileage,Model_ord,Doors,Sound_ord
0,8221,10,4,1
1,9135,10,4,1
2,13196,10,4,1
3,16342,10,4,0
4,19832,10,4,0
5,22236,10,4,1
6,22576,10,4,1
7,22964,10,4,1
8,24021,10,4,0
9,27325,10,4,1


The table of coefficients above gives us the values to plug into an equation of form:
    B0 + B1 * Mileage + B2 * model_ord + B3 * doors
    
But in this example, it's pretty clear that mileage is more important than anything based on the std err's.

Could we have figured that out earlier?

In [5]:
y.groupby(df.Doors).mean()

Unnamed: 0_level_0,Price
Doors,Unnamed: 1_level_1
2,23807.13552
4,20580.670749


Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?