# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [7]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls') # reads excel file into dataframes


In [22]:
df.head() # shows the first few lines

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather,Model_ord,Make_ord,Type_ord,Trim_ord
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1,10,0,3,44
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0,10,0,3,44
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0,10,0,3,44
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0,10,0,3,44
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1,10,0,3,44


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we use pandas.Categorical to convert textual category data (model name) into an ordinal number that we can work with.

In [25]:
import statsmodels.api as sm

df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
X = pd.normalize(X)
y = df[['Price']]
X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()

est.summary()

AttributeError: 'module' object has no attribute 'normalize'

In [4]:
y.groupby(df.Doors).mean()

Unnamed: 0_level_0,Price
Doors,Unnamed: 1_level_1
2,23807.13552
4,20580.670749


Surprisingly, more doors does not mean a higher price! So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?

In [28]:
df['Model_ord'] = pd.Categorical(df.Model).codes
df['Make_ord'] = pd.Categorical(df.Make).codes
df['Trim_ord'] = pd.Categorical(df.Trim).codes

X = df[['Mileage', 'Model_ord','Trim_ord', 'Cylinder']]
X = (X-X.mean())/(X.max()-X.min())
y = df[['Price']]
y = (y-y.mean())/(y.max()-y.min())
X1 = sm.add_constant(X) # add constant column with value 1
est = sm.OLS(y, X1).fit()

est.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.44
Model:,OLS,Adj. R-squared:,0.437
Method:,Least Squares,F-statistic:,157.0
Date:,"Thu, 08 Sep 2016",Prob (F-statistic):,4.26e-99
Time:,13:40:14,Log-Likelihood:,570.57
No. Observations:,804,AIC:,-1131.0
Df Residuals:,799,BIC:,-1108.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,5.031e-17,0.004,1.19e-14,1.000,-0.008 0.008
Mileage,-0.1417,0.026,-5.481,0.000,-0.192 -0.091
Model_ord,-0.1314,0.016,-8.059,0.000,-0.163 -0.099
Trim_ord,-0.1008,0.014,-7.348,0.000,-0.128 -0.074
Cylinder,0.3005,0.013,23.078,0.000,0.275 0.326

0,1,2,3
Omnibus:,234.245,Durbin-Watson:,0.092
Prob(Omnibus):,0.0,Jarque-Bera (JB):,690.775
Skew:,1.437,Prob(JB):,1e-150
Kurtosis:,6.516,Cond. No.,6.15


In [29]:
X

Unnamed: 0,Mileage,Model_ord,Trim_ord,Cylinder
0,-0.231658,-0.156074,0.376000,0.182836
1,-0.213422,-0.156074,0.376000,0.182836
2,-0.132398,-0.156074,0.376000,0.182836
3,-0.069630,-0.156074,0.376000,0.182836
4,0.000001,-0.156074,0.376000,0.182836
5,0.047965,-0.156074,0.376000,0.182836
6,0.054749,-0.156074,0.376000,0.182836
7,0.062490,-0.156074,0.376000,0.182836
8,0.083579,-0.156074,0.376000,0.182836
9,0.149500,-0.156074,0.376000,0.182836


Unnamed: 0,const,Mileage,Model_ord,Trim_ord,Cylinder
0,1,-0.231658,-0.156074,0.376000,0.182836
1,1,-0.213422,-0.156074,0.376000,0.182836
2,1,-0.132398,-0.156074,0.376000,0.182836
3,1,-0.069630,-0.156074,0.376000,0.182836
4,1,0.000001,-0.156074,0.376000,0.182836
5,1,0.047965,-0.156074,0.376000,0.182836
6,1,0.054749,-0.156074,0.376000,0.182836
7,1,0.062490,-0.156074,0.376000,0.182836
8,1,0.083579,-0.156074,0.376000,0.182836
9,1,0.149500,-0.156074,0.376000,0.182836
