# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [29]:
import pandas as pd

df = pd.read_excel('cars.xls')


In [30]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,70755.466717,583,Cadillac,XLR-V8,Hardtop Conv 2D,Convertible,8,4.6,8,1,1,1
1,69133.731722,7892,Cadillac,XLR-V8,Hardtop Conv 2D,Convertible,8,4.6,8,1,1,1
2,68566.187189,6420,Cadillac,XLR-V8,Hardtop Conv 2D,Convertible,8,4.6,8,1,1,1
3,66374.30704,12021,Cadillac,XLR-V8,Hardtop Conv 2D,Convertible,8,4.6,8,1,1,1
4,65281.481237,15600,Cadillac,XLR-V8,Hardtop Conv 2D,Convertible,8,4.6,8,1,1,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [31]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

#Esta forma de hacerlo tira muchos warnings
#X = df[['Mileage', 'Cylinder', 'Doors']]
#y = df['Price']

#X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

X = df.loc[:,('Mileage', 'Cylinder', 'Doors')]
y = df.loc[:, 'Price']

X.loc[:, ('Mileage', 'Cylinder', 'Doors')] = scale.fit_transform(X.loc[:, ('Mileage', 'Cylinder', 'Doors')].values)

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -2.349947  1.969717  3.313091
1   -1.457650  1.969717  3.313091
2   -1.637355  1.969717  3.313091
3   -0.953574  1.969717  3.313091
4   -0.516643  1.969717  3.313091
5   -0.199230  1.969717  3.313091
6    0.410325  1.969717  3.313091
7    1.150996  1.969717  3.313091
8    1.461695  1.969717  3.313091
9    2.790679  1.969717  3.313091
10  -2.152296  1.969717  3.313091
11  -1.605003  1.969717  3.313091
12  -2.101754  1.969717  3.313091
13  -2.324920  1.969717  3.313091
14  -1.781533  1.969717  3.313091
15  -0.548018  1.969717  3.313091
16  -1.978574  1.969717  3.313091
17  -0.732850  1.969717  3.313091
18  -0.942098  1.969717  3.313091
19   0.478447  1.969717  3.313091
20  -0.577195  1.969717  3.313091
21   0.184230  1.969717  3.313091
22   0.187404  1.969717  3.313091
23   0.432056  1.969717  3.313091
24   0.655222  1.969717  3.313091
25  -1.749792  1.969717  3.313091
26  -0.645317  1.969717  3.313091
27  -2.073675  0.527410  3.313091
28   0.515194 



0,1,2,3
Dep. Variable:,Price,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.068
Method:,Least Squares,F-statistic:,20.45
Date:,"Fri, 08 Mar 2019",Prob (F-statistic):,8.98e-13
Time:,03:26:40,Log-Likelihood:,-9203.9
No. Observations:,804,AIC:,18410.0
Df Residuals:,801,BIC:,18430.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-970.7422,806.219,-1.204,0.229,-2553.293,611.809
Cylinder,4608.1349,860.828,5.353,0.000,2918.389,6297.881
Doors,2688.4362,866.107,3.104,0.002,988.328,4388.544

0,1,2,3
Omnibus:,163.746,Durbin-Watson:,0.042
Prob(Omnibus):,0.0,Jarque-Bera (JB):,272.998
Skew:,1.302,Prob(JB):,5.24e-60
Kurtosis:,4.172,Cond. No.,1.49


In [32]:
y.groupby(df.Doors).mean()

Doors
2    20870.118496
4    19597.998868
8    48421.113104
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?