# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [27]:
import pandas as pd

In [28]:

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [29]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


In [30]:
df.describe()

Unnamed: 0,Price,Mileage,Cylinder,Liter,Doors,Cruise,Sound,Leather
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,21343.143767,19831.93408,5.268657,3.037313,3.527363,0.752488,0.679104,0.723881
std,9884.852801,8196.319707,1.387531,1.105562,0.850169,0.431836,0.467111,0.447355
min,8638.930895,266.0,4.0,1.6,2.0,0.0,0.0,0.0
25%,14273.07387,14623.5,4.0,2.2,4.0,1.0,0.0,0.0
50%,18024.995019,20913.5,6.0,2.8,4.0,1.0,1.0,1.0
75%,26717.316636,25213.0,6.0,3.8,4.0,1.0,1.0,1.0
max,70755.466717,50387.0,8.0,6.0,4.0,1.0,1.0,1.0


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [31]:
df =  pd.get_dummies(df)

In [34]:
X = df[['Mileage', 'Cylinder', 'Liter', 'Doors', 'Cruise', 'Sound',
       'Leather', 'Make_Buick', 'Make_Cadillac', 'Make_Chevrolet',
       'Make_Pontiac', 'Make_SAAB', 'Make_Saturn', 'Model_9-2X AWD',
       'Model_9_3', 'Model_9_3 HO', 'Model_9_5', 'Model_9_5 HO', 'Model_AVEO',
       'Model_Bonneville', 'Model_CST-V', 'Model_CTS', 'Model_Cavalier',
       'Model_Century', 'Model_Classic', 'Model_Cobalt', 'Model_Corvette',
       'Model_Deville', 'Model_G6', 'Model_GTO', 'Model_Grand Am',
       'Model_Grand Prix', 'Model_Impala', 'Model_Ion', 'Model_L Series',
       'Model_Lacrosse', 'Model_Lesabre', 'Model_Malibu', 'Model_Monte Carlo',
       'Model_Park Avenue', 'Model_STS-V6', 'Model_STS-V8', 'Model_Sunfire',
       'Model_Vibe', 'Model_XLR-V8', 'Trim_AWD Sportwagon 4D',
       'Trim_Aero Conv 2D', 'Trim_Aero Sedan 4D', 'Trim_Aero Wagon 4D',
       'Trim_Arc Conv 2D', 'Trim_Arc Sedan 4D', 'Trim_Arc Wagon 4D',
       'Trim_CX Sedan 4D', 'Trim_CXL Sedan 4D', 'Trim_CXS Sedan 4D',
       'Trim_Conv 2D', 'Trim_Coupe 2D', 'Trim_Custom Sedan 4D',
       'Trim_DHS Sedan 4D', 'Trim_DTS Sedan 4D', 'Trim_GT Coupe 2D',
       'Trim_GT Sedan 4D', 'Trim_GT Sportwagon', 'Trim_GTP Sedan 4D',
       'Trim_GXP Sedan 4D', 'Trim_Hardtop Conv 2D', 'Trim_L300 Sedan 4D',
       'Trim_LS Coupe 2D', 'Trim_LS Hatchback 4D', 'Trim_LS MAXX Hback 4D',
       'Trim_LS Sedan 4D', 'Trim_LS Sport Coupe 2D', 'Trim_LS Sport Sedan 4D',
       'Trim_LT Coupe 2D', 'Trim_LT Hatchback 4D', 'Trim_LT MAXX Hback 4D',
       'Trim_LT Sedan 4D', 'Trim_Limited Sedan 4D', 'Trim_Linear Conv 2D',
       'Trim_Linear Sedan 4D', 'Trim_Linear Wagon 4D', 'Trim_MAXX Hback 4D',
       'Trim_Quad Coupe 2D', 'Trim_SE Sedan 4D', 'Trim_SLE Sedan 4D',
       'Trim_SS Coupe 2D', 'Trim_SS Sedan 4D', 'Trim_SVM Hatchback 4D',
       'Trim_SVM Sedan 4D', 'Trim_Sedan 4D', 'Trim_Special Ed Ultra 4D',
       'Trim_Sportwagon 4D', 'Type_Convertible', 'Type_Coupe',
       'Type_Hatchback', 'Type_Sedan', 'Type_Wagon']]

In [35]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

In [36]:
y = df['Price']
est = sm.OLS(y, X).fit()

est.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.992
Model:,OLS,Adj. R-squared:,0.992
Method:,Least Squares,F-statistic:,1307.0
Date:,"Wed, 27 Feb 2019",Prob (F-statistic):,0.0
Time:,11:56:01,Log-Likelihood:,-6574.1
No. Observations:,804,AIC:,13300.0
Df Residuals:,730,BIC:,13640.0
Df Model:,73,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-0.1854,0.004,-45.754,0.000,-0.193,-0.177
Cylinder,2003.3763,79.508,25.197,0.000,1847.285,2159.467
Liter,3215.1541,111.186,28.917,0.000,2996.871,3433.437
Doors,1735.2188,52.607,32.985,0.000,1631.940,1838.497
Cruise,69.4021,101.524,0.684,0.494,-129.912,268.716
Sound,211.2642,79.765,2.649,0.008,54.669,367.859
Leather,295.3578,92.867,3.180,0.002,113.039,477.677
Make_Buick,-3053.2575,104.128,-29.322,0.000,-3257.684,-2848.831
Make_Cadillac,9385.1366,115.221,81.454,0.000,9158.933,9611.340

0,1,2,3
Omnibus:,146.716,Durbin-Watson:,1.268
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1692.659
Skew:,-0.436,Prob(JB):,0.0
Kurtosis:,10.055,Cond. No.,3.29e+19


In [37]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


0,1,2,3
Dep. Variable:,Price,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.06
Method:,Least Squares,F-statistic:,18.11
Date:,"Wed, 27 Feb 2019",Prob (F-statistic):,2.23e-11
Time:,11:56:25,Log-Likelihood:,-9207.1
No. Observations:,804,AIC:,18420.0
Df Residuals:,801,BIC:,18430.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1272.3412,804.623,-1.581,0.114,-2851.759,307.077
Cylinder,5587.4472,804.509,6.945,0.000,4008.252,7166.642
Doors,-1404.5513,804.275,-1.746,0.081,-2983.288,174.185

0,1,2,3
Omnibus:,157.913,Durbin-Watson:,0.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,257.529
Skew:,1.278,Prob(JB):,1.2e-56
Kurtosis:,4.074,Cond. No.,1.03


In [38]:
y.groupby(df.Doors).mean()

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

In [40]:
y.groupby(df.Cylinder).mean()

Cylinder
4    17862.564874
6    20081.395841
8    38968.043180
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?