# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [1]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [2]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [3]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102 

  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


0,1,2,3
Dep. Variable:,Price,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.06
Method:,Least Squares,F-statistic:,18.11
Date:,"Sun, 30 Dec 2018",Prob (F-statistic):,2.23e-11
Time:,20:44:29,Log-Likelihood:,-9207.1
No. Observations:,804,AIC:,18420.0
Df Residuals:,801,BIC:,18430.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1272.3412,804.623,-1.581,0.114,-2851.759,307.077
Cylinder,5587.4472,804.509,6.945,0.000,4008.252,7166.642
Doors,-1404.5513,804.275,-1.746,0.081,-2983.288,174.185

0,1,2,3
Omnibus:,157.913,Durbin-Watson:,0.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,257.529
Skew:,1.278,Prob(JB):,1.2e-56
Kurtosis:,4.074,Cond. No.,1.03


In [4]:
y.groupby(df.Doors).mean()

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?

## Trying this on my own :')

Avocado Prices from [Kaggle](https://www.kaggle.com/neuromusic/avocado-prices)

In [2]:
import pandas as pd

df = pd.read_csv('../Datasets/avocado.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [7]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Total Volume', 'Total Bags', 'Small Bags', 'Large Bags', "XLarge Bags"]]
y = df['AveragePrice']

X[['Total Volume', 'Total Bags', 'Small Bags', 'Large Bags', "XLarge Bags"]] = scale.fit_transform(X[['Total Volume', 'Total Bags', 'Small Bags', 'Large Bags', "XLarge Bags"]].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

       Total Volume  Total Bags  Small Bags  Large Bags  XLarge Bags
0         -0.227716   -0.234170   -0.232647   -0.222352    -0.175580
1         -0.230427   -0.233350   -0.231568   -0.222335    -0.175580
2         -0.212085   -0.234730   -0.233399   -0.222311    -0.175580
3         -0.223444   -0.237096   -0.236568   -0.222186    -0.175580
4         -0.231538   -0.236718   -0.236154   -0.221924    -0.175580
5         -0.230107   -0.236211   -0.235390   -0.222212    -0.175580
6         -0.222152   -0.234554   -0.233192   -0.222234    -0.175580
7         -0.214630   -0.236064   -0.235778   -0.220429    -0.175580
8         -0.217415   -0.231441   -0.229295   -0.221571    -0.175580
9         -0.224791   -0.234242   -0.233373   -0.220421    -0.175580
10        -0.221749   -0.234668   -0.233619   -0.221391    -0.175580
11        -0.227643   -0.232723   -0.230954   -0.221678    -0.175580
12        -0.228652   -0.234110   -0.232946   -0.221190    -0.175580
13        -0.215391   -0.236870   

  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_array(key, value)


0,1,2,3
Dep. Variable:,AveragePrice,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,11.22
Date:,"Sun, 30 Dec 2018",Prob (F-statistic):,7.96e-11
Time:,23:05:58,Log-Likelihood:,-32804.0
No. Observations:,18249,AIC:,65620.0
Df Residuals:,18244,BIC:,65660.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Total Volume,-0.1388,0.044,-3.173,0.002,-0.225,-0.053
Total Bags,-3.076e+04,1.66e+05,-0.185,0.853,-3.57e+05,2.95e+05
Small Bags,2.328e+04,1.26e+05,0.185,0.853,-2.23e+05,2.7e+05
Large Bags,7610.1930,4.11e+04,0.185,0.853,-7.3e+04,8.83e+04
XLarge Bags,551.9206,2983.898,0.185,0.853,-5296.800,6400.641

0,1,2,3
Omnibus:,987.06,Durbin-Watson:,0.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1159.008
Skew:,0.59,Prob(JB):,2.11e-252
Kurtosis:,3.363,Cond. No.,41700000.0


Maybe try with different data? lOLOL I mean, they are all bags ;-;...???
EH WELL. MOVING ON~~