Again using the auto data we used earlier this week, run a multiple linear regression model to predict mpg of a car using the features displacement, weight, and acceleration using statsmodels (note: we dropped cylinders because we found this to be highly correlated with other variables). 

REMEMBER to add a constant to your model! Examine the results and answer the following questions in the thread:

- What is the R-squared value?
- What does the R-squared value tell us?
- Were all the features significant at the .05 alpha level?

In [64]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [65]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', sep = '\s+',
                 names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 
                     'model_year', 'origin', 'car_name'])

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model_year      398 non-null int64
origin          398 non-null int64
car_name        398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [66]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight          float64
acceleration    float64
model_year        int64
origin            int64
car_name         object
dtype: object

In [70]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [71]:
df_model = df.drop(['cylinders','mpg','car_name'], axis=1)
df_model.horsepower.astype(float, errors = 'ignore')
df_model.head()

Unnamed: 0,displacement,horsepower,weight,acceleration,model_year,origin
0,307.0,130.0,3504.0,12.0,70,1
1,350.0,165.0,3693.0,11.5,70,1
2,318.0,150.0,3436.0,11.0,70,1
3,304.0,150.0,3433.0,12.0,70,1
4,302.0,140.0,3449.0,10.5,70,1


In [72]:
# imputing values for '?' in 'horsepower' column
df_model.loc[df_model.horsepower == '?'] = df_model.loc[df_model.horsepower != '?'].horsepower.median()

In [73]:
X = df_model
y = df.mpg

X = sm.add_constant(X)

In [74]:
# Preview X
X.head()

Unnamed: 0,const,displacement,horsepower,weight,acceleration,model_year,origin
0,1.0,307.0,130.0,3504.0,12.0,70.0,1.0
1,1.0,350.0,165.0,3693.0,11.5,70.0,1.0
2,1.0,318.0,150.0,3436.0,11.0,70.0,1.0
3,1.0,304.0,150.0,3433.0,12.0,70.0,1.0
4,1.0,302.0,140.0,3449.0,10.5,70.0,1.0


In [75]:
mod = sm.OLS(y,X.astype(float), hasconst=True) # must cast to float otherwise, model will not run
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.794
Model:                            OLS   Adj. R-squared:                  0.791
Method:                 Least Squares   F-statistic:                     251.3
Date:                Fri, 17 Jan 2020   Prob (F-statistic):          8.34e-131
Time:                        10:09:55   Log-Likelihood:                -1068.1
No. Observations:                 398   AIC:                             2150.
Df Residuals:                     391   BIC:                             2178.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          -10.4216      4.697     -2.219   

In [None]:
# R-squared: 0.794
# Tells us that 79.4% of the variance in mpg can be explained by our independent variables
# All variables except displacement, horsepower, and acceleration are significant at the alpha=0.05 level