<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_Tutorial_Regression_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                'acceleration', 'model_year', 'origin']

orig_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

In [None]:
car_mpg = orig_dataset.copy()

In [None]:
car_mpg.head()

In [None]:
car_mpg.info()

In [None]:
car_mpg.isnull().sum()

 we have 6 nulls. let's drop them

In [None]:
car_mpg = car_mpg.dropna()

In [None]:
car_mpg.shape

In [None]:
car_mpg['origin'] = car_mpg['origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
car_mpg.origin.value_counts()

In [None]:
car_mpg.origin.value_counts().plot.bar(title="count of origin")
plt.plot()

In [None]:
sns.pairplot(car_mpg)

In [None]:
sns.pairplot(car_mpg, diag_kind='kde')

# compute the correlation matrix

In [None]:
# compute correlation
cor = car_mpg.corr()

# Correlation

# heatmap of correlation

In [None]:


# plot the heatmap
sns.heatmap(cor, 
            annot=True,
            cmap=sns.color_palette("vlag"), 
        xticklabels=cor.columns,
        yticklabels=cor.columns)

plt.plot()

# show the correlation

In [None]:
cor['mpg'].sort_values(ascending=False)

# positive correlation

In [None]:
cor[cor['mpg']>0]['mpg'].sort_values(ascending=False)

# negative correlation

In [None]:
cor[cor['mpg']<0]['mpg'].sort_values(ascending=True)

stats model allows us to perform a single regression for one predictor an outcome. it does not allow multiple regression (multiple predictors). But for this investigation we'd like to see each predictor and our outcome separately. 

In [None]:
predictor = 'weight'

model = stats.linregress(car_mpg[predictor], car_mpg['mpg'])
print("r2_value",model.rvalue**2)
print("p value",model.pvalue)

#print("r2_value",model.coef)
plt.plot(car_mpg[predictor], car_mpg['mpg'], 'o', label='original data')
plt.plot(car_mpg[predictor], model.intercept + model.slope*car_mpg[predictor], 'r', label='fitted line')
plt.legend()
plt.title(f'regressing mpg and {predictor}')
plt.xlabel(predictor)
plt.ylabel('mpg')
plt.show()

In [None]:
predictor = 'horsepower'

model = stats.linregress(car_mpg[predictor], car_mpg['mpg'])
print("r2",model.rvalue**2)
print("p value",model.pvalue)
#print("r2_value",model.coef)
plt.plot(car_mpg[predictor], car_mpg['mpg'], 'o', label='original data')
plt.plot(car_mpg[predictor], model.intercept + model.slope*car_mpg[predictor], 'r', label='fitted line')
plt.legend()
plt.title(f'regressing mpg and {predictor}')
plt.xlabel(predictor)
plt.ylabel('mpg')
plt.show()

In [None]:
predictor = 'model_year'

model = stats.linregress(car_mpg[predictor], car_mpg['mpg'])
print("r2",model.rvalue**2)
print("p value",model.pvalue)
#print("r2_value",model.coef)
plt.plot(car_mpg[predictor], car_mpg['mpg'], 'o', label='original data')
plt.plot(car_mpg[predictor], model.intercept + model.slope*car_mpg[predictor], 'r', label='fitted line')
plt.legend()
plt.title(f'regressing mpg and {predictor}')
plt.xlabel(predictor)
plt.ylabel('mpg')
plt.show()

# removing redundant code with a function

It's getting clear that this needs to be repeated for each predictor. let's create a function to simplify

In [None]:
def helper_fun(predictor,df):
  model = stats.linregress(df[predictor], df['mpg'])
  print("r2",model.rvalue**2)
  print("p value",model.pvalue)
  #print("r2_value",model.coef)
  plt.plot(df[predictor], df['mpg'], 'o', label='original data')
  plt.plot(df[predictor], model.intercept + model.slope*df[predictor], 'r', label='fitted line')
  plt.legend()
  plt.title(f'regressing mpg and {predictor}')
  plt.xlabel(predictor)
  plt.ylabel('mpg')
  plt.show()

In [None]:
helper_fun('model_year',car_mpg)

In [None]:
car_mpg.drop('origin',axis=1).columns

In [None]:
for col in car_mpg.drop('origin',axis=1).columns:
  helper_fun(col,car_mpg)

In [None]:
cat_data = pd.get_dummies(car_mpg[['mpg','origin']])
cat_data.head()

In [None]:
helper_fun('origin_USA',cat_data)

In [None]:
cat_data.groupby('origin_USA').mean()

In [None]:
helper_fun('origin_Europe',cat_data)

In [None]:
cat_data.groupby('origin_Europe').mean()

In [None]:
helper_fun('origin_Japan',cat_data)

In [None]:
cat_data.groupby('origin_Japan').mean()

In [None]:
model.slope

In [None]:
import numpy as np

Using the values 29 and 20 for origin_USA 0 and 1 respectively inspect the graph. do you see where those values match the graph? not coincidentally they are exactly where the red line intersects the data for 0 and 1. Check Europe and Japan to check your understanding. 

regression for categorical data simply involves finding the mean value for that category. 

# understanding non-linearity and transforms

In [None]:
new_car_mpg = car_mpg.copy()
new_car_mpg['horsepower_1_over_x'] = new_car_mpg['horsepower']**.05

In [None]:
helper_fun('horsepower',new_car_mpg)

In [None]:
1/10


In [None]:
helper_fun('horsepower_1_over_x',new_car_mpg)

our new R squared suggests we have improved the overall fit of the line. We have changed the predictorwith a non-linear transform, and our result is a more linear results. But what does our x axis unit mean now? It could be translated back to the original units with the inverse of the transform we used but the casual observer will no longer easily be able to interpret horsepower. 

Also, have ALL the dots gotten closer to the line in ALL cases? or have we made some worse? Transforms like this are often a tradeoff. 

This is your first foray into a topic known as "Feature Engineering" in which you attempt to improve your model by creating new features based on the original data. 

In [None]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_Tutorial_Regression_EDA.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_Tutorial_Regression_EDA.ipynb"