In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error


<H1>EDA and Regression</h1>

Here we will walk through doing a regression for a common example dataset, complete with walking through some common exploratory steps. This is representative of what we'd do in a real situation, just a simple version. As we go through this we want to:

<ul>
<li>Load and clean the data - make sure we've gotten rid of junk, fixed errors and blanks, corrected data type issues, etc...
<li>Explore the data - what does our data look like? Do we have consistent data? What are the distributions and correlations? Is there anything that may make us adjust or correct before proceeding?
<li>Shape the data for prediction - prep whatever data we are going to use in a regression ready format.
<li>Perform regression.
<li>Examine results
</ul>

For the cleaning and exploring steps especially, we aren't really following a specific set of actions. We want to look at the data, and see if there's anything to change - some data may need cleaning, some won't; some data may have features we want to remove or change, some won't. We basically just want to look for anything that might lead us to adjust our approach away from just using all the data unchanged - this example isn't super dirty/complex, so we won't be doing an overwhelming amount of action here. This process is something we get better at with time, practice, and more ML tools. 

In [None]:
# Importing the Dataset
df = pd.read_csv("auto-mpg.data",delim_whitespace=True, names=["MPG","Cylinders","Displacement","HP","Weight","Acceleration","Year","Origin","Name"])
df.head()

In [None]:
#Get some info
df.info()

I'll set the two categorical values as categorical. This isn't required, but it will tell some things (e.g. pairplot, describe) to treat it as a categorical variable. 

Having correctly identified datatypes may or may not make a difference in terms of whether or not things will work, but it is good practice and will make some functions work more as we'd expect. 

In [None]:
# Converting the variables to the correct types

#We'll need this once we notice the error below
#df = df[df["HP"]!="?"]

df['Cylinders']=df['Cylinders'].astype('category')
df['Origin']=df['Origin'].astype('category')
df['HP']=df['HP'].astype('float64')
df.info()


In [None]:
#visualize pairplot
sns.pairplot(df)

There are only a few things that we might need to be attentive of in the pairplot. For the most part the data looks unremarkable:
<ul>
<li>Some colinearity between weight, displacement, and HP. We can address that later on. 
<li>MPG (the target) looks to have a non-linear relationship with those varaibles. We'll consider things like that next time. 
</ul> 

We can look at boxplots and counts for the categorical varaibles. the countplot is basically just a simple hist for categorical data. 

We didn't know exactly what Origin was, but now I am going to suspect that it is American/Japanesse/European - we could likely verify this by looking at the data and using our domain knowledge of where different cars are from. 

The grid used below is the matplotlib way to do subplots, it is a sightly more elaborate version of the thinkplot one. We basically make a grid, then assign each graph to a square in the grid. The details of exactly how we visualize data don't really matter, as long as we graph it, but playing with some different ways gives us more tools to make things look OK. Googling "seaborn ___________" will almost always give some examples online - the seaborn stuff is much easier to use than the matplotlib directly, so I'd advise sticking to that. 

In [None]:
#Print boxplots - we'll put them in subplots to make it look fancy
#Countplots are basically categorical hists - we could've used a hist as well
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(20,12))
fig.suptitle("Categorical Plotting", fontsize=35)
sns.boxplot(x="Cylinders", y="MPG", data=df,ax=ax[0,0])
sns.boxplot(x="Origin", y="MPG", data=df,ax=ax[0,1])
sns.countplot(x="Cylinders", data=df,ax=ax[1,0])
sns.countplot(x="Origin", data=df,ax=ax[1,1])

In [None]:
df[df["Cylinders"]==3].Name.count(), df[df["Cylinders"]==5].Name.count()

Looks like 3 and 5 cylinders are rare (if you konw about cars, this makes sense). I'm going to consider dropping those - we'll see. The reasoning is that we want to predict the MPG in general, there simply aren't very many samples for those two subgroups, so it may end up being more confounding than helpful. E.g. the impact of having 5 cylinders will be due to the 3 specific cars we have in the data, not the general impact of having that many cylinders. 

A similar example, say you were prediciting height of people and one factor was hair color. If you had 3 redheads in the data, any influence "being a redhead" has on the expected height would be overwhelmed by the influence that those specific 3 have. So that feature isn't giving you the "impact of having red hair" it is giving you the impact of "being Jim" - or the specific person in the sample. 

As well, if the data is being split, the results will probably be skewed - e.g. you may only get one 3 cylinder car in training the model - then the impact of 3 cylinders doesn't exist, you only get that one car. If you happen to get the one 3cyl car with terrible MPG, that will impact the results. If you got the most efficient 3cyl car ever in the next trial, things could then be totally different - the model would be unstable - bad. 

In [None]:
## Correlation Matrix
corr = df.corr()
corr.style.background_gradient()
corr.style.background_gradient().set_precision(2)

The correlation matrix is very useful in Linear Regression. We know that if there is correlation between the varibles then we can't accurately attribute the impact between the correlated variables. 

Here, weight and displacement (the size of the engine) are highly correlated. We may remove one later on... 
<br><br><br>

<h2>Prep data for regression</h2>

At this point we are pretty much ready to get our regression on. We just need to get the data in the right format. This will vary a little depending on what we want to use. We can also make some data cleaning choices relating to the issues we noticed above. 

In [None]:
#Clean out questionable bits - cylinders and correlated values. 
#We can try once with this and once without. 
#In this data, the difference probably won't be huge. 
print(len(df))
df_ = df.drop(columns={"Displacement","Name"})
df_ = df_[df_["Cylinders"]!=3]
df_ = df_[df_["Cylinders"]!=5]

#We need to remove the categories that are not used anymore.
#They don't automatically vanish. 
#The function from pandas need a series, so we have to do a roundabout way
df_['Cylinders'] = pd.Series(df_['Cylinders']).cat.remove_unused_categories()
print(len(df_))

In [None]:
#Check the cylinders 
df_["Cylinders"].value_counts()

We have two categorical variables - cylinders and origin. Currently it will work with the numbers, but I don't think it is best. Right now they'll be treated as numbers in the regression, and that doesn't make a tonne of sense. 

To deal with them better, we can use one hot encoding to translate each category to its own column. We will do it the easy way with the get_dummies function. I'm going to do it twice, we'll explain why after. 

In [None]:
#Get some new info
df_.info()

In [None]:
# Do the dummies
df_tmp = pd.get_dummies(df_)
df_tmp.head(5)


Pretty simple, instead of each categorical column having several possible values, now each of those values is a column, and each row is true (or 1) in one of the columns and a false (or 0) for the others. 
<br><br>
The one issue here is that what is represented if all the columns are 0? By giving each value its own column, we've invented possible data - all 0s. The solution to this is to just drop a column for each category - that one that is dropped is represented by all 0s now. By doing this we avoid inventing some data. This is common, though not universal. For linear regression we want to get rid of that. 

In [None]:
#Redo the dummy variables. 
df2 = pd.get_dummies(df_, drop_first=True)
df2.head(5)


The pandas function of creating the dummy variables is the most simplest one. The encoding notebook has an example of the sklearn process, but this way is much more simple. 

Next we want to split our features (Xs) and our target (Y). 

In [None]:
#note, whatever you make here will be the x/y in the split function below.
#We need a x and y array for sklearn.
#Having an x and y dataframe is also useful later on. 

#COMPLETE

x.shape, y.shape

Data is ready to go - we can now either sklearn with the arrays or statsmodel with the dataframes or the arrays. 

We can split the data again into a training set and a test set, so we can calculate our accuracy on some new data. 

**Note: repeating the split/train/test/calculate cycle is often a good idea. We don't need to build a big loop (though we could) - we'll add this later on with a function built into sklearn that makes it easier - kfold.** 

In [None]:
# Dividing the Dataset into Test and Train
# This does the splitting of both the xs and ys, and spits back all 4 sets. 
# This function is really common, and does the same thing as the sample splitting we did by hand
from sklearn.model_selection import train_test_split
xTrain,xTest,yTrain,yTest = train_test_split(x,y,test_size=.3)

print("X-Train:", xTrain.shape)
print("X-Test:", xTest.shape)
print("Y-Train:", yTrain.shape)
print("Y-Test:", yTest.shape)


In [None]:
# Implementing Linear Regression
# Fit the model
from sklearn.linear_model import LinearRegression

#COMPLETE

In [None]:
## Seeing the result of Linear Regression
#I'll use the dataframes to make sure that we get the column names. 

# This one is using the post-split test data. Hence the lack of labels. 
import statsmodels.api as sm
X2 = sm.add_constant(xTrain)
est = sm.OLS(yTrain,X2)
est2 = est.fit()
print(est2.summary())


In [None]:
# Predictiions for TEST data
# Calculate RMSE for TEST data
# The sklearn and the statsmodels models are the same, so either works here. 

#COMPLETE


In [None]:
#Get Residuals and picture them in a DF for easy reading. 
tmp1 = pd.DataFrame(yTest, columns={"Y values"})
tmp2 = pd.DataFrame(ypred, columns={"Predictions"})
tmp3 = pd.DataFrame((yTest-ypred), columns={"Residual"})
resFrame = pd.concat([tmp1,tmp2,tmp3], axis=1)
resFrame.sort_values("Residual").head()

After we're done with all of this testing, what if we want to use the model to make predictions? We all just got jobs monitoring MPG of new cars, and we are way to lazy to measure all of them, so we want to make accurate predictions and go back to sleep. 

For the model we want to use, we'll train it with all the data. The splitting and testing gave us good estimates for accuracy, but using all the data to create the model will deliver the best results in general. What it won't do is give us those test scores to evaluate it, but if we've already decided to use it we should train a new model with all the data. In practice we'd probably compare the test scores that we got here with the scores we got using different algorithms, then we'd train the most accurate one with all data and use that. 

<h2>Exercise - predict BMI</h2>

Follow a similar process to predict BMI using the data below. 

In [None]:
#Load data
df_d = pd.read_csv("diabetes.csv")
df_d.head()

In [None]:
#Regress...