# Data Science Fundamentals in Python 3

##  Introduction

The purpose of this example is to practice different components of within a data science process. We will use the classic 1974 *Motor Trend* car road tests (`mtcars`) dataset. 

We are going to practice:
1. How does a machine learn using Stochastic Gradient Descent
2. How to load data into Pandas dataframe and explore. 
3. How to train different models using Scikit-learn library and compare their performances. 
    - A linear model using all variables
    - A Gradient Boosting Machine (GBM) model 

## Section 1. Practice on Machine Learning using Stochastic Gradient Descent (SGD)

<img src="http://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization_files/ball.png" width="600" align="left"/>

### 1.1 Generate the data

In [None]:
import numpy as np
import pandas as pd
MU, SIGMA = 6, 2
SIGMA_NOISE = 0.05
NUM_OBS = 100
x = np.random.normal(MU, SIGMA, NUM_OBS)
noise = np.random.normal(0, SIGMA_NOISE, NUM_OBS)
A, B = 3.5, 8.5
y = A + B * x + noise

### 1.2 Split the Data into Training and Testing

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)

### 1.3 Start Training the Model

Here is the formula we used to derive the slopes for coefficients a and b. 

<img src="https://hzstorage.blob.core.windows.net/icse2019/sgd-derivatives.PNG?sv=2018-03-28&ss=bqtf&srt=sco&sp=rwdlacup&se=2019-05-23T23:59:54Z&sig=ETqhmBLQ4GuFiIQUAgqAKKScLtEQqAoyGUUwN8oI%2BUo%3D&_=1558627256972" width="300" align="left"/>

In [None]:
from random import shuffle
NUM_EPOCHS = 40
LEARNING_RATE = 0.01
NUM_TRAINING_OBS = len(X_train)
a_hat, b_hat = np.random.normal(0, 1, 2)
print(a_hat, b_hat)
sse_progress = [0] * NUM_EPOCHS
a_progress = [0] * NUM_EPOCHS
b_progress = [0] * NUM_EPOCHS
train_index = list(range(NUM_TRAINING_OBS))
for k in range(NUM_EPOCHS):
    shuffle(train_index)
    SSE = 0
    for i in train_index:
        y_hat = a_hat + b_hat * X_train[i]
        delta = y_train[i] - y_hat
        SSE += delta**2
        slope_a = 2 * delta * (-1)
        slope_b = 2 * delta * (-X_train[i])
        a_hat = a_hat - slope_a * LEARNING_RATE
        b_hat = b_hat - slope_b * LEARNING_RATE
    sse_progress[k] = SSE
    a_progress[k] = a_hat
    b_progress[k] = b_hat
    print("Epoch = {0}, SSE={1}".format(k, round(SSE,4)))
print("In the end, the learned coefficients are {0} and {1}.".format(round(a_hat, 4), round(b_hat, 4)))

### 1.4 Plot the Training Progress

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(1)
plt.subplot(3,1,1)
plt.plot(range(1, NUM_EPOCHS+1), a_progress)
plt.xlabel("Training Epochs")
plt.ylabel("Coefficient a")

plt.subplot(3,1,2)
plt.plot(range(1, NUM_EPOCHS+1), b_progress)
plt.xlabel("Training Epochs")
plt.ylabel("Coefficient b")

plt.subplot(3,1,3)
plt.plot(range(2, NUM_EPOCHS+1), sse_progress[1:])
plt.xlabel("Training Epochs")
plt.ylabel("SSE")


## Section 2. Train Models Using Scikit-Learn Library

### 2.1 Prepare Data

We'll start by loading the `mtcars` sample dataset and displaying its description:

In [None]:
!pip install pydataset --disable-pip-version-check -q  # install a Python package containing the dataset
import pydataset
from pydataset import data
df = data('mtcars')
df.head()

We can also quickly examine the distribution of values and first few rows of the dataset:

### 2.2 Explore the Data in a Better Detail using pandas-profiling Package

In [None]:
df.describe()

### 2.3 Get More Detailed Report of the Data Using Pandas Profiling

In [None]:
!pip install pandas-profiling

In [None]:
import pandas_profiling
# Drop the row index of the data to avoid special characters in row index
df1 = df.reset_index(drop=True)
pandas_profiling.ProfileReport(df1)

### 2.4 Split the Data into Training and Testing

The goal for the machine learning models in this tutorial will be to predict each car's gas mileage (`mpg`) from the car's other features.

We will split the records into training and test datasets: each model will be fitted using the training data, and evaluated using the withheld test data.

In [None]:
# split the dataset into features available for prediction (X) and value to predict (y)
y = df['mpg'].values
X = df.drop('mpg', 1).values
feature_names = df.drop('mpg', 1).columns

# save 30% of the records for the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
X_train.shape

As you can see from the description above, the number of predictive features available in this dataset (10) is comparable to the number of records (22). Such conditions tend to produce overfitted models that give exceptional predictions on their own training data, but poor predictions on the withheld test data. We will see an example of an overfitted model below.

### 2.5 Fit Models
#### 2.5.1 Linear Regression Model
The following lines of code fit a linear model (without regularization) using all of the original features:

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, y_train)

Below, we print the R-squared value for the true vs. predicted `mpg` values in the *training* set. We also show the fitted coefficients for different features.

In [None]:
import pandas as pd
from sklearn.metrics import r2_score

# print R^2 for the training set
print('The R-squared value for the training set is: {:0.4f}'.format(r2_score(y_train, lm.predict(X_train))))

Notice that the model performs very well on the training data to which it was fitted. (Predictions of the model account for 89% of the variance in `mpg` values.) Some of the feature coefficients may reflect our intuition: for example, heavy cars tend to have worse gas mileage ($\beta_{\textrm{wt}} = -5.0$), and cars with manual transmissions tend to have better gas mileage ($\beta_{\textrm{am}} = 5.2$).

Now, let's check the model's performance on the test dataset:

In [None]:
import numpy as np

predicted = lm.predict(X_test)

r_squared = r2_score(y_test, predicted)
mae = np.mean(abs(predicted - y_test))
rmse = np.sqrt(np.mean((predicted - y_test)**2))
rae = np.mean(abs(predicted - y_test)) / np.mean(abs(y_test - np.mean(y_test)))
rse = np.mean((predicted - y_test)**2) / np.mean((y_test - np.mean(y_test))**2)

# Create a data frame for storing results from each model
summary_df = pd.DataFrame(index = ['R-squared', 'Mean Absolute Error', 'Root Mean Squared Error',
                                   'Relative Absolute Error', 'Relative Squared Error'])
summary_df['Linear Regression, all variables'] = [r_squared, mae, rmse, rae, rse]
summary_df

Notice that the R-squared value for true vs. predicted `mpg` of the test set is much lower than it was for the training set. (Granted, our test set is not very large, so some fluctuation is expected.) This is indicative of model overfitting.

### Gradient Boosting Machine Regression Model

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from time import time
params = {'alpha':0.9, 'criterion': 'friedman_mse', 'learning_rate': 0.01,
         'loss': 'ls', 'max_depth': 2, 'min_samples_leaf': 1, 
         'min_samples_split': 2, 'n_estimators': 200, 'random_state': 123, 
         'subsample': 1, 'verbose': 0}
params['random_state'] = 123
params['loss'] = 'ls'
gbm = GradientBoostingRegressor(**params)

gbm.fit(X_train, y_train)

Now we can check the model's performance on the test data:

In [None]:
predicted = gbm.predict(X_test)

r_squared = r2_score(y_test, predicted)
mae = np.mean(abs(predicted - y_test))
rmse = np.sqrt(np.mean((predicted - y_test)**2))
rae = np.mean(abs(predicted - y_test)) / np.mean(abs(y_test - np.mean(y_test)))
rse = np.mean((predicted - y_test)**2) / np.mean((y_test - np.mean(y_test))**2)

summary_df['Gradient Boosted Machine Regression'] = [r_squared, mae, rmse, rae, rse]
summary_df