# Master Data Science In 4 Weeks

## Hands-On Project 1: The Pipeline of Machine Learning

In this hands-on project, we will show the **Pipeline of Machine Learning Project** using the housing price prediction example. 

When solving the real data science problem using machine learning techniques, experienced data scientists usually have the following checklist in mind.

- Understand the problem and draw the big picture. 
- Get the data. 
- **Explore the data**: gain some insights.
- **Data preprocessing** and **feature engineering**. 
- Explore some reasonable **machine learning models** and shortlist the best ones. 
- **Fine-tune** your models and combine them to get a better one. 
- Present your solution. 
- Launch, monitor and maintain your system.  

In this project, we will show the pipeline step by step. 

## 0. Problem Description

In this hands-on project, we will show the **Pipeline of Machine Learning Project** using the **housing price prediction** example. The data could be download at [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

The reasons we use this example are following:

1. This is a real problem you may meet in your life. For example, if you want to buy a new house, this may help you to estimate the price. 
2. This dataset is not clean, it's suitable for us to show some common used data cleaning and feature engineering techniques. 
3. The dataset is suitable for beginners to get start. 
4. A similar dataset for house price in Singapore could be provided for you to explore by yourself.

**Hope you can enjoy it.**

## 1.1 Set Up

First, we will import some common used libraries in the following four weeks. 

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import pandas as pd
import seaborn as sns

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# Import some statics
from scipy.stats import skew
from scipy.stats.stats import pearsonr

## 1.2 Load Data
**Pandas** is the most popular library in Python to handle data frame.  

In [None]:
# Set the relative director for training data
root_dir = "."
train_data_dir = root_dir + "/datasets/train.csv"

# read the data frame
train_data = pd.read_csv(train_data_dir)

# Print out some data samples
train_data.head()

#save index

index_array = train_data['Id']

## 2 Explore the data and gain some insights

In this part, we will visualize the dataset to help us to understand the data better. This is one of the most important steps when you try to analyze and understand your problem and data. 

### 2.1 Get a summary of your dataset with one line code

In pandas, there is a very useful function called **describe()**, this function can generate a summary for the dataframe. 

In [None]:
print(train_data.columns)
train_data.describe()

### 2.2 Look at the distribution of target variable

In [None]:
matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)
train_data['SalePrice'].hist()

We can see that the **SalePrice** is skewed. We will see later how can we handle this. 

### 2.3 Explore some relations between selected features and target variable

Correlation is a statistical value to imply the linear relationship between two variables. One way to explore the relations between features and target variable is to visualize the correlation between features and target variable. This is just a very rough idea, since correlations can only imply the **linear relationships** between two variables. 

In [None]:
#saleprice correlation matrix
corrmat = train_data.loc[:,'MSSubClass':'SalePrice'].corr()
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_data[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Although this is a very rough exporation, we still see that there are some meaningful features which agree with our common sense, for example, **Overall Quality, GrLivArea, Year Built**. Below, we visualize one of them.

In [None]:
#scatter plot grlivarea/saleprice
var = 'GrLivArea'
train_data.plot(kind='scatter', x = var, y = 'SalePrice')

## 3 Data preprocessing and feature engineering

Data preprocessing and feature engineering are extremely important in data science projects. 

In [None]:
# Separate Features and Target Variables

X_train = train_data.loc[:, 'MSSubClass':'SaleCondition']
y_train = train_data['SalePrice']

### 3.1 Handle Categorical Features

Usually, we need to convert categorical features to numerical values. Instead of naively convert them to a sequence of integers, a common used idea is **One-hot-encoding**. In **Pandas**, there is a method called **get_dummies** to do it.

In [None]:
X_train = pd.get_dummies(X_train)

# Visualize
X_train.columns

### 3.2 Missing Data

Missing data(usually indicated as **nan** in data matrix) is a common noise in real datasets, if we do not handle missing data, we need to drop the samples with **nan** values in training data, since we cannot do numerical computations for **nan** values. In test data, this is a disaster, because it means we cannot predict the value of corresponding target variable. 

In general, we can classfy the preprocessing strategies into two categories. 

**Naive Idea**: Drop the features with missing data. This is not a wise idea, but sometimes it works if you are confident that the corresponding feature is not important for prediction. 

**Popular Idea**: Imputation. We can replace the missing data with some statistics, like mean or median. 

In [None]:
#filling NA's with the mean of the column:
X_train = X_train.fillna(X_train.mean())

### 3.3 Handle Skewed Data

One popular way to handle the skewed positive data is to take the logorithm. Just like the figure we show in data exploration part, this will make the data more normal.

In [None]:
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train_data["SalePrice"], "log(price + 1)":np.log1p(train_data["SalePrice"])})
prices.hist()
y_train = np.log1p(y_train)

In [None]:
# Handle other skewed features.  
numerical_feats = X_train.dtypes[X_train.dtypes != "object"].index

# compute skewness of the numerical features
skewed_feats = X_train[numerical_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
X_train[skewed_feats] = np.log1p(X_train[skewed_feats])

### Splitting Training Datasets into Training set and Validation set. 

This is very important, experienced data scientists always do this. What we really care about is the accuracy of the model on unseen data. Usually, the test data is not available when we train the model. So, in order to evaluate the performance of your machine learning model on unseen data and preventing overfitting, we usually split the training set into two datasets randomly, using part of the training data as **validation data(10% - 20%)**. If the training data is limited, then we usually do **cross-validation**. There are **train_test_split** function in **Scikit Learn**.

In [None]:
# But if you use scikit learn, cross-validation is built-in method
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val, train_id, val_id = train_test_split(X_train, y_train, index_array,test_size = 0.2, random_state = 42)

In [None]:
X_train.describe()

## Explore some reasonable models

Usually, after you prepared your data, you will begin to try some reasonable machine learning model to solve your problem. In this problem, we will try **Linear Regression**, **Linear Regression with Regularization(Ridge Regression, Lasso)**, **Decision Tree**, **XGBoost** models. 

The most fundamental model called **Linear Regression**, which has the fllowing form:
$$\frac{1}{2}\sum_{i=1}^n[(y_i - \sum_{j=1}^d X_{ij}w_j)^2]$$
Or, we can write it in the following compact matrix form
$$\min_{w} \frac{1}{2}\|y_{train} - X_{train}w\|_2^2$$

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model_linear_reg = LinearRegression()

# Train your Linear Regression Model on Training Set
model_linear_reg.fit(X_train, y_train)

In [None]:
# Yeah, you get your first Machine Learning Model
# Print out some values 

data_sample = X_train.iloc[100:105]
y_sample = y_train.iloc[100:105]
print("Predictions: \t", model_linear_reg.predict(data_sample))
print("True Labels: \t", np.array(y_sample))

# compute the mean-square error on training set
y_pred_linear_reg = model_linear_reg.predict(X_train)
mse_model_linear_reg = mean_squared_error(y_train, y_pred_linear_reg)
print("The Mean-Square-Error of the linear regression model is:", mse_model_linear_reg)
# It seems not that bad!!!

In [None]:
# Explore the learned parameters we get

print("The value of w is:", model_linear_reg.coef_)

### Regularization

We can see that, the values of all parameters are nonzeros. However, we know there are some features in the dataset that are not benifit for us to predict the price of the house. In other words, we want to do the feature selections to get the useful features automatically. One popular way to achieve this goal is to add a regularization term $\|\cdot\|_1$ which can enforce the parameters to be sparse. 

Later, you will learn the regularization term from the aspect of preventing over-fitting. 

So, in order to do feature selection, we will consider the following regularized linear model, called **Lasso**:
$$\min_{w} \frac{1}{2}\|y_{train} - X_{train}w\|_2^2 + \alpha \|w\|_1$$
where $\alpha >0$ is the hyperparameter to control the strength of the regularization term. 

In [None]:
from sklearn.linear_model import Lasso
model_lasso = Lasso(alpha = 0.01, max_iter= 4000)

# choose the value of your hyperparameter alpha
# Train your model
model_lasso.fit(X_train, y_train)


In [None]:
# Print out some results to see the performance
data_sample = X_train.iloc[100:105]
y_sample = y_train.iloc[100:105]
print("Predictions: \t", model_lasso.predict(data_sample))
print("True Labels: \t", np.array(y_sample))

y_pred_lasso = model_lasso.predict(X_train)
mse_model_lasso = mean_squared_error(y_train, y_pred_lasso)
print("The Mean-Square-Error of the Lasso model is:", mse_model_lasso)

If you look at the results and compare them to the predictions we get from linear regression, you can see that the performance is worse. This is contradict to your common sense since we should get better results if we exclude some irrelavent features. 

**Analysis**: The problem maybe we select a inappropriate hyperparameter. 

In [None]:
model_lasso_2 = Lasso(alpha= 0.0002, max_iter= 4000)
model_lasso_2.fit(X_train, y_train)

# print out something
data_sample = X_train.iloc[100:105]
y_sample = y_train.iloc[100:105]
print("Predictions: \t", model_lasso_2.predict(data_sample))
print("True Labels: \t", np.array(y_sample))

y_pred_lasso_2 = model_lasso_2.predict(X_train)
mse_model_lasso_2 = mean_squared_error(y_train, y_pred_lasso_2)
print("The Mean-Square-Error of the new lasso model is:", mse_model_lasso_2)

### A better way to evaluate the performance of your machine learning model

In real project, what we really care about is the testing error. However, we cannot evaluate our machine learning model on the test dataset before we launch our model. One possible way to estimate the performance of our model on unseen data is cross-validation. 

In [None]:
from sklearn.model_selection import cross_val_score
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

In [None]:
alphas = [0.01, 0.0002]
cv_lasso = [rmse_cv(Lasso(alpha = alpha, max_iter= 4000)).mean() 
            for alpha in alphas]
cv_lasso = pd.Series(cv_lasso, index = alphas)
print("Cross Validation Error for Lasso:", cv_lasso)
#cv_lasso.plot(title = "Validation Error")
#plt.xlabel("alpha")
#plt.ylabel("rmse")

cv_linear_reg = rmse_cv(model_linear_reg).mean()
print("Cross Validation Error for Linear Regression:", cv_linear_reg)

In [None]:
print("The smallest validation error is:",(cv_lasso.min()))

Now, we try to explore more linear models. 
In high-dimensional problem, we usually want to get a sparse solution, One popular model to get the sparse solution is Lasso. 

In [None]:
model_lasso_better = Lasso(alpha= 0.0002, max_iter= 4000)
model_lasso_better.fit(X_train, y_train)

Now, we check whether Lasso has the ability to do feature selection or not. 

In [None]:
coef = pd.Series(model_lasso_better.coef_, index = X_train.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

Now, we try to check whether the Lasso model selected the meaningful features or not. To do this, we visualize the important features selected by Lasso model. 

In [None]:
imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])

# Visualize them.
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Important Coefficients in the Lasso Model")

## Fine Tune Your Machine Learning Model

Now, we have some models in hand. However, we want to fine tune our machine learning model to get better results. Usually, we will fine tune our model in two ways:

- **Hyperparameter Tuning**: From the example showed in Lasso model, when the hyperparameters are not selected appropriately, we may get poor results. 

- **Model Ensemble**: In the idea of ensemble, you will combine some short-listed models together via weighted average to get better results. This is commonly used in real problems. One intuitive reason why ensemble could get a better result is, ensemble can reduce the variance of the single models. 

### Grid Search

How to do **Hyperparameter Tuning** is very problem dependent. Some useful ideas are:

1. Try some descrete values to determine the rough range of the hyperparameters. 
2. Use Grid Search or Random Search to fine tune your hyperparameters in the range. 

To do Grid search, you can get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which
hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the
possible combinations of hyperparameter values, using cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'alpha':np.arange(0.0001,0.01,0.0002)}
]

model_lasso_cv = Lasso(max_iter=4000)

grid_search = GridSearchCV(model_lasso_cv, param_grid, cv = 5, scoring = 'neg_mean_squared_error')
grid_search.fit(X_train, y_train)

In [None]:
## Find the best parameters
grid_search.best_params_

In [None]:
### Use the best One 
model_lasso_best = grid_search.best_estimator_
print("Cross_Validation rmse is:", rmse_cv(model_lasso_best).mean())

## Evaluate the Results on Test Data Set

In [None]:
y_pred_linear_reg = model_linear_reg.predict(X_val)
y_pred_lasso = model_lasso_best.predict(X_val)

print("The Mean-Square-Error of the linear regression model is:", mean_squared_error(y_val, y_pred_linear_reg))
print("The Mean-Square-Error of the  lasso model is:", mean_squared_error(y_val, y_pred_lasso))

In [None]:
solution = pd.DataFrame({"id":val_id, "SalePrice":y_pred_lasso})
solution.to_csv("lasso_sol.csv", index = False)