# <center> Lecture 02 - Machine Learning Project Workflow</center>

## Outline
1. Overview of ML
    1. Introduction
    2. ML from different persectives
    3. Different Learnings in ML
2. ML Project Workfolow
    1. What makes ML so special?
    2. What is the workflow in ML Project?
        1. What is preprocessing and exploratory data analysis (EDA)?
        2. How do we make models and what is after?
        3. How can we make the models better?

# 1. Overview of ML
## 1.A Introduction

Machine learning (ML) is everywhere in 
- There are many fields ML is highly or indirectly involved...
    - Computer Science, Healthcare, Retail, Manufacturing, Energy, Finance, Science, Technology, etc.
- Why and How?
    - ML was around us for a long time.
    - As turning into the **Big Data** era,
        - Manuscripting models manually is limited (statistically, technically, and physically).
        - There are just more things humans needed to learn and experience data structure, pattern, and recognition.
        - and many more reasons... 
- What is Machine Learning?
    - A computer program is said to learn from experience,¬†**E**,¬†with respect to some class of tasks,¬†**T**,¬†and performance measure,¬†**P**.¬†
    - so, if its performance at tasks in¬†**T**, as measured by¬†**P**, improves with experience¬†**E**.
    - The term first coined in 1959, by Arthur Samuel from IBM
         - A branch of Artificial Intelligence (AI), 
         - Focused on design and development of algorithm 
        - Input: empirical data, such as that from sensors or databases, 
        - Output: **patterns** or **predictions** thought to be features of the underlying mechanism that generated the data.
    - Learner (the algorithm):
        - Takes advantage of **data** to capture *characteristics of interest* of their unknown underlying probability distribution. 
    - One fundamental difficulty:
        - Generalization: The set of all possible behaviors given all possible inputs is **too large** to be included in the set of observed examples (training data). Hence the learner must **generalize** from the given examples in order to produce a useful output in new cases.

## 1.B ML from different perspectives

- The **Artificial Intelligence (AI)** View:
    - Learning is central to human knowledge and intelligence, and likewise, it is also essential for building intelligent machines.
    - Years of effort in AI has shown that trying to build intelligent computers by programming all the rules cannot be done; automatic learning is crucial.
    - For example, we humans are not born with the ability to understand language. We learn it and it makes sense to try to have computers learn language instead of trying to program it all it.
- The **Software Engineering** View: 
    - ML allows us to program computers by example, which can be easier than writing code in the traditional way.
- The **Statistics** View:
    - ML is the marriage of computer science and statistics: computational techniques are applied to statistical problems. 
    - ML has been applied to a vast number of problems in many contexts, beyond the typical statistics problems. 
    - ML is often designed with different considerations than statistics (e.g., speed is often more important than accuracy).
- Examples: Spam Filtering, Face Detection, Games, etc...

<img src="img/face_detection.png" width=600 length=600/>

<img src="img/games.png" width=500 height=500 />

## 1.C Different Learnings in ML

#### Supervised Learning
- Labeled Data/targets
- Direct Feedback
- Predict outcome
- Forecast future

In [None]:
import pandas as pd

In [None]:
DF = pd.read_csv('Housing.csv')

In [None]:
DF.head(5)

In [None]:
print(DF.columns.tolist())

- Labels: headers, column names, feature names
- Columns: features, predictors, attributes
    - categorical and discrete data: integers, texts
    - numerical and continous data: floats
- Rows: observations, examples

In [None]:
DF.dtypes

#### Unsupervised Learning
- No labels/targets
- No feedback
- Find hidden structure in data

In [None]:
df = pd.read_csv('Spiral.csv')

In [None]:
df.head(5)

#### Reinforcement Learning
- Decision Process
- Reward system
- Learning series of actions

<img src="img/gm_example_0.png" width=500 height=500 />

# 2. ML Project Workflow
## 2.A What makes ML so special?

<img src="img/ML_approach.png" width=500 height=500 />

## 2.B What is the workflow in ML project?
ML is about: 
- Given a collections of examples, called ‚Äútraining data‚Äù	
- We want to predict something about novel examples, called ‚Äútest data‚Äù

What we usually do:
- Build idealized models of the application area we are working in
- Develop algorithms and implement in code
- Use historical data to learn numeric parameters, and sometimes model structure
- Use test data to validate the learned model, quantitatively measure its predictions
- Assess errors and repeat‚Ä¶

Every machine learning algorithm has three components:
- **Representation/Model Class**
    - Decision trees
    - Sets of rules / Logic programs
    - Graphical models (Bayes/Markov nets)
    - Neural networks
    - Support vector machines
    - Model ensembles
- **Evaluation/Objective Function**
    - Accuracy
    - Precision and recall
    - Squared error
    - Likelihood
    - Posterior probability
    - Cost / Utility
    - Margin
    - Entropy
    - K-L divergence
- **Optimization**
    - Discrete optimization 
        - Minimal Spanning Tree
        - Shortest Path
    - Continuous Optimization
        - Gradient Descent
        - Linear Programming

<img src="img/ML_roadmap.png" width=700 length=700/>

### 2.B.a What is preprocessing and exploratory data analysis (EDA)?
#### Importance of data preprocessing
- Data preprocessing is to make sure we have sensible data for ML
<img src="img/garbagein_garbageout.png" width=400 length=400/>

#### Missing values
- **Observation we intended to collect but did not get them**
    - Data entry issues, equipment errors, incorrect measurement etc
        - An individual may only have responded to certain questions in a survey, but not all
- **Problems of missing data**
    - Reduce representativeness of the sample
    - Complicating data handling and analysis
    - Bias resulting from differences between missing and complete data
- **Missing data handling**
    - Reducing the data set
        - Elimination of samples with missing values
        - Elimination of features (columns) with missing values
    - Imputing missing values
        - Replace the missing value with the mean/median (numerical) or most common  (categorical) value of that feature
    - Treating missing attribute values as a special value
        - Treat missing value itself as a new value and be part of the data analysis
            - Make a simple model to estimate the missing value

In [None]:
DF.isnull().sum()

In [None]:
DF.shape

##### Reducing the data set

In [None]:
DF_=DF.drop('total_bedrooms',axis=1)
DF_.shape

In [None]:
DF_=DF.dropna()
DF_.shape

##### Imputing missing values

In [None]:
import matplotlib.pyplot as plt

In [None]:
print(round(DF['total_bedrooms'].mean(),2))
print(DF['total_bedrooms'].median())

In [None]:
DF['total_bedrooms'].describe()

In [None]:
DF_ = DF
X0 = DF_['total_bedrooms'].dropna()
X1=DF_['total_bedrooms'].fillna(value=DF_['total_bedrooms'].mean())
X2=DF_['total_bedrooms'].fillna(value=DF_['total_bedrooms'].median())
X3=DF_['total_bedrooms'].fillna(value=0)
print("X0=",round(X0.mean(),2))
print("X1=",round(X1.mean(),2))
print("X2=",round(X2.mean(),2))
print("X3=",round(X3.mean(),2))

In [None]:
X = [X0,X1,X2,X3]
fig = plt.figure(figsize =(10, 7))
ax = fig.add_axes([0, 0, 1, 1])
bp = ax.boxplot(X)
plt.show()

In [None]:
DF_ = DF[DF['total_bedrooms']<1100]
X0 = DF_['total_bedrooms'].dropna()
X1=DF_['total_bedrooms'].fillna(value=DF['total_bedrooms'].mean())
X2=DF_['total_bedrooms'].fillna(value=DF['total_bedrooms'].median())
X3=DF_['total_bedrooms'].fillna(value=0)
print("X0=",round(X0.mean(),2))
print("X1=",round(X1.mean(),2))
print("X2=",round(X2.mean(),2))
print("X3=",round(X3.mean(),2))
X = [X0,X1,X2,X3]
fig = plt.figure(figsize =(10, 7))
ax = fig.add_axes([0, 0, 1, 1])
bp = ax.boxplot(X)
plt.show()

#### Data in different scales
Approaches to bring different values onto the same scale
- **Normalization**: rescale the feature to a range of (0,1)
    $$x_{norm}^j = \frac{x^j-x_{min}}{x_{max}-x_{min}}$$
    where $j$ is the column number. 
    - To changes values to a common scale (between 0 and 1) without distorting differences in the ranges of values.
    - Use when features are in different ranges. 
    - Use when distribution is not known or skewed. 
    - Use for specific ML algorithms, e.g., K-Nearest Neighbors and Neural Networks

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

In [None]:
X=DF['median_house_value']

In [None]:
sc = MinMaxScaler()
X_ = sc.fit_transform(np.array(X).reshape(-1,1))

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(10,5))
ax1.hist(X)
ax2.hist(X_)
plt.subplots_adjust(left=0.1,bottom=0.1,right=0.9, top=0.9,wspace=0.4,hspace=0.4)
plt.show()

- **Standardization**: re-center the feature to the mean and scaled by variance
    $$x_{std}^j= \frac{x^j-\mu_{x}}{\sigma_x}$$
    where $\mu_x$ is the average of $x^j$ and $\sigma_x$ is the standard deviation of $x_j$.
    - When measurements are in different units, we standardize the feature around the center 0 with 1ùúé.
    - Values at different scales can cause bias. 
    - Assumes that data has a Gaussian distribution and if ML algorithm holds the assumption (e.g., Linear Regression, Logistic Regression, Linear Discriminant Analysis). 

- Data scaling should be one of the first steps of data preprocessing for  many machine learning algorithms
     - Some machine learning algorithms can handle data in different scales (e.g.,  decision trees and random forests)

In [None]:
sc = StandardScaler()
X_ = sc.fit_transform(np.array(X).reshape(-1,1))

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))
ax1.hist(X)
ax2.hist(X_)
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)
plt.show()

#### Skewed Data
- Real-world data can be messy and contains attributes that need modifications before they can be used in modeling. 
- In case of normal distribution, the mean, median, and mode are approximately close to each other at the center of distribution. 
- The skewness of data can be determined by how these quantities are related to one another. 
    - **Right** skewed or **Positive** skewed: Mean > Median > Mode
    - **Left** Skewed or **Negative** skewed: Mode > Median > Mean
    - The tail region may act outliers that can affect the model‚Äôs performance in regression models. 

In [None]:
from scipy.stats import skewnorm
import matplotlib.pyplot as plt

numValues, maxValue = 1000,100
skewnessL,skewnessR = -10,10   #Negative values are left skewed, positive values are right skewed.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))
for ax, skewness in zip([ax1,ax2],[-10,10]):
    random = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues)
    ax.hist(random,30,density=True, color='red',alpha=0.3)
    ax.vlines(np.mean(random),0,1.0,color='black',label='mean')
    ax.vlines(np.median(random),0,1.0,color='black',linestyle='--',label='median')
    ax.legend(loc='best')
plt.subplots_adjust(left=0.1,bottom=0.1,right=1,top=0.9,wspace=0.4,hspace=0.4)
plt.show()

- Handling kewness:
    - Log transformation transforms skewed distribution to a normal distribution. (usually applies to right skewed data)
        - Values $\le0$ cannot be transformed. 
        - Add some constant so the minimum value be greater than $1\to\log{(1)}=0$ 
    - Remove outliers (both)
    - Normalize (applies to right skewed data)
    - Cube root, square root (applies to right skewed data)
    - Reciprocal (applies to right skewed data)
    - Square (applies to left skewed data)
    - Box Cox transformation (applies to both)
        - Transform using equations below:
        - $y(\lambda) = \begin{cases}
                (y^{\lambda}-1)/\lambda & \text{if $\lambda\ne0$ and $y>0$} \\
\log{y} & \text{if $\lambda=0$ and $y>0$} \\
\end{cases}$
        - $y(\lambda) = \begin{cases}
                ((y+\lambda_2)^{\lambda_1}-1)/\lambda_1 & \text{if $\lambda_1\ne0$ and $y<0$} \\
\log{y+\lambda_2} & \text{if $\lambda_1=0$ and $y<0$} \\
\end{cases}$
        - Usually, $\lambda=[‚àí5,5]$ but we use a $\lambda$ value that gives the best approximation to a normal distribution.  


In [None]:
random0 = X

random1 = DF['median_house_value'][DF['median_house_value']<500000]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))
for ax, random in zip([ax1,ax2],[random0,random1]):
    ax.hist(random,30,density=True, color='red',alpha=0.3)
    y_min, y_max = ax.get_ylim()
    ax.vlines(np.mean(random),y_min,y_max,color='black',label='mean')
    ax.vlines(np.median(random),y_min,y_max,color='black',linestyle='--',label='median')
    ax.text((random.max()+random.min())/2,y_max*0.90,round(random.skew(),2),ha='center')
    ax.legend(loc='best')
plt.subplots_adjust(left=0.1,bottom=0.1,right=1,top=0.9,wspace=0.4,hspace=0.4)
plt.show()

In [None]:
from scipy import stats

In [None]:
X = random1
random2 = (X-X.min())/(X.max()-X.min())
random3 = np.sqrt(X)
random4 = stats.boxcox(X)[0]
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))
for ax, random in zip([ax1,ax2,ax3],[random2,random3,random4]):
    ax.hist(random,30,density=True, color='red',alpha=0.3)
    y_min, y_max = ax.get_ylim()
    ax.vlines(np.mean(random),y_min,y_max,color='black',label='mean')
    ax.vlines(np.median(random),y_min,y_max,color='black',linestyle='--',label='median')
    ax.text((random.max()+random.min())/2,y_max*0.90,round(pd.Series(random).skew(),2),ha='center')
    ax.legend(loc='best')
plt.subplots_adjust(left=0.1,bottom=0.1,right=1,top=0.9,wspace=0.4,hspace=0.4)
plt.show()

#### Categorical Data
- for ordinal data, convert the strings into comparable integer values
    - E.g., XL > L > M > S $\to$ 5 (XL) > 4 (L) > 3 (M) > 2 (S)
    - Note that the value of integer itself has no special meaning besides for ordering
    - Mapping needs to be unique: 1 to 1 mapping for going back and forth
- For nominal data, convert the strings into integers
    - E.g., Red (0), Blue (1), Green (2)
    - A common practice to avoid software glitches in handling strings
    - Note that the value of integer itself has no special meaning (non-comparable)
    - Mapping needs to be unique: 1 to 1 mapping for going back and forth
- To avoid mistakenly comparing encoded integers for nominal data, one-  hot encoding can be used
    - Each unique value becomes a separate dummy feature

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
DF.head(3)

In [None]:
encoder = LabelEncoder()
housing_cat = DF["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

In [None]:
print(encoder.classes_)

In [None]:
set(housing_cat_encoded)

#### Correlation between features and feature engineering
- One good way to reduce the data size
- Correlations between two features explains how they are related to each other. 
    - Pearson correlation coefficient is widely used. 
    $$\rho_{x,y}=\frac{Cov(x,y)}{\sigma_x\sigma_y}$$
    - Ranges from -1 to 1. 

In [None]:
corr_matrix = DF.corr()
corr_matrix[['total_bedrooms','total_rooms','households']]

- Feature engineering extract features using domain knowledge
    - Improves the performance of ML 
    - Sometimes can be considered as applied ML
- For example, if $X$ and $Y$ are tightly correlated 
    - We can use only $X$ as an independent variable 
    - Or make a new feature call $Z = XY$ as an independent variable

In [None]:
plt.scatter(DF['total_bedrooms'],DF['total_rooms'])
plt.xlabel('total bedrooms')
plt.ylabel('total rooms')
plt.show()

In [None]:
X = DF[['total_bedrooms','total_rooms','households']].dropna()
X['bedroom_per_room'] = X['total_bedrooms']/X['total_rooms']
X.corr()

### 2.B.b How do we make models and what is after?
Machine learning is an **algorithm** that learns a model from data (training),  so that the model can be used to predict certain properties about new  data (generalization)
<img src="img/ML_Model.png" width=600 length=600/>

### 2.B.c How can we make the models better?
- **Training** is to build the ML model from data
    - Typically, training is a one-time effort, but computationally intesive
    - Speed is a main concern
- **Interference** is to use the ML model to predict results for new data
    - **Generalization** is the most interesting stage for applications
    - Typically, inference is fast but happens more frequently with a lot of more new data (unlabled)
    - Scalability is a main concern
- <b>Split</b> known data into train and test datasets
    - train dataset: a data set used to train the model
    - test dataset: a data set used to give an indication on how well the trained model will generalize to new data (unknown at this point)
    - test dataset is kept until the **very end** to evaluate the final model. 
    - since test dataset withholds valuable information that learning algorithm could benefit from, we do not want to put too much data into the test dataset neither. 
        - 70:30, 80:20, 90:10 splits are common
<img src="img/Data_split_01.png" width=600 length=600 />

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(DF, test_size=0.3, random_state=42)

In [None]:
print(len(train_set),"train +", len(test_set),"test")

- <b>Cross-validation</b>: a model tuning process
    - How can we make the model training process to be aware of the targeted  generalization quality so that training can do something about it?
    - We need to put the predicted generalization results as part of the training  optimization goal
        - We can **NOT** use the predicated generalization results from the test data,  otherwise, the test data would become part of the training process
        - We want to keep the test data still independent of training so that its predication  can still be a good indication of generalization quality for future unknown new data
    - <b>Holdout cross-validation</b>
        - Training dataset is further split into two sets: training set + validation set.
        - Validation results are sued to drive the continuation of training process until we obtain a reasonable validation result.
        - We still use test data to report the predicted generalization quality. 
<img src="img/Data_split_03.png" width=600 length=600 />

- <b>K-Fold Cross-Validation</b>
    - Repeat holdout cross-validation k times on k subsets of the training data
        - Randomly split the training dataset into k folds without replacement
            - K-1 folds are used for training, and one fold used for validation
    - Repeat this k times so that we obtain k models
    - Typically k=10, but larger k for smaller dataset, and smaller k for larger dataset
    - Pros: the whole dataset is used as both a training set and validation set.
    - Cons: 
        - Not useful for imbalance datasets. 
        - Not suitable for sequential datasets. 
<img src="img/k_fold_CV.png" width=600 length=600 />
- <b>Stratified K-Fold Cross-Validation</b>
    - An enhanced version of K-Fold CV which is mainly used for imbalanced datasets. 
    - Each fold will have the same ratio of instances of target variables as in the whole datasets. 
    - Pros: works perfectly well for imbalanced data
    - Cons: not suitable for sequential datasets. 
- <b>Leave One Out CV</b>
    - An exhausive CV technique in which 1 sample point is used as a validation set and the remaining n-1 samples are used as a traning set. 
    - Repeats until every sample of the dataset is used as a validation point. 

# 3. Example Demonstration. 

Data pre-processing is a very important stage in machine learning project because how well the model is trained, the meaningfulness of model is based on train data set - a model can have a stastically well done but may not tell a stroy. The preprocessing is therefore a stage where we spend most of time before we make models. The preprocessing is broadly separate into three parts: Data Cleaning, Data Transformation, and Data Reduction. We focus on exploratory data anslysis (EDA) to achieve insights and statistical measure in order to clean, tranform, and reduce the data to get ready for modeling. In this demonstration note, we will walk through to preprocessing using `Housing.csv` data. 

1. A Quick Look at the Data Structure.
2. Exploring Data Analysis (EDA)
3. Feature Engineering
4. Data Split

In [None]:
import pandas as pd

In [None]:
DF = pd.read_csv('Housing.csv')

## 3.1 A Quick Look at the Data Structure.

In [None]:
DF.shape

In [None]:
print(list(DF.columns))

In [None]:
DF.head(5)

- There are 10 columns - `longitude`, `latitude`, `housing_median_age`, `total_rooms`, `total_bedrooms`, `population`, `households`, `median_income`, `median_house_value`, `ocean_proximity`.
- use `shape` to find the size of data - 20640 rows and 10 columns. We are going to treat the problem as a supervised learning problem to predict `meadian_house_value` which we will call it a **target** and the rest **features**, **predictors**, or **attributes**. We call the rows **examples** or **observations**
- use `head` to observe how data looks is structured. However, it does not give descriptions of data. 

In [None]:
DF.info()

- use `info` to look for number of rows, and each attribute's type and number of non-null values. 
- notice `total_bedrooms` has only 20,433 non-null values while other attributes have 20,640 non-null values. This means that `total_bedrooms` have 207 missing values or *no observations*. 
- all attributes are numerical except the `ocean_proximity`. We see that this is a text data. 

In [None]:
DF["ocean_proximity"].value_counts()

In [None]:
DF.describe()

The `describe()` method shows a summary of the numerical attributes. The count, mean, min, and max rows are self-explanatory. Note that the null values are ignored (see `total_bedrooms`). The std row shows the standard deviation. The 25%, 50%, and 75% rows show the corresponding *percentiles* - the value below which a given percentage of observations in a group of observation falls. 

 We also aggregate data by the group to learn. 

In [None]:
DF.groupby("ocean_proximity").agg({'mean'})

We often make a boxplot for a visualization.

In [None]:
import matplotlib.pyplot as plt
features = DF.columns.tolist()

ax = DF[features[2:9]].plot(kind='box', title='boxplot', showmeans=True,figsize=(15,7))
plt.show()

In [None]:
fig, axs = plt.subplots(1, len(features[2:9]), figsize = (25,7),facecolor='white')
for i,feat in zip(range(len(features[2:9])),features[2:9]):
    axs[i].boxplot(DF[feat].dropna())
    axs[i].set_xlabel(feat)
plt.show()

We find missing values.

In [None]:
DF.isnull().sum()

Only `total_bedrooms` have missing values.

In [None]:
DF.hist(bins=50, figsize=(20,15))
plt.show()

There are few things to notice from histograms:
- The `median_income` is not in U.S. dollars. The data has been scaled and capped at 15 for higher median incomes and at 0.5 for lower median incomes. Working with preprocessed attributes is common in ML and it is not necessarily a problem. However, it is important to understand how the data was computed. 
- The `median_house_value` and `housing_median_age` are capped at 500,000 and 50, respectively. The former may be a serious problem since it is the target attribute and ML algorithms may learn that prices never go beyond that limit. If the goal is to predict `median_house_value` under \\$500,000, there is no problem - we can just use observations under the condition. However, if we have to predict even beyond \\$500,000, there we have two options:
    1. Collect proper labels for the districts whose labels were capped. 
    2. Remove these districts since it will poorly predict beyond \\$500,000 if they are included. (Do not assume that removing \\$500,000 observations without EDA is the right choice!)
- These features have very different scales. 
- Many histograms are *tail heavy*. This may make it a bit harder for some ML algorithms to detect patterns. We need to transform these attributes to have more **bell-shaped distributions**.

## 3.2. Exploring Data Analysis
### Visualization

Visualization is one of the easiest ways to understand the data structure - it is not only limited to the statistical measurements but also insights to learn about the data itself. 

In [None]:
DF.plot(kind="scatter",x="longitude",y="latitude")
plt.show()

In [None]:
DF.plot(kind="scatter",x="longitude",y="latitude", alpha=0.4, s=DF["population"]/100,
       label="population",c="median_house_value", cmap=plt.get_cmap("jet"),colorbar=True)
plt.legend()
plt.show()

The scatter plots made provides us the geomatrical information of data. 
- We can see that the houses are located in California bayside. 
- We can see the poluations of data and understand which part of bayside have high and low median house value. 

### Correlations

We can use `corr()` to compute the *standard correlation coeficient* also known as (*Pearson's r*) between every pair of features. 
- The coefficent ranges from -1 to 1. Closer to 1 means that there is a strong positievv correlation and being closer to 0 means that there is no correlation. 

In [None]:
corr_matrix = DF.corr()

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

- the `median_house_value` tends to go up when the `median_income` goes up.
- the `median_house_value` tends to go down as `latitude` and `longitude` goes up. 
- the `median_income` has the strongest correlation. 

The correlation study brings important information that can help us to design the preprocessing. We can predict that `median income` will be a major feature in the prediction whereas `population` will have the least impact in the model. 

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
features = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(DF[features], figsize=(12,8))
plt.show()

- For details, read the link: https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html
- the most promising feature to predict the `median_house_income` is the `median_income`. 

In [None]:
corr = DF.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
DF.plot(kind="scatter", x="median_income",y="median_house_value", alpha=0.1)
plt.show()

- the correlation is strong. 
- there is a price cap having a horizontal line at \\$500,000.
- Reveals other less obvious straight lines around \\$450,000, \\$350,000, and \\$280,000. 

## 3.3 Data Mining & Feature Engineering
### New Combinded Features

Consider the following arguement:
1. the `total_rooms` is not useful if the `household` is not known. 
2. the `total_bedrooms` itself is not useful if the `total_rooms` is not known. 
3. the arguments above does not come from ML experience, but more from how well deeply we about the data and the case. 

In [None]:
DF["rooms_per_household"]=DF["total_rooms"]/DF["households"]
DF["bedrooms_per_room"]=DF["total_bedrooms"]/DF["total_rooms"]
DF["population_per_household"]=DF["population"]/DF["households"]

In [None]:
corr_matrix=DF.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
features = ["median_house_value", "rooms_per_household", "bedrooms_per_room", "population_per_household"]
scatter_matrix(DF[features], figsize=(12,8))
plt.show()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))

n, bins, patches = ax1.hist(DF["rooms_per_household"],bins=60)
ax1.set_xlabel('rooms_per_household')
ax1.set_ylabel('Frequency')

n, bins, patches = ax2.hist(DF["total_rooms"],bins=60)
ax2.set_xlabel('total_rooms')
ax2.set_ylabel('Frequency')

n, bins, patches = ax3.hist(DF["households"],bins=60)
ax3.set_xlabel("households")
ax3.set_ylabel('Frequency')
plt.subplots_adjust(left=0.1,bottom=0.1,right=1,top=0.9,wspace=0.4,hspace=0.4)
plt.show()

- The `bedrooms_per_room` is much more correlated with the `median_house_value` than the `total_rooms` and the `total_bedrooms`. 
- Houses with a lower bedroom/room ratio tend to be more expensive. 
- The `rooms_per_household` is much informative than the `total_rooms`. 
- Bigger the house is, much expensive the house is. 
- This exploration does not have to be absolutely thorough - the point is to start off on the right foot and quickly gain insights that will help us to get a first reasonably good prototype. This is an iterative process. 

### Handling Missing Values

In [None]:
DF.isnull().sum()

Missing Value Handling:
- Impute with averge, median, any specific number (.e.g, 0), or drop 
- Use `fillna()` to fill in missing values
- If we decided to drop, we must consider dropping the column or the rows. 
    - use `dropna()` to delete all rows for having a column with missing values. 
    - use `drop()` to delete the column. 

In [None]:
DF.dropna(subset=['total_bedrooms']).shape

In [None]:
DF.drop("total_bedrooms",axis=1).shape

In [None]:
median=DF["total_bedrooms"].median()
average=DF["total_bedrooms"].mean()
print("median is",median)
print("average is", average)

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,5))

n, bins, patches = ax1.hist(DF["total_bedrooms"],bins=60)
ax1.set_xlabel('Without Fillna')
ax1.set_ylabel('Frequency')

n, bins, patches = ax2.hist(DF["total_bedrooms"].fillna(median),bins=60)
ax2.set_xlabel('Fillna(median)')
ax2.set_ylabel('Frequency')

n, bins, patches = ax3.hist(DF["total_bedrooms"].fillna(average),bins=60)
ax3.set_xlabel('Fillna(mean)')
ax3.set_ylabel('Frequency')
plt.subplots_adjust(left=0.1,bottom=0.1,right=1,top=0.9,wspace=0.4,hspace=0.4)
plt.show()

The **Kolmogorov-Smirnov test (KS-Test)** is a nonparametric test of the equality of continuous or discontinous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution or to compare two samples. 

The Kolmogorov‚ÄìSmirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case).

In [None]:
from scipy.stats import ks_2samp
ks_2samp(DF["total_bedrooms"].dropna(),DF["total_bedrooms"].fillna(median))

In [None]:
ks_2samp(DF["total_bedrooms"].dropna(),DF["total_bedrooms"].fillna(average))

Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.
- Therefore, we can either drop all `total_bedrooms` missing observations or fill either with the median or the average of `total_bedrooms`. 

We also can use `SimpleImputer()` from `sklearn.impute` to fillin all `Nan`. 

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
housing_num = DF.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)

In [None]:
imputer.statistics_

In [None]:
housing_num.median().values

In [None]:
X = imputer.transform(housing_num)
housing_tr=pd.DataFrame(X, columns=housing_num.columns)

In [None]:
housing_tr.describe()

In [None]:
housing_tr.isnull().sum()

In [None]:
ks_2samp(DF["bedrooms_per_room"],housing_tr["bedrooms_per_room"])

### Handling Categorical and Text Data:
- Most ML algorithms prefer to work with numbers. Therefore, converting text labels to numbers may bring meaningful information about data than just dropping. 
- The `ocean_proximity` is in text and we cau use `LabelEncoder()` to convert the text to numbers. 

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = DF["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

In [None]:
print(encoder.classes_)

- `OneHotEncoder()` encorder converts inteter categorical values into one-hot vectors. 
- Note that `fit_transform()` expects 2-D array and we need to reshape `DF_cat_encoded` because it is in 1-D array. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder=OneHotEncoder()
housing_cat_encoded_1hot=encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_encoded_1hot

In [None]:
housing_cat_encoded_1hot.toarray()

- We get a matrix with 5 columns with 0 and 1 per row. 
- We can use `LabelBinarizer()` to do the same work in one shot. 

In [None]:
from sklearn.preprocessing import LabelBinarizer
encoder=LabelBinarizer()
housing_cat_encoded_1hot=encoder.fit_transform(housing_cat_encoded)
housing_cat_encoded_1hot

In [None]:
housing_Encoded=pd.DataFrame(housing_cat_encoded_1hot)

In [None]:
housing_Encoded.head(5)

In [None]:
housing_Encoded.columns=['<1H OCEAN','INLAND','ISLAND','NEAR BAY','NEAR OCEAN']

In [None]:
housing_Encoded.head(5)

In [None]:
DF_final = pd.concat([housing_tr, housing_Encoded], axis=1)

In [None]:
DF_final.info()

### Feature Scaling
One of the most important transformations need to aaply to data is *feature scaling*. With few exceptions, ML algorithms do not perform well when the input numerial attributes have very different scales. 

Two common ways: *Min-Max Scaling* and *standarization*

***Min-Max Scaling*** (**Normalization**): values are shifted and rescaled ending up from 0 to 1. Scikit-Learn provies a transformer called `MinMaxScaler`. 

***Standarization***: Does not have specific range values and much less affective to outliers. Scikit-Learn provies a transformer called `StandardScaler`. 
- The dataset is labeled and if we are going to predict `median_house_value`, then this is **supervised learning**. 
- If the linear models that assumes features are in Gaussian distribution such as **linear regression** is going to be used, then we need to standardize them.
- If we are going to use other non-linear models such as tree-based algorithms (e.g., **decision tree, random forest**), then we do not need to scale the data.
- If we are oing to classify the `ocean_proximity`, then we should concanate **housing_cat_encoded** instead. 

We are going to drop outliers for all features. 
- In the example, the outliers will be considered observations of each feature being 1 or 99 percentile. 

In [None]:
cols = housing_tr.columns.tolist()
DF_ = DF_final
for col in cols:
    q01 = DF_[col].quantile(0.01)
    q99 = DF_[col].quantile(0.99)
    DF_ = DF_[(DF_[col]>q01) & (DF_[col]<q99)]

In [None]:
print(DF_.shape,DF_final.shape)

In [None]:
corr = DF_final[cols].corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
corr = DF_[cols].corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
DF_[cols].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
left_skew, right_skew,norm_ = [],[],[]
for col in cols:
    skew_coef = DF_[col].skew()
    if (skew_coef<-0.05):
        print(col,"left skewed",DF_[col].skew())
        left_skew.append(col)
    elif (skew_coef>0.05):
        print(col,"right skewed",DF_[col].skew())
        right_skew.append(col)
    else: 
        print(col,"close to Gaussian")
        norm_.append(col)

In [None]:
for col in right_skew[1:]:
    DF_[col] = stats.boxcox(DF_[col])[0]
    X_skew = pd.Series(DF_[col]).skew()
    if (abs(X_skew)<=0.05):
        print(col,X_skew)

In [None]:
for col in left_skew[1:]:
    DF_[col] = stats.boxcox(DF_[col])[0]
    X_skew = pd.Series(DF_[col]).skew()
    if (abs(X_skew)<=0.05):
        print(col,X_skew)

In [None]:
DF_[cols].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
DF_.to_csv('./Housing_scaled_01.csv')

In [None]:
corr = DF_[cols].corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

scaler = StandardScaler()
scaled_data = scaler.fit_transform(DF_final[cols])

In [None]:
scaled_data

In [None]:
housing_scaled=pd.DataFrame(scaled_data)
housing_scaled.columns = cols
housing_Encoded = DF_[housing_Encoded.columns.tolist()]

In [None]:
DF_final = pd.concat([housing_scaled, housing_Encoded], axis=1)

In [None]:
DF_final.describe()

In [None]:
DF_final[cols].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(DF[cols])
housing_scaled=pd.DataFrame(scaled_data)
housing_scaled.columns = cols
housing_Encoded = DF_[housing_Encoded.columns.tolist()]
DF_final = pd.concat([housing_scaled, housing_Encoded], axis=1)
DF_final[cols].hist(bins=50, figsize=(20,15))
plt.show()

## 3.4 Data Split

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(DF_, test_size=0.2, random_state=42)

In [None]:
print(len(train_set),"train +", len(test_set),"test")

- `train_test_split` is the simplest function to split data. 
- `random_state`: allows to set the random generator seed, pass it multiple datasets with an identical number of rows,  and split them on the same indices. 

# 4. Conclusion
- In this lecture, we walked through the general ML project workflow. 
- The ML project is "garbage in, garbage out".
- The data preprocessing is very important and it impacts the result. 
- There are many different and creative ways to attack the problem.
    - having strong understandability of different ML algorithms and techqniques is a strong asset. 
    - due the the unlimited way of attacking the problem, being creative can bring insteresting stories of dataset. 
- There are more things to aware in preprocessing and making traning and test sets than the example we seen in this lecture. 