<table width=100%; style="background-color:#caf0fa";>
    <tr style="background-color:#caf0fa">
        <td>
            <h1 style="text-align:right">
                Python for Data Science Training - Week 6
            </h1>
        </td>
        <td>
            <img src="../img/jica-logo.png" alt = "JICA Training" style = "width: 100px;"/>
        </td>
    </tr>
</table>

# Today's Contents
## Machine Learning

---
Before starting the session, you need to install [scikit-learn](https://scikit-learn.org/stable/index.html), the most famous machine learning library in Python. Launch your command prompt with `cmd`, then run this command:
```pyton
conda install -c anaconda scikit-learn
```

There are three Machine Learning tasks - **regression task**, **classification task**, and **clustering task**. Today we will conduct regression and classification tasks which both applies supervised machine learning, and leaving the clustering task which is primarily done with unsupervised machine learning.

# 1. Regression Problem
A regression problem is to take a continuous variable as a dependent vairable, for example, price and temperature, to predict the value given provided independent variable(s). The method applied is ordinary least square (OLS), which attempts to minimize the loss between the actuval value of the observed dependent variable and the predicted value by the OLS.

<img src="https://upload.wikimedia.org/wikipedia/commons/b/be/Normdist_regression.png">

source:[wikipedia](https://en.wikipedia.org/wiki/Regression_analysis)  


We will estimate the heating and cooling load based on the given properties.  
**Heating load** is the amount of heat energy that would need to be added to a space to maintain the temperature.  
**Cooling load** is the amount o heat energy that would need to be removed from a space to maintain the temperature.  
They, collectively called as "thermal loads", take into account:
- the dwelling's construction and insulation; including floors, walls, ceilings and roof; and
- the dwelling's glazing and skylights; based on size, performance, shading and overshadowing.

Lower thermal loads indicate that the dwelling will require less heating and cooling to maintain comfortable conditions.  
(Heating and cooling loads, [BASIX](https://basix.nsw.gov.au/iframe/thermal-help/heating-and-cooling-loads.html))

We will use the dataset provided by [UCI ML Repository](
https://archive.ics.uci.edu/ml/datasets/Energy+efficiency#)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os

# ignore user error message
import warnings; warnings.filterwarnings('ignore')

In [None]:
# read dataset
df = pd.read_csv('data/energy_efficiency.csv')
df.head()

According to the [documentation](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency#), here is the definition  of the variables. Target variables are `y1` and `y2`, and the rest of the variables are the predictors.

|feature name|variable|
|---|---|
|X1|Relative Compactness|
|X2|Surface Area|
|X3|Wall Area|
|X4|Roof Area|
|X5|Overall Height|
|X6|Orientation|
|X7|Glazing Area|
|X8|Glazing Area Distribution|
|y1|Heating Load|
|y2|Cooling Load|

Since the feature names are not intuitive, we'll use variable names rather than feature names.

In [None]:
# Change column names
column_name_dict = {
'X1':'Relative_Compactness',
'X2':'Surface_Area',
'X3':'Wall_Area',
'X4':'Roof_Area',
'X5':'Overall_Height',
'X6':'Orientation',
'X7':'Glazing_Area',
'X8':'Glazing_Area_Distribution',
'Y1':'Y1_Heating_Load',
'Y2':'Y2_Cooling_Load'}

df = df.rename(columns = column_name_dict)
df.head()

In [None]:
# check data types
df.dtypes

The data is well interpreted and no need to conduct a processing.

In [None]:
# check null values
df.isnull().sum()

In [None]:
# check statistics
df.describe()

From this statistics, we understand that:
- the sample size is 768.
- Our target variables (Y1 and Y2) are both continuous variables.
- Our input variables are similarly continous variables. (`Glazing_Area_Distribution` may be discrete, but we can intepret this as a continuous variable.)

Let's review Pearson's correlation coefficient to find out if there are unrelated variables to predict the target variable

In [None]:
# Looking into Pearson's correlation coefficient 
df_corr = df.corr()
df_corr

In [None]:
# Let's simplify the table by taking dependent variables as columns
df_corr_targets = df_corr[['Y1_Heating_Load', 'Y2_Cooling_Load']].iloc[:-2]
df_corr_targets

In [None]:
# Visualize correlation
df_corr_targets.plot(style = '.', figsize = (7, 5))
plt.axhline(y = 0, linestyle = '--', color = 'k')
plt.xticks(rotation = 90, fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel('Predictors', fontsize = 12)
plt.ylabel('Correlation', fontsize = 12)
plt.title('Pearson Correlation Coefficient', fontsize = 14);

### Machine Learning
Simply put, machine learning is to predict a dependent variable based on a set of input data. The following procedure is applied.
1. Split data into X (input) and y (output).
2. Split X and y into training and test data. (Normally, 80-20 or 70-30.)
3. Rescaling X data.
4. Build the model (fit -> evaluate)
5. Predict the unseen data.

## 1. Split data into X and y

In [None]:
# Preparing dataset. We have two `y data`, we'll create y1 and y2.
y1 = df['Y1_Heating_Load']
y2 = df['Y2_Cooling_Load']
X = df.drop(columns = ['Y1_Heating_Load', 'Y2_Cooling_Load'])

In [None]:
print('Y1 data: ', y1.values[:5])
print('Y2 data: ', y2.values[:5])
print('X data: ', X.values[:5])

## 2. Split X and y into training and test data

We will split our dataset into 70 percent of training data and 30 percent of test data. This can be quickly done with `train_test_split` under `sklearn.model_selection`. `train_test_split` takes arguments **X data**, **y data**, and **test_size** given as a percentage. Optionally we can add **random_state** to allow replication and **shuffle** to enable random sampling.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# `train_test_split` randomly select data based on the given test_size.
X_train, X_test, y1_train, y1_test = train_test_split(X, y1, test_size = .3, random_state = 1234, shuffle = True)

In [None]:
# Similary we can create y2 test data.
X_train, X_test, y2_train, y2_test = train_test_split(X, y2, test_size = .3, random_state = 1234, shuffle = True)

In [None]:
# While it's not normal, we can split both y1 and y2 with only one command.
X_train, X_test, y1_train, y1_test, y2_train, y2_test = train_test_split(X, y1, y2, test_size = .3,
                                                                         random_state = 1234, shuffle = True)

In [None]:
# We can quickly inspect each data size.
for data in [X_train, X_test, y1_train, y1_test, y2_train, y2_test]:
    print(data.shape)

In [None]:
for data in [X_train, X_test, y1_train, y1_test, y2_train, y2_test]:
    print(data.values[:5])

## 3. Rescaling X data.
Machine learning does not consider nuances between variables, we need to tell a machine that the data needs to be equally treated. Since the machine considers a higher value is high and a lower value is low, we need to normalize data to explicitly inform the machine to treat the data equally.

We'll standardize the data by taking mean 0 and standard deviation is 1. Mathematically put, scaled data is

$$ z = \frac{(x - u)}{s} $$

, where *u* is the mean of the samples and *s* is the standard deviation of the samples.

Rescaling is conducted with:
```python
from sklearn.preprocessing import StandardScaler
````
We'll initialize the `StandardScaler`, then fit and transform the training data.

In [None]:
# Let's quickly review X_train data
X_train.head()

In [None]:
# also inspect test data
X_test.head()

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
ss = StandardScaler()

# Compute the mean and std based on X_train data.
scaler = ss.fit(X_train)

# Sdandardize X_train data
X_train = scaler.transform(X_train)

# Standardize X_test data
X_test = scaler.transform(X_test)

In [None]:
print('X_train transformed: ', X_train[0])
print('X_test transformed: ', X_test[0])

## 4. Build the model (fit -> evaluate -> predict)


In [None]:
from sklearn.linear_model import LinearRegression
# Initializing the model
lr1 = LinearRegression()
lr2 = LinearRegression()

In [None]:
## Y1 Data
# Fitting data to the model
lr1.fit(X_train, y1_train)

In [None]:
# Evaluate the train score
lr_y1_train_score = lr1.score(X_train, y1_train)

# Evaluate the test score
lr_y1_test_score = lr1.score(X_test, y1_test)

print('Training score on y1: ', lr_y1_train_score)
print('Test score on y1: ', lr_y1_test_score)

In [None]:
# Predict values
lr_y1_predictions = lr1.predict(X_test)

print('ML results on predicting Heating Load (Y1)\n')
for y1, y1_hat in zip(y1_test[:10], lr_y1_predictions[:10]):
    print('Actual: {}\t| Predicted: {}'.format(round(y1, 3), round(y1_hat, 3)))

In [None]:
## Y2 Data
# Fitting data to the model
lr2.fit(X_train, y2_train)

# Evaluate the train score
lr_y2_train_score = lr2.score(X_train, y2_train)

# Evaluate the test score
lr_y2_test_score = lr2.score(X_test, y2_test)

# Predict values
lr_y2_predictions = lr2.predict(X_test)

In [None]:
print('ML results on predicting Cooling Load (Y2)\n')
for y2, y2_hat in zip(y2_test[:10], lr_y2_predictions[:10]):
    print('Actual: {}\t| Predicted: {}'.format(round(y2, 3), round(y2_hat, 3)))

In [None]:
# Visualizing data
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5))

# scatter of actual and predicted
ax1.scatter(y1_test, lr_y1_predictions, s = 8, marker = '.')
# draw a diagonal line
ax1.plot([0, max(max(y1_test), max(lr_y1_predictions))], [0, max(max(y1_test), max(lr_y1_predictions))], '--', color = 'red')
# add ylabel
ax1.set_ylabel('Predicted')
# add xlabel
ax1.set_xlabel('Actual')
# add title
ax1.set_title('Predicted results - y1 outcome | score={}'.format(round(lr_y1_test_score, 2)))

ax2.scatter(y2_test, lr_y2_predictions, s = 8, marker = '.')
ax2.plot([0, max(max(y2_test), max(lr_y2_predictions))], [0, max(max(y2_test), max(lr_y2_predictions))], '--', color = 'red')
ax2.set_ylabel('Predicted')
ax2.set_xlabel('Actual')
ax2.set_title('Predicted results - y2 outcome | score={}'.format(round(lr_y2_test_score, 2)));

We can retrive our intercept and coefficients.

In [None]:
print('Coefficients: ', lr2.coef_)
print('\nIntercept: ', lr2.intercept_)

From the above numbers, we can construct the formula as:

$$
y_(hat) = 24.38189944134079 + -7.90034619*b1 + -4.20952194*b2 + 0.01555827*b3 + -4.13528677*b4 + 7.41827072*b5 + 0.12330692*b6 + 1.94546433*b7 + 0.19784801*b8
$$

We can predict the value based on arbitrary values.

In [None]:
# arbitrary values
b_values = [-1.53852577,  1.43804518,  1.91398512,  0.94193515,  1.02069834, 1.30082467,  1.23012947,  1.41565359]

In [None]:
# Manuarlly predict the value
b_estimated = 24.38189944134079 + -7.90034619*b_values[0] +\
                -4.20952194*b_values[1] + 0.01555827*b_values[2] +\
                -4.13528677*b_values[3] + 7.41827072*b_values[4] + \
                0.12330692*b_values[5] + 1.94546433*b_values[6] + 0.19784801*b_values[7]

In [None]:
print('Predicted b on y2 outcome is: {:.2f}'.format(b_estimated))

## 5. Predict the unseen data.


In [None]:
X.columns

In [None]:
X.describe()

In [None]:
# Create random values
big_house = [0.60, 800, 400, 220, 7, 5, 0.4, 5]
small_house = [0.99, 500, 250, 1, 4, 0, 0, 0]
strange_house = [0.60, 600, 300, 150, 10, 5, 0.4, 5]
sample_dict = {}
sample_dict['Big'] = big_house
sample_dict['Small'] = small_house
sample_dict['Strange'] = strange_house
sample_dict

In [None]:
# Create sample dataframe
sample_data = pd.DataFrame.from_dict(sample_dict, orient = 'index', columns = X.columns)
sample_data

In [None]:
# Sdandardize sample data
sample_data_scaled = scaler.transform(sample_data)
sample_data_scaled

In [None]:
# Predict heating and cooling values
sample_y1_heating_load_hat = lr1.predict(sample_data_scaled)
sample_y2_cooling_load_hat = lr2.predict(sample_data_scaled)

In [None]:
print('Heating performance - Big house = {:.2f} | Small house = {:.2f} | Strange house = {:.2f}'.format(
                                                        sample_y1_heating_load_hat[0],
                                                        sample_y1_heating_load_hat[1],
                                                        sample_y1_heating_load_hat[2],
))

print('Cooling performance - Big house = {:.2f} | Small house = {:.2f} | Strange house = {:.2f}'.format(
                                                        sample_y2_cooling_load_hat[0],
                                                        sample_y2_cooling_load_hat[1],
                                                        sample_y2_cooling_load_hat[2],
))

Looks like a small house performs the best, but do you want to live in such a house??   haha

---

# Classification task

We will predict the students' knowledge level about electrical machines. The knowledge level is classisifed into 4 categories: Very Low, Low, Middle, and High. We have five predictors to project the likely knowledge category.

Classification task is to identify a set of categories an observation belongs to. An example of the classification task is depected below.

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg" width=600px>

source: [wikipedia](https://en.wikipedia.org/wiki/Support-vector_machine)



We'll use the dataset from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling).

In [None]:
df_knowledge_train = pd.read_excel('data/User_Knowledge_Modeling_Data_Set.xls', sheet_name = 'Training_Data')
df_knowledge_test  = pd.read_excel('data/User_Knowledge_Modeling_Data_Set.xls', sheet_name = 'Test_Data')

In [None]:
df_knowledge_train.head()

In [None]:
df_knowledge_test.head()

According to this [documentation](https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling#), the dataset is given as below.

| feature name | description | input/output |
|---|---|---|
|STG| The degree of study time for goal object materails| input value|
|SCG| The degree of repetition number of user for goal object materails| input value|
|STR| The degree of study time of user for related objects with goal object| input value|
|LPR| The exam performance of user for related objects with goal object| input value|
|PEG| The exam performance of user for goal objects| input value|
|UNS| The knowledge level of user| target value|

In [None]:
# Print column name
print(df_knowledge_train.columns)
print(df_knowledge_test.columns)

In [None]:
# Select columns
df_knowledge_train = df_knowledge_train[['STG', 'SCG', 'STR', 'LPR', 'PEG', ' UNS']]
df_knowledge_test = df_knowledge_test[['STG', 'SCG', 'STR', 'LPR', 'PEG', ' UNS']]

# Since there is a leading space in front of UNS, remove the space.
df_knowledge_train = df_knowledge_train.rename(columns = {' UNS':'UNS'})
df_knowledge_test = df_knowledge_test.rename(columns = {' UNS':'UNS'})

In [None]:
# Check data types
df_knowledge_train.dtypes

In [None]:
# Check null values
df_knowledge_train.isnull().sum()

In [None]:
# Check statistics
df_knowledge_train.describe()

In [None]:
# Convert string item to numeric ones
print(df_knowledge_train['UNS'].unique())
print(df_knowledge_test['UNS'].unique())

In [None]:
# Create a dictionary to replace string values of the column `UNS`
replace_values_dict = {
    'very_low': 1,
    'Very Low':1,
    'Low':2,
    'Middle':3,
    'High':4
}

In [None]:
# Replace the value to the above dictionary.
df_knowledge_train['UNS'] = df_knowledge_train['UNS'].replace(replace_values_dict)
df_knowledge_test['UNS'] = df_knowledge_test['UNS'].replace(replace_values_dict)

In [None]:
# Check if converted
print(df_knowledge_train['UNS'].unique())
print(df_knowledge_test['UNS'].unique())

Since the data is pre-scaled within 0 and 1, we don't apply normalization.

## Machine Learning

In [None]:
# preparing x and y data
y_train = df_knowledge_train['UNS']
X_train = df_knowledge_train.drop(columns = ['UNS'])

y_test = df_knowledge_test['UNS']
X_test = df_knowledge_test.drop(columns = ['UNS'])

We will test three classifiers: K Nearest Neighbor, Support Vector Machine, and Random Forest.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
# conduct the classification task with three classifiers
results_dict = {}
for classifier, name in zip([KNeighborsClassifier, SVC, RandomForestClassifier], ['KNN', 'SVM', 'RFC']):
    classifier_init = classifier()
    classifier_init.fit(X_train, y_train)
    predictions = classifier_init.predict(X_test)
    results_dict[name] = predictions

In [None]:
# Look into the resulting dictionary
results_dict

## Evaluate the results
It is important to carefully review various results in classification. Key parameters are precision, recall and f1-score.
- **Precision** is the fraction of relevant instances among the retrieved instances
$$ Precision = \frac{TP}{(TP + FP)} $$
- **Recall** is the fraction of relevant instances that were retrieved.
$$ Recall = \frac{TP}{(TP + FN)} $$
- **F1 score** is the harmonic mean of precision and recall:
$$ F1 = \frac{2TP}{(2TP + FP + FN)} $$

<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg">

source: [wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# Create a function to report the results
def evaluate_result(classifier):
    pred = results_dict.get(classifier)
    print('-----{}-----'.format(classifier))
    
    # Overall accuracy
    print('- Overall accuracy score: {:.2f}'.format(accuracy_score(y_test, pred)))
    
    # Confusion matrix
    print('\n- Confusion Matrix -')
    print(confusion_matrix(y_test, pred))
    
    # Classification report
    print('\n- Classification Report- ')
    print(classification_report(y_test, pred, target_names = ['Very Low','Low','Middle','High']))

In [None]:
evaluate_result('KNN')

In [None]:
evaluate_result('SVM')

In [None]:
evaluate_result('RFC')

In [None]:
# This cell is to create a config file.
# Hiding this cell for authority

# Hiding celll from https://gist.github.com/Zsailer/5d1f4e357c78409dd9a5a4e5c61be552
from IPython.display import HTML
from IPython.display import display

# Taken from https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook
tag = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
Creating requirements.txt file. To show/hide this cell's raw code input, click <a href="javascript:code_toggle()">here</a>.''')
display(tag)

############### Write code below ##################
# Config file to freeze packages in a notebook
# from https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook
import pkg_resources
import types

def get_requirements():
    def get_imports():
        for name, val in globals().items():
            if isinstance(val, types.ModuleType):
                # Split ensures you get root package, 
                # not just imported function
                name = val.__name__.split(".")[0]

            elif isinstance(val, type):
                name = val.__module__.split(".")[0]

            # Some packages are weird and have different
            # imported names vs. system/pip names. Unfortunately,
            # there is no systematic way to get pip names from
            # a package's imported name. You'll have to add
            # exceptions to this list manually!
            poorly_named_packages = {
                "PIL": "Pillow",
                "sklearn": "scikit-learn"
            }
            if name in poorly_named_packages.keys():
                name = poorly_named_packages[name]

            yield name
    imports = list(set(get_imports()))

    # The only way I found to get the version of the root package
    # from only the name of the package is to cross-check the names 
    # of installed packages vs. imported packages
    requirements = []
    for m in pkg_resources.working_set:
        if m.project_name in imports and m.project_name!="pip":
            requirements.append((m.project_name, m.version))

    
    with open("requirements.txt", "w") as f:
        print('Create "requirements.txt"')
        for r in requirements:
            string = r[0] + '==' + r[1] + '\n'
            f.write(string)
            print("\t{}=={}".format(*r))
    print('"requirements.txt" was created.')
        
get_requirements()