<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-5/plot_learning_curve.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plotting Learning Curves

Welcome to the hands-on lab. This is part of a series of exercises to help you to acquire skills in different techniques to fine-tune your machine learning model. 

In this lab, you will learn how to:
- diagnose overfitting/underfitting problems in machine learning  
- plot learning curves for both classification and regression types of problems

*Acknowledgement: This exercise is adapted from https://www.dataquest.io/blog/learning-curves-machine-learning/*

## 1. Import Required Packages ##

Let's first import all the packages that you will need during this exercise.
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [sklearn](http://scikit-learn.org/stable/) provides simple and efficient tools for data mining and data analysis. 
- [matplotlib](http://matplotlib.org) is a library for plotting graphs in Python.
- [pandas](https://pandas.pydata.org) is a library for data analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve

%matplotlib inline


## 2. Learning Curve for Regression Problem ##


### Prepare the Data

First, let's get the dataset you will work on. The description of the data can be found [here](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

In [None]:
# if you are using jupyter notebook and wish to load the data locally
# electricity = pd.read_excel('data/combined_pp.xlsx)
# if you wish to load data from an url
electricity = pd.read_excel('https://github.com/nyp-sit/sdaai-iti103/raw/master/session-5/data/combined_pp.xlsx')
electricity.head()

Let's check the data types and also check if there is any missing values.

In [None]:
electricity.info()

Let's separate the features from the target and instantiate a LinearRegressor as the estimator to be used later. 


In [None]:
# We separate the features and target from the data set
features = ['AT','V','AP','RH']
target = 'PE'

X = electricity[features]
y = electricity[target]

# Instantiate a LinearRegressor
estimator = LinearRegression()

### Plot the learning curve 

`learning_curve()` in scikit-learn can be used to  generate the data needed to plot a learning curve, i.e. the training and validation scores. The function returns a tuple containing three elements: ``train_sizes``, and ``train_scores`` and ``validation_scores``. The function accepts the following parameters:
- estimator — indicates the learning algorithm we use to estimate the true model
- X — the features
- y — the target labels
- train_sizes — the numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. (Note: the notation (0,1] means inclusive of 0 but exclusive of 1). Otherwise it is interpreted as absolute sizes of the training sets. 
- cv — determines the cross-validation splitting strategy.
- scoring — controls the metrics used to evaluate estimator. Possible pre-defined metrics can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html)
- shuffle - whether to shuffle training data before taking prefixes of it based on ``train_sizes``.

You can refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html) for more detail of the function. 

There are a total of 9568 rows of data. If we are using 5-fold cross validation, only $4/5$ or 80% of the data are availabel for training which is around 7654 samples. We will plot the training curve for training sizes of 1, 100, 500, 2000, 5000, 7654. 
For the scoring metric, we will choose `'neg_mean_squared_error'`. There is no `'mean_squared_error'` because this metric supposed to measure how good the model is, and not how much error the model made. 

<details><summary>Click here for answer</summary>

```python
    
train_sizes = [1,100,500,2000,5000,7654]
    
train_sizes, train_scores, validation_scores = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=5,
                    scoring = 'neg_mean_squared_error',
                    shuffle=False, random_state=0)
    
```
    
</details>

In [None]:
# declare the list of different training sizes
train_sizes = [1, 100, 500, 2000, 5000, 7654]

# call the learning_curve() to return train_scores/validation scores for different train sizes
train_sizes, train_scores, validation_scores = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=5,
                    scoring = 'neg_mean_squared_error',
                    shuffle=False, random_state=0)


Let us print out the values of train_scores and validation_scores(neg_mean_squared_error). Each row corresponds to a test size and each columns corresponds to a split. 

In [None]:
# print the train and validation scores
print('Train scores:\n\n', train_scores)
print('\n','-'*70)
print('\nValidation scores:\n\n', validation_scores)

You might have noticed that some error scores on the training sets are the same. For the row corresponding to training set size of 1, this is expected, but what about other rows? With the exception of the last row, we have a lot of identical values. For instance, take the second row where we have identical values from the second split onward. Why is that so? 

This is caused by not randomizing the training data for each split. Let’s walk through a single example with the aid of the diagram below. When the training size is 100 the first 100 samples in the training set are selected.

For the first split, these 100 samples will be taken from the second chunk. From the second split onward, these 100 samples will be taken from the first chunk. Because we don’t randomize the training set, the 100 samples used for training are the same for the second split onward. This explains the identical values from the second split onward for the 100 training instances case. The same reasoning applies to the case of training size of 500, and so on. 

<div>
<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/iti103/learning_curve_splits.png" alt="k-fold" width="600" align='center'/>
</div>



You can fix this problem by setting ``shuffle`` to **``True``** in the call to ``learning_curve()``.  Note that the train_scores are no more have identical values.

In [None]:
train_sizes, train_scores, validation_scores = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=5,
                    scoring = 'neg_mean_squared_error',
                    shuffle=True, random_state=0)

# print the train and validation scores
print('Train scores:\n\n', train_scores)
print('\n','-'*70)
print('\nValidation scores:\n\n', validation_scores)


To plot the learning curves, we need only a single error score per training set size, not 5. So we will take the mean values of the 5 error scores (for the 5 splits), which corresponds to axis 1.
The scores returned are negative mean squared error,  which are negative values. So we will need to negate the values to the MSE.

In [None]:
train_errors_mean = -train_scores.mean(axis=1)
validation_errors_mean = -validation_scores.mean(axis=1)  

# print out the errors # 
print('Mean training errors:\n', pd.Series(train_errors_mean, index=train_sizes))
print('\n', '-'*50)
print('Mean validation errors:\n', pd.Series(validation_errors_mean, index=train_sizes))

Let's define a function ``plot_curve()`` that will plot the train_errors, validation_errors against train_size. 

In [None]:
def plot_curve(title, ylabel, train_sizes, train_scores, validation_scores, ylim=None):
    plt.style.use('seaborn')
    plt.figure(figsize=(10,8))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)    
    plt.xlabel("Training size")
    plt.ylabel(ylabel)
    plt.plot(train_sizes, train_scores, 'o-', color="r",
             label="Training")
    plt.plot(train_sizes, validation_scores, 'o-', color="g",
             label="Validation")
    plt.legend(loc="best")


Plot the learning curve using the above function. We also need to limit the range of y-axis to (0,40) as the MSE for training size of 1 is very large compared to the rest, and we want to see the details of MSEs for other training sizes.

In [None]:
plot_curve('Linear Regression', 'MSE', train_sizes, 
             train_errors_mean, validation_errors_mean, ylim=(0,40))


The validation MSE seems to stagnate at a value of approximately 20. Is this good enough? 

We’d benefit from some domain knowledge.
Technically, that value of 20 has MW (megawatts squared) as units (the units get squared as well when we compute the MSE). The values in our target column are in MW (according to the documentation). Taking the square root of 20 MW results in approximately 4.5 MW. Each target value represents net hourly electrical energy output. So for each hour our model is off by 4.5 MW on average. According to this [Quora](https://www.quora.com/How-can-I-get-an-intuitive-understanding-of-what-a-Kw-Mw-Gw-of-electricity-equates-to-in-real-life-terms) answer, 4.5 MW is equivalent to the heat power produced by 4500 handheld hair dryers. And this would add up if we tried to predict the total energy output for one day or a longer period. We can conclude that the an MSE of 20 MW is quite large. 

***Exercise***:

Examine the learning curve you plot, answer the following questions (don't look at the answer first).

1. Is this a high-bias problem or a low-bias problem?

<details><summary>Click here for answer</summary><p>High Bias</p></details>

2. Is it high variance or low variance?

<details><summary>Click here for answer</summary><p>Low Variance</p></details>

3. Will adding more training data help to improve the performance of the model?

<details><summary>Click here for answer</summary><p>No</p></details>

We can try to reduce the bias with the following methods:
- use a more complex learning algorithm
- add more features (not samples) or try generate polynomial features from existing features
- reduce regularization

Let's try using RandomForestRegressor instead. You don't need to know the details of RandomForestRegressor, and we are just using it to see how it impacts the bias/variance. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

estimator=RandomForestRegressor(n_estimators=30)
train_sizes, train_scores, validation_scores = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=5,
                    scoring = 'neg_mean_squared_error',
                    shuffle=True, random_state=0)

train_errors_mean = -train_scores.mean(axis = 1)
validation_errors_mean = -validation_scores.mean(axis = 1)

plot_curve('RandomForest Regressor', 'MSE', train_sizes, train_errors_mean, validation_errors_mean, ylim=(0,40))

***Exercise:***

1. Does the new learning curve show a low or high bias?

<details><summary>Click here for answer</summary><p>Low Bias</p></details>

2. Does the new learning curve show a low or high variance?

<details><summary>Click here for answer</summary><p>High Variance</p></details>

3. Will adding more training data help to improve the performance of the model?

<details><summary>Click here for answer</summary><p>Yes, this may help</p></details>

## 3. Learning Curve for Classification Problem ##

First, let's get the dataset you will work on. The following code will load a "[digits](https://scikit-learn.org/stable/datasets/index.html#digits-dataset)" dataset into variables `X` and `y`. 

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

You have:
- a numpy-array X that contains your features (the pixel values)
- a numpy-array y that contains your labels (digits 0 to 9).

Lets first get a better sense of what our data is like. 


In [None]:
shape_X = X.shape
shape_y = y.shape
print ('The shape of X is: ' + str(shape_X))
print ('The shape of y is: ' + str(shape_y))

Let us visualize some digit in the data set.


In [None]:
some_digit = X[3]
some_label = y[3]
some_digit_image = some_digit.reshape(8, 8)

plt.figure(figsize=(1,1))
plt.imshow(some_digit_image, cmap = plt.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
print('Label = {}'.format(some_label))


**learning_curve()** expects a param called **train_sizes**, which are numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

***Exercise:***

Divide the number training samples into 5 equal sizes, starting from 0.1 (i.e. 10% of the training samples). 

**Hint:**
Use [numpy.linspace()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html)



<details><summary>Click here for answer</summary>

```python
train_sizes = np.linspace(0.1, 1.0, 5)
```
    
</details>

In [None]:
### START CODE HERE ### 

train_sizes = ??

### END CODE HERE ###

print(train_sizes)

***Exercise:***

Create a LogisticRegression estimator with solver='liblinear' and multi-class='auto' and call the ``learning_curve()`` function to get the train and validation scores. Use a 5-fold cross-validation. You need to choose scoring metrics appropriate for classification problem (e.g. 'accuracy'). Specify random_state=0 for repeatability.


<details><summary>Click here for answer</summary>

```python
estimator = LogisticRegression(solver='liblinear', multi_class='auto')
train_sizes, train_scores, validation_scores = learning_curve(
        estimator, X, y, cv=5, scoring='accuracy', train_sizes=train_sizes, shuffle=True, random_state=0)
```
    
</details>


In [None]:
### START CODE HERE ### 

estimator = ??
train_sizes, train_scores, validation_scores = ??

### END CODE HERE ### 


***Exercise:***

What do you think are the shapes of the train_scores and test_scores?

<details><summary>Click here for answer</summary>
<p>
Since we specify 5 training sizes, for each training, we specify a 5-fold cross-validations, we should have 5 x 5 train_scores and test_scores.
</p>
</details>

In [None]:
## Uncomment the following to check your answers

# print(train_scores.shape)
# print(validation_scores.shape)

***Exercise:*** 

To plot the learning curves, we need only a single score per training set size, not 5. To do this we need to take the mean value of 5 scores of each training/validation round. As the scores is the accuracy scores, you will need to convert them to error rate. 

***Hint:***  Fraction of error = 1.0 - (fraction of correct)


<details><summary>Click here for answer</summary>

```python

train_errors_mean = 1. - np.mean(train_scores, axis=1)
validation_errors_mean = 1. - np.mean(validation_scores, axis=1)
    
```
    
</details>

In [None]:
### START CODE HERE ### (~ 2 lines of code)


### END CODE HERE ###

### Plot the learning curve ###
Ok, now we can start plotting the curve. You should expect to see a learning curve similar to the following:

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/iti103/classification_lc.png" alt="classification learning curve" width="300"/>

***Exercise:***

Plot the learning curve for logistic regression. 

<details><summary>Click here for answer</summary>

```python
plot_curve('Logistic Regression', 'Error', train_sizes, train_errors_mean, validation_errors_mean)
```
    
</details>

In [None]:
### START CODE HERE ### 


### END CODE HERE ### 


***Exercise:***

Is this a high-bias or high variance problem?
<details><summary>Click here for answer</summary><p>High variance</p></details>


Let us try a more complex non-linear algorithm such as Support Vector Machine (SVM). 
You don't need to know the details of SVM, and we are just using it to see how it impacts the bias/variance. 


In [None]:
from sklearn.svm import SVC

estimator=SVC()
train_sizes, train_scores, validation_scores = learning_curve(estimator, X, y, 
                                                              cv=5, scoring='accuracy', 
                                                              train_sizes=train_sizes, 
                                                              shuffle=True, random_state=0)
train_errors_mean = 1. - np.mean(train_scores, axis=1)
validation_errors_mean = 1. - np.mean(validation_scores, axis=1)
plot_curve('SVM', 'Error', train_sizes, train_errors_mean, validation_errors_mean)


***Exercise:***

How does the use of SVM affect bias and variance of the model?

<details><summary>Click here for answer</summary><p>In this case, using a more complex, non-linear algoritm such as SVM improves variance of the model, while the bias is kept low too.</p></details>