<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-5/plot_learning_curve.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plotting Learning Curves

Welcome to learning curve programming exercise. This is part of a series of exercises to help you to acquire skills in different techniques to fine-tune your machine learning model.

**You will learn how to:**
- Dignose overfitting/underfitting problems in machine learning  
- Plot learning curves for both classification and regression types of problems

*Acknowledgement: This exercise is adapted from https://www.dataquest.io/blog/learning-curves-machine-learning/*

## 1. Import Required Packages ##

Let's first import all the packages that you will need during this exercise.
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [sklearn](http://scikit-learn.org/stable/) provides simple and efficient tools for data mining and data analysis. 
- [matplotlib](http://matplotlib.org) is a library for plotting graphs in Python.
- [pandas](https://pandas.pydata.org) is a library for data analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

%matplotlib inline


## 2. Learning Curve for Regression Problem ##
First, let's get the dataset you will work on. The description of the data can be found [here](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

In [None]:
# if you are using jupyter notebook and wish to load the data locally
# electricity = pd.read_excel('data/combined_pp.xlsx)
# if you wish to load data from an url
electricity = pd.read_excel('https://github.com/nyp-sit/sdaai-iti103/raw/master/session-5/data/combined_pp.xlsx')
electricity.info()

***Exercise***

Separate the features from the target. Instantiate a LinearRegressor as the estimator to be used later 

<details><summary>Click here for answer</summary>

```python
    
X = electricity[features]
y = electricity[target]

estimator = LinearRegression()
    
```
    
</details>


In [None]:
# We separate the features and target from the data set
features = ['AT','V','AP','RH']
target = 'PE'

### START CODE HERE ### 
X = None
y = None

# Instantiate a LinearRegressor
estimator = None 

### END CODE HERE 

`learning_curve()` in scikit-learn can be used to  generate the data needed to plot a learning curve, i.e. the training and validation scores. The function returns a tuple containing three elements: ``train_sizes``, and ``train_scores`` and ``validation_scores``. The function accepts the following parameters:
- estimator — indicates the learning algorithm we use to estimate the true model
- X — the features
- y — the target labels
- train_sizes — the numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. (Note: the notation (0,1] means inclusive of 0 but exclusive of 1). Otherwise it is interpreted as absolute sizes of the training sets. 
- cv — determines the cross-validation splitting strategy.
- scoring — controls the metrics used to evaluate estimator. Possible pre-defined metrics can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html)
- shuffle - whether to shuffle training data before taking prefixes of it based on ``train_sizes``.

You can refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html) for more detail of the function. 

***Exercise:***

Complete the code below to obtain the train and validation scores using the ``learning_curve()``. Use a cross-validation fold  of 10. You need to choose an appropriate scoring metric (in our case, `'neg_mean_squared_error'` will be a good choice) for linear regression problem. Set shuffle to ``False``.  Specify a random_state=42, so that the results across experiments are repeatable.

There are 9568 rows of data. We set aside 80% of the data for training which is around 7654 samples. We will plot the training curve for training sizes of 1, 100, 500, 2000, 5000, 7654. 

<details><summary>Click here for answer</summary>

```python
    
train_sizes = [1,100,500,2000,5000,7654]
    
train_sizes, train_errors, validation_errors = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=10,
                    scoring = 'neg_mean_squared_error',
                    shuffle=False, random_state=42)
    
```
    
</details>

In [None]:
### START CODE HERE ### 

# declare the list of different training sizes

train_sizes = None

# call the learning_curve() to return train_scores/validation error scores for different train sizes
train_sizes, train_errors, validation_errors = None

### END CODE HERE ###


***Exercise:***

What do you think are the shapes of the train_scores and test_scores?

<details><summary>Click here for answer</summary>
<p>
Since we specify 6 training sizes, for each training, we specify a 10-fold cross-validations, we should have 6 x 10 train_scores and test_scores.
</p>
</details>

Let us print out the values of train_scores and validation_scores (neg_mean_squared_error). Each row corresponds to a test size and each columns corresponds to a split. 

In [None]:
### Uncomment the following lines ### 

# print('Train scores:\n\n', train_errors)
# print('\n','-'*70)
# print('\nValidation scores:\n\n', validation_errors)

You might have noticed that some error scores on the training sets are the same. For the row corresponding to training set size of 1, this is expected, but what about other rows? With the exception of the last row, we have a lot of identical values. For instance, take the second row where we have identical values from the second split onward. Why is that so? 

This is caused by not randomizing the training data for each split. Let’s walk through a single example with the aid of the diagram below. When the training size is 100 the first 100 samples in the training set are selected.

For the first split, these 100 samples will be taken from the second chunk. From the second split onward, these 100 samples will be taken from the first chunk. Because we don’t randomize the training set, the 100 samples used for training are the same for the second split onward. This explains the identical values from the second split onward for the 100 training instances case. The same reasoning applies to the case of training size of 500, and so on. 

<div>
<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/iti103/splits.png" alt="k-fold" width="600" align='left'/>
</div>



***Exercise:***

You can fix this problem by setting ``shuffle`` to **``True``** in the call to ``learning_curve()``.  Specify a random_state=42, so that the results across run are repeatable.


<details><summary>Click here for answer</summary>

```python
    
train_sizes, train_errors, validation_errors = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=10,
                    scoring = 'neg_mean_squared_error',
                    shuffle=True, random_state=42)
    
```
    
</details>

In [None]:
### START CODE HERE ###

train_sizes, train_errors, validation_errors = None

### END CODE HERE ###

To plot the learning curves, we need only a single error score per training set size, not 10. So we will take the mean values of the 10 error scores (for the 10 splits). 
You will notice that the scores (which is negative mean squared error) are negative values. We will need to negate the values. 

***Exercise:*** 

Take the mean (of the 10-splits CV) of the train_errors and validation_errors and also negate (flip the sign) the mean values to get mean_squared_error (MSE) values.

***Hint:*** 

Use the ``numpy.mean()`` function and specify the correct ``axis``.



<details><summary>Click here for answer</summary>

```python

train_errors_mean = -train_errors.mean(axis=1)
validation_errors_mean = -validation_errors.mean(axis=1)
    
    
```
    
</details>

In [None]:
### START CODE HERE ###

train_errors_mean = None
validation_errors_mean = None 

### END CODE HERE ###

### Uncomment the following to print out the errors ### 
# print('Mean training errors:\n', pd.Series(train_errors_mean, index=train_sizes))
# print('\n', '-'*50)
# print('Mean validation errors:\n', pd.Series(validation_errors_mean, index=train_sizes))


**Expected Output**

Mean training errors:
<div>
<p>
    <table style="width:20%" align="left">
      <tr>
        <td>1</td>
        <td>0.000000</td> 
      </tr>
      <tr>
        <td>100</td>
        <td>18.022630</td> 
      </tr>
        <td>500</td>
        <td>19.610796</td> 
      </tr>
      <tr>
        <td>2000</td>
        <td>20.668527</td> 
      </tr>
      <tr>
        <td>5000</td>
        <td>20.854990</td> 
      <tr>
      <tr>
        <td>7654</td>
        <td>20.817307</td> 
      </tr>
    </table>
</p>
</div>


<div>
<p>
Mean validation errors:
<table style="width:20%" align="left">
  <tr>
    <td>1</td>
    <td>443.007852</td> 
  </tr>
  <tr>
    <td>100</td>
    <td>22.096985</td> 
  </tr>
  <tr>
    <td>500</td>
    <td>20.969215</td> 
  </tr>
  <tr>
    <td>2000</td>
    <td>20.806284</td> 
  </tr>
  <tr>
    <td>5000</td>
    <td>20.794057</td> 
  <tr>
  <tr>
    <td>7654</td>
    <td>20.801154</td> 
  </tr>
</table>
</p>
</div>

Let's define a function ``plot_curve()`` that will plot the train_errors, validation_errors against train_size. 

In [None]:
def plot_curve(title, label, train_sizes, train_scores, validation_scores, ylim=None):
    plt.style.use('seaborn')
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)    
    plt.xlabel("Training size")
    plt.ylabel(label)
    plt.plot(train_sizes, train_scores, 'o-', color="r",
             label="Training")
    plt.plot(train_sizes, validation_scores, 'o-', color="g",
             label="Validation")
    plt.legend(loc="best")

***Exercise:*** 

Plot the learning curve using the above function. You may need to limit the range of y-axis to (0,40) as the MSE for training size of 1 is very large compared to the rest, and we want to see the details of MSEs for other training sizes.

<details><summary>Click here for answer</summary>

```python
    
plot_curve('Linear Regression', 'MSE', train_sizes, 
             train_errors_mean, validation_errors_mean, ylim=(0,40))
    
    
```
    
</details>




In [None]:
### START CODE HERE ###


### END CODE HERE ###

The validation MSE seems to stagnate at a value of approximately 20. Is this good enough? 

We’d benefit from some domain knowledge.
Technically, that value of 20 has MW (megawatts squared) as units (the units get squared as well when we compute the MSE). The values in our target column are in MW (according to the documentation). Taking the square root of 20 MW results in approximately 4.5 MW. Each target value represents net hourly electrical energy output. So for each hour our model is off by 4.5 MW on average. According to this [Quora](https://www.quora.com/How-can-I-get-an-intuitive-understanding-of-what-a-Kw-Mw-Gw-of-electricity-equates-to-in-real-life-terms) answer, 4.5 MW is equivalent to the heat power produced by 4500 handheld hair dryers. And this would add up if we tried to predict the total energy output for one day or a longer period. We can conclude that the an MSE of 20 MW is quite large. 

***Exercise***:

Examine the learning curve you plot, answer the following questions (don't look at the answer first).

1. Is this a high-bias problem or a low-bias problem?

<details><summary>Click here for answer</summary><p>High Bias</p></details>

2. Is it high variance or low variance?

<details><summary>Click here for answer</summary><p>Low Variance</p></details>

3. Will adding more training data help to improve the performance of the model?

<details><summary>Click here for answer</summary><p>No</p></details>

We can try to reduce the bias with the following methods:
- use a more complex learning algorithm
- add more features (not samples) or try generate polynomial features from existing features
- reduce regularization

Let's try using RandomForestRegressor instead. You don't need to know the details of RandomForestRegressor, and we are just using it to see how it impacts the bias/variance. 

***Exercise:*** 

Complete the code below to plot the learning curve. The steps are similar to the above. 

<details><summary>Click here for answer</summary>

```python

train_sizes, train_errors, validation_errors = learning_curve(
                    estimator, X, y, train_sizes = train_sizes, cv=10,
                    scoring = 'neg_mean_squared_error',
                    shuffle=True, random_state=42)
    
train_errors_mean = -train_errors.mean(axis = 1)
validation_errors_mean = -validation_errors.mean(axis = 1)
    
plot_curve('RandomForest Regressor', 'MSE', train_sizes, train_errors_mean, validation_errors_mean, ylim=(0,40))

    
    
```
    
</details>


In [None]:
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(n_estimators=100)

### START CODE HERE ###


### END CODE HERE ###

***Exercise:***

1. Does the new learning curve show a low or high bias?

<details><summary>Click here for answer</summary><p>Low Bias</p></details>

2. Does the new learning curve show a low or high variance?

<details><summary>Click here for answer</summary><p>High Variance</p></details>

3. Will adding more training data help to improve the performance of the model?

<details><summary>Click here for answer</summary><p>Yes, this may help</p></details>

## 2. Learning Curve for Classification Problem ##

First, let's get the dataset you will work on. The following code will load a "[digits](https://scikit-learn.org/stable/datasets/index.html#digits-dataset)" dataset into variables `X` and `y`. 

In [None]:
digits = load_digits()
X, y = digits.data, digits.target

You have:
    - a numpy-array X that contains your features (the pixel values)
    - a numpy-array Y that contains your labels (digits 0 to 9).

Lets first get a better sense of what our data is like. 

***Exercise:***

How many training examples do you have? In addition, what is the `shape` of the variables `X` and `Y`? 

***Hint***: How do you get the shape of a numpy array? [(help)](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html)

<details><summary>Click here for answer</summary>

```python

shape_X = X.shape
shape_Y = y.shape
m = shape_X[0]
    
```
    
</details>



In [None]:
### START CODE HERE ### 

shape_X = None
shape_Y = None
m = None

### END CODE HERE ###

### Uncomment the codes below to print out the values ###
# print ('The shape of X is: ' + str(shape_X))
# print ('The shape of Y is: ' + str(shape_Y))
# print ('I have m = %d training examples!' % (m))

**Expected Output**:

<table style="width:40%" align="left">
  <tr>
    <td><b>shape of X</b></td>
    <td>(1797, 64)</td> 
  </tr>
  <tr>
    <td><b>shape of Y</b></td>
    <td>(1797,)</td> 
  </tr>
    <tr>
    <td><b>m</b></td>
    <td>1797</td> 
  </tr>
</table>

Let us visualize some digit in the data set.

***Exercise:***

The original image is a 8 x 8 grey scale image. However, the sample in ``X`` is a numpy array of 64 values. Add codes below to transform the ``X[3]`` into a 8 x 8 image for plotting.

**Hint**: How do you reshape a numpy array? [(help)](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#numpy.reshape)

<details><summary>Click here for answer</summary>

```python

some_digit = X[3]
label = y[3]
some_digit_image = some_digit.reshape(8, 8)
    
```
    
</details>



In [None]:
### START CODE HERE ### 

some_digit = None
label = None 
some_digit_image = None 

### END CODE HERE ###

plt.figure(figsize=(1,1))
plt.imshow(some_digit_image, cmap = plt.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
print('Label = {}'.format(label))


***Exercise:***

Create a cross-validation splits with 50 iterations to get smoother mean test and train
score curves, each time with 20% data randomly selected as a validation set.

**Hint**: Use the [ShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) in scikit-learn 


<details><summary>Click here for answer</summary>

```python

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=42)
    
```
    
</details>


In [None]:
### START CODE HERE ### 

cv = None 

### END CODE HERE ###

**learning_curve()** expects a param called **train_sizes**, which are numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

***Exercise:***

Divide the number training samples into 5 equal sizes, starting from 0.1 (i.e. 10% of the training samples). 

**Hint:**
Use [numpy.linspace()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html)



<details><summary>Click here for answer</summary>

```python

train_sizes = np.linspace(0.1, 1.0, 5)
    
```
    
</details>

In [None]:
### START CODE HERE ### 

train_sizes = None

### END CODE HERE ###

print(train_sizes)

***Exercise:***

Create a LogisticRegression estimator with solver='liblinear' and multi-class='auto' and call the ``learning_curve()`` function to get the train and validation scores. You can set the ``cv`` param to the cross-validation you created earlier. You need to choose scoring metrics appropriate for classification problem (e.g. 'accuracy')


<details><summary>Click here for answer</summary>

```python

estimator = LogisticRegression(solver='liblinear', multi_class='auto')
train_sizes, train_scores, validation_scores = learning_curve(
        estimator, X, y, cv=cv, scoring='accuracy', train_sizes=train_sizes, shuffle=True)
    
```
    
</details>


In [None]:
### START CODE HERE ### 

estimator = None 
train_sizes, train_scores, validation_scores = None

### END CODE HERE ### 



***Exercise:***

What do you think are the shapes of the train_scores and test_scores?

<details><summary>Click here for answer</summary>
<p>
Since we specify 5 training sizes, for each training, we specify a 50-fold cross-validations, we should have 5 x 50 train_scores and test_scores.
</p>
</details>

In [None]:
### Uncomment the following to check your answers

# print(train_scores.shape)
# print(validation_scores.shape)

***Exercise:*** 

To plot the learning curves, we need only a single score per training set size, not 50. To do this we need to take the mean value of 50 scores of each training/validation round. As the scores is the accuracy scores, you will need to convert them to error rate. 

***Hint:***  Fraction of error = 1.0 - (fraction of correct)


<details><summary>Click here for answer</summary>

```python

train_errors_mean = 1. - np.mean(train_scores, axis=1)
validation_errors_mean = 1. - np.mean(validation_scores, axis=1)
    
```
    
</details>

In [None]:
### START CODE HERE ### (~ 2 lines of code)

train_errors_mean = None 
validation_errors_mean = None 

### END CODE HERE ###

### Plot the learning curve ###
Ok, now we can start plotting the curve. You should expect to see a learning curve similar to the following:

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/iti103/classification_lc.png" alt="classification learning curve" width="300"/>

***Exercise:***

Plot the learning curve for logistic regression. 

<details><summary>Click here for answer</summary>

```python

plot_curve('Logistic Regression', 'Error', train_sizes, train_errors_mean, validation_errors_mean)

    
```
    
</details>

In [None]:
### START CODE HERE ### 
 

### END CODE HERE ### 


***Exercise:***

Is this a high-bias or high variance problem?
<details><summary>Click here for answer</summary><p>High variance</p></details>


Let us try a more complex non-linear algorithm such as Support Vector Machine (SVM). 
You don't need to know the details of SVM, and we are just using it to see how it impacts the bias/variance. 

As SVC takes longer to train, we reduce the split to 10 to speed-up the training time. 

***Exercise:***

Complete the code below to plot the learning curve for SVM. 


<details><summary>Click here for answer</summary>

```python

train_sizes, train_scores, validation_scores = learning_curve(
        estimator, X, y, cv=cv, scoring='accuracy', train_sizes=train_sizes, shuffle=True)
train_errors_mean = 1. - np.mean(train_scores, axis=1)
validation_errors_mean = 1. - np.mean(validation_scores, axis=1)
plot_curve('SVM', 'Error', train_sizes, train_errors_mean, validation_errors_mean)
    
```
    
</details>



In [None]:
from sklearn.svm import SVC

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)

### START CODE HERE ### (~ 4 lines of code)



### END CODE HERE ### 

***Exercise:***

How does the use of SVM affect the bias of the model?

<details><summary>Click here for answer</summary><p>Using a more complex, non-linear algoritm such as SVM improves the bias of the model</p></details>