# Lesson 2$^3$

REVISIONS|
---------|------------------------------------------------
2018-1218|CEF, initial.                  
2019-0131|CEF, spell checked and update. 
2019-0104|CEF, minor source text update. 

## Train-Test Split

An essential method in ML is the split between training data and test data. The model will be trained on only a portion of the complete dataset, typically around 80%, and training will _NEVER_ see the test-set left out.

So, _never_ train on test (or validation) data!

Normally we just use the built-in methods for splitting, namely 

   ```sklearn.model_selection.train_test_split```, 

see documentation here

   * https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
   
but in this exercise you want to build a test-train split function yourself.

OPTIONAL: More documentation on Train-Validation-Test split at 
  
   * https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets


## Setup the Housing data form §2 [HOML] 

We use the housing data from the book, this cell will set everything up for you...

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = ".."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    #path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("IGNORING: Saving figure", fig_id)
    #if tight_layout:
    #    plt.tight_layout()
    #plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("../datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "../datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#housing.head()
print("housing.shape=",housing.shape,"\n")
housing.info()

%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
#save_fig("attribute_histogram_plots")
plt.show()

# NOTE: ITMAL, convert Pandas dataframe to numpy array, i.e. matrix
#       and use H later instead of housing
H = housing.values
print('H.shape=',H.shape,", type(H)=",type(H))

print('OK')

## Create our own train-test split function

<img src="Figs/training_and_test_splits.png" style="height:300px">

### Qa Create Your Own Split Function

Starting from the split function [HOML,p49], getting inspiration from it (do not copy it directly)

```python
def my_split_train_test(data, test_size, shuffle=False):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_size)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
```
create your own split function, that can do the data shuffling (as it is now) or do a simpler split without shuffling.

Notice that it would be better to name the function ```my_split_train_test``` to avoid clashing problems later with the Scikit-learn function of the same name. The ```test_ratio``` parameter has also been renamed to ```test_size```. 

Also note that the split function in [HOML] operates on Pandas data frames, and this will give us a mixup problem later, when we pass the function numpy arrays (matrices).

Test that your new split function returns the same number of train and test data no matter if shuffleling is on or off, using the test stub below.

In [None]:
# TODO: Qa...define your my_split_train_test here

def my_split_train_test(...



# TEST VECTORS: use the housing panda dataframe or the H numpy object, your choice
dat=housing
#dat=H

def TestSize(train_set, test_set):
    # works only for 0.2 split
    expected_n_train=16512
    expected_n_test=4128
    assert len(train_set)==expected_n_train, 'Oh, mismatch in expected train n'
    assert len(test_set) ==expected_n_test,  'Oh, mismatch in expected test n'
    print(len(train_set), "train +", len(test_set), "test","..OK")

train_set, test_set = my_split_train_test(dat, 0.2, shuffle=True)
TestSize(train_set, test_set)

train_set, test_set = my_split_train_test(dat, 0.2, shuffle=False)
TestSize(train_set, test_set)

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(dat, test_size=0.2, shuffle=True, random_state=42)
TestSize(train_set, test_set)

train_set, test_set = train_test_split(dat, test_size=0.2, shuffle=False)
TestSize(train_set, test_set)

### Qb Why Shuffling

Explain why disabling shuffling is a bad idea?

### Qc Test and Compare 

Compare your split function with the one from Scikit-learn, first using the simple X-y data set generated below and then using the housing data via the ```H``` numpy array variable.

Splitting the dataset via your split function and the built-in split does not yield a logical true for the comparison

```python
(y_train == y_train_my).all().all()
```

Why is it so? Find the exact values in ```H[i,j]``` that are not equal and explain the problem.

In [None]:
# Simple data for Qc

import numpy as np
X, y = np.arange(10).reshape((5, 2)), np.array(list(range(5)))

print("X=",X)
print("y=",y)

# TODO: Qc...



# TEST VECTORS: notice that H is not splitted into X-y parts
train, test = train_test_split(H, test_size=0.25, shuffle=False)
print("build-in split: len(train)=",len(train),", len(test)=", len(test))

train_my, test_my = my_split_train_test(H, test_size=0.25, shuffle=False)
print("my split:       len(train)=",len(train_my),", len(test)=", len(test_my))

assert train.shape==train_my.shape

# Test for equality here...
assert train.shape==train_my.shape
equal_train=(train==train_my).all().all()
equal_test =(test ==test_my).all().all()

# TODO: why not equal?
print("equal_train=", equal_train, ", equal_test=",equal_test)

### Qd The Cross-Validation [CV] Algorithm

For very small data sets, it can be difficult to split the data into sufficiently large train and test/validation data. One popular method is to split all data in smaller chunks, typically called K-folds or CV (CrossValidation) folds, and then estimate the total validation error from this set of train-test partitioned data. 

In figure form the CV works like this

<img src="Figs/cross_validation.png" style="width:500px">

or a little more verbose (ignore the B and C part for now)

<img src="Figs/kfold.png" style="width:650px">

Explain in test, pseudocode or diagrams, what how a cross-validation split works in detail.


### [OPTIONAL] Qe Extend your splitter

Extend your splitter, such that it can handle both numpy arrays and Panda data frames in both shuffle modes (you need to work on ```iloc``` in ```shuffle=True``` mode).

In [None]:
# TODO: Qe...

### [OPTIONAL] Qf Implement a Cross-Validation Function

From your description of a cross-validation algorithm in the question above, try to implement one. Compare it with the Scikit-learn function 

```python
from sklearn.model_selection import cross_val_score
..
scores = cross_val_score(.., cv=5)
```

In [None]:
# TODO: Qf...