# Train-Test Splitting in Machine Learning
#### This notebook contains my notes following tutorials from the machinelearningmastery.com website.

- Train-test split is a technique used to evaluate the performance of a machine learning model/algorithm. 

- It can be used for both classification and regression algorithms. 

- Involves dividing the datasets into two subsets.

    - The first subset is used to fit (or train) the model and is usually called the **training** dataset
    
    - The second subset is used to evaluate the performance (fit) of the machine learning model. This dataset is often called the **test** dataset. The test dataset is not used for training the model and is kept aside for testing after training the model. In this sence, the **test** dataset has not been seen by the model.
    - The performance of the model on the test set can provide a sense of the performance of the model on unseen datasets.
    
- This technique is useful when there is enough dataset. It should not be used if the data size is small.

- With insufficient data, the k-fold cross-validation technique can be used where k-1 folds is used for training the model and 1 fold is used for testing the model.

- Train-test split also improves computational efficiency due to working with smaller datasets. 

- The train-test split is determined by the test_size parameter which determines the proportion of datasets that are assigned to the training and test sets. 

### How to a choose a test_size split
- Computational cost of training the model
- Computational cost in evaluating the model
- Training test representativeness (consistent target distribution with original dataset)
- Test set representativeness (consistent target distribution with original dataset)

Common split percentages are:
- Train: 80%, Test: 20%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%

## Example

In [4]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

# create the dataset
X, y = make_blobs(n_samples=1000)

# Split the data into traiing and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# We can also use train_size argument to split the dataset
#X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(670, 2) (330, 2) (670,) (330,)


### Repeatable Train-Test Splits

When comparing ML algorithms, it is sometimes required that they are fit and tested on the same datasets. This can be achieved by fixing the seed of the pseudo-random number during the call. Specifically, we set the random_state parameter.

In [5]:
# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])

[[-3.55668015 -2.25799207]
 [ 0.90040631 -2.15669713]
 [ 8.58794766 -7.78255649]
 [-4.43181295 -5.09644906]
 [-3.11308548 -5.35597826]]
[[-3.55668015 -2.25799207]
 [ 0.90040631 -2.15669713]
 [ 8.58794766 -7.78255649]
 [-4.43181295 -5.09644906]
 [-3.11308548 -5.35597826]]


### Stratified Train-Test Splits

- applicable to classification problems only

- some problems do not have balanced number of examples for each class label. 

- therefore, it is desirable to split the datasets into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset

- this is called stratified train-test split.

- a stratified train-test split is achieved by setting the 'stratify' argument to the y component of the original dataset


In [6]:
# split imbalanced dataset into train and test sets without stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 94, 1: 6})
Counter({0: 45, 1: 5})
Counter({0: 49, 1: 1})


- In the above, the composition of the train and test sets differ
- the original data has a 94% vs 6% clas label distribution
- after splitting, the train data contains 45/5 examples and test contains 49/1 examples.
- this composition of the train and test sets differ and is not desirable

This issue is fixed using the train-test split with stratify option

In [7]:
# split imbalanced dataset into train and test sets with stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 94, 1: 6})
Counter({0: 47, 1: 3})
Counter({0: 47, 1: 3})


Give that we have a 50% split of the train and test sets, we would expect both the train and test sets to have 47/3 exampes in the train/test sets respectively. We see this is the case as shown above

## Classification Example

In [8]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

(208, 60) (208,)
(139, 60) (69, 60) (139,) (69,)
Accuracy: 0.783


## Regression Example

In [10]:
# train-test split evaluation random forest on the housing dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f (thousands of dollars)' % mae)

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)
MAE: 2.171 (thousands of dollars)
