# Exercise 4: Train test split

The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.  

```python
   X = np.arange(1,21).reshape(10,-1)
   y = np.zeros(10)
   y[7:] = 1
```

1. Split the data using `train_test_split` with `shuffle=False`. The test set represents 20% of the total size of the data set. Print X_train, y_train, X_test, y_test. Compute the proportion of class `1` on the train set and test set.

2. Having a train set with different properties than the test set is not recommended. The analogy of the exam (https://www.youtube.com/watch?v=_vdMKioCXqQ) helps to understand this point: if the questions you have at the exam are completely different from what you prepared for you are not evaluated on what you learnt. The training set has to be representative of the data set. Now, split the data in a train set and test set, but keep the proportion of class `1` nearly constant. The parameter `shuffle` in theory works as it relies on a random sampling. The parameter `stratify` will always split the data and keep the same proportion of class `1` in the train set and test set. Using the parameter `stratify` split the data below and print the proportion of class `1` in the train set and train set.

```python
   X = np.arange(1,201).reshape(100,-1)
   y = np.zeros(100)
   y[70:] = 1
```

In [16]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X = np.arange(1,21).reshape(10,-1)
y = np.zeros(10)
y[7:] = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)

# 1.
print('X_train', X_train)
print('y_train', y_train)
print('X_test', X_test)
print('y_test', y_test)
print('portion of class 1 train:', (y_train == 1).sum()/len(y_train))
print('portion of class 1 test:', (y_test == 1).sum()/len(y_test))

[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]
 [17 18]
 [19 20]]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
X_train [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]]
y_train [0. 0. 0. 0. 0. 0. 0. 1.]
X_test [[17 18]
 [19 20]]
y_test [1. 1.]
portion of class 1 train: 0.125
portion of class 1 test: 1.0


In [14]:
# 2.
# all the explination is in the point 2
X = np.arange(1,201).reshape(100,-1)
y = np.zeros(100)
y[70:] = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

print('portion of class 1 train:', (y_train == 1).sum()/len(y_train))
print('portion of class 1 test:', (y_test == 1).sum()/len(y_test))

portion of class 1 train: 0.3
portion of class 1 test: 0.3
