# Train and Test Data Split
In this notebook we will work on the train and test data split. It might look simple at first, what can be so difficult with assigning the train and test data, right? However, as we say, the further you go into forest, the more trees there are.

When it comes to splitting the data we need to remember about a few rules that will allow us to create better models and allow for more precise model evaluation. 

**Basically, why do we split the data into train and test sets?**  
In theory - we don't have to do it to create a model. We just need to feed the model with the independent data (our **X** values, all our features) and dependent data (our **y**, our desired outcomes) and our model will learn how to predict the y for the data we input later. But will it be a good model? Probably not. Will we know how good or bad it is? Also, probably not.   

If we however divide the data into train and test datasets, we will, as the names suggest, train the model on the training data and then do some testing on the test data. Then we can compare the outcome on the test data with the actuall results and therefore evaluate our model. If the model does poorly on the predictions, it's a signal for us that we might want to tweek it a bit, change the parameters or do some more feature engineering so that we get a better result. We can also see, that the model is under- or overfitted and therefore it will make mistake on new data. 

In this notebook we will take a more theoretical look at the data splitting and some of the issues that come either from the splitting, but also take a look at the most common issues that might pop up at this point of machine learning journey:
1. Using a population or sample data
2. Basic train/test split
3. Validation methods
4. Dealing with imbalanced data
5. Information Value * (I will add it later)

## Using population or sample data
The dataset we are using could also be called our population of data. Depending on the size of this dataset we might notice, that the population is considerably large (to consider if we have more than few hundred thousand records) and takes up a lot of memory, and therefore leads to longer modeling, especially when we are just experimenting, searching for correct model or trying to optimize the results. In such scenario we might consider using a sample of the data instead of the whole populations. 

|  | **Advantages** | **Disadvantages** |
|:---|:---:|:---:|
| **Population** | Represents the tendencies for the whole population | Takes more time to be processed, <br>Assumes that we have all the data, <br> Might be imbalanced for minority classes |
| **Sample** | Is faster to process and model, <br>Might help with imbalanced data | Provides a generalization of the populations tendencies, <br>Might be bias or be wrongly sampled |

Choosing if we want to work on a sample or populations depends on different factors, dataset size, resources and time we have to do the task. If we want do use a sample, we can go a few different routes to achieve the desired, best sample for our task:
- **Simple random sample** - which will create a random sample of the size we desire from our dataset. These data will we totally random and might miss on some patterns, therefore choosing the right size is very importand here, as well as comparing the sample to population for similarities;
- **Stratified sample** - in which we divide the population into subgroups for the unique values (layers) of a feature and draw a sample from it. It ensures that every layer is represented, even it the proportions are slightly off.
- **Proportional sample** - in which we are trying to save the proportions of the of the given feature;
- **Systematic sample** - in which we take sample given the some intervals. We use a random starting point and then take n record in steps of our intervals. This works best if the data is sorted by some feature.

> For our machine learning in these notes we do not need to take a sample, the dataset has only ~50k records and therefore is quite small, but we will draw the explained samples for this tutorial

In [1]:
# Let's import our libraries and read in our dataset that we preprocessed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read our dataset for this notebook
df = pd.read_csv('Diamonds_encoded.csv')
df.head()

Unnamed: 0,carat,cut,clarity,depth,table,price,E,F,G,H,I,J,cube
0,0.23,2,3,61.5,55.0,326,1,0,0,0,0,0,38.20203
1,0.21,3,2,59.8,61.0,326,1,0,0,0,0,0,34.505856
2,0.23,1,4,56.9,65.0,327,1,0,0,0,0,0,38.076885
3,0.29,3,5,62.4,58.0,334,0,0,0,0,1,0,46.72458
4,0.31,1,3,63.3,58.0,335,0,0,0,0,0,1,51.91725


In [13]:
# simple random sample

random_sample = df.sample(n=5000, # here we input the size of our sample
                         random_state=42 # random state ensures that each time we draw, the sample will remain the same
                         )

In [14]:
# proportional sample on clarity
clarity_dict = {0: 'I1', 1: 'IF', 2: 'SI1', 3: 'SI2', 4: 'VS1', 5: 'VS2', 6: 'VVS1', 7: 'VVS2'} #as reminder

# let's set up the weights 
# if we want the data to be splitted exactly, all should have the same weight assigned
weights = {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1} 

proportional_sample = df.sample(n=5000, 
                         random_state=42,
                         weights=df['clarity'].map(weights) # assigning weights
                               )

In [17]:
# stratified sample on cut

proportion = 0.1  # Sample size as proportion to the population, here 10%

stratified_sample = df.groupby('cut').apply(lambda x: x.sample(frac=proportion))

In [27]:
# systematic sample with sorted cut
df1 = df.sort_values(by='cut') # we need to sort and I don't want to overwrite my df

# the following part can be done in one line as well
proportion = 0.1
population_size = len(df1)
sample_size = int(population_size * proportion) 
step = population_size//sample_size

# assign first draft
first_index = np.random.randint(0, sample_size) 

systematic_sample = df1[first_index::step].reset_index(drop=True)

When it comes to sampling, these would be the basic sampling methods to use if we were to resample our dataset, if it is to big or we would like to first play with our data in a way, that involves less computing.

There are also different techniques in `scipy.stats` or in `sklearn.model_selection`.

## Basic train and test split
Now that we decided whether we want to work on the population or sample, we can split our data for our machine learning model. As mentioned before, we do it so that we can check how well or bad our model is behaving. In many situations we will be spliting not only into training and testing sets, we will also create a validation set for hyperparameters tuning, but more on that later.

Most often, we will split our data into 70-80 % for training and 20-30% for testing (and validation). For larger datasets we can split it more deliberately, for example 50/50 or 90/10, but we can dive into that the more we understand and tune our model. 

As the names sugest, we use our **training set** to train our model, providing it with the features/parameters values X and the expected values y. The **testing set** serves as, obviously, a test, in which we compare the model prediction with the expected, correct values. The more our model is correct in it's estimation, the better. 

> **In this notebook we will be using cut as the dependent value.**  If we were modeling at this point, after the train/test split we might do scaling on our X_train and later scale X_test when predicting.

In [2]:
# let's import our needed methods
from sklearn.model_selection import train_test_split

In [3]:
# let's now assign the dependent and independed variables
y = df.pop('cut') #dependent variable
X = df # independent variables

Of course for the X/y split there are tons of methods, but this is the one I like right now. If we are choosing only a few columns from the whole dataset for prediction, this code could look like this:
```python
X = df[['cut', 'depth', 'cube', 'clarity']]
y = df['price']
```

We could also create pipelines that will automaticly assign the classes/expected values. This might come in handy when updating and already existing models.

In [4]:
# now let's split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, # our variables
                                                   test_size=0.2, # proportion of the test size
                                                   random_state=42, # so it is always divides same way
                                                   stratify=y # for classification, so we get equal number of classes
                                                   )
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(43152, 12) (10788, 12)
(43152,) (10788,)


## Validation and it's methods
In the previous part, I mentioned that we can also distinguish a validation set. But before that, let's consider, what actually is validation and why we need it, when creating a machine learning model. 

If we were to split our data into train-validation-test, the validation set would be there to evaluate our model. Sounds too similar to what testing set does? The difference we make in here is that we keep the testing set until the very end, so that we could evaluate the model on data it hasn't seen before. Machine learning models tend to learn the provided data *by heart*, so to speak, therefore it's good to evaluate on data it has never seen before. The validation set is there, so that we can evaluate the model many times before calling it perfected (or good enough) and stop tuning the hyperparameters. Then, if the results are same or close on the testing set, we are good to go with deploying our model.

There is also an alternative for validation set called **cross-validation**. It is useful, since if we were to use a validation set, we would have to sacrifice approximatel 10-20% of our data, that could be used for training. With cross-validation, the data can stay in the training set and we will have a way to validate the model. 

Cross-validation also serves as a controlling step, that will lessen the likelihood of model ovefitting. By using cross-validation, the data we feed into our model will be given in different states, shuffled and therefore our model will be more robust. 

In [5]:
# the basic way with validation set

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=42,
                                                   stratify=y 
                                                   )

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, # we use train sets in here!
                                                  test_size=0.25,  # remember that X_train = 0.8 of X
                                                  random_state=42,
                                                  stratify=y_train
                                                 )

Now with that we would:
- use training set to train the model;
- use validation set to validate the model and then tune the hyperparameters to see, if we can create a better outcome;
- use the test set to test the final model after validation.

However this way we are training the model on only **60%** of data. Therefore we call for the cross-validation method to the rescue.

What the most of cross-validations do is they create a desired number of folds of train/test sets, there test data don't overlap in different folds. With these folds, we can train our model for the number of folds and take the average score, on which we are evaluating the model (more on the evaluation in different notebook!). When we are training the model with cross-validation, we do it in a loop.

There are quite a few methods on cross-validation from Sklearn and we will take a few into consideration here, with honorable mentions as well. 

In [6]:
# let's import the cross-validation methods
from sklearn.model_selection import (LeaveOneOut, 
                                     KFold, 
                                     StratifiedKFold)

### Leave-One-Out
Leave-One-Out is one of the cross-validation methods, that can be used to teach a ML model. In each loop, the algoritm will leave one record out and use it for evaluation, hence the name.  
![LOOCV](https://www.bijenpatel.com/content/images/2020/11/loocv.png)

LOO cross-validation is however very computationally expensive, as it basicly loops through the whole given dataset. It is therefore not recommended as a validation method for larger datasets, as it might simply take to long to be computed. LOO is a great method for smaller datasets (up to few thousand records) and might be great solution for fighting the overfitting, as it can be very precise.
> In this notebook I will code the LOO for the diamonds dataset but won't run it, since it is a larger dataset.

Alternatively to LOO, there is another cross-validation method called `LeavePOut()`, which allows us to define the number of records left out. 

In [None]:
loocv = LeaveOneOut()

for train, test in loocv.split(X, y): # this loop provides us with the indexes of train and test rows
    X_train, y_train = X.iloc[train], y.iloc[train] # assign the train spilt
    X_test, y_test = X.iloc[test], y.iloc[test] # assign the train split
    
    print(X_train.shape, X_test.shape) # here we will follow up with model fitting in next notebook

### KFold
KFold cross-validation method might be the most popular one, as it's much faster and computationally cheaper. Similarely to Leave-One-Out, in each loop it takes a piece of the data for testing and does it for the k-number of folds. The testing data does not overlap and is evenly distributed in all the folds.
![KFold](https://i0.wp.com/sqlrelease.com/wp-content/uploads/2021/07/K-fold-cross-validation-1.jpg?ssl=1)

In [7]:
kfoldcv = KFold(n_splits=5, # number of folds, 5 is default
               shuffle=True, # if true, the data is shuffled at the beggining
               random_state=42 # ensures that when we use it again we will get same results
               )

# and the loop is the same for each cross-validation
for train, test in kfoldcv.split(X, y):
    X_train, y_train = X.iloc[train], y.iloc[train]
    X_test, y_test = X.iloc[test], y.iloc[test]
    print(X_train.shape, X_test.shape)

(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)


### Stratified KFold
This cross-validation is a variation of KFold that returns stratified folds (we talked about stratification earlier). The folds are made by preserving the percentage of samples for each category. This is a good cross-validation for imbalanced data, as it ensures the presence of each category in every training and testing fold.
![Stratified KFold](https://i.stack.imgur.com/XJZve.png)

In [8]:
strkfoldcv = KFold(n_splits=5,
               shuffle=True,
               random_state=42 
               )

# and the loop is the same for each cross-validation
for train, test in strkfoldcv.split(X, y):
    X_train, y_train = X.iloc[train], y.iloc[train]
    X_test, y_test = X.iloc[test], y.iloc[test]
    print(X_train.shape, X_test.shape)

(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)
(43152, 12) (10788, 12)


### Groups
With cross-validation and generally, in data analytics and ML as well, we sometimes come across the usage of *groups* in our datasets. What are these and how do we use it? Well, I'm glad I asked. 😁

In regards to our dependent and independent variables, groups won't fit into any of these. The groups are there in the dataset as the sources of the records. 

For example, let's assume that for the dataset we are working with, diamonds, each unique diamond color comes from a different mine. And therefore, if we are doing a classification for the cut of the diamond, we can later see, where do the best cuts come frome and where the cuts are so-so (this is a made up example to the used dataset, as unfortunately there are not groups in it).  
Another example could be that we are given the patients data from few different hospitals and need to determine, if they are going to have a heart attack or not. The group in here would be the hospitals, from which the informations about the patient come from. If one of the hospital has high level of heart attacks, maybe they do not take good care of their patients? Or maybe this hospital is located in an unhealthy area? Or, perhaps, they specialize in heart diseases? 

Either way, different sources of information/records might cause the model to be to sensitive or not sensitive enough for our prediction. Like in the example given above with the hospitals, if we give the model training data from "normal" hospitals and then ask the same model to predict the heart attacks for patients in the heart-specializing hospital, we won't get good predictions. This might also go the other way round, as if we would train the model with data from mainly heart-spetializing hospitals and then ask it to predict heart attacks for a childrens hospital, where the rates are much lower.

Importatant to remember here is that groups aren't a feature in the sense that we train our model with them. We most likely will come across the groups when creating a larger, general models, in which we would like to avoid biases and make mistakes based on the sources of our data, if they come from a source different than the ones we trained our model with.

Why mentioning it? Well, there are cross-validation methods that allow us to validated our train and test data with considerations of the groups as well.

### Honorable mentions
As I explained, how the groups work in data above, we can move on to a few different cross-validation methods that are involving the usage of groups.
- **GroupKFold** - which ensures that for each fold our model is validated on data from a different group(s) than the ones it was learned on;
- **StratifiedKFold** - which is a mixture of it's name, it takes data from each group to be trained and tested on, while preserving the percentage of samples for each group;
- **LeaveOneGroupOut** - where for each iteration, different group is left out for validation;
- **PredefinedSplit** - in which we predefine the schema for the spilt;
- **TimeSeriesSplit** - which is a special cross-validation method for time-series forecasting, which I will coved later on, in time series notes.

## Dealing with imbalanced data
Imbalanced data is a dreadful situation to think of, but it's farely common, when dealing with minory classes in our data or trying to predict something rare, like shark attacks. The less likely or less common the cagetory is, the harder it will be for our model to differentiate it from the majority category(-ies). What is also important to note here is that usually those minority classes are the once that are crutial to us and important to recognize. For example, there is less people that are pre-diabetic than the once that are not, but it is more important the recognize the ones that are than the ones that aren't, since with proper predictions we might save some people from being diabetic.

We talk about imbalanced data when in a binary classification one of our categories is seen in less than 10% of record or lower. The margin might be sligly higher or lower, depending on the context, but when it goes below 10%, we might debate whether we are dealing with imbalanced data or not. A golden rule would say, that the bigger is the difference between the smallest class and the rest the more trouble will there be for our model. If our minority class takes less than 10% of the data, then if we would test our y_test with an array of just the values of the majority class, we still would get at least 90% of accuracy, which in itself might look not bad, but it doesn't serve the purpose of our model, to detect these 10% of minorities. 

In case we are dealing with imbalance, if we want our model to be sensitive for minority classes and be able to predict them, we have a few solutions to handle that.
1. We might use sampling to try to even out the proportions of our classes;
2. Stratified cross-validation helps us with balancing the data and not skipping over minory class(es);
3. We can use Random Oversampling or SMOTE (Synthetic Minority Oversampling Technique);
4. We can also do Random Undersampling or TomekLinks on the majority class;
5. We can use an ML algoritm that uses weight-classes and assign higher weights to the minority class(es)

We already discussed sampling and cross-validation and we will talk about models in the next notebook, hence we will now move on to the over- and undersamplin and SMOTE. However, these techniques provide best results when used on binary classification and in lower dimentionalities. Our diamonds dataset is rather low in dimentions, but for our predicted value *cut* we have 5 classes. Therefore in this notebook I will discuss these techniques on titanic dataset with only numeric columns for the sake of these examples.

> 💡 **Important! When we are balancing the data for the model, we only do so with training data!**

In [55]:
# let's load the titanic for our balancing examples
titanic = sns.load_dataset('titanic')
titanic = titanic[['survived', 'age', 'sibsp', 'parch', 'pclass', 'fare']]
titanic = titanic.dropna(axis=0)
# I narrower the dataset as I don't want to waste time encoding these values

# let's see if the survived data is imbalanced
a = titanic.groupby('survived').count()
a['perc'] = a['fare']/a['fare'].sum()
a['perc']

survived
0    0.593838
1    0.406162
Name: perc, dtype: float64

The hasn't survived/survived ratio is actually not that bad, we wouldn't have to balance it out, as the data is well represented in both classes. But of course, we will do it for the sake of these examples and explanations and since I am slightly too lazy to search for appropriate dataset 😝. Generally, in cases where the data are not balanced 1:1, but both (or all) categories are well represened in the dataset it's better not to use these methods, as it also might disrupt the ditribution patterns.

## Balancing the dataset with binary classes
### Random Oversampling and Undersampling
I have put these two methods together, as these use the same methodology, but in opposite directions.

When we are talking about **Oversampling**, we have our minority class in mind. From this class, we are taking the data and then copying them again and again, until the minority and majority class are equal (or more  comparable) in the number of records.  
***Pros***:    The data is no longer imbalanced and detects minority class with higher rates  
***Cons***:    Our model might be overfitted for the minority class and not be able to predict it on slightly different data 🥲

**Undersampling** is the polar opposite of oversampling and we perform it on the majority class. In this case, we take a sample of the size of minority class (or in the proportions we demand). This way our training dataset is usually smaller than initially, but is now balanced.  
***Pros***:    The data is no longer imbalanced and detects minority class with higher rates    
***Cons***:    Our model might be underfittet for the majority class and miss out on some additional patterns in the data.

Whichever of these we choose, we need to remember about proper validation and testing on not-sampled data, so that we can compare if our sampling raised the predictability of our model.

In [56]:
# let's start with importing our methods
from imblearn.over_sampling import SMOTE, RandomOverSampler 
from imblearn.under_sampling import TomekLinks, RandomUnderSampler
from collections import Counter # cool function 

# and prepare a basic train test split for titanic
y = titanic.pop('survived')
X = titanic

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=42,
                                                   stratify=y 
                                                   )

In [57]:
# using oversampler on minority class in titanic
ros = RandomOverSampler(random_state=42, # we want it to always be the same with random_state
                        sampling_strategy=1 # 1 will give us 1:1 proportion, but we can choose decimals as well
                       )
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train) # let's resample

print('Resampled dataset shape %s' % Counter(y_train_ros)) # check the sizes of our classes

Resampled dataset shape Counter({0: 339, 1: 339})


In [58]:
# now undersampler will be exactly same in contruction
rus = RandomUnderSampler(random_state=42, 
                         sampling_strategy=1)

X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print('Resampled dataset shape %s' % Counter(y_train_rus))

Resampled dataset shape Counter({0: 232, 1: 232})


### SMOTE
SMOTE, also known as Synthetic Minority Oversampling Technique, is a form of oversampling our minority class data. Differently to the random oversampling, instead of multipying the underrepresented data, SMOTE will generate new, similar records for it. It takes a similar algoritm to k-nearest neighbours and depending on the k_neighbors number it will create a grid among the nearest datapoint and interpolate to generate new, artifitial points (records).   
![SMOTE](https://miro.medium.com/v2/resize:fit:734/1*yRumRhn89acByodBz0H7oA.png)


***Pros***:    The data is no longer imbalanced and we didn't lose majority records nor multiplied the minority ones    
***Cons***:    Our model is learning on synthetical data; SMOTE might also have problems with higher dimentions and many features

In [59]:
# let's use SMOTE on our titanic dataset
sm = SMOTE(random_state=42,
           sampling_strategy='auto',
           k_neighbors=4 # choosing the amount od neighbours, 5 is default
          )

X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

print('Resampled dataset shape %s' % Counter(y_train_sm))

Resampled dataset shape Counter({0: 339, 1: 339})


### TomekLinks
Tomek Links is an alternative to random undersampling, as it does not remove random records from the majority class, but it removes the records that are the closest to the minority class records. It finds pairs of records from opposite classes by proximity and removes the record of majority class in the pair.
![TomekLinks](https://mlwhiz.com/images/imbal/1.png)

***Pros***:    The data is no longer imbalanced and it helps with decision boundaries.     
***Cons***:    Our model might not learn subtleties of the true decision boundaries.

In [60]:
# let's use SMOTE on our titanic dataset
tl = TomekLinks(sampling_strategy='auto')

X_train_tl, y_train_tl = tl.fit_resample(X_train, y_train)

print('Resampled dataset shape %s' % Counter(y_train_tl))

Resampled dataset shape Counter({0: 296, 1: 232})


## Balancing the dataset with multiclasses
While balancing binary classes of majority and minority is quite easy, with multiclasses we might have some troubles when we want to classify with more than two classes. It's harder not to lose patterns, it's also more difficult to keep most of the training data and it's specificts. 

While we can use over- or undersampling with it's disadvantages, methods like SMOTE might also be not-ideal. With two variables it's easier to differentiate the patterns and boundaries, so SMOTE or TomekLinks will be good for that, however with multiclasses using these methods might cause class overlapping, which will actually worsen our model. Hence we must be very careful when deciding on methods to choose here.

And of course, we have ***alternatives*** ~.

Instead of using resampling methods, we might use model-level methods. 
For models these might include:
- **Updating the loss function**, which penalizes the wrong classifications of the minority class more than wrong classification of majority classes - it forces the model to treat specific classes with more weight than others;
- **Selecting appropriate algorithms**, that do well with imbalanced data, e.g. Gradient boosting trees (Decision Trees, XGBoost, Catboost), Random Forest or Ensemble methods (like AdaBoost or Bagging) - these models usually contain some kind of weights hyperparameters;
- **Combining model with small over- or undersampling**, to better the position of minority class(es), but not evening it out too much.

## Information Value (IV) and Weight of Evidence (WoE)

*To be explained in the future*