# Data Split: Train, Validation, Test sets

---

## Fundamentals of ETL: data extraction, transformation and loading


Applied Mathematical Modeling in Banking

---

# Table of contents

---

# 1. What's Train, Validation, Test datasets


Before model fitting and some stages of features engeniering we shoudl split out dataset on 2 or 3 parts:

- [x] `Training` dataset: The sample of data used to fit the model.

The model sees and learns from this data.

- [x] `Validation` dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

The validation set is used to evaluate a given model, but this is for frequent evaluation. We, as machine learning engineers, use this data to fine-tune the model hyperparameters. Hence the model occasionally sees this data, but never does it “Learn” from this. We use the validation set results, and update higher level hyperparameters. So the validation set affects a model, but only indirectly. The validation set is also known as the Dev set or the Development set. This makes sense since this dataset helps during the “development” stage of the model.

- [x] `Testing` dataset: Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained(using the train and validation sets). The test set is generally what is used to evaluate competing models (For example on many Kaggle competitions, the validation set is released initially along with the training set and the actual test set is only released when the competition is about to close, and it is the result of the the model on the Test set that decides the winner). Many a times the validation set is used as the test set, but it is not good practice. The test set is generally well curated. It contains carefully sampled data that spans the various classes that the model would face, when used in the real world.

![](assets/images/03/train_test1.png)

You can also find papers with splitting only for `train`/`test`. In this case `test` means `validation`.

![](assets/images/03/train_test2.png)

---

# 2. Splitting data in R

Lets describe some conditions before start studiyng splitting data functions in R:

- [x] We will use same seed for all splittings to control results reproduction, for example, let it be 2021.
- [x] We will use dataset for client churn prediction `Telco Customer Churn`: https://www.kaggle.com/blastchar/telco-customer-churn

Short dataset description:

- [x] `customerID` - Customer ID
- [x] `gender` Whether the customer is a male or a female
- [x] `SeniorCitizen` - Whether the customer is a senior citizen or not (1, 0)
- [x] `Partner` - Whether the customer has a partner or not (Yes, No)
- [x] `Dependents` - Whether the customer has dependents or not (Yes, No)
- [x] `tenure` - Number of months the customer has stayed with the company
- [x] `PhoneService` - Whether the customer has a phone service or not (Yes, No)
- [x] `MultipleLines` - Whether the customer has multiple lines or not (Yes, No, No phone service)
- [x] `InternetService` - Customer’s internet service provider (DSL, Fiber optic, No)
- [x] `OnlineSecurity` - Whether the customer has online security or not (Yes, No, No internet service)
- [x] `OnlineBackup` - Whether the customer has online backup or not (Yes, No, No internet service)
- [x] `DeviceProtection` - Whether the customer has device protection or not (Yes, No, No internet service)
- [x] `TechSupport` - Whether the customer has tech support or not (Yes, No, No internet service)
- [x] `StreamingTV` - Whether the customer has streaming TV or not (Yes, No, No internet service)
- [x] `StreamingMovies` - Whether the customer has streaming movies or not (Yes, No, No internet service)
- [x] `Contract` - The contract term of the customer (Month-to-month, One year, Two year)
- [x] `PaperlessBilling` - Whether the customer has paperless billing or not (Yes, No)
- [x] `PaymentMethod` - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- [x] `MonthlyCharges` - The amount charged to the customer monthly
- [x] `TotalCharges` - The total amount charged to the customer
- [x] `Churn` - Whether the customer churned or not (Yes or No)

In [1]:
# read data
telecom_users <- read.csv("data/telecom_users.csv")
head(telecom_users)

Unnamed: 0_level_0,X,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
1,1869,7010-BRBUU,Male,0,Yes,Yes,72,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Credit card (automatic),24.1,1734.65,No
2,4528,9688-YGXVR,Female,0,No,No,44,Yes,No,Fiber optic,...,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),88.15,3973.2,No
3,6344,9286-DOJGF,Female,1,Yes,No,38,Yes,Yes,Fiber optic,...,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),74.95,2869.85,Yes
4,6739,6994-KERXL,Male,0,No,No,4,Yes,No,DSL,...,No,No,No,Yes,Month-to-month,Yes,Electronic check,55.9,238.5,No
5,432,2181-UAESM,Male,0,No,No,2,Yes,No,DSL,...,Yes,No,No,No,Month-to-month,No,Electronic check,53.45,119.5,No
6,2215,4312-GVYNH,Female,0,Yes,No,70,No,No phone service,DSL,...,Yes,Yes,No,Yes,Two year,Yes,Bank transfer (automatic),49.85,3370.2,No


Lets check the proportion of column `Churn == Yes` and `Churn == No` in dataset with `CrossTable()` function from `gmodels` package.

In [5]:
#install.packages("gmodels")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
library(gmodels)
CrossTable(telecom_users$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      4399 |      1587 | 
          |     0.735 |     0.265 | 
          |-----------|-----------|



 


You can also use `CrossTable()` to check cross proportions by other fields. Lets check crosstable for `TechSupport` and `Churn`:

In [11]:
CrossTable(telecom_users$Churn, telecom_users$TechSupport) # for example


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5986 

 
                    | telecom_users$TechSupport 
telecom_users$Churn |                  No | No internet service |                 Yes |           Row Total | 
--------------------|---------------------|---------------------|---------------------|---------------------|
                 No |                1738 |                1192 |                1469 |                4399 | 
                    |              87.892 |              62.377 |              29.512 |                     | 
                    |               0.395 |               0.271 |               0.334 |               0.735 | 
                    |               0.587 |               0.923 |               0.847 |                     | 
                    |       

You can see that most part of Churn 1222 of 1587

Next, we will check 6 possible ways to split data for train/test sets.

---

## 2.1. Split with `sample()`

In [10]:
set.seed(2021)

sample_size = round(nrow(telecom_users)*.70) # setting what is 70%
print(paste0("Size: ", sample_size))

index <- sample(nrow(telecom_users), size = sample_size)
 
train <- telecom_users[index, ] # index is numbers of selected rows from dataset
test <-telecom_users[-index, ] # -index select only rows not in index

[1] "Size: 4190"


In [11]:
# check Churn == Yes/No proportion in train/test
CrossTable(train$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4190 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      3099 |      1091 | 
          |     0.740 |     0.260 | 
          |-----------|-----------|



 


In [12]:
# check Churn == Yes/No proportion in train/test
CrossTable(test$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  1796 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      1300 |       496 | 
          |     0.724 |     0.276 | 
          |-----------|-----------|



 


Its 0.260 for train and 0.276 for test. Diffrence is 1,6%, so, its close.

---

## 2.3. Split with `sample_frac` from `dplyr`

In [36]:
library(dplyr)
set.seed(2021)

# Using the above function to create 70 - 30 slipt into test and train

tu <- telecom_users %>% mutate(Id = row_number())

train <- tu %>% sample_frac(.70)
test <- tu[-train$Id, ]

In [39]:
# check Churn == Yes/No proportion in train/test
CrossTable(train$Churn)
CrossTable(test$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4190 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      3099 |      1091 | 
          |     0.740 |     0.260 | 
          |-----------|-----------|



 

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  1796 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      1300 |       496 | 
          |     0.724 |     0.276 | 
          |-----------|-----------|



 


`sample_n` made other proportion of `Churn == Yes/No` and difference just 0.7%.

---

## 2.5. Split with `createDataPartition()` from `caret`

In [41]:
#install.packages("caret")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [45]:
library(caret)
set.seed(2021)
 
index = createDataPartition(telecom_users$Churn, p = 0.70, list = FALSE)
train = telecom_users[index, ]
test = telecom_users[-index, ]

In [46]:
# check Churn == Yes/No proportion in train/test
CrossTable(train$Churn)
CrossTable(test$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4191 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      3080 |      1111 | 
          |     0.735 |     0.265 | 
          |-----------|-----------|



 

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  1795 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      1319 |       476 | 
          |     0.735 |     0.265 | 
          |-----------|-----------|



 


Ckeck the proportion of target variable. Caret trying to make the same split for both train and test. This is one of the best split methods in R.

---

## 2.5. Split with `sample.split` from `caTools`

In [47]:
install.packages("caTools")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [66]:
library(caTools)
 
set.seed(2021)
sample = sample.split(telecom_users$Churn, SplitRatio = .70)

train = telecom_users[sample, ]
test  = telecom_users[!sample, ]

In [65]:
# check Churn == Yes/No proportion in train/test
CrossTable(train$Churn)
CrossTable(test$Churn)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4190 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      3079 |      1111 | 
          |     0.735 |     0.265 | 
          |-----------|-----------|



 

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4190 

 
          |        No |       Yes | 
          |-----------|-----------|
          |      3079 |      1111 | 
          |     0.735 |     0.265 | 
          |-----------|-----------|



 


---

## Для нашого курсу це не потрібно поки! Переходимо до наступної теми


# 3. Splitting for n-folds

In [29]:
# read data again
library(caret)
telecom_users <- read.csv("data/telecom_users.csv")
nrow(telecom_users)
head(telecom_users)

Unnamed: 0_level_0,X,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
1,1869,7010-BRBUU,Male,0,Yes,Yes,72,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Credit card (automatic),24.1,1734.65,No
2,4528,9688-YGXVR,Female,0,No,No,44,Yes,No,Fiber optic,...,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),88.15,3973.2,No
3,6344,9286-DOJGF,Female,1,Yes,No,38,Yes,Yes,Fiber optic,...,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),74.95,2869.85,Yes
4,6739,6994-KERXL,Male,0,No,No,4,Yes,No,DSL,...,No,No,No,Yes,Month-to-month,Yes,Electronic check,55.9,238.5,No
5,432,2181-UAESM,Male,0,No,No,2,Yes,No,DSL,...,Yes,No,No,No,Month-to-month,No,Electronic check,53.45,119.5,No
6,2215,4312-GVYNH,Female,0,Yes,No,70,No,No phone service,DSL,...,Yes,Yes,No,Yes,Two year,Yes,Bank transfer (automatic),49.85,3370.2,No


In [33]:
folds <- createFolds(telecom_users)
folds

In [34]:
library(caret)
library(mlbench)
data(Sonar)
 
folds <- createFolds(Sonar$Class)
str(folds)

ERROR: Error in library(mlbench): there is no package called 'mlbench'


## References

1. About Train, Validation and Test Sets in Machine Learning by Tarang Shah. Url: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7