# Witful ML 07 - Train-Test Split
by Kaan Kabalak, Editor In Chief @ witfuldata.com

# What is train_test split and why is it necessary?

The main goal of most machine learning models is to carry out predictions based on data. In real implementations, ML models are expected work with data which is completely new to them. They are trained with a certain amount of data, but make their predictions based on completely new data. Actually we have two options and this presents a dilemma:

- We train/fit the model with the data we have. The model adjusts its algorithm accordingly and it fine-tunes everything for that dataset. This makes the model super efficient on that particular dataset but makes it unreliable when it is used on new data. This is beacause the model has 'learned' everything on a single dataset. It fails to function when it meets the different patterns in the new data (This is called 'Overfitting')


- We train/fit the model with very little data to avoid this overfitting issue. This will also make the model inefficient as it won't have access to the amount it needs to adjust its algorithm. (This is called 'Underfitting')

To ensure that our models learn enough and perform well on new data, we must set a balance between how they are trained on existing data and how they are tested with new data.

But wait! How can we do this if we have no new data? How can we produce new data when all we have is a single dataset?

Train_test split is here to help! It splits our data set, trains the model with a part of it and reserves the rest, never showing it to our model. This way, it helps us see how our model would perform on brand new data. 

Let's see how it is all done on Python with scikitlearn

In [1]:
#Import
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
#Data frame
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df = diabetes_df.astype(float)
diabetes_df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0,0.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0


## Without the split

In [3]:
#X-y variables
X = diabetes_df.drop("Outcome",axis=1).values
y = diabetes_df["Outcome"].values

In [4]:
#Insantiate
k_model = KNeighborsClassifier(n_neighbors=3)

#Fit
k_model.fit (X,y)

#Predict
k_predictions = k_model.predict(X)

#Score
non_split_score = accuracy_score(k_predictions,y)

## With the split

In [5]:
#Train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=3,)

#### test_size

It determines the percentage of the data which will be kept out of model training and will be reserved for testing. Here, 0.2 means that %20 of that will be reserved as the test set.

#### stratify
If there is an imbalance between the classes (more positives than negatives or more negatives than positives) you should use stratify to ensure that it won't affect the split in a negative manner. 


#### random_state

Random state is not that important if you are going to use the split dataset only once. It becomes important when the same split will be used several times for different models. It ensures that the split will shuffle the dataset in a certain way and will produce the same results. For example, if you want to use the same dataset on different models, you would need to do train_test split with the same shuffle (random_state).

In [6]:
#Instantiate the model
k_model_split = KNeighborsClassifier (n_neighbors=3)

#Fit the model with the training set
k_model_split.fit(X_train,y_train)

#Predict
k_predictions_split = k_model.predict(X_test)

In [7]:
#Score
split_score = accuracy_score(k_predictions_split,y_test)

In [8]:
#The difference between scores
non_split_score - split_score

0.0022321428571429047

The difference may seem unimportant, but bear in mind that on even the smallest differences matter in large scale projects. 