## Foundations: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

### Read in Data

In [3]:
#so we are importing the data set that we cleaned earlier as the dataset is cleaned 
#we will start running our model on the dataset

#we will first going to load the library
#so we are importing the train and test method from the sklearn so that we can 
#split the dataset and then we can help it to train our model 

import pandas as pd
from sklearn.model_selection import train_test_split

titanic = pd.read_csv('titanic_cleaned.csv')
titanic.head()

#titanic.drop('Unnamed',axis=1,inplace=True)

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_int
0,0,0,3,0,22.0,7.25,1,0
1,1,1,1,1,38.0,71.2833,1,1
2,2,1,3,1,26.0,7.925,0,0
3,3,1,1,1,35.0,53.1,1,1
4,4,0,3,0,35.0,8.05,0,0


In [7]:
#here we got the extra column in our dataset so we will going to drop it form the dataset
titanic.drop('Unnamed: 0',axis=1,inplace=True)

In [8]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_int
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0


### Split into train, validation, and test set

![Split Data](../../img/split_data.png)

In [11]:
#so this is the way to divide the dataset into the features set and the test set
#so for features we remove the target and include all the variable in it
#so for label we select the target 


#so our main goal is to divide our dataset into the training data, validation , and testing data
#so we will train the model on the training dataset
#we will evaluate the different models on the validation dataset
#after validating the right model we will test it on the testing set

features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

#X_train, X_test, y_train, y_test = train_test_split(features,labels,)

In [14]:
#so here we will going to split the dataset into 3 set as we discussed earlier

#so we will start with dividing it into two parts/because with this method we cannot divide it into three parts
X_train, X_test, y_train, y_test = train_test_split(features,labels,test_size=0.4,random_state=42)

#now we will be dividing it into further to get the validation set
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5,random_state=42)


In [15]:
#now quickly see the result of the that we have splitted our dataset into two orders

#so in order to find out the split percentage in our dataset
#we will going to use the for loop

for dataset in [y_train,y_test,y_val]:
    print(round(len(dataset)/len(labels),2))
    

0.6
0.2
0.2


In [None]:
#so we have splitted our dataset into 60% train, 20%validation,20%testing

### Write out all data

In [17]:
X_train.to_csv('train_features.csv', index=False)
X_val.to_csv('val_features.csv', index=False)
X_test.to_csv('test_features.csv', index=False)

y_train.to_csv('train_labels.csv', index=False)
y_val.to_csv('val_labels.csv', index=False)
y_test.to_csv('test_labels.csv', index=False)

In [None]:
#so we have saved the dataset in the x,y and in train, validation and tested dataset
