# Prepare Data: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition (we are only using the training set).

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

## Read in Data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
titanic = pd.read_csv('../Data/titanic_cleaned.csv')

In [3]:
titanic.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,0,0,3,0,22.0,7.25,1,0
1,1,1,1,1,38.0,71.2833,1,1
2,2,1,3,1,26.0,7.925,0,0
3,3,1,1,1,35.0,53.1,1,1
4,4,0,3,0,35.0,8.05,0,0


## Split into train, validation, and test set

In [5]:
features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)

In [7]:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [10]:
# quick check of the size

for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset)/len(labels), 2))

0.6
0.2
0.2


## Write out all data

In [11]:
X_train.to_csv('../Data/train_features.csv', index=False)
X_val.to_csv('../Data/val_features.csv', index=False)
X_test.to_csv('../Data/test_features.csv', index=False)

y_train.to_csv('../Data/train_labels.csv', index=False)
y_val.to_csv('../Data/val_labels.csv', index=False)
y_test.to_csv('../Data/test_labels.csv', index=False)