# Splitting test and training data
When you train a data model you may need to split up your data into test and training data sets

To accomplish this task we will use the scikit-learn library

scikit-learn is an open source, BSD licensed library for data science for preprocessing and training models.

Before we can split our data test and training data, we need to do some data preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
newfile=Path.joinpath(Path.cwd().parent,'CSV_files/lots_of_flight_data.csv')

In [2]:
delays_df=pd.read_csv(newfile)
delays_df.shape

(100, 16)

# Split data into features and labels
Create a DataFrame called X containing only the features we want to use to train our model.

Note You can only use numeric values as features, if you have non-numeric values you must apply different techniques such as Hot Encoding to convert these into numeric values before using them as features to train a model. Check out Data Science courses for more information on these techniques!

In [3]:
X=delays_df.loc[:,['DISTANCE','CRS_ELAPSED_TIME']]
X.head()

Unnamed: 0,DISTANCE,CRS_ELAPSED_TIME
0,1670,225
1,1670,225
2,580,105
3,580,105
4,580,100


# Create a DataFrame called y containing only the value we want to predict with our model.

In our case we want to predict how many minutes late a flight will arrive. This information is in the ARR_DELAY column.

In [4]:
y=delays_df.loc[:,['ARR_DELAY']]
y.head()

Unnamed: 0,ARR_DELAY
0,-17
1,-25
2,-13
3,-12
4,-7


Split into test and training data
Use scikitlearn train_test_split to move 30% of the rows into Test DataFrames

The other 70% of the rows into DataFrames we can use to train our model

NOTE: by specifying a value for random_state we ensure that if we run the code again the same rows will be moved into the test DataFrame. This makes our results repeatable.

In [6]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=13)

We now have a DataFrame X_train which contains 70% of the rows

We will use this DataFrame to train our model

In [11]:
X_train.shape

(70, 2)


X_train and X_test contain our features

The features are the columns we think can help us predict how late a flight will arrive: DISTANCE and CRS_ELAPSED_TIME

In [12]:
X_train.head()

Unnamed: 0,DISTANCE,CRS_ELAPSED_TIME
39,289,80
27,1111,175
42,1610,240
13,759,120
8,349,75


In [13]:
y_train.head()

Unnamed: 0,ARR_DELAY
39,-18
27,-11
42,-5
13,-2
8,2


In [16]:
X_train.to_csv('data/X_train.csv')
X_test.to_csv('data/X_test.csv')
y_train.to_csv('data/y_train.csv')
y_test.to_csv('data/y_test.csv')