# Splitting test and training data
When you train a data model you may need to split up your data into test and training data sets

To accomplish this task we will use the [scikit-learn](https://scikit-learn.org/stable/) library

scikit-learn is an open source, BSD licensed library for data science for preprocessing and training models.

Before we can split our data test and training data, we need to do some data preparation

In [1]:
import pandas as pd

Let's load our csv file with information about flights and flight delays

Use **shape** to find out how many rows and columns are in the original DataFrame

In [2]:
delays_df = pd.read_csv('Data/Lots_of_flight_data.csv')
delays_df.shape

(300000, 16)

## Split data into features and labels
Create a DataFrame called X containing only the features we want to use to train our model.

**Note** You can only use numeric values as features, if you have non-numeric values you must apply different techniques such as Hot Encoding to convert these into numeric values before using them as features to train a model. Check out Data Science courses for more information on these techniques!

In [3]:
X = delays_df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]
X.head()

Unnamed: 0,DISTANCE,CRS_ELAPSED_TIME
0,1670,225
1,1670,225
2,580,105
3,580,105
4,580,100


Create a DataFrame called y containing only the value we want to predict with our model. 

In our case we want to predict how many minutes late a flight will arrive. This information is in the ARR_DELAY column. 

In [4]:
y = delays_df.loc[:,['ARR_DELAY']]
y.head()

Unnamed: 0,ARR_DELAY
0,-17.0
1,-25.0
2,-13.0
3,-12.0
4,-7.0


## Split into test and training data
Use **scikitlearn train_test_split** to move 30% of the rows into Test DataFrames

The other 70% of the rows into DataFrames we can use to train our model

NOTE: by specifying a value for *random_state* we ensure that if we run the code again the same rows will be moved into the test DataFrame. This makes our results repeatable.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

We now have a DataFrame **X_train** which contains 70% of the rows

We will use this DataFrame to train our model

In [7]:
X_train.shape

(210000, 2)

The DataFrame **X_test** contains the remaining 30% of the rows

We will use this DataFrame to test our trained model, so we can check it's accuracy

In [8]:
X_test.shape

(90000, 2)

**X_train** and **X_test** contain our features

The features are the columns we think can help us predict how late a flight will arrive: **DISTANCE** and **CRS_ELAPSED_TIME**

In [9]:
X_train.head()

Unnamed: 0,DISTANCE,CRS_ELAPSED_TIME
186295,237,60
127847,411,111
274740,342,85
74908,1005,164
11630,484,100


The DataFrame **y_train**  contains 70% of the rows

We will use this DataFrame to train our model

If you don't need to keep the original DataFrame, you can just delete the rows within the existing DataFrame instead of creating a new one
**inplace=*True*** indicates you want to drop the rows in the specified DataFrame

In [27]:
y_train.shape

(210000, 1)

The DataFrame **y_test** contains the remaining 30% of the rows

We will use this DataFrame to test our trained model, so we can check it's accuracy

In [28]:
y_test.shape

(90000, 1)

**y_train** and **y_test** contain our label

The label is the columns we want to predict with our trained model: **ARR_DELAY**

**NOTE:**  a negative value for ARR_DELAY indicates a flight arrived early

In [29]:
y_train.head()

Unnamed: 0,ARR_DELAY
186295,-7.0
127847,-16.0
274740,-10.0
74908,-19.0
11630,-13.0
