# Split Records into Train and Test
This notebook reads the feature set joined in the `03_join_features_from_data_prep` notebook which contains:

- The food inspections features created in the `01_food_inspections_data_prep` notebook and
- The census features created in the `02_census_data_prep` notebook.

This notebook then splits the data into 80% train and 20% test and writes off the splits as compressed CSVs.

### Set Global Seed

In [1]:
SEED = 666

### Imports

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Read Chicago Food Inspections Feature Dataset from the Data Prep Notebook

In [3]:
feature_df = pd.read_csv('../data/Final_Features.gz', compression='gzip')

### Split Dataset into Train and Test

In [4]:
train_df, test_df = train_test_split(feature_df,
                                     test_size=0.20,
                                     shuffle=True,
                                     random_state=SEED)

In [5]:
train_df.shape

(133496, 97)

In [6]:
test_df.shape

(33375, 97)

### Drop ID and Target Columns in Feature Matrix and Select Target Vector for Training

In [7]:
X_train = train_df.drop(['inspection_id', 'result'], 1)
X_test = test_df.drop(['inspection_id', 'result'], 1)
y_train = train_df['result']
y_test = test_df['result']

### Write Data Splits to CSV

In [8]:
X_train.to_csv('../data/X_train.gz', compression='gzip', index=False, header=True)
X_test.to_csv('../data/X_test.gz', compression='gzip', index=False, header=True)
y_train.to_csv('../data/y_train.gz', compression='gzip', index=False, header=True)
y_test.to_csv('../data/y_test.gz', compression='gzip', index=False, header=True)