<a id='toc'></a>
# Creating the Train and Test Sets
Here we create the final train and test sets ready for predictive modelling.

1. [Setting up the Environment](#envir)
1. [Loading Data Post Correlation Analysis](#load)
1. [Split into Train and Test Sets](#split)
1. [Imputing Missing Values from Training Set](#impute)
1. [Standard Scaling](#standard)
1. [Dropping SK_ID_CURR and TARGET Attributes](#drop)
1. [Saving the Train and Test Sets](#save)

<a id='envir'></a>
# 1. Setting up the Environment

[Return](#toc)

In [1]:
import numpy as np
import pandas as pd
from numpy import nan as NaN
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer, StandardScaler
#from sklearn.linear_model import LogisticRegression
#from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score, confusion_matrix
#from sklearn.model_selection import KFold
#import matplotlib.pyplot as plt
#import seaborn as sns
#import os
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 200
pd.options.display.max_columns = 200
#%matplotlib inline

<a id='load'></a>
# 2. Loading Data Post Correlation Analysis

[Return](#toc)

In [2]:
# Read in the data post correlation analysis
path = 'C:/Users/X/Documents/A_Documents/Cap_Data/CSV'
app = pd.read_csv(path + '/app_less_corr/app_less_corr.csv')
print('Shape of dataframe ' + str(app.shape))

Shape of dataframe (307511, 261)


<a id='split'></a>
# 3. Split into Train and Test Sets

[Return](#toc)

In [3]:
# Spliting the data in train and test sets on a stratified basis around the TARGET attribute
# random_state = 777 is used to ensure we consistently get the same train and test sets
train_set, test_set = train_test_split(app, test_size=0.3, random_state=777, stratify=app['TARGET'])

In [4]:
# A quick check of the stratification
print('Training Set:')
print(train_set['TARGET'].value_counts()/len(train_set))
print('\nTesting Set:')
print(test_set['TARGET'].value_counts()/len(test_set))

Training Set:
0    0.919273
1    0.080727
Name: TARGET, dtype: float64

Testing Set:
0    0.919266
1    0.080734
Name: TARGET, dtype: float64


In [5]:
# Just checking that we do actually get consistent train/test splits
#tr = pd.read_csv(path + '/train_test_IDs/train_IDs.csv')
#te = pd.read_csv(path + '/train_test_IDs/test_IDs.csv')

In [6]:
# Should be zero
#sum(~(np.sort(tr['SK_ID_CURR'].values) == np.sort(train_set['SK_ID_CURR'].values)))

In [7]:
# Should be zero
#sum(~(np.sort(te['SK_ID_CURR'].values) == np.sort(test_set['SK_ID_CURR'].values)))

In [8]:
# Copying the TARGET attribute to make label sets
train_labels = train_set[['TARGET']].copy()
test_labels = test_set[['TARGET']].copy()

<a id='impute'></a>
# 4. Imputing Missing Values from Training Set

[Return](#toc)

In [9]:
# Training the imputer on the training set ONLY
imputer = Imputer(strategy="median")
imputer.fit(train_set)

# Running the imputer across trian_set and test_set
train_set = imputer.transform(train_set)
test_set = imputer.transform(test_set)

<a id='standard'></a>
# 5. Standard Scaling

[Return](#toc)

In [10]:
# Create the scaler
scaler = StandardScaler()

# Training the scaler on the training set ONLY
scaler.fit(train_set)

# Running the scaler across the the training and test sets
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

In [11]:
print('train_set: {}'.format(train_set.shape))
print('test_set: {}'.format(test_set.shape))

train_set: (215257, 261)
test_set: (92254, 261)


<a id='drop'></a>
# 6. Dropping SK_ID_CURR and TARGET Attributes

[Return](#toc)

In [12]:
# Converting back to dataframes. Not really required but keeps the data tidy.
train_set = pd.DataFrame(train_set, columns=app.columns)
test_set = pd.DataFrame(test_set, columns=app.columns)

# Dropping SK_ID_CURR (no longer required as it's just an ID) and the TARGET attribute from the train and test sets
train_set = train_set.drop(['SK_ID_CURR','TARGET'], axis=1)
test_set = test_set.drop(['SK_ID_CURR','TARGET'], axis=1)

<a id='save'></a>
# 7. Saving the Train and Test Sets

[Return](#toc)

In [13]:
train_set.to_csv(path+'/train_test_sets/train_set_std.csv', index=False)
train_labels.to_csv(path+'/train_test_sets/train_labels.csv', index=False)
test_set.to_csv(path+'/train_test_sets/test_set_std.csv', index=False)
test_labels.to_csv(path+'/train_test_sets/test_labels.csv', index=False)