# Train / Cross Validation / Test Split

Morale of train/cv/test split methods: try to feed the best data possible to the deep model.

[help here](https://towardsdatascience.com/preprocessing-differences-in-standardization-methods-de53d2525a87)
and [here](https://towardsdatascience.com/finally-why-we-use-an-80-20-split-for-training-and-test-data-plus-an-alternative-method-oh-yes-edc77e96295d)

Alternative splitting methods:

- Pareto principle split: 80/20
- Scaling law split
- Banana split (no unfortunately this doesn't exist yet)

In [2]:
import os
from os import path
import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection

In [3]:
# Loading data

nodes = pd.read_csv('../data/raw/no_nan_data.csv', low_memory=False)

In [4]:
nodes

Unnamed: 0,node_id,Label,Revenue Size Flag,Account ID String,Address,Person or Organisation,Name,Income Size Flag,CoreCaseGraphID,ExtendedCaseGraphID,testingFlag
0,1502000,1,4,RvIOFQqK0E,missing,-1,missing,-1,0,0,
1,1502001,1,2,cSnM0hVDsm,missing,-1,missing,-1,0,0,
2,1502002,1,2,WAQWpZi4AD,missing,-1,missing,-1,2492,0,0.0
3,1502003,1,4,n5J9mBTeZc,missing,-1,missing,-1,0,0,
4,1502004,1,2,qxlAEuUm7P,missing,-1,missing,-1,0,0,
...,...,...,...,...,...,...,...,...,...,...,...
319371,3001177742,5,-1,missing,missing,2,gqDyJLC8DS,-1,0,0,
319372,3001177743,5,-1,missing,missing,1,B5TdCmIf69,-1,0,0,
319373,3001177744,5,-1,missing,missing,2,izrJE4sDpr,-1,0,0,
319374,3001177745,5,-1,missing,missing,1,TKQfFZ3fkk,-1,0,2030,0.0


In [5]:
# How many values for testingFlag
nodes.testingFlag.value_counts()

0.0    37669
1.0     1792
Name: testingFlag, dtype: int64

In [6]:
nodes.testingFlag.isna().sum()

279915

In [23]:
# TRAIN_SET + CV_SET
# Nodes with testingFlag == 0 are nodes to use for training?

train_cv_set = nodes.loc[nodes['testingFlag'].isnull()]
train_cv_set.shape[0]

279915

In [24]:
# TEST SET
# Nodes that have testingFlag == 1 are nodes on which to evaluate performance?

test_set = nodes.loc[nodes['testingFlag'] == 0]
test_set.shape[0]

37669

## 1: Pareto Split

In [25]:
# 80% train, 20% cross validation

train_set, validation_set = model_selection.train_test_split(train_cv_set, test_size=0.2)

In [26]:
train_set.shape

(223932, 11)

In [27]:
validation_set.shape

(55983, 11)

## 2: Other methods

TODO

## Remove testingFlag

In [28]:
train_set = train_set.drop(['testingFlag'], axis=1)
test_set = test_set.drop(['testingFlag'], axis=1)

In [29]:
# TODO: what is testingFlag?? ASK

In [33]:
train_set

Unnamed: 0,node_id,Label,Revenue Size Flag,Account ID String,Address,Person or Organisation,Name,Income Size Flag,CoreCaseGraphID,ExtendedCaseGraphID
280955,3001167904,5,-1,missing,missing,2,HfXhe7Zzza,-1,0,0
120957,15020059496,1,4,xs1WkuTym3,missing,-1,missing,-1,0,0
11965,15020026582,1,3,ukpmdugHEs,missing,-1,missing,-1,0,0
46576,1502015926,1,4,rrusZirsiq,missing,-1,missing,-1,0,0
262096,3001132750,5,-1,missing,missing,2,AZ9HZXBCcx,-1,0,0
...,...,...,...,...,...,...,...,...,...,...
258939,2003014831,4,-1,missing,missing,2,Pda8K6njO9,-1,0,0
1936,1502004478,1,4,5ckxg2HLVg,missing,-1,missing,-1,0,0
306041,3001143150,5,-1,missing,missing,1,ia6fnhceUA,-1,0,0
138335,15020145184,1,3,I8xySuN8HD,missing,-1,missing,-1,0,0


## Export datasets

In [34]:
def write_csv_df(path, filename, df):
    pathfile = os.path.normpath(os.path.join(path,filename))
    files_present = os.path.isfile(pathfile) 
    if not files_present:
        df.to_csv(pathfile, encoding='utf-8', index=False)
    else:
        overwrite = input("WARNING: " + pathfile + " already exists! Do you want to overwrite <y/n>? \n ")
        if overwrite == 'y':
            df.to_csv(pathfile, encoding='utf-8', index=False)
        elif overwrite == 'n':
            new_filename = input("Type new filename: \n ")
            write_csv_df(path,new_filename,df)
        else:
            print ("Not a valid input. Data is NOT saved!\n")

In [35]:
# Export 

write_csv_df("../data/split", "train_set_subset.csv", train_set)
write_csv_df("../data/split", "validation_set_subset.csv", validation_set)
write_csv_df("../data/split", "test_set_subset.csv", test_set)

 y
 y
 y
