# Train / Cross Validation / Test Split

Morale of train/cv/test split methods: try to feed the best data possible to the deep model.

[help here](https://towardsdatascience.com/preprocessing-differences-in-standardization-methods-de53d2525a87)
and [here](https://towardsdatascience.com/finally-why-we-use-an-80-20-split-for-training-and-test-data-plus-an-alternative-method-oh-yes-edc77e96295d)

Alternative splitting methods:

- Pareto principle split: 80/20
- Scaling law split
- Banana split (no unfortunately this doesn't exist yet)

In [1]:
import os
from os import path
import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection

In [2]:
# Loading data

nodes = pd.read_csv('../data/raw/no_nan_data.csv', low_memory=False)

In [3]:
nodes

Unnamed: 0,node_id,Label,Revenue Size Flag,Account ID String,Address,Person or Organisation,Name,Income Size Flag,CoreCaseGraphID,ExtendedCaseGraphID,testingFlag
0,1502000,1,4,RvIOFQqK0E,missing,-1,missing,-1,0,0,
1,1502001,1,2,cSnM0hVDsm,missing,-1,missing,-1,0,0,
2,1502002,1,2,WAQWpZi4AD,missing,-1,missing,-1,2492,0,0.0
3,1502003,1,4,n5J9mBTeZc,missing,-1,missing,-1,0,0,
4,1502004,1,2,qxlAEuUm7P,missing,-1,missing,-1,0,0,
...,...,...,...,...,...,...,...,...,...,...,...
319371,3001177742,5,-1,missing,missing,2,gqDyJLC8DS,-1,0,0,
319372,3001177743,5,-1,missing,missing,1,B5TdCmIf69,-1,0,0,
319373,3001177744,5,-1,missing,missing,2,izrJE4sDpr,-1,0,0,
319374,3001177745,5,-1,missing,missing,1,TKQfFZ3fkk,-1,0,2030,0.0


In [4]:
# How many values for testingFlag
nodes.testingFlag.value_counts()

0.0    37669
1.0     1792
Name: testingFlag, dtype: int64

In [5]:
nodes.testingFlag.isna().sum()

279915

In [6]:
# TRAIN_SET + CV_SET
# Nodes with testingFlag == 0 are nodes to use for training?

train_cv_set = nodes.loc[nodes['testingFlag'].isnull()]
train_cv_set.shape[0]

279915

In [7]:
# TEST SET
# Nodes that have testingFlag == 1 are nodes on which to evaluate performance?

test_set = nodes.loc[nodes['testingFlag'] == 0]
test_set.shape[0]

37669

## 1: Pareto Split

In [8]:
# 80% train, 20% cross validation

train_set, validation_set = model_selection.train_test_split(train_cv_set, test_size=0.2)

In [9]:
train_set.shape

(223932, 11)

In [10]:
validation_set.shape

(55983, 11)

## 2: Other methods

TODO

## Remove testingFlag

In [11]:
train_set = train_set.drop(['testingFlag'], axis=1)
test_set = test_set.drop(['testingFlag'], axis=1)

In [12]:
# TODO: what is testingFlag?? ASK

In [13]:
train_set

Unnamed: 0,node_id,Label,Revenue Size Flag,Account ID String,Address,Person or Organisation,Name,Income Size Flag,CoreCaseGraphID,ExtendedCaseGraphID
70530,15020111969,1,3,54WA38G4CE,missing,-1,missing,-1,0,0
238033,2003013750,4,-1,missing,missing,1,"CONSTRUCTORA BIG, S.A.",-1,0,0
39686,15020118163,1,2,2738haLxZa,missing,-1,missing,-1,0,0
5741,15020013303,1,4,2PWl85WAUH,missing,-1,missing,-1,0,0
290925,3001147772,5,-1,missing,missing,1,qW35CCIBZA,-1,0,0
...,...,...,...,...,...,...,...,...,...,...
51684,15020119611,1,4,3806Ngie0q,missing,-1,missing,-1,0,0
6813,15020015419,1,2,AOsz31bpzy,missing,-1,missing,-1,0,0
143635,2501110691,2,-1,missing,TCmdcp9qxt,-1,missing,-1,0,0
208996,1001047766,3,-1,missing,missing,2,Jhzg3Fpn1E,1,0,0


## Export datasets

In [14]:
def write_csv_df(path, filename, df):
    pathfile = os.path.normpath(os.path.join(path,filename))
    files_present = os.path.isfile(pathfile) 
    if not files_present:
        df.to_csv(pathfile, encoding='utf-8', index=False)
    else:
        overwrite = input("WARNING: " + pathfile + " already exists! Do you want to overwrite <y/n>? \n ")
        if overwrite == 'y':
            df.to_csv(pathfile, encoding='utf-8', index=False)
        elif overwrite == 'n':
            new_filename = input("Type new filename: \n ")
            write_csv_df(path,new_filename,df)
        else:
            print ("Not a valid input. Data is NOT saved!\n")

In [16]:
# Export 

write_csv_df("../data/split", "train_set_subset.csv", train_set)
write_csv_df("../data/split", "validation_set_subset.csv", validation_set)
write_csv_df("../data/split", "test_set_subset.csv", test_set)