# Description
**Functionality**: The Wikipedia Homograph Data provided by Google is a two-way split dataset: train and eval. For the purposes of this study, a three way split of the data into train, valid, and test is needed. 

This module splits the Wikipedia homograph train split into 90% train, 10% valid. For the test split, the original Wikipedia homograph eval split has been copied over manually.

**Use**: The train and valid splits is used to train neural net models for homograph disambiguation, while the test set is held out for final evaluation.

### Imports

In [1]:
import os
from glob import glob
import pandas as pd
from tqdm import tqdm

### Variables

In [2]:
#Paths 
WHD_DATA = "C:/Users/jseal/Dev/dissertation/Data/WikipediaHomographData/data/"
TRAIN = WHD_DATA + "train/"
NEW_TRAIN = WHD_DATA + "three_split_data/train/"
NEW_VAL = WHD_DATA + "three_split_data/valid/"
#Variables
RANDOM_STATE = 45
FRAC = 0.1

### Functions

In [3]:
def create_train_val(TRAIN_DF : pd.DataFrame, f_name : str) -> None:
    # Select 10% from train, and serialize as valid
    # Retain rest of train and serialize as train
    val_df = TRAIN_DF.sample(frac=FRAC, random_state=RANDOM_STATE) 
    train_df = TRAIN_DF.drop(val_df.index)
    val_df.to_csv(NEW_VAL + f_name, sep='\t', index=False)
    train_df.to_csv(NEW_TRAIN + f_name, sep='\t', index=False)

# Script

In [4]:
# Make valid split, 10% of train; save rest as train
for f in tqdm(glob(TRAIN +'*.tsv')):
        f_name = os.path.basename(f)
        df = pd.read_table(f)
        create_train_val(df, f_name)

100%|███████████████████████████████████████████████████████████████████████████████| 162/162 [00:00<00:00, 191.10it/s]
