# Description
**Functionality**: The Wikipedia Homograph Data provided by Google is a two-way split dataset: train and eval. For the purposes of this study, we split the data into train, valid, and test. 

This module splits the Wikipedia homograph train split into 90% train, 10% valid. For the test split, the original Wikipedia homograph eval split is copied over.

**Use**: The train and valid splits are used to train neural net models for homograph disambiguation, while the test set is held out for final evaluation.

### Imports

In [1]:
import os
import shutil
from glob import glob
import pandas as pd
from tqdm import tqdm

### Variables

In [2]:
#Paths 
WHD_DATA = "C:/Users/jseal/Dev/dissertation/Data/WikipediaHomographData/data/"
WHD_DATA_VARIANT = "C:/Users/jseal/Dev/dissertation/Data/WikipediaHomographData/data/variant_data/"
TRAIN = WHD_DATA_VARIANT + "train/"
EVAL = WHD_DATA_VARIANT + "eval/"

THREE_SPLITS = WHD_DATA + "three_split_variant_data/"
NEW_TRAIN = THREE_SPLITS + "train/"
NEW_DEV = THREE_SPLITS + "dev/"
NEW_TEST = THREE_SPLITS + "test/"

#Variables
RANDOM_STATE = 45
FRAC = 0.1

### Functions

In [3]:
def create_train_val(TRAIN_DF : pd.DataFrame, f_name : str) -> None:
    # Select 10% from train, and serialize as valid
    # Retain rest of train and serialize as train
    val_df = TRAIN_DF.sample(frac=FRAC, random_state=RANDOM_STATE) 
    train_df = TRAIN_DF.drop(val_df.index)
    val_df.to_csv(NEW_DEV + f_name, sep='\t', index=False)
    train_df.to_csv(NEW_TRAIN + f_name, sep='\t', index=False)
    
def eval_2_test() -> None: 
    # Copy eval into test
    for f in tqdm(glob(EVAL + "*.tsv")): 
        shutil.copy(f, NEW_TEST)

def train_2_train_dev() -> None:
    # Make 10% of train the dev split; save rest as train
    for f in tqdm(glob(TRAIN +'*.tsv')):
        f_name = os.path.basename(f)
        df = pd.read_table(f)
        create_train_val(df, f_name)

# Script

In [4]:
eval_2_test()
train_2_train_dev()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 1482.76it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 230.64it/s]
