### Overview

In this script, I first split the CCNC (Comprehensive Chinese Name Corpus) into train/dev/test sets based on the ratio of 8:1:1, which translates into 2926487: 365811: 365811 in terms of sample size.

As only the first names will be used to train the logistic regression model, both the train set and dev set are then deduplicated based on first names and gender (so that the model can be more efficiently trained). This process reduces the size of train set from 2926487 to 633857, and the size of dev set from 365811 to 167528. The finally saved train/dev sets only contain first names and related genders.

However, in the test set, the full names and related genders are instead saved and it is thus deduplicated based on full names and gender (if I do it for first name + gender, the sample will be reduced from 365811 to 167849). The full names are saved because there will be a rule-based filter to retrieve the first names from given names and I want to test whether that works well. 

Therefore, in actuality, the ratio among the train/dev/test sets is 633857: 167528: 167849, which roughly equals to 6: 2: 2.

In [1]:
from random import shuffle, seed


def readFile(filepath='data/ccnc.txt'):
    '''Read a file given its file path.
    
    Paras:
        file: filepath. Defaults to 'ccnc.txt'. Please
            ensure this file is within the current directory. 
    
    '''
    data = []
    f = open(filepath, 'r')
    header = next(f)
    for line in f:
        data.append(line)
    return data


def train_dev_test_split(data, train=0.6, dev=0.2, test=0.2, seed_idx=5):
    '''
    Split ccnc.txt into train, dev and test sets with a predefined ratio.
    
    Paras:
        train, dev, test: respective ratio for the train, dev and test sets. 
            Default to 0.6, 0.2, 0.2 respectively. 
        seed_idx: Int. Defaults to 5 (a random picked seed). 
    '''
    
    seed(seed_idx)
    shuffle(data)
    length = len(data)
    boundary1 = round(length * train)
    boundary2 = round(length * (train + dev))
    
    # return the tran_ds, dev_ds, test_ds 
    return data[:boundary1], data[boundary1: boundary2], data[boundary2:]


In [2]:
data = readFile()
train_ds, dev_ds, test_ds = train_dev_test_split(data, 0.8, 0.1, 0.1)
# debug to see whether the spliting is working
assert(len(data) == len(train_ds) + len(dev_ds) + len(test_ds))
print(f' train set size: {len(train_ds)}\n dev set size: {len(dev_ds)}\n test set size: {len(test_ds)}')

 train set size: 2926487
 dev set size: 365811
 test set size: 365811


In [3]:
# deduplicate the train set and dev set based on first name and gender

from collections import Counter


def deduplicate(data, is_test=False):
    '''This function deduplicates the train set and dev set based on first name and gender
    and returns dict_lists that only contain first names and genders. If the data is the test set,
    this function will return a dict_dict that contains only full names and genders. 
    Paras:
        data: list --> train_ds or dev_ds
    '''
    count = Counter()
    if not is_test:
        for example in data:
            example = example.split('\t')
            fname, gender = example[1], example[3] # fname = first name
            # deduplicate based on first name and gender
            count[fname + '\t' + gender] = 1
    else:
        for example in data:
            example = example.split('\t')
            Fname, gender = example[2], example[3] # Fname = full name
            # deduplicate based on full name and gender
            count[Fname + '\t' + gender] = 1
    return count.keys()

In [4]:
train_ds_new = deduplicate(train_ds)
dev_ds_new = deduplicate(dev_ds)
test_ds_new = deduplicate(test_ds, is_test=True)
print(f' new train set: {len(train_ds_new)}\n new dev set: {len(dev_ds_new)}\n new test set: {len(test_ds_new)}')

 new train set: 633857
 new dev set: 167528
 new test set: 365810


In [5]:
def fileWriter(data, file_name):
    '''Write a list of name examples back into a txt file and save in 
    the current directory if the full path is not given in the file_name.
    
    Paras:
        data: list
            a list of name examples; each example contains 
            last name, first name, full name and gender
        file_name: str
    '''
    file_name = file_name if file_name.endswith('.txt') else file_name + '.txt'
    tmp = '{}\t{}'
    with open(file_name, 'w') as f:
        f.write(tmp.format('name', 'gender\n'))
        f.write(''.join(data))

In [6]:
fileWriter(train_ds_new, 'data/train_ds.txt')
fileWriter(dev_ds_new, 'data/dev_ds.txt')
fileWriter(test_ds_new, 'data/test_ds.txt')