## Split into Training and Evaluation Datasets

*  The training and testing dataset were split using the letter number assigned during segmentation, into even and odd numbers.  
*  Since topic evolved over time due to the nature of newspapers, this approach was more robust than splitting the datasets based on years as tempting as it was.  
*  Even were designated a training and Odd as evaluation. This was also easily interchangeable to test further.

In [25]:
import os
import re
import sys
import glob
from pathlib import Path
from pprint import pprint

current_directory = os.getcwd()
prj_root = os.path.dirname(current_directory)
data_dir = f'{prj_root}/data'
annif_unesco_dir = f'{data_dir}/annif/letters-unesco'
training_dir = f'{data_dir}/annif/letters-unesco/training'
eval_dir = f'{data_dir}/annif/letters-unesco/evaluation'
txt_proc_dir = f'{prj_root}/data/TXT_PROC'
txt_xml_dir = f'{prj_root}/data/TXT_XML'

proc_year = "1974"

path_list = []
for f in sorted(Path(annif_unesco_dir).glob(f'{proc_year}/*.txt')):
    txt_path = str(f) # cast PosixPath to str
    txt_name = os.path.basename(txt_path)
    path_list.append(txt_name)
    
odds = []
even = []
for idx, path in enumerate(path_list):
    txt_name = os.path.basename(path)
    txt_sans_ext = os.path.splitext(txt_name)[0]
    txt_proc_path = f"{annif_unesco_dir}/{proc_year}/{path}"
    letter_no = txt_sans_ext[txt_sans_ext.rindex('-')+1:]
    if (int(letter_no) % 2) == 0:
        even.append(txt_sans_ext)
    else:
        odds.append(txt_sans_ext)

print(f"Even {len(even)}, Odds {len(odds)}")

Even 149, Odds 207


In [26]:
from shutil import copyfile, rmtree

training_directory = f'{training_dir}/{proc_year}'
if not os.path.exists(training_directory):
    os.makedirs(training_directory)

for even_file in even:
    file_path = f'{even_file}.txt'
    key_path = f'{even_file}.key'

    copyfile(f"{annif_unesco_dir}/{proc_year}/{file_path}", f"{training_dir}/{proc_year}/{file_path}")
    copyfile(f"{annif_unesco_dir}/{proc_year}/{key_path}", f"{training_dir}/{proc_year}/{key_path}")

eval_directory = f'{eval_dir}/{proc_year}'
if not os.path.exists(eval_directory):
    os.makedirs(eval_directory)    
    
for odd_file in odds:
    file_path = f'{odd_file}.txt'
    key_path = f'{odd_file}.key'

    copyfile(f"{annif_unesco_dir}/{proc_year}/{file_path}", f"{eval_dir}/{proc_year}/{file_path}")
    copyfile(f"{annif_unesco_dir}/{proc_year}/{key_path}", f"{eval_dir}/{proc_year}/{key_path}")    

In [27]:
# delete the year folder now?!
try:
    rmtree(f'{annif_unesco_dir}/{proc_year}')
except OSError as e:
    print("Error: %s : %s" % (f'{annif_unesco_dir}/{proc_year}', e.strerror))

#### Load Vocabulary
`annif loadvoc letters-omikuji-bonsai-en data/vocabs/unesco-en.tsv`  

#### Train with the odds numbered letters in all years 1974‒1978  
`annif train letters-omikuji-bonsai-en data/annif/letters-unesco/evaluation/1974 data/annif/letters-unesco/evaluation/1975 data/annif/letters-unesco/evaluation/1976 data/annif/letters-unesco/evaluation/1977 data/annif/letters-unesco/evaluation/1978`

#### Use even numbered letters in all the years 1974‒1978  
`annif eval letters-omikuji-bonsai-en --limit 5 --threshold 0.6 data/annif/letters-unesco/training/1974 data/annif/letters-unesco/training/1975 data/annif/letters-unesco/training/1976 data/annif/letters-unesco/training/1977 data/annif/letters-unesco/training/1978`

#### Quick test
`cat data/TXT_PROC/1975/dds-89477-page-8-article-6.txt | annif suggest letters-omikuji-bonsai-en`  

#### To find the best precision for a directory
`annif optimize letters-omikuji-bonsai-en data/annif/letters-unesco/training/1975`

`annif loadvoc letters-omikuji-parabel-en data/vocabs/unesco-en.tsv`  
`annif train letters-omikuji-parabel-en data/training/letters-unesco`  
`cat /home/anthony/Documents/UofA/Thesis/lettersiterate/data/TXT_PRE/1975/dds-89477-page-8-article-6-PRE.txt  | annif suggest letters-omikuji-parabel-en`  

`annif loadvoc letters-fasttext data/vocabs/unesco-en.tsv`  
`annif train letters-fasttext data/training/letters-unesco`  
`cat /home/anthony/Documents/UofA/Thesis/lettersiterate/data/TXT_PRE/1975/dds-89477-page-8-article-6-PRE.txt  | annif suggest letters-fasttext`  