# Summary

Generate train and test files for default training with the script provided by Huggingface. Later we'll experiment with incendio/fastai/pytorch lightning and add things like discriminative learning rates, experiment tracking in comet.ML/mlflow, or architecture experiments. For now I just want to try running the basics and see what kind of results it produces.

Note: I've decided to remove my page break token. Since I'm not training page by page anyway, these don't mean much. Page breaks usually don't capture that much information about an author's voice (in fact, different versions of the book might have breaks in different places depending on page/font size) so I don't think we're losing much. I can always insert page breaks every ~500 words in the generated text if I want to make a longer book.

Due to the train/val split method, we should use their LineByLine datasset rather than their TextDataset. I didn't want to just chop off the end of each book to construct a validation set since this would leave much of the Sanderlanche out of the training set.

In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [20]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

from htools import *
from stormlight import *

In [3]:
cd_root()

Current directory: /Users/hmamin/stormlight


In [44]:
books = load_books(*STORMLIGHT)
books.keys()

dict_keys(['kings', 'words', 'oath'])

In [45]:
len([word for book in books.values() for word in book.split(' ')])

1128260

In [46]:
lines = [line for book in books.values() 
         for line in book.replace(BREAK, '').split('.') if line]
len(lines)

118571

In [48]:
train, val = train_test_split(lines, test_size=0.05, random_state=0, 
                              shuffle=False)
len(train), len(val)

(112642, 5929)

In [49]:
train[:5]

['PRELUDE TO\n\nTHE STORMLIGHT ARCHIVE\n\nKalak rounded a rocky stone ridge and stumbled to a stop before the body of a dying\nthunderclast',
 ' The enormous stone beast lay on its side, riblike protrusions from its chest\nbroken and cracked',
 ' The monstrosity was vaguely skeletal in shape, with unnaturally long\nlimbs that sprouted from granite shoulders',
 ' The eyes were deep red spots on the arrowhead\nface, as if created by a fire burning deep within the stone',
 ' They faded']

In [50]:
val[:5]

[' To get a few spheres',
 '\nHe stumbled back as they hovered around Rock and Bisig, then fled through a falling patch of\nshamespren into the hallway outside',
 '\nFIVE AND A HALF YEARS AGO\nDalinar came to himself, gasping, in the cabin of a stormwagon',
 ' Heart pounding, he spun about,\nkicking aside empty bottles and lifting his fists',
 ' Outside, the riddens of a storm washed the walls with\nrain']

In [51]:
save('.'.join(train), 'data/clean/train_huggingface.txt')
save('.'.join(val), 'data/clean/val_huggingface.txt')

Writing data to data/clean/train_huggingface.txt.
Writing data to data/clean/val_huggingface.txt.


In [52]:
save('.'.join(lines), 'data/clean/all_hugginface.txt')

Writing data to data/clean/all_hugginface.txt.


In [56]:
def generate_huggingface_files(books, pre, train_pct=1.0, replace_break=''):
    books = load_books(*books)
    lines = [line for book in books.values() for line in 
             book.replace(BREAK, replace_break).split('.') if line]
    if train_pct < 1.0:
        texts = train_test_split(lines, train_size=train_pct,
                                 random_state=0, shuffle=False)
    else:
        texts = [lines]
    for text, split in zip(texts, ['train', 'val']):
        save('.'.join(text), f'data/clean/{pre}_{split}_huggingface.txt')
    return texts

In [58]:
cosmere = generate_huggingface_files(STORMLIGHT+NOVELLAS, 
                                     'cosmere')[0]

Writing data to data/clean/cosmere_train_huggingface.txt.
