## Pretraining on WikiText103

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
#export
from exp.nb_12a import *

## Data

One time download

In [3]:
path = datasets.Config().data_path(); path
version = '103' #2

In [4]:
!ls {path}

bedroom      imagenette2-160	  imdb		       planet
bedroom.tgz  imagenette2-160.tgz  imdb.tgz	       wikitext-103
camvid	     imagenette2.tgz	  mnist.pkl.gz	       wikitext-103-v1.zip
camvid.tgz   imagewoof2-160	  oxford-iiit-pet      wikitext-103-v1.zip.1
imagenette2  imagewoof2-160.tgz   oxford-iiit-pet.tgz


In [5]:
# ! wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-{version}-v1.zip -P {path}
# ! unzip -q -n {path}/wikitext-{version}-v1.zip  -d {path}
# ! mv {path}/wikitext-{version}/wiki.train.tokens {path}/wikitext-{version}/train.txt
# ! mv {path}/wikitext-{version}/wiki.valid.tokens {path}/wikitext-{version}/valid.txt
# ! mv {path}/wikitext-{version}/wiki.test.tokens {path}/wikitext-{version}/test.txt

Split the articles: WT103 is given as one big text file and we need to chunk it in different articles if we want to be able to shuffle them at the beginning of each epoch. 

In [6]:
path = datasets.Config().data_path()/'wikitext-103'

In [7]:
def istitle(line):
    return len(re.findall(r'^ = [^=]* = $', line)) !=0

In [8]:
def read_wiki(filename):
    articles = []
    with open(filename, encoding='utf8') as f:
        lines = f.readlines()
    current_article = ''
    for i, line in enumerate(lines):
        current_article += line
        if i< len(lines)-2 and lines[i+1] == ' \n' and istitle(lines[i+2]):
            current_article = current_article.replace('<unk>', UNK)
            articles.append(current_article)
            current_article = ''
    current_article = current_article.replace('<unk>', UNK)
    articles.append(current_article)
    return articles

In [9]:
train = TextList(read_wiki(path/'train.txt'), path=path) # +read_file(path/'test.txt')
valid = TextList(read_wiki(path/'valid.txt'), path=path)

In [10]:
len(train), len(valid)

(28476, 60)

In [11]:
sd = SplitData(train, valid)

In [12]:
proc_tok, proc_num = TokenizeProcessor(), NumericalizeProcessor()

In [13]:
# ll = label_by_func(sd, lambda x: 0, proc_x = [proc_tok, proc_num])

In [14]:
# pickle.dump(ll, open(path/'ld.pkl', 'wb'))

In [15]:
ll = pickle.load(open(path/'ld.pkl', 'rb'))

In [16]:
bs, bptt = 128, 72
data = lm_databunchify(ll, bs, bptt)

In [17]:
vocab = ll.train.proc_x[-1].vocab
len(vocab)

60002

## Training the Language Model

In [18]:
dps = np.array([0.1, 0.15, 0.25, 0.02, 0.2]) * 0.2
tok_pad = vocab.index(PAD)

In [19]:
emb_sz, nh, nl = 300, 300, 2
model = get_language_model(len(vocab), emb_sz, nh, nl, tok_pad, *dps)

In [20]:
cbs = [partial(AvgStatsCallback,accuracy_flat),
       CudaCallback, Recorder,
       partial(GradientClipping, clip=0.1),
       partial(RNNTrainer, α=2., β=1.),
       ProgressCallback]

In [21]:
import gc
learn, tst_model, z = None, None, None
gc.collect()

110

In [22]:
learn = Learner(model, data, cross_entropy_flat, lr=5e-3, cb_funcs=cbs, opt_func=adam_opt())

In [23]:
lr = 5e-3
sched_lr  = combine_scheds([0.3,0.7], cos_1cycle_anneal(lr/10., lr, lr/1e5))
sched_mom = combine_scheds([0.3,0.7], cos_1cycle_anneal(0.8, 0.7, 0.8))
cbsched = [ParamScheduler('lr', sched_lr), ParamScheduler('mom', sched_mom)]

In [None]:
learn.fit(10, cbs=cbsched)

epoch,train_loss,train_accuracy_flat,valid_loss,valid_accuracy_flat,time


Since it takes 10 or so hours to train on this system, we can just import the pretrained model from the fastai library.

In [None]:
torch.save(learn.model.state_dict(), path/'pretrained.pth')
pickle.dump(vocab, open(path/'vocab.pkl', 'wb'))