## Preprocess Text

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
#export
from exp.nb_11a import *

## Data

We're gonna use the IMDB dataset to classify positive and negative reviews.

In [3]:
path = datasets.untar_data(datasets.URLs.IMDB)

In [4]:
path.ls()

[PosixPath('/home/jupyter/.fastai/data/imdb/ld.pkl'),
 PosixPath('/home/jupyter/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train'),
 PosixPath('/home/jupyter/.fastai/data/imdb/README'),
 PosixPath('/home/jupyter/.fastai/data/imdb/tmp_lm'),
 PosixPath('/home/jupyter/.fastai/data/imdb/ll_clas.pkl'),
 PosixPath('/home/jupyter/.fastai/data/imdb/test'),
 PosixPath('/home/jupyter/.fastai/data/imdb/unsup'),
 PosixPath('/home/jupyter/.fastai/data/imdb/tmp_clas')]

We define a subclass of `ItemList` that will read the texts in the corresponding filenames.

In [5]:
#export
def read_file(fn):
    with open(fn, 'r', encoding='utf8') as f: return f.read()
    
class TextList(ItemList):
    @classmethod
    def from_files(cls, path, extensions='.txt', recurse=True, include=None, **kwargs):
        return cls(get_files(path, extensions, recurse=recurse, include=include), path, **kwargs)
    
    def get(self, i):
        if isinstance(i, Path): return read_file(i)
        return i

Just in case there are some log files, we restrict the ones we take to the training, test and unsupervised folders.

In [6]:
tl = TextList.from_files(path, include=['train', 'test', 'unsup'])

We should expect a total of 100k texts.

In [7]:
tl

TextList (100000 items)
[PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/3719_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/11471_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/2165_4.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/8280_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/3885_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/8296_4.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/2521_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/10148_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/5893_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/10118_4.txt')...]
Path: /home/jupyter/.fastai/data/imdb

In [8]:
txt = tl[0]
txt

"Imagine pulling back the mask of a lethal assassin and finding Barbara Cartland there... that's what happens with this film.<br /><br />The opening showed promise, but soon it drops all pretenses of being a thriller (or even an imaginative love story) and the only reason they made this story becomes abundantly clear: to fill a gap in their female viewing market by creating yet another re-hash of 'mis-understood, brooding bad-boy' (Andrei) meets 'innocent, whimsical beauty' (Paula). <br /><br />Rather than waste any time in creating an original premise, the filmmakers went straight for the money-shot: the bad boy being tamed by said whimsical beauty. Thence follows a string of insincere and heavily-clichéd love scenes sprinkled with pseudo philosophical/poetic fluff. Andrei's admission of being (eponymously) a 'poet' is levered in to round out the perceived qualities a Byronic hero should have - but even when we're told in heavy, underlined writing who and what he is, it's still diffic

For text classification, we will split the grand parent folder as before; but for language modelling, we take all the texts and just put 10% aside for validation.

In [9]:
sd = SplitData.split_by_func(tl, partial(random_splitter, p_valid=0.1))

In [10]:
sd

SplitData
Train: TextList (89944 items)
[PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/3719_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/2165_4.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/8280_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/3885_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/8296_4.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/10148_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/5893_1.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/10118_4.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/10285_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/3991_4.txt')...]
Path: /home/jupyter/.fastai/data/imdb
Valid: TextList (10056 items)
[PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/11471_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/2521_3.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/train/neg/2911_3.txt'), PosixPath('/

## Tokenizing

We need to tokenize the dataset first, which is splitting a sentence in individual tokens. THose tokens are the basic words or punctuation signs with a few tweaks: don't for instance is split between do and n't. We will use a processor for this, in conjuction with the spacy library.

In [11]:
#export
import spacy, html

Before even tokenizing, we apply a bit of preprocessing on the texts to clean them up(we saw the one up there had some HTML code). These rules are applied before we split the sentences in tokens.

In [12]:
#export
#special tokens
UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ = "xxunk xxpad xxbos xxeos xxrep xxwrep xxup xxmaj".split()

def sub_br(t):
    "Replaces the <br /> by \n"
    re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
    return re_br.sub("\n", t)

def spec_add_spaces(t):
    "Add spaces around / and #"
    return re.sub(r'([/#])', r' \1 ', t)

def rm_useless_spaces(t):
    "Removes multiple spaces"
    return re.sub(r' {2,}', ' ', t)

def replace_rep(t):
    "Replaces repetitions at the character level: cccc-> TK_REP 4 c"
    def _replace_rep(m:Collection[str])-> str:
        c, cc = m.groups()
        return f' {TK_REP} {len(cc)+1} {c} '
    re_rep = re.compile(r'(\S)(\1{3,})')
    return re_rep.sub(_replace_rep, t)

def replace_wrep(t):
    "Replace word repetitions: word word word ->> TK_WREP 3 word"
    def _replace_wrep(m:Collection[str])-> str:
        c, cc = m.groups()
        return f'{TK_WREP} {len(cc.split())+1} {c}'
    re_wrep = re.compile(r'(\b\w+\W+)(\1{3,})')
    return re_wrep.sub(_replace_wrep, t)

def fixup_text(x):
    "Various messy things we've seen  in documents"
    re1 = re.compile(r' +')
    x = x.replace('#39;', "''").replace('amp;', '&').replace('#146;', "''").replace(
    'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
    '<br />', "\n").replace('\\"','"').replace('<unk>', UNK).replace(' @.@ ', '.').replace(
    ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

default_pre_rules = [fixup_text, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, sub_br]
default_spec_tok = [UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ]

In [13]:
replace_rep('cccc')

' xxrep 4 c '

In [14]:
replace_wrep('word word word word k')

'xxwrep 4 word k'

These rules are applied AFTER the tokenization on the list of tokens(below)

In [15]:
#export
def replace_all_caps(x):
    "Replace tokens in ALL CAPS by their lower version and adds a TK_UP token before"
    res = []
    for t in x:
        if t.isupper() and len(t)>1: res.append(TK_UP); res.append(t.lower())
        else: res.append(t)
    return res

def deal_caps(x):
    "Replace all Capitalized tokens in by their lower verison & add TK_MAJ before"
    res = []
    for t in x:
        if t == '': continue
        if t[0].isupper() and len(t)> 1 and t[1:].islower(): res.append(TK_MAJ)
        res.append(t.lower())
    return res

def add_bos_eos(x): return [BOS] + x + [EOS]

default_post_rules = [deal_caps, replace_all_caps, add_bos_eos]

In [16]:
replace_all_caps(["ABRA", "kA", 'DABRA'])

['xxup', 'abra', 'kA', 'xxup', 'dabra']

In [17]:
deal_caps(['My', 'name', 'is', 'Jake'])

['xxmaj', 'my', 'name', 'is', 'xxmaj', 'jake']

Since tokenizing and applying those rules takes some time, we'll parallelize it using `ProcessPoolExecutor` to go faster

In [18]:
#export
from spacy.symbols import ORTH
from concurrent.futures import ProcessPoolExecutor

def parallel(func, arr, max_workers=4):
    if max_workers<2: results = list(progress_bar(map(func, enumerate(arr)), total=len(arr)))
    else: 
        with ProcessPoolExecutor(max_workers=max_workers) as ex:
            return list(progress_bar(ex.map(func, enumerate(arr)), total = len(arr)))
    if any([o is not None for o in results]):return results     

In [19]:
#export
class TokenizeProcessor(Processor):
    def __init__(self, lang='en', chunksize=2000, pre_rules=None, post_rules=None, max_workers=4):
        self.chunksize, self.max_workers = chunksize, max_workers
        self.tokenizer = spacy.blank(lang).tokenizer
#         print(default_spec_tok)
        for w in default_spec_tok:
            self.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pre_rules  = default_pre_rules if pre_rules is None else pre_rules
        self.post_rules = default_post_rules if post_rules is None else post_rules
        
    def proc_chunk(self, args):
        i, chunk = args
        chunk = [compose(t, self.pre_rules) for t in chunk]
        docs = [[d.text for d in doc] for doc in self.tokenizer.pipe(chunk)]
        docs = [compose(t, self.post_rules) for t in docs]
        return docs
    
    def __call__(self, items):
        toks = []
        if isinstance(items[0], Path): items = [read_file(i) for i in items]
        chunks = [items[i:i+self.chunksize] for i in (range(0, len(items), self.chunksize))]
        toks = parallel(self.proc_chunk, chunks, max_workers=self.max_workers)
        return sum(toks, [])
    
    def proc1(self, item): return self.proc_chunk([item])[0]
    
    def deprocess(self, toks): return [self.deproc1(tok) for tok in toks]
    def deproc1(self, tok): return " ".join(tok)
        

In [20]:
tp = TokenizeProcessor()

In [21]:
txt[:250]

"Imagine pulling back the mask of a lethal assassin and finding Barbara Cartland there... that's what happens with this film.<br /><br />The opening showed promise, but soon it drops all pretenses of being a thriller (or even an imaginative love story"

In [22]:
# tp(txt)

In [23]:
' • '.join(tp(tl[:100])[0]) #[:400]

"xxbos • xxmaj • imagine • pulling • back • the • mask • of • a • lethal • assassin • and • finding • xxmaj • barbara • xxmaj • cartland • there • ... • that • 's • what • happens • with • this • film • . • \n\n • xxmaj • the • opening • showed • promise • , • but • soon • it • drops • all • pretenses • of • being • a • thriller • ( • or • even • an • imaginative • love • story • ) • and • the • only • reason • they • made • this • story • becomes • abundantly • clear • : • to • fill • a • gap • in • their • female • viewing • market • by • creating • yet • another • re • - • hash • of • ' • mis • - • understood • , • brooding • bad • - • boy • ' • ( • xxmaj • andrei • ) • meets • ' • innocent • , • whimsical • beauty • ' • ( • xxmaj • paula • ) • . • \n\n • xxmaj • rather • than • waste • any • time • in • creating • an • original • premise • , • the • filmmakers • went • straight • for • the • money • - • shot • : • the • bad • boy • being • tamed • by • said • whimsical • beauty • .

## Numericalizing

Once we have tokenized our texts, we replace each token by an individual number; this is called numericalizing. Again, we do this with a processor(Not so different from the `CategoryProcessor`

In [24]:
#export
import collections

class NumericalizeProcessor(Processor):
    def __init__(self, vocab=None, max_vocab=60000, min_freq=2):
        self.vocab, self.max_vocab, self.min_freq = vocab, max_vocab, min_freq
        
    def __call__(self, items):
        # the vocab is defined on the first use
        if self.vocab is None:
            freq = Counter(p for o in items for p in o)
            self.vocab = [o for o, c in freq.most_common(self.max_vocab) if c>=self.min_freq]
            for o in reversed(default_spec_tok):
                if o in self.vocab: self.vocab.remove(o)
                self.vocab.insert(0, o) # insert the tokens in default_spec_tok
        if getattr(self, 'otoi', None) is None:
            self.otoi = collections.defaultdict(int, {v:k for k, v in enumerate(self.vocab)})
        return [self.proc1(o) for o in items]
    
    def proc1(self, item): return [self.otoi[o] for o in item]
    
    def deprocess(self, idxs):
        assert self.vocab is not None
        return [self.deporc1(idx) for idx in idxs]
    
    def deproc1(self, idx): return [self.vocab[i] for i in idx]            

When we do language modelling, we will infer the labels from the text during training, so there's no need to label. The training loop expects labels however, so we need to add dummy ones.

In [44]:
proc_tok, proc_num = TokenizeProcessor(max_workers=8), NumericalizeProcessor()

In [45]:
%time ll = label_by_func(sd, lambda x: 0, proc_x = [proc_tok, proc_num])

CPU times: user 21.2 s, sys: 2.57 s, total: 23.8 s
Wall time: 1min 4s


Once the items have been processed they will become a list of numbers, we can still access the underlying data in `x_obj`(or `y_obj` for the targets, but we don't have them here.) 

In [46]:
ll.train.x_obj(0)

"xxbos xxmaj imagine pulling back the mask of a lethal assassin and finding xxmaj barbara xxmaj cartland there ... that 's what happens with this film . \n\n xxmaj the opening showed promise , but soon it drops all pretenses of being a thriller ( or even an imaginative love story ) and the only reason they made this story becomes abundantly clear : to fill a gap in their female viewing market by creating yet another re - hash of ' mis - understood , brooding bad - boy ' ( xxmaj andrei ) meets ' innocent , whimsical beauty ' ( xxmaj paula ) . \n\n xxmaj rather than waste any time in creating an original premise , the filmmakers went straight for the money - shot : the bad boy being tamed by said whimsical beauty . xxmaj thence follows a string of insincere and heavily - clichéd love scenes sprinkled with pseudo philosophical / poetic fluff . xxmaj andrei 's admission of being ( xxunk ) a ' poet ' is xxunk in to round out the perceived qualities a xxmaj byronic hero should have - but eve

Since the preprocessing step takes time, we save the intermediate result using pickle.   
NOTE: Don't use any lambda functions in your processors or they won't be able to pickle

In [47]:
pickle.dump(ll, open(path/'ld.pkl', 'wb'))
pickle.dump(proc_num, open(path/'proc_num_vocab.pkl', 'wb'))

In [29]:
ll = pickle.load(open(path/'ld.pkl', 'rb'))
proc_num.vocab = pickle.load(open(path/'proc_num_vocab.pkl', 'rb'))

## Batching

We have a bit of work to convert our `LabelList` in a `DataBunch` as we don't just want batches of IMDB reviews. We want to stream through all the texts concatenated. We also have to prepare the targets that are the next words in the text. All of this is done with the next object called `LM_Dataset`. At the beginning of the each epoch, it'll shuffle the articles (if `shuffle=True`) and create a big stream in `bs` smaller streams. That will read in chunks of bptt length.

In [48]:
from IPython.display import display, HTML
import pandas as pd

Let's say our stream is:

In [49]:
stream = """
In this notebook, we will go back over the example of classifying movie reviews we studied in part 1 and dig deeper under the surface. 
First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the Processor used in the data block API.
Then we will study how we build a language model and train it.\n
"""
tokens = np.array(tp([stream])[0])

Then if we split it in 6 batches it would give something like this:

In [50]:
type(tokens)

numpy.ndarray

In [51]:
bs ,seq_len = 6, 15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False, header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,\n,xxmaj,in,this,notebook,",",we,will,go,back,over,the,example,of
classifying,movie,reviews,we,studied,in,part,1,and,dig,deeper,under,the,surface,.
\n,xxmaj,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into
numbers,and,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have
another,example,of,the,xxmaj,processor,used,in,the,data,block,api,.,\n,xxmaj
then,we,will,study,how,we,build,a,language,model,and,train,it,.,\n\n


Then if we have a `bptt` of 5, we would go over those 3 batches

In [52]:
bs, bptt = 6,5
for k in range(3):
    d_tokens = np.array([tokens[i*seq_len + k*bptt: i*seq_len + (k+1) * bptt] for i in range(bs)])
    df = pd.DataFrame(d_tokens)
    display(HTML(df.to_html(index=False, header=None)))

0,1,2,3,4
xxbos,\n,xxmaj,in,this
classifying,movie,reviews,we,studied
\n,xxmaj,first,we,will
numbers,and,how,to,customize
another,example,of,the,xxmaj
then,we,will,study,how


0,1,2,3,4
notebook,",",we,will,go
in,part,1,and,dig
look,at,the,processing,steps
it,.,xxmaj,by,doing
processor,used,in,the,data
we,build,a,language,model


0,1,2,3,4
back,over,the,example,of
deeper,under,the,surface,.
necessary,to,convert,text,into
this,",",we,'ll,have
block,api,.,\n,xxmaj
and,train,it,.,\n\n


In [53]:
#export
class LM_Dataset():
    def __init__(self, data, bs=64, bptt=72, shuffle=False):
        self.data, self.bs, self.bptt, self.shuffle = data, bs, bptt, shuffle
        total_len = sum([len(t) for t in data.x])
        self.n_batch = total_len // bs
        self.batchify()
        
    def __len__(self): return ((self.n_batch-1) // self.bptt) * self.bs
    
    def __getitem__(self, idx):
        source = self.batched_data[idx % self.bs]
        seq_idx = (idx // self.bs) * self.bptt
        return source[seq_idx: seq_idx+self.bptt], source[seq_idx+1: seq_idx+self.bptt+1] # x, y
    
    def batchify(self):
        texts = self.data.x
        if self.shuffle: texts = texts[torch.randperm(len(texts))]
        stream = torch.cat([tensor(t) for t in texts])
        self.batched_data = stream[:self.n_batch * self.bs].view(self.bs, self.n_batch)

In [54]:
dl = DataLoader(LM_Dataset(ll.valid, shuffle=True), batch_size=64)

Let's check if it all works ok: `x1`, `y1`, `x2`, and `y2` should all be of size `bs` by `bptt`. The texts in each row of `x1` should continue in `x2`. `y1` and `y2` should have the same texts as their x counterpart shifted by 1 position to the right.

In [55]:
iter_dl = iter(dl)
x1, y1 = next(iter_dl)
x2, y2 = next(iter_dl)

In [56]:
x1.size(), y1.size()

(torch.Size([64, 72]), torch.Size([64, 72]))

In [57]:
vocab = proc_num.vocab

In [58]:
" ".join(vocab[o] for o in x1[0])

'xxbos xxmaj in many ways , the filmic career of independent film - making legend xxmaj john xxmaj cassavetes is the polar opposite of someone like xxmaj alfred xxmaj hitchcock , the consummate studio director . xxmaj where xxmaj hitchcock infamously treated his actors as cattle , xxmaj cassavetes sought to work with them xxunk . xxmaj where every element in a xxmaj hitchcock shot is composed immaculately , xxmaj cassavetes cared'

In [59]:
" ".join(vocab[o] for o in y1[0])

'xxmaj in many ways , the filmic career of independent film - making legend xxmaj john xxmaj cassavetes is the polar opposite of someone like xxmaj alfred xxmaj hitchcock , the consummate studio director . xxmaj where xxmaj hitchcock infamously treated his actors as cattle , xxmaj cassavetes sought to work with them xxunk . xxmaj where every element in a xxmaj hitchcock shot is composed immaculately , xxmaj cassavetes cared less'

In [60]:
" ".join(vocab[o] for o in x2[0])

"less for the way a scene was figuratively composed than in how it felt , or what it conveyed , emotionally . xxmaj hitchcock 's tales were always plot - first narratives , with the human element put in the background . xxmaj cassavetes put the human experience forefront in every one of his films . xxmaj if some things did not make much sense logically , so be it . \n\n"

Let's prepare a convenience function to do this quickly

Since we don't compute gradients for valid_ds, we usually set it's bs twice that to train_ds 

In [61]:
#export
def get_lm_dls(train_ds, valid_ds, bs, bptt, **kwargs):
    return (DataLoader(LM_Dataset(train_ds, bs, bptt, shuffle=True), batch_size=bs, **kwargs),
           DataLoader(LM_Dataset(valid_ds, bs, bptt, shuffle=False), batch_size=2*bs, **kwargs))

def lm_databunchify(sd, bs, bptt, **kwargs):
    return DataBunch(*get_lm_dls(sd.train, sd.valid, bs, bptt, **kwargs))

In [62]:
bs ,bppt = 64, 72
data = lm_databunchify(ll, bs, bptt)

## Batching for Classification

When we will want to tackle classification, gathering the data will be a bit different: first we will label our texts with the folder they come from, and then we will need to apply padding to batch them together. To avoid mixing very long texts with very short ones, we will also use `Sampler` to sort(with a bit of randomness for the training set) ur samples by length.  
  
First, the datablock API calls should look familiar

In [63]:
proc_cat = CategoryProcessor()

In [64]:
tl = TextList.from_files(path, include=['train', 'test'])
sd = SplitData.split_by_func(tl, partial(grandparent_splitter, valid_name='test'))
ll = label_by_func(sd, parent_labeler, proc_x = [proc_tok, proc_num], proc_y = proc_cat)

In [65]:
pickle.dump(ll, open(path/'ll_clas.pkl', 'wb'))

In [66]:
ll = pickle.load(open(path/'ll_clas.pkl', 'rb'))

Let's check if the labels are consistent with the texts

In [67]:
[(ll.train.x_obj(i), ll.train.y_obj(i)) for i in [1, 12000]]

[('xxbos xxmaj is it a coincidence that xxmaj orca was made two years after xxmaj jaws ? xxmaj orca is n\'t exactly a " xxmaj jaws rip off " but it is obvious that it tried to profit from xxmaj jaws \'s success . xxmaj first of all xxmaj orca in my opinion was a bad movie , not terrible but definitely not good , average at best . \n\n xxmaj the plot is basically a male killer whale ( orca ) after seeing its mate and its unborn calf killed by a fisherman seeks revenge . i could n\'t stand to watch this movie again . xxmaj the direction of this film is poor and when compared to xxmaj jaws it looks like the director , producers , and writers were almost talentless . \n\n xxmaj as for the acting , it was very average and believable , however the actual characters are n\'t the least bit likable . xxmaj the effects were alright for its time and the footage of the killer whale looked pretty good . \n\n xxmaj the violence is confusing , bloody , and not recommended for more sensitive people . 

In [84]:
[ll.train.x_obj(i) for i in [1, 10]]

['xxbos xxmaj is it a coincidence that xxmaj orca was made two years after xxmaj jaws ? xxmaj orca is n\'t exactly a " xxmaj jaws rip off " but it is obvious that it tried to profit from xxmaj jaws \'s success . xxmaj first of all xxmaj orca in my opinion was a bad movie , not terrible but definitely not good , average at best . \n\n xxmaj the plot is basically a male killer whale ( orca ) after seeing its mate and its unborn calf killed by a fisherman seeks revenge . i could n\'t stand to watch this movie again . xxmaj the direction of this film is poor and when compared to xxmaj jaws it looks like the director , producers , and writers were almost talentless . \n\n xxmaj as for the acting , it was very average and believable , however the actual characters are n\'t the least bit likable . xxmaj the effects were alright for its time and the footage of the killer whale looked pretty good . \n\n xxmaj the violence is confusing , bloody , and not recommended for more sensitive people . x

We saw samplers in notebook 3. For the valid_ds, we will simply sort the samples by length and we will begin w the longest ones for memory reasons(it's better to always have the biggest tensors first).

In [68]:
#export
from torch.utils.data import Sampler

class SortSampler(Sampler):
    def __init__(self, data_source, key): self.data_source, self.key = data_source, key
    def __len__(self): return len(self.data_source)
    def __iter__(self): return iter(sorted(list(range((len(self.data_source)))), key=self.key, reverse=True))

In [69]:
# Sampler??

For the training set, we want some kind of randomness on top of this. So first, we shuffle the texts and build megabatches of size `50*bs`. We sort those megabatches by length before splitting them in 50 mini-batches. That way we'll have randomized batches of roughly the same length.

Then we make sure to have the biggest batch first and shuffle the order of other batches. We also make sure that the last batch stays at the end because it's size is probably smaller than the batch size.

In [75]:
#export
class SortishSampler(Sampler):
    def __init__(self, data_source, key, bs):
            self.data_source, self.key, self.bs = data_source, key, bs
        
    def __len__(self)->int: return len(self.data_source)
    
    def __iter__(self):
        idxs = torch.randperm(len(self.data_source))
        megabatches = [idxs[i:i+self.bs*50] for i in range(0, len(idxs), self.bs*50)]
        sorted_idx = torch.cat([tensor(sorted(s, key=self.key, reverse=True)) for s in megabatches])
        batches = [sorted_idx[i:i+self.bs] for i in range(0, len(sorted_idx), self.bs)]
        max_idx = torch.argmax(tensor([self.key(ck[0]) for ck in batches])) # find the chunk with the largest key
        batches[0], batches[max_idx] = batches[max_idx], batches[0] # then make sure it goes first
        # make sure the largest and smallest batch goes first and last and randomize the order of the rest.
        batch_idxs = torch.randperm(len(batches)-2)
        sorted_idx = torch.cat([batches[i+1] for i in batch_idxs]) if len(batches)>1 else LongTensor([])
        sorted_idx = torch.cat([batches[0], sorted_idx, batches[-1]])
        return iter(sorted_idx)

Padding: we had the padding token(as an id of 1) at the end of each sequence to make them all the same size when batching them. Note that we need padding at the end to be able to use `Pytorch` convenience functions that will let us ignore that padding(see 12c).

In [85]:
#export
def pad_collate(samples, pad_idx=1, pad_first=False):
    max_len = max([len(s[0]) for s in samples])
    res = torch.zeros(len(samples), max_len).long() + pad_idx
    for i,s in enumerate(samples):
        if pad_first: res[i, -len(s[0]):] = LongTensor(s[0])
        else:         res[i, :len(s[0]) ] = LongTensor(s[0])
    return res, tensor([s[1] for s in samples])

In [86]:
bs = 64
train_sampler = SortishSampler(ll.train.x, key=lambda t: len(ll.train[int(t)][0]), bs=bs)
train_dl = DataLoader(ll.train, batch_size=bs, sampler=train_sampler, collate_fn = pad_collate)

In [87]:
iter_dl = iter(train_dl)
x, y = next(iter_dl)

In [91]:
x.shape, y.shape

(torch.Size([64, 3311]), torch.Size([64]))

In [88]:
lengths = []
for i in range(x.size(0)): lengths.append(x.size(1)- (x[i]==1).sum().item())
lengths[:5], lengths[-1]

([3311, 1699, 1394, 1390, 1355], 1049)

In [92]:
len(lengths)

64

The last one is minimal length. This is the first batch so it has the longest sequence, but if we look at the next one that is more random, we see lengths are roughly the same.

In [93]:
x, y = next(iter_dl)
lengths = []
for i in range(x.size(0)): lengths.append(x.size(1)- (x[i]==1).sum().item())
lengths[:5], lengths[-1]

([449, 448, 448, 447, 447], 424)

We can see the padding in the end(id is 1)

In [94]:
x

tensor([[   2,   18,   73,  ...,   89,   66,    3],
        [   2,    7, 1066,  ...,    9,    3,    1],
        [   2,    7,   19,  ...,    9,    3,    1],
        ...,
        [   2,    7,   19,  ...,    1,    1,    1],
        [   2,   18,  319,  ...,    1,    1,    1],
        [   2,   18,  235,  ...,    1,    1,    1]])

And we add a convenience function:

In [98]:
#export
def get_class_dls(train_ds, valid_ds, bs, **kwargs):
    train_sampler = SortishSampler(train_ds.x, key=lambda t: len(train_ds.x[t]), bs=bs)
    valid_sampler = SortSampler(valid_ds.x, key=lambda t: len(valid_ds.x[t]))
    return (DataLoader(train_ds, batch_size=bs, sampler=train_sampler, collate_fn=pad_collate, **kwargs),
           DataLoader(valid_ds, batch_size=bs*2, sampler=valid_sampler, collate_fn=pad_collate, **kwargs))

def clas_databunchify(sd, bs, **kwargs):
    return DataBunch(*get_class_dls(sd.train, sd.valid, bs, **kwargs))

In [99]:
bs, bptt = 64, 72
data = clas_databunchify(ll, bs)

## Export

In [100]:
!python notebook2script.py 12_text.ipynb

Converted 12_text.ipynb to exp/nb_12.py
