# Text Preprocessing on the IMDb (Internet Movie Review Database)
### The IMDB consists of 50,000 labeled reviews of movies (positive or negative) and 50,000 unlabelled ones.

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
#export
from exp.nb_11a import *

## 1. Define a class `TextList` that will directly read text from filenames
### `TextList` is a subclass of `ItemList` 

In [5]:
#export
def read_file(fn): 
    with open(fn, 'r', encoding = 'utf8') as f: return f.read()
    
class TextList(ItemList):
    @classmethod
    def from_files(cls, path, extensions='.txt', recurse=True, include=None, **kwargs):
        return cls(get_files(path, extensions, recurse=recurse, include=include), path, **kwargs)
    
    def get(self, i):
        if isinstance(i, Path): return read_file(i)
        return i

## 2. Import the IMDb Data directly into a `TextList`

[Jump_to lesson 12 video](https://course.fast.ai/videos/?lesson=12&t=4964)

In [3]:
path = datasets.untar_data(datasets.URLs.IMDB)

In [4]:
path.ls()

[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/imdb.vocab'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ld.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_clas.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/README'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_clas'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_lm'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/unsup')]

Just in case there are some text log files, we restrict the ones we take to the training, test, and unsupervised folders.

In [6]:
il = TextList.from_files(path, include=['train', 'test', 'unsup'])

## 3. A little Exploratory Data Analysis (EDA)

We should expect a total of 100,000 texts.

In [7]:
len(il.items)

100000

Here is the first item in the list as an example.

In [8]:
txt = il[0]
txt

"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in."

In [33]:
xx = 'str'
yy = " ".join(xx)
yy

's t r'

In [None]:
# il has 100,000 text elements
len(il)

In [55]:
txt

"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in."

In [35]:
txt[:250]

'Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner'

In [54]:
# first text element in il (same as txt) has length of 900
print(len(il[:100][0]))
print(len(txt))

900
900


In [82]:
# before token processing
print(len(il[:100][0]))
il[:100][0]

900


"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in."

## 4. Tokenizing using the `spaCy` library

We need to tokenize the dataset first, which is splitting a sentence in individual tokens. Those tokens are the basic words or punctuation signs with a few tweaks: don't for instance is split between do and n't. We will use a processor for this, in conjunction with the [spaCy library](https://spacy.io/). spaCy describes itself as a purveyor of `Industrial Strength Natural Language Processing`

In [15]:
#export
import spacy,html

[Jump_to lesson 12 video](https://course.fast.ai/videos/?lesson=12&t=5070)

## 4.1 Helper functions
Before and after tokenization, we will use helper functions to do a bit of processing. These helper functions manipulate text using the powerful syntax of `regular expressions` (sometimes abbreviated `regex`). Python's `re` library implements the language of `regular expressions`. On first encounter, the compact syntax of `re` can be a bit off-putting. It can be annoying, like trying to read a foreign language about which you have no clue. The brain becomes confused at its inability to parse as when reading words of its native language.

What each of these helper functions do is made clear by the documentation. So on first pass, think of the `re` operations as black boxes implementing the documentation. For now you needn't worry about the details of regular expressions. 

At some point, you will eventually need to bite the bullet and understand `regular expressions` to go further in NLP. Here are a few helpful resources once you are ready to learn more:

https://scotch.io/tutorials/an-introduction-to-regex-in-python

https://www.w3schools.com/python/python_regex.asp

http://marvin.cs.uidaho.edu/Handouts/regex.html

http://flockhart.virtualave.net/RBIF0100/regexp.html

https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html


### 4.1.1 Pre-proccessing

Before tokenizeing, we will apply a bit of preprocessing on the texts to clean them up (we saw the one up there had some HTML code). These rules are applied before we split the sentences in tokens.

In [12]:
#export
#special tokens
UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ = "xxunk xxpad xxbos xxeos xxrep xxwrep xxup xxmaj".split()

def sub_br(t):
    "Replaces the <br /> by \n"
    re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
    return re_br.sub("\n", t)

def spec_add_spaces(t):
    "Add spaces around / and #"
    return re.sub(r'([/#])', r' \1 ', t)

def rm_useless_spaces(t):
    "Remove multiple spaces"
    return re.sub(' {2,}', ' ', t)

def replace_rep(t):
    "Replace repetitions at the character level: cccc -> TK_REP 4 c"
    def _replace_rep(m:Collection[str]) -> str:
        c,cc = m.groups()
        return f' {TK_REP} {len(cc)+1} {c} '
    re_rep = re.compile(r'(\S)(\1{3,})')
    return re_rep.sub(_replace_rep, t)
    
def replace_wrep(t):
    "Replace word repetitions: word word word -> TK_WREP 3 word"
    def _replace_wrep(m:Collection[str]) -> str:
        c,cc = m.groups()
        return f' {TK_WREP} {len(cc.split())+1} {c} '
    re_wrep = re.compile(r'(\b\w+\W+)(\1{3,})')
    return re_wrep.sub(_replace_wrep, t)

def fixup_text(x):
    "Various messy things we've seen in documents"
    re1 = re.compile(r'  +')
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>',UNK).replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))
    
default_pre_rules = [fixup_text, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, sub_br]
default_spec_tok = [UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ]

#### Examples 

In [13]:
replace_rep('cccc')

' xxrep 4 c '

In [14]:
replace_wrep('word word word word word ')

' xxwrep 5 word  '

### 4.1.2 Post-proccessing
After tokenization we process the tokens to remove capitalization, but adding marker tokens to flag the start and end of the text and to preserve information about where the capitalization was in the original text.

`TK_UP` indicates that the next token was originally in all caps

`TK_MAJ` indicates that the next token was originally capitalized

`BOS` indicates the beginning of a string

`EOS` indicates the end of a string

In [25]:
#export
def replace_all_caps(x):
    "Replace tokens in ALL CAPS by their lower version and add `TK_UP` before, if length > 1"
    res = []
    for t in x:
        if t.isupper() and len(t) > 1: res.append(TK_UP); res.append(t.lower())
        else: res.append(t)
    return res

def deal_caps(x):
    "Replace all Capitalized tokens by their lower version and add `TK_MAJ` before."
    res = []
    for t in x:
        if t == '': continue
        if t[0].isupper() and len(t) > 1 and t[1:].islower(): res.append(TK_MAJ)
        res.append(t.lower())
    return res

"What does this function do? And why does it go last?"
"Brackets each token with BOS and EOS tokens"
def add_eos_bos(x): return [BOS] + x + [EOS]

default_post_rules = [deal_caps, replace_all_caps, add_eos_bos]

#### Examples

In [19]:
replace_all_caps(['I', 'AM', 'SHOUTING'])

['I', 'xxup', 'am', 'xxup', 'shouting']

In [20]:
deal_caps(['My', 'name', 'is', 'Jeremy'])

['xxmaj', 'my', 'name', 'is', 'xxmaj', 'jeremy']

## 4.2 Parallellizing the Tokenization Process
Since tokenizing and applying those rules takes a bit of time, we'll parallelize it using `ProcessPoolExecutor` to go faster.

In [28]:
#export
from spacy.symbols import ORTH
from concurrent.futures import ProcessPoolExecutor

def parallel(func, arr, max_workers=4):
    # should specify what are the inputs func and arr?
    if max_workers<2: results = list(progress_bar(map(func, enumerate(arr)), total=len(arr)))
    else:
        # use context manager to handle parallel processing case
        with ProcessPoolExecutor(max_workers=max_workers) as ex:
            return list(progress_bar(ex.map(func, enumerate(arr)), total=len(arr)))
    if any([o is not None for o in results]): return results

In [32]:
#export
class TokenizeProcessor(Processor):
    # initialize max_workers to 1, because Windows 10 won't allow max_workers > 1
    #def __init__(self, lang="en", chunksize=2000, pre_rules=None, post_rules=None, max_workers=4): 
    def __init__(self, lang="en", chunksize=2000, pre_rules=None, post_rules=None, max_workers=1): 
        self.chunksize,self.max_workers = chunksize,max_workers
        # using spacy's tokenizer
        self.tokenizer = spacy.blank(lang).tokenizer
        for w in default_spec_tok:
            # dictionary of default_spec_tok
            self.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pre_rules  = default_pre_rules  if pre_rules  is None else pre_rules
        self.post_rules = default_post_rules if post_rules is None else post_rules

    def proc_chunk(self, args):
        # specify inputs: what are i and chunk?
        i,chunk = args
        
        # pre-process
        chunk = [compose(t, self.pre_rules) for t in chunk]
        # what does .pipe do?
        docs = [[d.text for d in doc] for doc in self.tokenizer.pipe(chunk)]
        
        # post-process
        docs = [compose(t, self.post_rules) for t in docs]
        return docs

    def __call__(self, items): 
        toks = []
        if isinstance(items[0], Path): items = [read_file(i) for i in items]
        chunks = [items[i: i+self.chunksize] for i in (range(0, len(items), self.chunksize))]
        toks = parallel(self.proc_chunk, chunks, max_workers=self.max_workers)
        return sum(toks, [])
    
    def proc1(self, item): return self.proc_chunk([item])[0]
    
    # what do these deprocessing functions do?
    def deprocess(self, toks): return [self.deproc1(tok) for tok in toks]
    # this one inserts blank space between characters 
    def deproc1(self, tok):    return " ".join(tok)

## 4.3 Instantiate the TokenizeProcessor() and explore the data a bit more

In [34]:
tp = TokenizeProcessor()

In [64]:
# hmmmm.... this is weird What's going on here?
#      Makes a list with an item for each of the 900 characters in the text.
#           Each item is a list with 3 elements: 
#                the character, bracketed by 'xxbos' and 'xxeos'
#                so xxbos and xxeos flag the beginning and end of each item
# Aha! This is the result of add_eos_bos(x), the last function in the post-processing default_post_rules!
tp(il[:100][0])


[['xxbos', 'o', 'xxeos'],
 ['xxbos', 'n', 'xxeos'],
 ['xxbos', 'c', 'xxeos'],
 ['xxbos', 'e', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'a', 'xxeos'],
 ['xxbos', 'g', 'xxeos'],
 ['xxbos', 'a', 'xxeos'],
 ['xxbos', 'i', 'xxeos'],
 ['xxbos', 'n', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'm', 'xxeos'],
 ['xxbos', 'r', 'xxeos'],
 ['xxbos', '.', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'c', 'xxeos'],
 ['xxbos', 'o', 'xxeos'],
 ['xxbos', 's', 'xxeos'],
 ['xxbos', 't', 'xxeos'],
 ['xxbos', 'n', 'xxeos'],
 ['xxbos', 'e', 'xxeos'],
 ['xxbos', 'r', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'h', 'xxeos'],
 ['xxbos', 'a', 'xxeos'],
 ['xxbos', 's', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'd', 'xxeos'],
 ['xxbos', 'r', 'xxeos'],
 ['xxbos', 'a', 'xxeos'],
 ['xxbos', 'g', 'xxeos'],
 ['xxbos', 'g', 'xxeos'],
 ['xxbos', 'e', 'xxeos'],
 ['xxbos', 'd', 'xxeos'],
 ['xxbos', ' ', 'xxeos'],
 ['xxbos', 'o', 'xxeos'],
 ['xxbos', 'u', 'xxeos'],
 ['xxbos', 't', 'xxeos'],
 ['xxbos', '

In [81]:
# the 900 characters of the first text item are mapped into 207 words, including special tokens and punctuation
#      note that the beginning and end tokens are 'xxbos' and 'xxeos'
print(len(tp(il[:100])[0]))
tp(il[:100])[0]

207


['xxbos',
 'xxmaj',
 'once',
 'again',
 'xxmaj',
 'mr.',
 'xxmaj',
 'costner',
 'has',
 'dragged',
 'out',
 'a',
 'movie',
 'for',
 'far',
 'longer',
 'than',
 'necessary',
 '.',
 'xxmaj',
 'aside',
 'from',
 'the',
 'terrific',
 'sea',
 'rescue',
 'sequences',
 ',',
 'of',
 'which',
 'there',
 'are',
 'very',
 'few',
 'i',
 'just',
 'did',
 'not',
 'care',
 'about',
 'any',
 'of',
 'the',
 'characters',
 '.',
 'xxmaj',
 'most',
 'of',
 'us',
 'have',
 'ghosts',
 'in',
 'the',
 'closet',
 ',',
 'and',
 'xxmaj',
 'costner',
 "'s",
 'character',
 'are',
 'realized',
 'early',
 'on',
 ',',
 'and',
 'then',
 'forgotten',
 'until',
 'much',
 'later',
 ',',
 'by',
 'which',
 'time',
 'i',
 'did',
 'not',
 'care',
 '.',
 'xxmaj',
 'the',
 'character',
 'we',
 'should',
 'really',
 'care',
 'about',
 'is',
 'a',
 'very',
 'cocky',
 ',',
 'overconfident',
 'xxmaj',
 'ashton',
 'xxmaj',
 'kutcher',
 '.',
 'xxmaj',
 'the',
 'problem',
 'is',
 'he',
 'comes',
 'off',
 'as',
 'kid',
 'who',
 'think

In [78]:
print(len(tp(il[:100])[0]))
tp(il[:100])[0]

207


['xxbos',
 'xxmaj',
 'once',
 'again',
 'xxmaj',
 'mr.',
 'xxmaj',
 'costner',
 'has',
 'dragged',
 'out',
 'a',
 'movie',
 'for',
 'far',
 'longer',
 'than',
 'necessary',
 '.',
 'xxmaj',
 'aside',
 'from',
 'the',
 'terrific',
 'sea',
 'rescue',
 'sequences',
 ',',
 'of',
 'which',
 'there',
 'are',
 'very',
 'few',
 'i',
 'just',
 'did',
 'not',
 'care',
 'about',
 'any',
 'of',
 'the',
 'characters',
 '.',
 'xxmaj',
 'most',
 'of',
 'us',
 'have',
 'ghosts',
 'in',
 'the',
 'closet',
 ',',
 'and',
 'xxmaj',
 'costner',
 "'s",
 'character',
 'are',
 'realized',
 'early',
 'on',
 ',',
 'and',
 'then',
 'forgotten',
 'until',
 'much',
 'later',
 ',',
 'by',
 'which',
 'time',
 'i',
 'did',
 'not',
 'care',
 '.',
 'xxmaj',
 'the',
 'character',
 'we',
 'should',
 'really',
 'care',
 'about',
 'is',
 'a',
 'very',
 'cocky',
 ',',
 'overconfident',
 'xxmaj',
 'ashton',
 'xxmaj',
 'kutcher',
 '.',
 'xxmaj',
 'the',
 'problem',
 'is',
 'he',
 'comes',
 'off',
 'as',
 'kid',
 'who',
 'think

In [71]:
' • '.join(tp(il[:100])[0])[:]

"xxbos • xxmaj • once • again • xxmaj • mr. • xxmaj • costner • has • dragged • out • a • movie • for • far • longer • than • necessary • . • xxmaj • aside • from • the • terrific • sea • rescue • sequences • , • of • which • there • are • very • few • i • just • did • not • care • about • any • of • the • characters • . • xxmaj • most • of • us • have • ghosts • in • the • closet • , • and • xxmaj • costner • 's • character • are • realized • early • on • , • and • then • forgotten • until • much • later • , • by • which • time • i • did • not • care • . • xxmaj • the • character • we • should • really • care • about • is • a • very • cocky • , • overconfident • xxmaj • ashton • xxmaj • kutcher • . • xxmaj • the • problem • is • he • comes • off • as • kid • who • thinks • he • 's • better • than • anyone • else • around • him • and • shows • no • signs • of • a • cluttered • closet • . • xxmaj • his • only • obstacle • appears • to • be • winning • over • xxmaj • costner • . • xxmaj 

In [178]:
' • '.join(tp(il[:100])[0])[:400]

'xxbos • xxmaj • once • again • xxmaj • mr. • xxmaj • costner • has • dragged • out • a • movie • for • far • longer • than • necessary • . • xxmaj • aside • from • the • terrific • sea • rescue • sequences • , • of • which • there • are • very • few • i • just • did • not • care • about • any • of • the • characters • . • xxmaj • most • of • us • have • ghosts • in • the • closet • , • and • xxmaj'

In [179]:
' • '.join(tp(il[:100])[0])[:400]

'xxbos • xxmaj • once • again • xxmaj • mr. • xxmaj • costner • has • dragged • out • a • movie • for • far • longer • than • necessary • . • xxmaj • aside • from • the • terrific • sea • rescue • sequences • , • of • which • there • are • very • few • i • just • did • not • care • about • any • of • the • characters • . • xxmaj • most • of • us • have • ghosts • in • the • closet • , • and • xxmaj'

In [180]:
len(' • '.join(tp(il[:100])[0])[:])

1451

## 5. Numericalizing the text data

Once we have tokenized our texts, we replace each token by an individual number, this is called numericalizing. Again, we do this with a processor (not so different from the `CategoryProcessor`).

[Jump_to lesson 12 video](https://course.fast.ai/videos/?lesson=12&t=5491)

In [153]:
#export
import collections

class NumericalizeProcessor(Processor):
    def __init__(self, vocab=None, max_vocab=60000, min_freq=2): 
        self.vocab,self.max_vocab,self.min_freq = vocab,max_vocab,min_freq
    
    def __call__(self, items):
        #The vocab is defined on the first use.
        if self.vocab is None:
            freq = Counter(p for o in items for p in o)
            # include a word only if it occurs more than self.min_freq times in the text
            self.vocab = [o for o,c in freq.most_common(self.max_vocab) if c >= self.min_freq]
            for o in reversed(default_spec_tok):
                if o in self.vocab: self.vocab.remove(o)
                self.vocab.insert(0, o)
        if getattr(self, 'otoi', None) is None:
            self.otoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.vocab)}) 
        return [self.proc1(o) for o in items]
    def proc1(self, item):  return [self.otoi[o] for o in item]
    
    def deprocess(self, idxs):
        assert self.vocab is not None
        return [self.deproc1(idx) for idx in idxs]
    def deproc1(self, idx): return [self.vocab[i] for i in idx]

### 5.1 Splitting the data into training and validation sets
For text classification, we will split by the grand parent folder as before, but for language modeling, we take all the texts and just put 10% aside.

In [85]:
sd = SplitData.split_by_func(il, partial(random_splitter, p_valid=0.1))

In [86]:
sd

SplitData
Train: TextList (90085 items)
[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/0_2.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10000_4.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10001_1.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10002_3.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10003_3.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10004_2.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10005_2.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10006_2.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10007_4.txt'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10008_4.txt')...]
Path: C:\Users\cross-entropy\.fastai\data\imdb
Valid: TextList (9915 items)
[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test/neg/10010_2.txt'), WindowsPath('C:/Users/cro

### 5.2 Labeling
When we do language modeling, we will infer the labels from the text during training, so there's no need to label. The training loop expects labels however, so we need to add dummy ones.

In [87]:
# proc_tok,proc_num = TokenizeProcessor(max_workers=8),NumericalizeProcessor()
proc_tok,proc_num = TokenizeProcessor(max_workers=1),NumericalizeProcessor()

In [88]:
%time ll = label_by_func(sd, lambda x: 0, proc_x = [proc_tok,proc_num])

Wall time: 11min 54s


Once the items have been processed they will become list of numbers. We can still access the underlying raw data in `x_obj` for the text and `y_obj` for the targets (which in this case are all dummies).

In [121]:
# Numericalized text lists
ll.train.x_obj

<bound method LabeledData.x_obj of LabeledData
x: TextList (90085 items)
[[2, 7, 301, 193, 7, 596, 7, 5444, 61, 3435, 60, 12, 29, 28, 248, 1135, 93, 1700, 9, 7, 1216, 51, 8, 1353, 1615, 2073, 828, 10, 13, 79, 54, 38, 70, 190, 18, 56, 87, 37, 474, 59, 120, 13, 8, 121, 9, 7, 110, 13, 202, 41, 3017, 17, 8, 4844, 10, 11, 7, 5444, 22, 123, 38, 1695, 432, 34, 10, 11, 115, 1533, 385, 94, 328, 10, 47, 79, 75, 18, 87, 37, 474, 9, 7, 8, 123, 90, 156, 83, 474, 59, 15, 12, 70, 9452, 10, 37895, 7, 8476, 7, 10744, 9, 7, 8, 462, 15, 39, 297, 142, 26, 543, 49, 1212, 39, 22, 146, 93, 270, 345, 209, 108, 11, 294, 74, 3670, 13, 12, 16872, 4844, 9, 7, 40, 82, 13455, 738, 14, 43, 1769, 143, 7, 5444, 9, 7, 449, 68, 90, 38, 89, 520, 8, 331, 116, 243, 13, 19, 3972, 10, 7, 5444, 715, 202, 44, 59, 7, 10744, 22, 3017, 9, 7, 90, 38, 594, 154, 7, 10744, 15, 2084, 14, 43, 8, 139, 27, 74, 2716, 16667, 55, 12049, 9, 7, 74, 1298, 148, 10, 16, 25, 44, 18, 95, 58, 14, 409, 51, 1618, 16, 142, 48, 563, 17, 9, 3], [2, 7, 1

In [154]:
# Here is the text in the first review
ll.train.x_obj(0)

"xxbos xxmaj once again xxmaj mr. xxmaj costner has dragged out a movie for far longer than necessary . xxmaj aside from the terrific sea rescue sequences , of which there are very few i just did not care about any of the characters . xxmaj most of us have ghosts in the closet , and xxmaj costner 's character are realized early on , and then forgotten until much later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj finally when we are well past the half way point of this stinker , xxmaj costner tells us all about xxmaj kutcher 's ghosts . xxmaj we are told why xxmaj kutcher is driven to be the best with no prior inkling or foreshadowing . xxmaj no magic here , it was all i could do to kee

### 5.3 Save the labels`
Since the preprocessing takes time, we save the intermediate result using pickle. 

Don't use`lambda` functions in your processors or they won't be able to pickle!

In [90]:
pickle.dump(ll, open(path/'ld.pkl', 'wb'))

In [91]:
ll = pickle.load(open(path/'ld.pkl', 'rb'))

## 6. Batching

### 6.1 Batching for Language Modeling

We have a bit of work to convert our `LabelList` in a `DataBunch` as we don't just want batches of IMDB reviews. We want to stream through all the concatenated texts. We also have to prepare the targets that are the newt words in the text. All of this is done with the next object called `LM_PreLoader`. At the beginning of each epoch, it'll shuffle the articles (if `shuffle=True`) and create a big stream by concatenating all of them. We divide this big stream in `bs` smaller streams. That we will read in chunks of bptt length.

[Jump_to lesson 12 video](https://course.fast.ai/videos/?lesson=12&t=5565)

In [155]:
# Just using those for illustration purposes, they're not used otherwise.
from IPython.display import display,HTML
import pandas as pd

Let's say our stream is:

In [156]:
stream = """
In this notebook, we will go back over the example of classifying movie reviews we studied in part 1 and dig deeper under the surface. 
First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the Processor used in the data block API.
Then we will study how we build a language model and train it.\n
"""
tokens = np.array(tp([stream])[0])

Here's how to split the data into 6 batches of 15 tokens each.
Here, the use of `bs` to denote the `number of batches` is potentially confusing because we've come to associate `bs` with `batch size`, so I've replaced `bs` with `n_batches`

In [157]:
n_batches,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(n_batches)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,\n,xxmaj,in,this,notebook,",",we,will,go,back,over,the,example,of
classifying,movie,reviews,we,studied,in,part,1,and,dig,deeper,under,the,surface,.
\n,xxmaj,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into
numbers,and,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have
another,example,of,the,xxmaj,processor,used,in,the,data,block,api,.,\n,xxmaj
then,we,will,study,how,we,build,a,language,model,and,train,it,.,\n\n


In [158]:
# we can also view the data frame like this:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,\n,xxmaj,in,this,notebook,",",we,will,go,back,over,the,example,of
1,classifying,movie,reviews,we,studied,in,part,1,and,dig,deeper,under,the,surface,.
2,\n,xxmaj,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into
3,numbers,and,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have
4,another,example,of,the,xxmaj,processor,used,in,the,data,block,api,.,\n,xxmaj
5,then,we,will,study,how,we,build,a,language,model,and,train,it,.,\n\n


Splitting each batch of 15 tokens into 3 sub-batches of 5 tokens each can be done like this
Here, `bptt` means `back-propagation through time`

In [159]:
n_batches,bptt = 6,5
for k in range(3):
    d_tokens = np.array([tokens[i*seq_len + k*bptt:i*seq_len + (k+1)*bptt] for i in range(n_batches)])
    df = pd.DataFrame(d_tokens)
    #display(HTML(df.to_html(index=False,header=None)))
    display(df)

Unnamed: 0,0,1,2,3,4
0,xxbos,\n,xxmaj,in,this
1,classifying,movie,reviews,we,studied
2,\n,xxmaj,first,we,will
3,numbers,and,how,to,customize
4,another,example,of,the,xxmaj
5,then,we,will,study,how


Unnamed: 0,0,1,2,3,4
0,notebook,",",we,will,go
1,in,part,1,and,dig
2,look,at,the,processing,steps
3,it,.,xxmaj,by,doing
4,processor,used,in,the,data
5,we,build,a,language,model


Unnamed: 0,0,1,2,3,4
0,back,over,the,example,of
1,deeper,under,the,surface,.
2,necessary,to,convert,text,into
3,this,",",we,'ll,have
4,block,api,.,\n,xxmaj
5,and,train,it,.,\n\n


In [160]:
#export
class LM_PreLoader():
    def __init__(self, data, bs=64, bptt=70, shuffle=False):
        self.data,self.bs,self.bptt,self.shuffle = data,bs,bptt,shuffle
        total_len = sum([len(t) for t in data.x])
        self.n_batch = total_len // bs
        self.batchify()
    
    def __len__(self): return ((self.n_batch-1) // self.bptt) * self.bs
    
    def batchify(self):
        texts = self.data.x
        if self.shuffle: texts = texts[torch.randperm(len(texts))]
        stream = torch.cat([tensor(t) for t in texts])
        self.batched_data = stream[:self.n_batch * self.bs].view(self.bs, self.n_batch)
        
    def __getitem__(self, idx):
        source = self.batched_data[idx % self.bs] # % is the mod() operation
        seq_idx = (idx // self.bs) * self.bptt
        return source[seq_idx:seq_idx+self.bptt],source[seq_idx+1:seq_idx+self.bptt+1]

### Instantiate a text DataLoader

In [161]:
dl = DataLoader(LM_PreLoader(ll.valid, shuffle=True), batch_size=64)

Let's check it all works ok: `x1`, `y1`, `x2` and `y2` should all be of size `bs`  by `bptt`. The texts in each row of `x1` should continue in `x2`. `y1` and `y2` should have the same texts as their `x` counterpart, shifted of one position to the right.

In [162]:
iter_dl = iter(dl)
x1,y1 = next(iter_dl)
x2,y2 = next(iter_dl)

In [163]:
x1.size(),y1.size()

(torch.Size([64, 70]), torch.Size([64, 70]))

In [164]:
vocab = proc_num.vocab

In [165]:
" ".join(vocab[o] for o in x1[0])

'xxbos xxmaj saw this in a near empty cinema when it came out and enjoyed it all the more . xxmaj got it again on a battered old vhs and it is still as great . xxmaj so why do some people hate it ? i think firstly the film is more about mood than plot , so you have to be able to relax to get into it .'

In [166]:
" ".join(vocab[o] for o in y1[0])

'xxmaj saw this in a near empty cinema when it came out and enjoyed it all the more . xxmaj got it again on a battered old vhs and it is still as great . xxmaj so why do some people hate it ? i think firstly the film is more about mood than plot , so you have to be able to relax to get into it . xxmaj'

In [167]:
" ".join(vocab[o] for o in x2[0])

'xxmaj its dream - like and as in dreams ( and musicals ) not everything makes sense or looks right . xxmaj the film is also about colour , every set piece has been designed to show bright neon colours - again dream like , but to others it just looks fake . xxmaj and to top it all you have a dream girl in the shape of xxmaj natassia'

And let's prepare some convenience function to do this quickly.

In [168]:
#export
def get_lm_dls(train_ds, valid_ds, bs, bptt, **kwargs):
    return (DataLoader(LM_PreLoader(train_ds, bs, bptt, shuffle=True), batch_size=bs, **kwargs),
            DataLoader(LM_PreLoader(valid_ds, bs, bptt, shuffle=False), batch_size=2*bs, **kwargs))

def lm_databunchify(sd, bs, bptt, **kwargs):
    return DataBunch(*get_lm_dls(sd.train, sd.valid, bs, bptt, **kwargs))

In [169]:
bs,bptt = 64,70
data = lm_databunchify(ll, bs, bptt)

## Batching for classification

When we will want to tackle classification, gathering the data will be a bit different: first we will label our texts with the folder they come from, and then we will need to apply padding to batch them together. To avoid mixing very long texts with very short ones, we will also use `Sampler` to sort (with a bit of randomness for the training set) our samples by length.

First the data block API calls shold look familiar.

[Jump_to lesson 12 video](https://course.fast.ai/videos/?lesson=12&t=5877)

In [181]:
proc_cat = CategoryProcessor()

In [171]:
il = TextList.from_files(path, include=['train', 'test'])
sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='test'))
ll = label_by_func(sd, parent_labeler, proc_x = [proc_tok, proc_num], proc_y=proc_cat)

In [177]:
ll

SplitData
Train: LabeledData
x: TextList (25000 items)
[[2, 7, 81, 13, 12, 145, 49, 61, 6926, 1419, 28, 12, 4396, 9, 7, 527, 60, 27, 12, 649, 150, 20, 15, 12, 1353, 509, 13, 1903, 223, 9, 12, 10940, 7256, 329, 15, 660, 104, 48, 2019, 10, 1069, 2495, 47, 8, 949, 0, 13, 16, 22, 5787, 9, 7, 494, 16, 2937, 1903, 8, 241, 75, 27, 74, 765, 1444, 872, 254, 16, 56, 117, 142, 1508, 9, 7, 76, 165, 51, 8, 958, 156, 43, 660, 142, 9, 7, 8, 12909, 430, 73, 113, 7, 2250, 327, 756, 14, 12, 846, 12352, 9, 7, 34, 12, 1919, 646, 16, 22, 146, 93, 32, 252, 122, 27, 65, 67, 665, 47, 714, 101, 7, 37310, 7, 39074, 9, 7, 714, 425, 7, 3582, 7, 21255, 11, 7, 11996, 7, 6461, 78, 43, 130, 3384, 9, 3], [2, 7, 4118, 62, 12233, 527, 26, 12, 3399, 183, 8685, 14591, 1595, 15, 4820, 72, 27, 4491, 6380, 200, 161, 11736, 14, 970, 4943, 7, 3759, 7, 4585, 36, 7, 619, 7, 1834, 33, 49, 15, 1701, 111, 200, 12, 760, 13, 24171, 22, 14, 40, 3820, 17, 9541, 13, 16, 129, 3200, 14, 8, 1041, 26, 12, 4756, 10, 103, 34, 1816, 15, 7, 458

In [182]:
pickle.dump(ll, open(path/'ll_clas.pkl', 'wb'))

PicklingError: Can't pickle <class '__main__.NumericalizeProcessor'>: it's not the same object as __main__.NumericalizeProcessor

In [183]:
ll = pickle.load(open(path/'ll_clas.pkl', 'rb'))

EOFError: Ran out of input

Let's check the labels seem consistent with the texts.

In [92]:
[(ll.train.x_obj(i), ll.train.y_obj(i)) for i in [1,12552]]

[("xxbos xxmaj airport ' 77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman xxmaj philip xxmaj stevens ( xxmaj james xxmaj stewart ) who is flying them & a bunch of vip 's to his estate in preparation of it being opened to the public as a museum , also on board is xxmaj stevens daughter xxmaj julie ( xxmaj kathleen xxmaj quinlan ) & her son . xxmaj the luxury xxunk takes off as planned but mid - air the plane is hi - jacked by the co - pilot xxmaj chambers ( xxmaj robert xxmaj foxworth ) & his two accomplice 's xxmaj banker ( xxmaj monte xxmaj markham ) & xxmaj wilson ( xxmaj michael xxmaj pataki ) who knock the passengers & crew out with sleeping gas , they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent xxmaj chambers almost hits an oil rig in the xxmaj ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom righ

We saw samplers in notebook 03. For the validation set, we will simply sort the samples by length, and we begin with the longest ones for memory reasons (it's better to always have the biggest tensors first).

In [93]:
#export
from torch.utils.data import Sampler

class SortSampler(Sampler):
    def __init__(self, data_source, key): self.data_source,self.key = data_source,key
    def __len__(self): return len(self.data_source)
    def __iter__(self):
        return iter(sorted(list(range(len(self.data_source))), key=self.key, reverse=True))

For the training set, we want some kind of randomness on top of this. So first, we shuffle the texts and build megabatches of size `50 * bs`. We sort those megabatches by length before splitting them in 50 minibatches. That way we will have randomized batches of roughly the same length.

Then we make sure to have the biggest batch first and shuffle the order of the other batches. We also make sure the last batch stays at the end because its size is probably lower than batch size.

In [94]:
#export
class SortishSampler(Sampler):
    def __init__(self, data_source, key, bs):
        self.data_source,self.key,self.bs = data_source,key,bs

    def __len__(self) -> int: return len(self.data_source)

    def __iter__(self):
        idxs = torch.randperm(len(self.data_source))
        megabatches = [idxs[i:i+self.bs*50] for i in range(0, len(idxs), self.bs*50)]
        sorted_idx = torch.cat([tensor(sorted(s, key=self.key, reverse=True)) for s in megabatches])
        batches = [sorted_idx[i:i+self.bs] for i in range(0, len(sorted_idx), self.bs)]
        max_idx = torch.argmax(tensor([self.key(ck[0]) for ck in batches]))  # find the chunk with the largest key,
        batches[0],batches[max_idx] = batches[max_idx],batches[0]            # then make sure it goes first.
        batch_idxs = torch.randperm(len(batches)-2)
        sorted_idx = torch.cat([batches[i+1] for i in batch_idxs]) if len(batches) > 1 else LongTensor([])
        sorted_idx = torch.cat([batches[0], sorted_idx, batches[-1]])
        return iter(sorted_idx)

Padding: we had the padding token (that has an id of 1) at the end of each sequence to make them all the same size when batching them. Note that we need padding at the end to be able to use `PyTorch` convenience functions that will let us ignore that padding (see 12c).

In [95]:
#export
def pad_collate(samples, pad_idx=1, pad_first=False):
    max_len = max([len(s[0]) for s in samples])
    res = torch.zeros(len(samples), max_len).long() + pad_idx
    for i,s in enumerate(samples):
        if pad_first: res[i, -len(s[0]):] = LongTensor(s[0])
        else:         res[i, :len(s[0]) ] = LongTensor(s[0])
    return res, tensor([s[1] for s in samples])

In [96]:
bs = 64
train_sampler = SortishSampler(ll.train.x, key=lambda t: len(ll.train[int(t)][0]), bs=bs)
train_dl = DataLoader(ll.train, batch_size=bs, sampler=train_sampler, collate_fn=pad_collate)

In [97]:
iter_dl = iter(train_dl)
x,y = next(iter_dl)

In [98]:
lengths = []
for i in range(x.size(0)): lengths.append(x.size(1) - (x[i]==1).sum().item())
lengths[:5], lengths[-1]

([3311, 1394, 1358, 1346, 1344], 1013)

The last one is the minimal length. This is the first batch so it has the longest sequence, but if look at the next one that is more random, we see lengths are roughly the sames.

In [99]:
x,y = next(iter_dl)
lengths = []
for i in range(x.size(0)): lengths.append(x.size(1) - (x[i]==1).sum().item())
lengths[:5], lengths[-1]

([102, 102, 102, 101, 101], 92)

We can see the padding at the end:

In [100]:
x

tensor([[   2,    7,   19,  ...,  185,    9,    3],
        [   2,    7,   19,  ...,  108,    9,    3],
        [   2,    7, 5049,  ...,  676,    9,    3],
        ...,
        [   2,    7,   48,  ...,    1,    1,    1],
        [   2,    7,   19,  ...,    1,    1,    1],
        [   2,   18,   25,  ...,    1,    1,    1]])

And we add a convenience function:

In [101]:
#export
def get_clas_dls(train_ds, valid_ds, bs, **kwargs):
    train_sampler = SortishSampler(train_ds.x, key=lambda t: len(train_ds.x[t]), bs=bs)
    valid_sampler = SortSampler(valid_ds.x, key=lambda t: len(valid_ds.x[t]))
    return (DataLoader(train_ds, batch_size=bs, sampler=train_sampler, collate_fn=pad_collate, **kwargs),
            DataLoader(valid_ds, batch_size=bs*2, sampler=valid_sampler, collate_fn=pad_collate, **kwargs))

def clas_databunchify(sd, bs, **kwargs):
    return DataBunch(*get_clas_dls(sd.train, sd.valid, bs, **kwargs))

In [102]:
bs,bptt = 64,70
data = clas_databunchify(ll, bs)

## Export

In [103]:
!python notebook2script.py 12_text.ipynb

Converted 12_text.ipynb to exp\nb_12.py
