# Lesson 10

## 00:00:14 - Review of last week

* Last week: lots of people struggling with the material from last week.
  * By Lesson 14, you'll have a second go at Object Detection.
* Multiple objects is similar to the single bounding box problem, except we have to solve the "matching problems"
  * Create far more activations than ground truth bounding boxes, and match each ground truth object to a set of the activations.
* If you're stuck: revisit Lesson 8.
  
## 00:02:42 - Multiple objects revision

* More activations.
* Took advantage of convolutional natural of network to try to have activiations that had a receptive field that was similar to ground truth objects we're predicting.
  * [Chloe Sultan](http://forums.fast.ai/t/part-2-lesson-9-wiki/14028/375) mapped out the size of the grids as the network downsamples using stride 2 convolutions.
  
<img src="http://forums.fast.ai/uploads/default/optimized/2X/a/a7942de9d5bf6c3afe50e10d55116c6c7fb5b721_1_669x500.png" width="700px">

* How did she calculate the numbers?
  * Could use PDB set trace to view the output at each step of the network.
* Talked about increasing K: number of activations using different zooms and sizes.
* Got down to a small num of bounding boxes using NMS.
  * A paper has come out to try to calculate an end-to-end network (not using NMS)


## 00:07:04 - Reading papers

* Not enough people reading papers in study group: papers are the "real ground truth".
  * ``SSD_MultiHead.forward`` is not doing the same thing as the SSD paper: the paper may have a better version.
  * Uses a smaller k but a lot more smaller grids.
* Useful to map code and equations together:

<img src="https://i.gyazo.com/84dc304d3d8eab72b6fcbc1895424aa5.gif" width="700px">

  * Some people are code people who learn the math from the code and vice versa.
  
## 00:10:08 - Math notation

* Math notation can be hard to lookup.
* [List of mathematical symbols](https://en.wikipedia.org/wiki/List_of_mathematical_symbols) Wikipedia article is a useful reference.
* Nobody learns all of math in one go: even top researchers need to research math symbols.

## 00:11:17 - Recreating results in papers

* Key figure from retina loss figure was created in Excel by Sarada Lee from the forums.


## 00:14:12 - NLP

* Seen the idea of taking a pretrained model, removing the top layer and getting it to do something similar.
 * Want to see if that idea applies to NLP.
* Next lesson: combine NLP and computer vision.
  * learn to find word structures from images (image captioning)
  * learn to find images from word structures

## 00:18:56 - torchtext to fastai.text

* torchtext has a number of limitations:
  * quite slow
    * doesn't do parallel processing.
    * doesn't cache results.
  * hard to do simple things like multi label problems.
* `fastai.text` is a replacement for `fastai.nlp`

## 00:20:30 - IMDB revisited

* Dataset of movie reviews.

In [160]:
%install_ext https://raw.githubusercontent.com/SiggyF/notebooks/master/pep8_magic.py

from fastai.fastai.text import *
import html
from pathlib import Path

UsageError: Line magic function `%install_ext` not found.


In [111]:
PATH=Path('./data/aclImdb/')

In [104]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz  --directory-prefix=data

--2018-07-27 15:23:10--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘data/aclImdb_v1.tar.gz.1’


2018-07-27 15:23:17 (12.4 MB/s) - ‘data/aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



In [105]:
!tar -xzf data/aclImdb_v1.tar.gz.1 -C data

In [106]:
BEGINNING_OF_SENTENCE_TAG = 'xbos'  # beginning of sentence tag
DATAFIELD_TAG = 'xfld'  # data field tag

In [107]:
CLAS_PATH = Path('data/imdb_clas/')
CLAS_PATH.mkdir(exist_ok=True)

LM_PATH = Path('data/imdb_lm/')
LM_PATH.mkdir(exist_ok=True)

In [108]:
CLASSES = ['neg', 'pos', 'unsup']

In [112]:
def get_texts(path):
    texts, labels = [], []
    for idx, label in enumerate(CLASSES):
        for fname in (path / label).glob('*.*'):
            texts.append(fname.open('r').read())
            labels.append(idx)

    return np.array(texts), np.array(labels)
            
trn_texts, trn_labels = get_texts(PATH / 'train')
val_texts, val_labels = get_texts(PATH / 'test')

In [113]:
len(trn_texts), len(val_texts)

(75000, 25000)

* Approach above is much easier than the convoluted torch text method - reading text is not that hard.

In [114]:
col_names = ['labels', 'text']

In [115]:
np.random.seed(42)

train_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))

In [116]:
train_texts = trn_texts[train_idx]
val_texts = val_texts[val_idx]

In [117]:
train_labels = trn_labels[train_idx]
val_labels = val_labels[val_idx]

In [118]:
train_texts.shape

(75000,)

In [119]:
train_texts[0]

'I bought this DVD for $2.00 at the local variety warehouse. The creepy clown cover and quote "he\'s not clowning around"took my eye with a hope of either a Horror/slasher or possibly a b grade movie that might have some laughs.<br /><br />Man was i mistaken. This movie might be OK to see if you had smoked some wicked herb or taken some acid, though because i hadnt it just made me angry.<br /><br />The story/plot was not original and the constant use of similar sounding riffs to the john carpenter halloween theme and the ëxcorcist theme quickly became annoying.<br /><br />Alice witnessed the death of her son or did she? Was it the good clown,bad clown an evil serial killer or herself? I don\'t know if this film was SUPPOSED to be open for interpretation if this was the writer/directors master plan... i don\'t think it was.<br /><br />maybe i am not really here typing this review and i am secretly in some mental hospital thinking about writing this review?.<br /><br />Overall this film 

### 00:25:50 - Classification path vs language model path

* Classification model requires labels:

In [120]:
df_train = pd.DataFrame({'text': train_texts, 'labels': train_labels}, columns=col_names)
df_val = pd.DataFrame({'text': val_texts, 'labels': val_labels}, columns=col_names)

df_train[df_train['labels'] != 2].to_csv(CLAS_PATH / 'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH / 'test.csv', header=False, index=False)

(CLAS_PATH / 'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)

df_train.head()

Unnamed: 0,labels,text
0,2,I bought this DVD for $2.00 at the local varie...
1,0,1st watched 4/29/2007 - 4 out of 10(Dir-Mick G...
2,1,"Election is a Chinese mob movie, or triads in ..."
3,2,I will ignore the obviously superior films by ...
4,2,I went into this film with no expectations. St...


* Language model has no labels:

In [121]:
df_train = pd.DataFrame({'text': train_texts, 'labels': [0] * len(train_texts)}, columns=col_names)
df_val = pd.DataFrame({'text': val_texts, 'labels': [0] * len(val_labels)}, columns=col_names)

df_train[df_train['labels'] != 2].to_csv(CLAS_PATH / 'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH / 'test.csv', header=False, index=False)

(CLAS_PATH / 'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)

df_train.head()

Unnamed: 0,labels,text
0,0,I bought this DVD for $2.00 at the local varie...
1,0,1st watched 4/29/2007 - 4 out of 10(Dir-Mick G...
2,0,"Election is a Chinese mob movie, or triads in ..."
3,0,I will ignore the obviously superior films by ...
4,0,I went into this film with no expectations. St...


* Can create validation set as follows:

In [122]:
train_texts, val_texts = sklearn.model_selection.train_test_split(
    np.concatenate([train_texts, val_texts]), test_size=0.1)

In [123]:
len(train_texts), len(val_texts)

(90000, 10000)

In [125]:
df_train.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)

## Language model tokens

In [126]:
chunk_size = 24000

In [127]:
re1 = re.compile(r' +')

def fixup(x):
    x = (x
        .replace('#39;', "'")
        .replace('amp;', '&')
        .replace('#146;', "'")
        .replace('nbsp;', ' ')
        .replace('#36;', '$')
        .replace('\\n', "\n")
        .replace('quot;', "'")
        .replace('<br />', "\n")
        .replace('\\"', '"')
        .replace('<unk>', 'u_n')
        .replace(' @.@ ', '.')
        .replace(' @-@ ', '-')
        .replace('\\', ' \\ '))

    return re1.sub(' ', html.unescape(x))

In [161]:
def get_texts(df, num_labels=1):
    labels = df.iloc[:, range(num_labels)].values.astype(np.int64)
    texts = f'\n{BEGINNING_OF_SENTENCE_TAG} {DATAFIELD_TAG} 1' + df[num_labels].astype(str)
    for i in range(num_labels + 1, len(df.columns)):
        texts += f' {DATAFIELD_TAG} [i - num_labels]' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)
    
    core_partitions = partition_by_cores(texts)
    tok = Tokenizer().proc_all_mp(core_partitions)
    return tok, list(labels)

In [162]:
def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)
        tok += tok_
        labels += labels_
    return tok, labels

In [165]:
!python -m spacy download en

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 57.9MB/s ta 0:00:01
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')



In [163]:
df_train = pd.read_csv(LM_PATH / 'train.csv', header=None, chunksize=chunk_size)
df_val = pd.read_csv(LM_PATH / 'test.csv', header=None, chunksize=chunk_size)

In [166]:
tok_trn, train_labels = get_all(df_train, 1)
tok_val, val_labels = get_all(df_val, 1)

0
1
2
0
1


In [167]:
(LM_PATH/'tmp').mkdir(exist_ok=True)

In [168]:
np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val)

In [170]:
tok_trn = np.load(LM_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(LM_PATH/'tmp'/'tok_val.npy')

In [171]:
freq = Counter(p for o in tok_trn for p in o)
freq.most_common(25)

[('the', 686043),
 ('.', 564563),
 (',', 562529),
 ('and', 336037),
 ('a', 331366),
 ('of', 299003),
 ('to', 276187),
 ('is', 225266),
 ('it', 193359),
 ('in', 192089),
 ('i', 162280),
 ('that', 149386),
 ('this', 146216),
 ('"', 134061),
 ("'s", 126735),
 ('-', 107354),
 ('was', 102382),
 ('\n\n', 102266),
 ('as', 94208),
 ('with', 91086),
 ('for', 90453),
 ('movie', 89957),
 ('but', 86291),
 ('film', 82097),
 (')', 70829)]

In [172]:
max_vocab = 60000
min_freq = 2

In [174]:
itos = [o for o, c in freq.most_common(max_vocab) if c > min_freq]
itos.insert(0, '_pad_')
itos.insert(0, '_unk_')