### Operator overloading for more concise model definitions

Thinc allows you to **overload operators** and bind arbitrary functions to Python operators like `+`, `*`, but also `>>` or `@`. The `Model.define_operators` contextmanager takes a dict of operators mapped to functions – typically combinators like `chain`. The operators are only valid for the `with` block. This lets us define the model like this:

In [None]:
!pip install "thinc>=8.0.0a0" ml_datasets "tqdm>=4.41" syntok

In [3]:
from thinc.api import prefer_gpu
prefer_gpu()

False

Instead of defining the `chain` as a comma-separated list of elements, one can use custom operators.
For example, transforming the following code 

```python
from thinc.api import Model, chain, Relu, Softmax
n_hidden = 32
dropout = 0.2

model = chain(
    Relu(nO=n_hidden, dropout=dropout), 
    Relu(nO=n_hidden, dropout=dropout), 
    Softmax()
)
```


into this:

In [4]:
# First, add src to sys.path

import sys
import os
from pathlib import PurePath

# add custom python modules root to the path variable,
root_path = PurePath(os.getcwd()).parents[0]
print(root_path)
src_path = str(
    root_path.joinpath('src'))

if src_path not in sys.path:
    sys.path.insert(0, str(src_path))

print(sys.path)



/Users/jean.metz/workspace/jmetzz/sandbox-thinc.ai
['/Users/jean.metz/workspace/jmetzz/sandbox-thinc.ai/src', '/Users/jean.metz/workspace/jmetzz/sandbox-thinc.ai/notebooks', '/private/var/folders/b0/wtsjtzcx3mdb4z2kbs_2bxd1c3lb67/T/b482e531-dfb3-4077-b15e-7b7fd9fb4869', '/Users/jean.metz/miniconda/envs/thinc.ai/lib/python37.zip', '/Users/jean.metz/miniconda/envs/thinc.ai/lib/python3.7', '/Users/jean.metz/miniconda/envs/thinc.ai/lib/python3.7/lib-dynload', '', '/Users/jean.metz/miniconda/envs/thinc.ai/lib/python3.7/site-packages', '/Users/jean.metz/miniconda/envs/thinc.ai/lib/python3.7/site-packages/IPython/extensions', '/Users/jean.metz/.ipython']


In [5]:
from thinc.api import Model, chain, Relu, Softmax

 
n_hidden = 32
dropout = 0.2

with Model.define_operators({">>": chain}):
    model = Relu(nO=n_hidden, dropout=dropout) >> Relu(nO=n_hidden, dropout=dropout) >> Softmax()

You can now use the `model` object as an argument to the `train_model` function defined below

In [6]:
from thinc.api import Adam, fix_random_seed
from tqdm.notebook import tqdm
import ml_datasets
from train import train_model

fix_random_seed(0)
optimizer = Adam(0.001)
batch_size = 128
data = (train_X, train_Y), (dev_X, dev_Y) = ml_datasets.mnist()


print("Measuring performance across iterations:")
train_model(data, model, optimizer, 20, batch_size)



18%|█▊        | 75/422 [00:00<00:00, 749.51it/s]Measuring performance across iterations:
 23%|██▎       | 96/422 [00:00<00:00, 958.40it/s]0	22486.57	0.845
 20%|██        | 86/422 [00:00<00:00, 854.79it/s]1	10951.25	0.890
 22%|██▏       | 93/422 [00:00<00:00, 919.01it/s]2	8775.06	0.896
 21%|██▏       | 90/422 [00:00<00:00, 897.51it/s]3	7903.34	0.909
 21%|██▏       | 90/422 [00:00<00:00, 893.95it/s]4	7232.81	0.915
 22%|██▏       | 91/422 [00:00<00:00, 901.38it/s]5	6666.07	0.918
 21%|██        | 89/422 [00:00<00:00, 883.01it/s]6	6439.15	0.918
 21%|██        | 87/422 [00:00<00:00, 861.36it/s]7	6108.11	0.924
 21%|██        | 88/422 [00:00<00:00, 873.26it/s]8	5841.27	0.929
 21%|██        | 89/422 [00:00<00:00, 883.25it/s]9	5656.59	0.928
 40%|███▉      | 168/422 [00:00<00:00, 825.47it/s]10	5528.26	0.927
 42%|████▏     | 176/422 [00:00<00:00, 873.26it/s]11	5399.24	0.931
 20%|█▉        | 83/422 [00:00<00:00, 829.31it/s]12	5204.12	0.931
 41%|████      | 173/422 [00:00<00:00, 862.12it/s]13	5127.3

Next is a definition for a text classification network, which expects a *list of arrays as input*, where each array should have two columns with different numeric identifier features. 

The model takes a list of 2-dimensional arrays (the tokenized texts mapped to vocab IDs) and outputs a 2d array.

The two features will be embedded using separate embedding tables, and the two vectors added and passed through a `Maxout` layer with layer `normalization` and `dropout`. The sequences then pass through two `pooling` functions, and the `concatenated` results are passed through 2 `Relu` layers with `dropout` and `residual` connections. Finally, the sequence vectors are passed through an output layer, which has a `Softmax` activation.

In [8]:
from thinc.api import add, chain, concatenate, clone
from thinc.api import with_array, reduce_max, reduce_mean, residual
from thinc.api import Model, Embed, Maxout, Softmax, Dropout

nH = 5

with Model.define_operators({">>": chain, "|": concatenate, "+": add, "**": clone}):
    model = (
        with_array(
            (Embed(128, column=0) + Embed(64, column=1))
            >> Maxout(nH, normalize=True, dropout=0.2)
        )
        >> (reduce_max() | reduce_mean())
        >> residual(Relu() >> Dropout(0.2)) ** 2
        >> Softmax()
    )


In [14]:
from syntok.tokenizer import Tokenizer

def load_data():
    train_data, dev_data = ml_datasets.dbpedia(limit=2000)
    train_texts, train_cats = zip(*train_data)
    dev_texts, dev_cats = zip(*dev_data)
    unique_cats = list(numpy.unique(numpy.concatenate((train_cats, dev_cats))))
    nr_class = len(unique_cats)
    print(f"{len(train_data)} training / {len(dev_data)} dev\n{nr_class} classes")

    train_y = numpy.zeros((len(train_cats), nr_class), dtype="f")
    for i, cat in enumerate(train_cats):
        train_y[i][unique_cats.index(cat)] = 1
    dev_y = numpy.zeros((len(dev_cats), nr_class), dtype="f")
    for i, cat in enumerate(dev_cats):
        dev_y[i][unique_cats.index(cat)] = 1

    train_tokenized = tokenize_texts(train_texts)
    dev_tokenized = tokenize_texts(dev_texts)
    # Generate simple vocab mapping, <unk> is 0
    vocab = {}
    count_id = 0
    for text in train_tokenized:
        for token in text:
            if token not in vocab:
                vocab[token] = count_id
                count_id += 1
    # Map texts using vocab
    train_X = []
    for text in train_tokenized:
        train_X.append(numpy.array([vocab.get(t, 0) for t in text]))
    dev_X = []
    for text in dev_tokenized:
        dev_X.append(numpy.array([vocab.get(t, 0) for t in text]))
    return (train_X, train_y), (dev_X, dev_y), vocab, train_texts, dev_texts


def tokenize_texts(texts):
    tok = Tokenizer()
    return [[token.value for token in tok.tokenize(text)] for text in texts]


In [None]:
(train_X, train_y), (dev_X, dev_y), vocab, train_texts, dev_texts = load_data()

In [None]:
train_model(data, model, optimizer, 20, batch_size)