# Basic neural bag-of-words text classifier with Thinc

This notebook shows how to implement a simple neural text classification model in Thinc.

In [1]:
!pip install "thinc>=8.0.0a0" ml_datasets "tqdm>=4.41" syntok



In [2]:
import ml_datasets
import numpy
from syntok.tokenizer import Tokenizer
from thinc.api import fix_random_seed
from tqdm.notebook import tqdm
from thinc.types import Floats2d, Floats1d
from thinc.model import Model


## Utility functions

For simple and standalone tokenization, we'll use the [`syntok`](https://github.com/fnl/syntok) package and the following function:

In [3]:
def tokenize_texts(texts):
    tok = Tokenizer()
    return [[token.value for token in tok.tokenize(text)] for text in texts]

In [4]:
def prepare_data(raw_data):
    """Converts, tokenizes the data, and generates a simple vocabulary mapping.
    """
    train_data, dev_data = raw_data
    train_texts, train_cats = zip(*train_data)
    dev_texts, dev_cats = zip(*dev_data)
    unique_cats = list(numpy.unique(numpy.concatenate((train_cats, dev_cats))))
    nr_class = len(unique_cats)
    print(f"{len(train_data)} training / {len(dev_data)} dev\n{nr_class} classes")

    train_y = numpy.zeros((len(train_cats), nr_class), dtype="f")
    for i, cat in enumerate(train_cats):
        train_y[i][unique_cats.index(cat)] = 1
    dev_y = numpy.zeros((len(dev_cats), nr_class), dtype="f")
    for i, cat in enumerate(dev_cats):
        dev_y[i][unique_cats.index(cat)] = 1

    train_tokenized = tokenize_texts(train_texts)
    dev_tokenized = tokenize_texts(dev_texts)
    # Generate simple vocab mapping, <unk> is 0
    vocab = {}
    count_id = 1
    for text in train_tokenized:
        for token in text:
            if token not in vocab:
                vocab[token] = count_id
                count_id += 1
    # Map texts using vocab
    train_X = []
    for text in train_tokenized:
        train_X.append(numpy.array([vocab.get(t, 0) for t in text]))
    dev_X = []
    for text in dev_tokenized:
        dev_X.append(numpy.array([vocab.get(t, 0) for t in text]))
    return (train_X, train_y), (dev_X, dev_y), vocab, nr_class, train_texts, dev_texts

In [5]:
def train_model(data, model, optimizer, n_iter, batch_size):
    (train_X, train_Y), (test_X, test_Y) = data
    for i in range(n_iter):
        batches = model.ops.multibatch(batch_size, train_X, train_Y, shuffle=True)
        for X, Y in tqdm(batches, leave=False):
            Yh, backprop = model.begin_update(X)
            backprop(Yh - Y)
            model.finish_update(optimizer)
        # Evaluate and print progress
        score = evaluate_model(model, test_X, test_y, batch_size)
        print(f" {i} score: {float(score):.3f}")

def evaluate_model(model: Model, test_features: Floats2d, test_target: Floats1d, batch_size: int) -> float:
    correct = 0
    total = 0
    for X, Y in model.ops.multibatch(batch_size, test_features, test_target):
        prediction = model.predict(X)
        correct += (prediction.argmax(axis=1) == Y.argmax(axis=1)).sum()
        total += prediction.shape[0]
    score = correct / total
    return score


## Setting up the data

In [6]:

#Instead of `ml_datasets.dbpedia` you can also try `ml_datasets.imdb` for the IMDB review dataset.
raw_data = ml_datasets.dbpedia(limit=1000)
train_data, _ = raw_data
for text, annot in train_data[0:1]:
    print(f"Text: {text}")
    print(f"Category: {annot}")

Text: Spicy Horse
 Spicy Horse (simplified Chinese: 麻辣马; traditional Chinese: 麻辣馬; pinyin: Má là mǎ) is a Shanghai-based independent video game developer started by American McGee Anthony Jacobson and Adam Lang in 2007.
Category: 1


In [7]:
(train_X, train_y), (test_X, test_y), vocab, nr_class, train_texts, test_texts = prepare_data(raw_data)

1000 training / 1000 dev
14 classes


In [8]:
print(f"Type: '{type(train_X[0])}")
print(f"First element from training set: '{train_X[0]}")
print(f"Category: '{train_y[0]}")

Type: '<class 'numpy.ndarray'>
First element from training set: '[ 1  2  1  2  3  4  5  6  7  8  9  5  6 10  8 11  6 12 13 14 15 16 17 18
 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36]
Category: '[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## Defining the model and config

The model takes a list of 2-dimensional arrays (the tokenized texts mapped to vocab IDs) and outputs a 2d array. Because the embed layer's `nV` dimension (the number of entries in the lookup table) depends on the vocab and the training data, it's passed in as an argument and registered as a **reference**. This makes it easy to retrieve it later on by calling `model.get_ref("embed")`, so we can set its `nV` dimension.

In [9]:
from typing import List
import thinc
from thinc.api import prefer_gpu, Model, list2ragged, list2array, chain, with_array, reduce_mean, Softmax
from thinc.types import Array2d

prefer_gpu()


False

In [19]:
# The model takes a list of 2-dimensional arrays (the tokenized texts mapped to vocab IDs) and outputs a 2d array.
# Because the embed layer's nV dimension (the number of entries in the lookup table) depends on the vocab and the training data,
# it's passed in as an argument and registered as a reference.


@thinc.registry.layers("EmbedPoolTextcat.v1")
def EmbedPoolTextcat(embed: Model[Array2d, Array2d]) -> Model[List[Array2d], Array2d]:
    with Model.define_operators({">>": chain}):
        model = (
            with_array(embed) 
            >> list2ragged() 
            >> reduce_mean() 
            >> Softmax()
        )
    model.set_ref("embed", embed)
    return model

# The embed layer map integers to vectors, using a fixed-size lookup table. 
# The input to the layer should be a two-dimensional array of integers,
#  one column of which the embeddings table will slice as the indices.

# The embed output is converted to ragged format.
# If sequences are already ragged, do nothing. A ragged array is 
# a tuple (data, lengths), where data is the concatenated data. 

# reduce_mean is a pooling layer that reduces the dimensions of the data 
# by computing the average value of each feature.



Alternativelly, the model could be defined withouth operator overloading as:

```python
model = chain(
    with_array(Embed(nO=width, nV=len(vocab) + 1)),
    list2ragged(),
    reduce_mean(),
    Softmax(nO=nr_classes)
)
```

In [11]:
CONFIG = """
[hyper_params]
width = 64

[model]
@layers = "EmbedPoolTextcat.v1"

[model.embed]
@layers = "Embed.v1"
nO = ${hyper_params:width}

[optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001

[training]
batch_size = 8
n_iter = 10
"""
from thinc.api import registry, Config

C = registry.make_from_config(Config().from_str(CONFIG))

batch_size = C["training"]["batch_size"]
optimizer = C["optimizer"]
model = C["model"]
model.get_ref("embed").set_dim("nV", len(vocab) + 1)
model.initialize(X=train_X[:5], Y=train_y[:5])

<thinc.model.Model at 0x7fea30b739e0>

---

## Training the model

In [12]:

fix_random_seed(0)
data = (train_X, train_y), (test_X, test_y)
train_model(data, model, optimizer, 20, batch_size)

HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

0 score: 0.307


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

1 score: 0.429


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

2 score: 0.537


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

3 score: 0.621


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

4 score: 0.750


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

5 score: 0.809


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

6 score: 0.857


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

7 score: 0.885


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

8 score: 0.890


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

9 score: 0.902


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

10 score: 0.905


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

11 score: 0.905


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

12 score: 0.904


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

13 score: 0.915


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

14 score: 0.917


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

15 score: 0.918


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

16 score: 0.916


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

17 score: 0.920


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

18 score: 0.921


HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))

19 score: 0.922


In [23]:

# Redefining the model without operator overloading:

from thinc.layers import Embed
model = chain(
    with_array(Embed(nO=64, nV=len(vocab) + 1)),
    list2ragged(),
    reduce_mean(),
    Softmax(nO=nr_class)
)
model.initialize(X=train_X[:5], Y=train_y[:5])


<thinc.model.Model at 0x7fea0ebd1320>

In [None]:
train_model(data, model, optimizer, 5, batch_size)