# AllenNLP fundamentals

AllenNLP is a great open-source deep learning research library, paritcularly famous for NLP. This notebook serves as a tutorial to go over fundamental concepts of ALlenNLP

# Tokenization

In [4]:
from allennlp.data.tokenizers import Token, Tokenizer, SpacyTokenizer, WhitespaceTokenizer, CharacterTokenizer

text = "I don't hate notebooks, I just don't like them."

tokenizer = SpacyTokenizer()

tokens = tokenizer.tokenize(text)
tokens

[I, do, n't, hate, notebooks, ,, I, just, do, n't, like, them, .]

# Token Indexers

A Token Indexer turns tokens into indices or list of indices. We won't be able to see how they operate until slightly later.

In [6]:
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer

In [10]:
token_indexer = SingleIdTokenIndexer()

# Fields

Training exampels are represented as `Instances`, each containing typed `Fields`

In [11]:
from allennlp.data.fields import TextField, LabelField

A `text field` is for storing text, and also needs one or more `TokenIndexers` that will be used to convert the text into indices.

In [12]:
text_field = TextField(tokens, {'tokens': token_indexer})

In [14]:
text_field._indexed_tokens

In [15]:
label_field = LabelField('technology')

# Instances

Each instance is made up of fields

In [18]:
from allennlp.data.instance import Instance
instance = Instance({'text': text_field, 'category': label_field})

# Vocabulary

Based on our instances we construct a `Vocabulary` which contains the various mappings token <-> index, label <-> index and so on.

In [19]:
from allennlp.data.vocabulary import Vocabulary

vocab = Vocabulary.from_instances([instance])

HBox(children=(FloatProgress(value=0.0, description='building vocab', max=1.0, style=ProgressStyle(description…




Here you can see that our vocabulary has two mappings, a `tokens` mappings and `labels` mappings

In [20]:
vocab._token_to_index

_TokenToIndexDefaultDict(None,
                         {'tokens': {'@@PADDING@@': 0,
                           '@@UNKNOWN@@': 1,
                           'I': 2,
                           'do': 3,
                           "n't": 4,
                           'hate': 5,
                           'notebooks': 6,
                           ',': 7,
                           'just': 8,
                           'like': 9,
                           'them': 10,
                           '.': 11},
                          'labels': {'technology': 0}})

In [21]:
text_field._indexed_tokens


In [22]:
label_field._label_id


Although we have constructed the mappings, we haven't yet used them to index the fields in our instance. We have to do that manually (although when you use the allennlp trainer all of this will be taken care of.)

In [23]:
instance.index_fields(vocab)

In [24]:
text_field._indexed_tokens


{'tokens': {'tokens': [2, 3, 4, 5, 6, 7, 2, 8, 3, 4, 9, 10, 11]}}

In [25]:
label_field._label_id

0

In [26]:
instance.as_tensor_dict()

{'text': {'tokens': {'tokens': tensor([ 2,  3,  4,  5,  6,  7,  2,  8,  3,  4,  9, 10, 11])}},
 'category': tensor(0)}

In [27]:
instance.get_padding_lengths()


{'text': {'tokens___tokens': 13}, 'category': {}}

# Batching and padding

When you're doign NLP, you have sequences with different lengths, which means that padding and masking are very important. They're tricky to get right ! Luckily, AllenNLP handles most of the details for you.

In [28]:
text1 = "I just don't like notebooks."
tokens1 = tokenizer.tokenize(text)
text_field1 = TextField(tokens1, {"tokens": token_indexer})
label_field1 = LabelField("Joel")
instance1 = Instance({"text": text_field1, "speaker": label_field1})
text2 = "I do like notebooks."
tokens2 = tokenizer.tokenize(text2)
text_field2 = TextField(tokens2, {"tokens": token_indexer})
label_field2 = LabelField("Tim")
instance2 = Instance({"text": text_field2, "speaker": label_field2})

In [30]:
from allennlp.data.batch import Batch
vocab = Vocabulary.from_instances([instance1, instance2])

batch = Batch([instance1, instance2])
batch.index_instances(vocab)

HBox(children=(FloatProgress(value=0.0, description='building vocab', max=2.0, style=ProgressStyle(description…




In [31]:
batch.as_tensor_dict()

{'text': {'tokens': {'tokens': tensor([[ 2,  3,  4,  8,  5,  9,  2, 10,  3,  4,  6, 11,  7],
           [ 2,  3,  6,  5,  7,  0,  0,  0,  0,  0,  0,  0,  0]])}},
 'speaker': tensor([0, 1])}

# Using Multiple Indexers

In some circumstances, you might want to use multiple indexers. For instance, you might want to index a token using the token_id, but also a sequence of character_ids. This is as simple as adding extra token indexers to our text fields.

In [32]:
from allennlp.data.token_indexers import TokenCharactersIndexer

In [33]:
token_character_indexer = TokenCharactersIndexer(min_padding_length=3)

text_field = TextField(tokens, {'tokens': token_indexer, 'token_characters': token_character_indexer})
label_field = LabelField('technology')

In [34]:
instance = Instance({'text': text_field, 'label': label_field})

In [35]:
vocab = Vocabulary.from_instances([instance])

HBox(children=(FloatProgress(value=0.0, description='building vocab', max=1.0, style=ProgressStyle(description…




In [37]:
instance.index_fields(vocab)

In [38]:
instance.as_tensor_dict()

{'text': {'tokens': {'tokens': tensor([ 2,  3,  4,  5,  6,  7,  2,  8,  3,  4,  9, 10, 11])},
  'token_characters': {'token_characters': tensor([[ 6,  0,  0,  0,  0,  0,  0,  0,  0],
           [ 7,  3,  0,  0,  0,  0,  0,  0,  0],
           [ 5,  8,  2,  0,  0,  0,  0,  0,  0],
           [ 9, 12,  2,  4,  0,  0,  0,  0,  0],
           [ 5,  3,  2,  4, 13,  3,  3, 10, 11],
           [14,  0,  0,  0,  0,  0,  0,  0,  0],
           [ 6,  0,  0,  0,  0,  0,  0,  0,  0],
           [15, 16, 11,  2,  0,  0,  0,  0,  0],
           [ 7,  3,  0,  0,  0,  0,  0,  0,  0],
           [ 5,  8,  2,  0,  0,  0,  0,  0,  0],
           [17, 18, 10,  4,  0,  0,  0,  0,  0],
           [ 2,  9,  4, 19,  0,  0,  0,  0,  0],
           [20,  0,  0,  0,  0,  0,  0,  0,  0]])}},
 'label': tensor(0)}

# Token Embedders

Once w've our text represented as ids, we use token embedders to create tensor embeddings.

In [39]:
text1 = "I just don't like notebooks."
tokens1 = tokenizer.tokenize(text)
text_field1 = TextField(tokens1, {"tokens": token_indexer})
label_field1 = LabelField("Joel")
instance1 = Instance({"text": text_field1, "speaker": label_field1})
text2 = "I do like notebooks."
tokens2 = tokenizer.tokenize(text2)
text_field2 = TextField(tokens2, {"tokens": token_indexer})
label_field2 = LabelField("Tim")
instance2 = Instance({"text": text_field2, "speaker": label_field2})
vocab = Vocabulary.from_instances([instance1, instance2])
batch = Batch([instance1, instance2])
batch.index_instances(vocab)

HBox(children=(FloatProgress(value=0.0, description='building vocab', max=2.0, style=ProgressStyle(description…




In [40]:
tensor_dict = batch.as_tensor_dict()
tensor_dict

{'text': {'tokens': {'tokens': tensor([[ 2,  3,  4,  8,  5,  9,  2, 10,  3,  4,  6, 11,  7],
           [ 2,  3,  6,  5,  7,  0,  0,  0,  0,  0,  0,  0,  0]])}},
 'speaker': tensor([0, 1])}

In [41]:
from allennlp.modules.token_embedders import Embedding


In [45]:
embedding = Embedding(num_embeddings=vocab.get_vocab_size("tokens"), embedding_dim=5)


Accordingly, we can apply these embeddings to the indexed tokens.

In [48]:
embedding(tensor_dict['text']['tokens']['tokens'])

tensor([[[ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.5468, -0.3153,  0.2503,  0.0751, -0.3173],
         [-0.2189,  0.3179,  0.3703,  0.5374,  0.2767],
         [-0.5462, -0.2733, -0.2118, -0.5640, -0.1482],
         [-0.1358,  0.0344, -0.1710, -0.3339,  0.4811],
         [ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [ 0.4478,  0.3009,  0.1941,  0.5378, -0.1945],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.5468, -0.3153,  0.2503,  0.0751, -0.3173],
         [-0.4426, -0.3893, -0.0188,  0.1502,  0.3023],
         [-0.0896,  0.4304,  0.3974,  0.4833, -0.1077],
         [ 0.1111, -0.4240, -0.3640,  0.5719, -0.2200]],

        [[ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.4426, -0.3893, -0.0188,  0.1502,  0.3023],
         [-0.5462, -0.2733, -0.2118, -0.5640, -0.1482],
         [ 0.1111, -0.4240, -0.3640,  0.5719, 

# Text Field Embedders

A Text Field emebdder may have multiple indexed representations of its tokens, in which case it needs multiple corresponding `TextFiedlEmbedders` 

In [49]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder


In [50]:
text_field_embedder = BasicTextFieldEmbedder({"tokens": embedding})


In [56]:
text_field_embedder(tensor_dict['text'])

tensor([[[ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.5468, -0.3153,  0.2503,  0.0751, -0.3173],
         [-0.2189,  0.3179,  0.3703,  0.5374,  0.2767],
         [-0.5462, -0.2733, -0.2118, -0.5640, -0.1482],
         [-0.1358,  0.0344, -0.1710, -0.3339,  0.4811],
         [ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [ 0.4478,  0.3009,  0.1941,  0.5378, -0.1945],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.5468, -0.3153,  0.2503,  0.0751, -0.3173],
         [-0.4426, -0.3893, -0.0188,  0.1502,  0.3023],
         [-0.0896,  0.4304,  0.3974,  0.4833, -0.1077],
         [ 0.1111, -0.4240, -0.3640,  0.5719, -0.2200]],

        [[ 0.3082, -0.1054, -0.4880,  0.4342, -0.3927],
         [-0.0926,  0.3000,  0.3102, -0.1149, -0.2023],
         [-0.4426, -0.3893, -0.0188,  0.1502,  0.3023],
         [-0.5462, -0.2733, -0.2118, -0.5640, -0.1482],
         [ 0.1111, -0.4240, -0.3640,  0.5719, 

# Seq2VecEncoders

At this point, we've ended up with sequence of tensors. Frequently we'll want to collapse that sequence into a single contextualized tensor representations, which we do with `Seq2VecEncoders`

In [57]:
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder


In [58]:
encoder = BagOfEmbeddingsEncoder(embedding_dim=text_field_embedder.get_output_dim())


In [61]:
encoder(text_field_embedder(tensor_dict['text']))


tensor([[-1.5365, -0.2444,  0.3412,  2.1716, -1.4349],
        [-1.3771, -3.1807,  2.9189, -1.4364, -2.0097]], grad_fn=<SumBackward1>)

# Model

In [65]:
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder
from typing import Dict
import torch

In [69]:
class MyModel(Model):
    def __init__(self,
                vocab: Vocabulary,
                embedder: TextFieldEmbedder,
                encoder: Seq2VecEncoder,
                output_dim: int)->None:
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        self.linear = torch.nn.Linear(in_features=embedder.get_output_dim(), out_features=output_dim)
        
    def forward(self, text: Dict[str, torch.Tensor], speaker:torch.Tensor)-> Dict[str, torch.Tensor]:
        embedded = self.embedder(text)
        encoded = self.encoder(embedded)
        output = self.linear(encoded)
        
        return {'output': output}

In [70]:
model = MyModel(vocab, text_field_embedder, encoder, 3)

In [71]:
model(**tensor_dict)

{'output': tensor([[-0.3698, -1.5098, -0.8451],
         [-0.2740, -1.9789,  0.0265]], grad_fn=<AddmmBackward>)}