<a href="https://colab.research.google.com/github/paruliansaragi/DL-Notebooks/blob/master/FakeNewsAllenNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/a4/c8/10342a6068a8d156a5947e03c95525d559e71ad62de0f2585ab922e14533/allennlp-0.8.3-py3-none-any.whl (5.6MB)
[K    100% |████████████████████████████████| 5.6MB 7.6MB/s 
Collecting parsimonious>=0.8.0 (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/02/fc/067a3f89869a41009e1a7cdfb14725f8ddd246f30f63c645e8ef8a1c56f4/parsimonious-0.8.1.tar.gz (45kB)
[K    100% |████████████████████████████████| 51kB 20.1MB/s 
Collecting moto>=1.3.4 (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/57/40/cec89fa5c13108eb1c8de435633f8b7639e0e43fcbcdc8ac52633efeeabe/moto-1.3.7-py2.py3-none-any.whl (552kB)
[K    100% |████████████████████████████████| 552kB 23.2MB/s 
[?25hCollecting overrides (from allennlp)
  Downloading https://files.pythonhosted.org/packages/de/55/3100c6d14c1ed177492fcf8f07c4a7d2d6c996c0a7fc6a9a0a41308e7eec/overrides-1.9.tar.gz
Collecting awscli>=1.11.91 (

In [0]:
%load_ext autoreload
%autoreload 2

##AllenNLP

DatasetReader: Extracts necessary information from data into a list of Instance objects

Model: The model to be trained (with some caveats!)

Iterator: Batches the data

Trainer: Handles training and metric recording

(Predictor: Generates predictions from raw strings)

In [0]:
from pathlib import Path
from typing import *
import torch
import torch.optim as optim
import numpy as np
import pandas as pd
from functools import partial
from overrides import overrides

from allennlp.data import Instance
from allennlp.data.token_indexers import TokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.nn import util as nn_util

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [0]:
class Config(dict):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        for k, v in kwargs.items():
            setattr(self, k, v)
    
    def set(self, key, val):
        self[key] = val
        setattr(self, key, val)
        
config = Config(
    testing=True,
    seed=1,
    batch_size=64,
    lr=3e-4,
    epochs=20,
    hidden_sz=64,
    max_seq_len=100, # necessary to limit memory usage
    max_vocab_size=100000,
)

In [0]:
USE_GPU = torch.cuda.is_available()

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [0]:
!ls -l ~/.kaggle

total 4
-rw------- 1 root root 70 Apr  5 11:49 kaggle.json


In [0]:
!kaggle competitions download -c fake-news

Downloading train.csv to /content
 95% 89.0M/94.1M [00:01<00:00, 43.1MB/s]
100% 94.1M/94.1M [00:01<00:00, 62.9MB/s]
Downloading test.csv to /content
 71% 17.0M/24.0M [00:00<00:00, 19.4MB/s]
100% 24.0M/24.0M [00:00<00:00, 40.8MB/s]


In [0]:
torch.manual_seed(config.seed)

<torch._C.Generator at 0x7f8b56a9d4f0>

In [0]:
DATA_ROOT = './'

The DatasetReader is responsible for the following:

Reading the data from disk

Extracting relevant information from the data

Converting the data into a list of Instances (we’ll discuss Instances in a second)


You may be surprised to hear that there is no Dataset class in AllenNLP, unlike traditional PyTorch. DatasetReaders are different from Datasets in that they are not a collection of data themselves: they are a schema for converting data on disk into lists of instances.

In [0]:
from allennlp.data.vocabulary import Vocabulary
from allennlp.data.dataset_readers import DatasetReader

In [0]:
label_cols = ['label']

In [0]:
from allennlp.data.fields import TextField, MetadataField, ArrayField

class FNDDatasetReader(DatasetReader):
    def __init__(self, tokenizer: Callable[[str], List[str]]=lambda x: x.split(),
                 token_indexers: Dict[str, TokenIndexer] = None,
                 max_seq_len: Optional[int]=config.max_seq_len) -> None:
        super().__init__(lazy=False)
        self.tokenizer = tokenizer
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
        self.max_seq_len = max_seq_len
        self.test = test
    
    '''The second central method for the DatasetReader is the text_to_instance method. 
    This method is slightly misleading: it handles not only text but also labels, metadata, 
    and anything else that your model will need later on.'''
    @overrides
    def text_to_instance(self, tokens: List[Token], id: str=None,
                         labels: np.ndarray=None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"tokens": sentence_field}
        
        id_field = MetadataField(id)
        fields["id"] = id_field
        
        if labels is None:
            labels = np.zeros(len(label_cols))
        label_field = ArrayField(array=labels)
        fields["label"] = label_field

        return Instance(fields)
    
    '''As you will probably already have guessed, the _read method is responsible 
    for 1: reading the data from disk into memory.
    
    The essence of this method is simple: take the data for a single example 
    and pack it into an Instance object. Here, we’re passing the labels and ids of each example.
    
    all you need to know about them in practice is that they are instantiated 
    with a dictionary mapping field names to “Field”s, which are our next topic.
    '''
    @overrides
    def _read(self, df) -> Iterator[Instance]:
        
        if config.testing: df = df.head(1000)
        for i, row in df.iterrows():
          if 'label' in df:
            yield self.text_to_instance(
                [Token(x) for x in self.tokenizer(row["comment_text"])],
                row["id"], row[label_cols].values,
            )
          else:
            yield self.text_to_instance(
                [Token(x) for x in self.tokenizer(row["comment_text"])],
                row["id"],
            )

###Field

Field objects in AllenNLP correspond to inputs to a model or fields in a batch that is fed into a model, depending on how you look at it. For each Field, the model will receive a single input (you can take a look at the forward method in the BaselineModel class in the example code to confirm). Each field handles converting the data into tensors, so if you need to do some fancy processing on your data when converting it into tensor form, you should probably write your own custom Field class.

Types of Field:

###TextField

it converts a sequence of tokens into integers. Be careful here though, since this is all the TextField does. It doesn’t clean the text, tokenize the text, etc.. You’ll need to do that yourself.

The TextField takes an additional argument on init: the token indexer. Though the TextField handles converting tokens to integers, you need to tell it how to do this. Why? Because you might want to use a character level model instead of a word-level model or do some even funkier splitting of tokens (like splitting on morphemes). Instead of specifying these attributes in the TextField, AllenNLP has you pass a separate object that handles these decisions instead. This is the principle of composition, and you’ll see how this makes modifying your code easy later.

For now, we’ll use a simple word-level model so we use the standard SingleIdTokenIndexer.

DatasetReaders read data from disk and return a list of Instances. Instances are composed of Fields which specify both the data in the instance and how to process it.

In [0]:
from allennlp.data.tokenizers.word_splitter import SpacyWordSplitter
from allennlp.data.token_indexers import SingleIdTokenIndexer

# the token indexer is responsible for mapping tokens to integers
token_indexer = SingleIdTokenIndexer()

def tokenizer(x: str):
    return [w.text for w in
            SpacyWordSplitter(language='en_core_web_sm', 
                              pos_tags=False).split_words(x)[:config.max_seq_len]]

In [0]:
reader = FNDDatasetReader(
    tokenizer=tokenizer,
    token_indexers={"tokens": token_indexer}
)

In [0]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

test = test.drop(['title', 'author'], axis=1)
train = train.drop(['title', 'author'], axis=1)

In [0]:
test.head()

Unnamed: 0,id,text
0,20800,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...
2,20802,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"If at first you don’t succeed, try a different..."
4,20804,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [0]:
train = train.rename(columns={'text':'comment_text'})
test = test.rename(columns={'text':'comment_text'})

In [0]:
train.head()

Unnamed: 0,id,comment_text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,Ever get the feeling your life circles the rou...,0
2,2,"Why the Truth Might Get You Fired October 29, ...",1
3,3,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Print \nAn Iranian woman has been sentenced to...,1


In [0]:
test = test.dropna()
train = train.dropna()

In [0]:
train.to_csv('train1.csv')
test.to_csv('test1.csv')

In [0]:
train = pd.read_csv('train1.csv')

In [0]:
test = pd.read_csv('test1.csv')

In [0]:
test.head()

Unnamed: 0,comment_text
0,"PALO ALTO, Calif. — After years of scorning..."
1,Russian warships ready to strike terrorists ne...
2,Videos #NoDAPL: Native American Leaders Vow to...
3,"If at first you don’t succeed, try a different..."
4,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [0]:
train_ds = reader.read(train)

In [0]:
test_ds = reader.read(test)
val_ds = None

In [0]:
train_ds[:5]

[<allennlp.data.instance.Instance at 0x7f8b045cb908>,
 <allennlp.data.instance.Instance at 0x7f8b041d9860>,
 <allennlp.data.instance.Instance at 0x7f8b0457a048>,
 <allennlp.data.instance.Instance at 0x7f8b045c9898>,
 <allennlp.data.instance.Instance at 0x7f8b040b8748>]

In [0]:
len(train_ds)

1000

In [0]:
len(test_ds)

1000

In [0]:
vars(train_ds[0].fields["tokens"])

Wait, aren’t the fields supposed to convert my data into tensors?

This is one of the gotchas of text processing for deep learning: you can only convert fields into tensors after you know what the vocabulary is. To build the vocabulary, you need to pass through all the text. To build a vocabulary over the training examples, just run the following code:

In [0]:
vocab = Vocabulary.from_instances(train_ds, max_vocab_size=config.max_vocab_size)


  0%|          | 0/1000 [00:00<?, ?it/s][A
100%|██████████| 1000/1000 [00:00<00:00, 10552.80it/s][A

Where do we tell the fields to use this vocabulary? This is not immediately intuitive, but the answer is the Iterator – which nicely leads us to our next topic: DataIterators.

Neural networks in PyTorch are trained on mini batches of tensors, not lists of data. Therefore, datasets need to be batched and converted to tensors.

This seems trivial at first glance, but there is a lot of subtlety here. To list just a few things we have to consider:

- Sequences of different lengths need to be padded
- To minimize padding, sequences of similar lengths can be put in the same batch
- Tensors need to be sent to the GPU if using the GPU
- Data needs to be shuffled at the end of each epoch during training, but we don’t want to shuffle in the midst of an epoch in order to cover all examples evenly

Thankfully, AllenNLP has several convenient iterators that will take care of all of these problems behind the scenes. Therefore, you will rarely have to implement your own Iterators from scratch (unless you are doing something really tricky during batching).

In [0]:
from allennlp.data.iterators import BucketIterator

iterator = BucketIterator(batch_size=config.batch_size, 
                          sorting_keys=[("tokens", "num_tokens")],
                         )

In [0]:
iterator.index_with(vocab)

In [0]:
batch = next(iter(iterator(train_ds)))
batch

The BucketIterator batches sequences of similar lengths together to minimize padding. To prevent the batches from becoming deterministic, a small amount of noise is added to the lengths. The sorting_keys keyword argument tells the iterator which field to reference when determining the text length of each instance. Remember, Iterators are responsible for numericalizing the text fields. We pass the vocabulary we built earlier so that the Iterator knows how to map the words to integers.

Important Tip: Don’t forget to run iterator.index_with(vocab)!

You may have noticed that the iterator does not take datasets as an argument. This is an important distinction between general iterators in PyTorch and iterators in AllenNLP. Whereas iterators are direct sources of batches in PyTorch, in AllenNLP, iterators are a schema for how to convert lists of Instances into mini batches of tensors. Therefore, you can’t directly iterate over a DataIterator in AllenNLP!

In [0]:
batch["tokens"]["tokens"]

tensor([[   22,  8584,  8585,  ...,  8592,   385,  5129],
        [  452,    12,   425,  ...,    14,     7,   176],
        [  154,    12,    33,  ...,    81,     7,   329],
        ...,
        [ 1791,    24,  1539,  ...,    37,   153,    11],
        [12755, 12756,   782,  ..., 12808, 12809, 12810],
        [  256,   106,   107,  ...,  2743,  2744,  3094]])

In [0]:
batch["tokens"]["tokens"].shape

torch.Size([64, 100])

###Model

AllenNLP models are mostly just simple PyTorch models. The key difference is that AllenNLP models are required to return a dictionary for every forward pass and compute the loss function within the forward method during training.

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim

from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.nn.util import get_text_field_mask
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder

This may seem a bit unusual, but this restriction allows you to use all sorts of creative methods of computing the loss while taking advantage of the AllenNLP Trainer (which we will get to later). For instance, you can apply masks to your loss function, weight the losses of different classes adaptively, etc.

One amazing aspect of AllenNLP is that it has a whole host of convenient tools for constructing models for NLP. To utilize these components fully, AllenNLP models are generally composed from the following components:

A token embedder

An encoder

(For seq-to-seq models) A decoder

Therefore, at a high level our model can be written very simply as

In [0]:
class BaselineModel(Model):
    def __init__(self, word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 out_sz: int=len(label_cols)):
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.projection = nn.Linear(self.encoder.get_output_dim(), out_sz)
        self.loss = nn.BCEWithLogitsLoss()
        
    def forward(self, tokens: Dict[str, torch.Tensor],
                id: Any, label: torch.Tensor) -> torch.Tensor:
        mask = get_text_field_mask(tokens)
        embeddings = self.word_embeddings(tokens)
        state = self.encoder(embeddings, mask)
        class_logits = self.projection(state)
        
        output = {"class_logits": class_logits}
        output["loss"] = self.loss(class_logits, label)

        return output

###The Embedder

The embedder maps a sequence of token ids (or character ids) into a sequence of tensors.

You’ll notice that there are two classes here for handling embeddings: the Embedding class and the BasicTextFieldEmbedder class. This is slightly clumsy but is necessary to map the fields of a batch to the appropriate embedding mechanism.

In [0]:
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder

token_embedding = Embedding(num_embeddings=config.max_vocab_size + 2,
                            embedding_dim=300, padding_index=0)
# the embedder maps the input tokens to the appropriate embedding matrix
word_embeddings: TextFieldEmbedder = BasicTextFieldEmbedder({"tokens": token_embedding})

###The Encoder

To classify each sentence, we need to convert the sequence of embeddings into a single vector. In AllenNLP, the model that handles this is referred to as a Seq2VecEncoder: a mapping from sequences to a single vector.

Though AllenNLP provides many Seq2VecEncoders our of the box, for this example we’ll use a simple bidirectional LSTM. Don’t remember the semantics of LSTMs in PyTorch? Don’t worry: AllenNLP has you covered. AllenNLP provides a handy wrapper called the PytorchSeq2VecWrapper that wraps the LSTM so that it takes a sequence as input and returns the final hidden state, converting it into a Seq2VecEncoder.

In [0]:
from allennlp.modules.seq2vec_encoders import PytorchSeq2VecWrapper
encoder: Seq2VecEncoder = PytorchSeq2VecWrapper(nn.LSTM(word_embeddings.get_output_dim(),
                                                        config.hidden_sz, bidirectional=True, batch_first=True))

Now, we can build our model in 3 simple lines of code! (or 4 lines depending on how you count it).

In [0]:
model = BaselineModel(
    word_embeddings, 
    encoder, 
)

In [0]:
if USE_GPU: model.cuda()
else: model

In [0]:
batch = nn_util.move_to_device(batch, 0 if USE_GPU else -1)
tokens = batch["tokens"]
labels = batch

In [0]:
tokens

In [0]:
mask = get_text_field_mask(tokens)
mask

In [0]:
embeddings = model.word_embeddings(tokens)
state = model.encoder(embeddings, mask)
class_logits = model.projection(state)
class_logits

In [0]:
model(**batch)

In [0]:
loss = model(**batch)["loss"]
loss

tensor(0.6986, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

In [0]:
loss.backward()

###Trainer

AllenNLP – thanks to the light restrictions it puts on its models and iterators – provides a Trainer class that removes the necessity of boilerplate code and gives us all sorts of functionality, including access to Tensorboard, one of the best visualization/debugging tools for training neural networks.

In [0]:
optimizer = optim.Adam(model.parameters(), lr=config.lr)

In [0]:
from allennlp.training.trainer import Trainer

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    iterator=iterator,
    train_dataset=train_ds,
    cuda_device=0 if USE_GPU else -1,
    num_epochs=config.epochs,
)

In [0]:
metrics = trainer.train()


  0%|          | 0/16 [00:00<?, ?it/s][A
loss: 0.6329 ||:   6%|▋         | 1/16 [00:00<00:01,  9.85it/s][A
loss: 0.6077 ||:  19%|█▉        | 3/16 [00:00<00:01, 11.53it/s][A
loss: 0.6040 ||:  31%|███▏      | 5/16 [00:00<00:00, 13.07it/s][A
loss: 0.5888 ||:  50%|█████     | 8/16 [00:00<00:00, 14.71it/s][A
loss: 0.5763 ||:  69%|██████▉   | 11/16 [00:00<00:00, 16.38it/s][A
loss: 0.5629 ||:  88%|████████▊ | 14/16 [00:00<00:00, 17.91it/s][A
loss: 0.5532 ||: 100%|██████████| 16/16 [00:00<00:00, 19.70it/s][A
  0%|          | 0/16 [00:00<?, ?it/s][A
loss: 0.4479 ||:  12%|█▎        | 2/16 [00:00<00:00, 15.83it/s][A
loss: 0.4211 ||:  31%|███▏      | 5/16 [00:00<00:00, 17.64it/s][A
loss: 0.3980 ||:  50%|█████     | 8/16 [00:00<00:00, 19.41it/s][A
loss: 0.3874 ||:  69%|██████▉   | 11/16 [00:00<00:00, 20.67it/s][A
loss: 0.3684 ||:  88%|████████▊ | 14/16 [00:00<00:00, 21.63it/s][A
loss: 0.3531 ||: 100%|██████████| 16/16 [00:00<00:00, 22.82it/s][A
  0%|          | 0/16 [00:00<?, ?it/s]

###AllenNLP Predictors

AllenNLP’s predictors aren’t very easy to use and don’t feel as polished as other parts of the API. Instead of toiling through the predictor API in AllenNLP, I propose a simpler solution: let’s write our own predictor. Thanks to the great tools in AllenNLP this is pretty easy and instructive!

Our predictor will simply extract the model logits from each batch and concatenate them to form a single matrix containing predictions for all the Instances in the dataset. 

we’re using iterators to batch our data easily and exploiting the semantics of the model output.

In [0]:
from allennlp.data.iterators import DataIterator
from tqdm import tqdm
from scipy.special import expit # the sigmoid function

def tonp(tsr): return tsr.detach().cpu().numpy()

class Predictor:
    def __init__(self, model: Model, iterator: DataIterator,
                 cuda_device: int=-1) -> None:
        self.model = model
        self.iterator = iterator
        self.cuda_device = cuda_device
        
    def _extract_data(self, batch) -> np.ndarray:
        out_dict = self.model(**batch)
        return expit(tonp(out_dict["class_logits"]))
    
    def predict(self, ds: Iterable[Instance]) -> np.ndarray:
        pred_generator = self.iterator(ds, num_epochs=1, shuffle=False)
        self.model.eval()
        pred_generator_tqdm = tqdm(pred_generator,
                                   total=self.iterator.get_num_batches(ds))
        preds = []
        with torch.no_grad():
            for batch in pred_generator_tqdm:
                batch = nn_util.move_to_device(batch, self.cuda_device)
                preds.append(self._extract_data(batch))
        return np.concatenate(preds, axis=0)

In [0]:
from allennlp.data.iterators import BasicIterator
# iterate over the dataset without changing its order
seq_iterator = BasicIterator(batch_size=64)
seq_iterator.index_with(vocab)

In [0]:
predictor = Predictor(model, seq_iterator, cuda_device=0 if USE_GPU else -1)
train_preds = predictor.predict(train_ds) 
test_preds = predictor.predict(test_ds)




  0%|          | 0/16 [00:00<?, ?it/s][A[A

 38%|███▊      | 6/16 [00:00<00:00, 53.54it/s][A[A

 75%|███████▌  | 12/16 [00:00<00:00, 52.81it/s][A[A

100%|██████████| 16/16 [00:00<00:00, 51.96it/s][A[A

  0%|          | 0/16 [00:00<?, ?it/s][A[A

 38%|███▊      | 6/16 [00:00<00:00, 53.74it/s][A[A

 75%|███████▌  | 12/16 [00:00<00:00, 53.05it/s][A[A

100%|██████████| 16/16 [00:00<00:00, 51.73it/s][A[A

In [0]:
train['comment_text'][3]

'Videos 15 Civilians Killed In Single US Airstrike Have Been Identified The rate at which civilians are being killed by American airstrikes in Afghanistan is now higher than it was in 2014 when the US was engaged in active combat operations.   Photo of Hellfire missiles being loaded onto a US military Reaper drone in Afghanistan by Staff Sgt. Brian Ferguson/U.S. Air Force. \nThe Bureau has been able to identify 15 civilians killed in a single US drone strike in Afghanistan last month – the biggest loss of civilian life in one strike since the attack on the Medecins Sans Frontieres hospital (MSF) last October. \nThe US claimed it had conducted a “counter-terrorism” strike against Islamic State (IS) fighters when it hit Nangarhar province with missiles on September 28. But the next day the United Nations issued an unusually rapid and strong statement saying the strike had killed 15 civilians and injured 13 others who had gathered at a house to celebrate a tribal elder’s return from a pil

In [0]:
trial_tok = ['Videos 15 Civilians Killed In Single US Airstrike Have Been Identified The rate at which civilians are being killed by American airstrikes in Afghanistan is now higher than it was in 2014 when the US was engaged in active combat operations.   Photo of Hellfire missiles being loaded onto a US military Reaper drone in Afghanistan by Staff Sgt. Brian Ferguson/U.S. Air Force. \nThe Bureau has been able to identify 15 civilians killed in a single US drone strike in Afghanistan last month – the biggest loss of civilian life in one strike since the attack on the Medecins Sans Frontieres hospital (MSF) last October. \nThe US claimed it had conducted a “counter-terrorism” strike against Islamic State (IS) fighters when it hit Nangarhar province with missiles on September 28. But the next day the United Nations issued an unusually rapid and strong statement saying the strike had killed 15 civilians and injured 13 others who had gathered at a house to celebrate a tribal elder’s return from a pilgrimage to Mecca. \nThe Bureau spoke to a man named Haji Rais who said he was the owner of the house that was targeted. He said 15 people were killed and 19 others injured, and provided their names (listed below). The Bureau was able to independently verify the identities of those who died. \nRais’ son, a headmaster at a local school, was among them. Another man, Abdul Hakim, lost three of his sons in the attack. \nRais said he had no involvement with IS and denied US claims that IS members had visited his house before the strike. He said: “I did not even speak to those sort of people on the phone let alone receiving them in my house.” \nThe deaths amount to the biggest confirmed loss of civilian life in a single American strike in Afghanistan since the attack on the MSF hospital in Kunduz last October, which killed at least 42 people. \nThe Nangarhar strike was not the only US attack to kill civilians in September. The Bureau’s data indicates that as many as 45 civilians and allied soldiers were killed in four American strikes in Afghanistan and Somalia that month. \nOn September 18 a pair of strikes killed eight Afghan policemen in Tarinkot, the capital of Urozgan provice. US jets reportedly hit a police checkpoint, killing one officer, before returning to target first responders. The use of this tactic – known as a “double-tap” strike – is controversial because they often hit civilian rescuers. \nThe US told the Bureau it had conducted the strike against individuals firing on and posing a threat to Afghan forces. The email did not directly address the allegations of Afghan policemen being killed. \nAt the end of the month in Somalia, citizens burnt US flags on the streets of the north-central city of Galcayo after it emerged a drone attack may have unintentionally killed 22 Somali soldiers and civilians. The strike occurred on the same day as the one in Nangarhar. \nIn both the Somali and Afghan incidents, the US at first denied that any non-combatants had been killed. It is now investigating both the strikes in Nangarhar and Galcayo. \nThe rate at which civilians are being killed by American airstrikes in Afghanistan is now higher than it was in 2014 when the US was engaged in active combat operations. Name']
trial_df = pd.DataFrame({'id':1,'comment_text': trial_tok})

In [0]:
trial_ds = reader.read(trial_df)





0it [00:00, ?it/s][A[A[A[A



1it [00:00, 80.86it/s][A[A[A[A

In [0]:
trial = predictor.predict(trial_ds) 





  0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A



100%|██████████| 1/1 [00:00<00:00, 73.15it/s][A[A[A[A

In [0]:
trial

array([[0.9981964]], dtype=float32)

In [0]:
train.head()

Unnamed: 0,id,comment_text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,Ever get the feeling your life circles the rou...,0
2,2,"Why the Truth Might Get You Fired October 29, ...",1
3,3,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Print \nAn Iranian woman has been sentenced to...,1


Seems to be working great!

In [0]:
torch.save(model, 'fnd_model.h5')

  "type " + obj.__name__ + ". It won't be checked "


In [0]:
mod = torch.load('fnd_model.h5')

In [0]:
predictor = Predictor(mod, seq_iterator, cuda_device=0 if USE_GPU else -1)


In [0]:
trial = predictor.predict(trial_ds) 





  0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A



100%|██████████| 1/1 [00:00<00:00, 72.21it/s][A[A[A[A

In [0]:
trial

array([[0.9981964]], dtype=float32)

###How to Switch to ELMo

Simply building a single NLP pipeline to train one model is easy. Writing the pipeline so that we can iterate over multiple configurations, swap components in and out, and implement crazy architectures without making our codebase explode is much harder.

Here, I’ll demonstrate how you can use ELMo to train your model with minimal changes to your code. ELMo is a recently developed method for text embedding in NLP that takes contextual information into account and achieved state-of-the-art results in many NLP tasks (If you want to learn more about ELMo, please refer to this blog post I wrote in the past explaining the method – sorry for the shameless plug).

To incorporate ELMo, we’ll need to change two things:

The token indexer
The embedder

ELMo uses character-level features so we’ll need to change the token indexer from a word-level indexer to a character-level indexer. In addition to converting characters to integers, we’re using a pre-trained model so we need to ensure that the mapping we use is the same as the mapping that was used to train ELMo. This seems like a lot of work, but in AllenNLP, all you need to is to use the ELMoTokenCharactersIndexer:

In [0]:
from allennlp.data.tokenizers.word_splitter import SpacyWordSplitter
from allennlp.data.token_indexers.elmo_indexer import ELMoCharacterMapper, ELMoTokenCharactersIndexer

# the token indexer is responsible for mapping tokens to integers
token_indexer = ELMoTokenCharactersIndexer()

def tokenizer(x: str):
    return [w.text for w in
            SpacyWordSplitter(language='en_core_web_sm', 
                              pos_tags=False).split_words(x)[:config.max_seq_len]]

Wait, is that it? you may ask. What about the DatasetReader? Surely if we use a different indexer, we’ll need to change the way we read the dataset? Well, not in AllenNLP. This is where composition shines; since we delegate all the decisions regarding how to convert raw text into integers to the token indexer, we get to reuse all the remaining code simply by swapping in a new token indexer.

One thing to note is that the ELMoTokenCharactersIndexer handles the mapping from characters to indices for you (you need to use the same mappings as the pretrained model for ELMo to have any benefit). Therefore, the code for initializing the Vocabulary is as follows:

In [0]:
reader = FNDDatasetReader(
    tokenizer=tokenizer,
    token_indexers={"tokens": token_indexer}
)

train_ds = reader.read(train)

1000it [00:11, 89.34it/s]


In [0]:
test_ds = reader.read(test)

1000it [00:09, 107.47it/s]


In [0]:
train_ds[:10]

[<allennlp.data.instance.Instance at 0x7f8b0b01a978>,
 <allennlp.data.instance.Instance at 0x7f8b0b0d0320>,
 <allennlp.data.instance.Instance at 0x7f8b0afb8668>,
 <allennlp.data.instance.Instance at 0x7f8b0af78e80>,
 <allennlp.data.instance.Instance at 0x7f8b0b09e080>,
 <allennlp.data.instance.Instance at 0x7f8b0b0fdba8>,
 <allennlp.data.instance.Instance at 0x7f8b0af84cc0>,
 <allennlp.data.instance.Instance at 0x7f8b0af41710>,
 <allennlp.data.instance.Instance at 0x7f8b0c4f6128>,
 <allennlp.data.instance.Instance at 0x7f8b0c0c3048>]

In [0]:
vars(train_ds[0].fields["tokens"])

In [0]:
vocab = Vocabulary() #We don't need to build the vocab: all that is handled by the token indexer

In [0]:
iterator = BucketIterator(batch_size=config.batch_size, 
                          sorting_keys=[("tokens", "num_tokens")],
                         )

In [0]:
iterator.index_with(vocab)

In [0]:
batch = next(iter(iterator(train_ds)))
batch["tokens"]["tokens"].shape

torch.Size([64, 100, 50])

Now, to change the embeddings to ELMo, you can simply follow a similar process:

In [0]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import ElmoTokenEmbedder

options_file = 'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json'
weight_file = 'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5'

elmo_embedder = ElmoTokenEmbedder(options_file, weight_file)
word_embeddings = BasicTextFieldEmbedder({"tokens": elmo_embedder})

In [0]:
from allennlp.modules.seq2vec_encoders import PytorchSeq2VecWrapper
encoder: Seq2VecEncoder = PytorchSeq2VecWrapper(nn.LSTM(word_embeddings.get_output_dim(), config.hidden_sz, bidirectional=True, batch_first=True))

In [0]:
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.nn.util import get_text_field_mask
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder

class BaselineModel2(Model):
    def __init__(self, word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 out_sz: int=len(label_cols)):
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.projection = nn.Linear(self.encoder.get_output_dim(), out_sz)
        self.loss = nn.BCEWithLogitsLoss()
        
    def forward(self, tokens: Dict[str, torch.Tensor],
                id: Any, label: torch.Tensor) -> torch.Tensor:
        mask = get_text_field_mask(tokens)
        embeddings = self.word_embeddings(tokens)
        state = self.encoder(embeddings, mask)
        class_logits = self.projection(state)
        
        output = {"class_logits": class_logits}
        output["loss"] = self.loss(class_logits, label)

        return output

In [0]:
model = BaselineModel2(
    word_embeddings, 
    encoder, 
)
if USE_GPU: model.cuda()
else: model

In [0]:
batch = nn_util.move_to_device(batch, 0 if USE_GPU else -1)
tokens = batch["tokens"]
labels = batch
tokens

In [0]:
mask = get_text_field_mask(tokens)
mask

In [0]:
embeddings = model.word_embeddings(tokens)
state = model.encoder(embeddings, mask)
class_logits = model.projection(state)
class_logits

In [0]:
loss = model(**batch)["loss"]
loss

tensor(0.6850, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

In [0]:
optimizer = optim.Adam(model.parameters(), lr=config.lr)
trainer = Trainer(
    model=model,
    optimizer=optimizer,
    iterator=iterator,
    train_dataset=train_ds,
    cuda_device=0 if USE_GPU else -1,
    num_epochs=config.epochs,
)

In [0]:
metrics = trainer.train()

loss: 0.6862 ||: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]
loss: 0.6618 ||: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]
loss: 0.6309 ||: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]
loss: 0.5769 ||: 100%|██████████| 16/16 [00:09<00:00,  1.73it/s]
loss: 0.4997 ||: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]
loss: 0.4226 ||: 100%|██████████| 16/16 [00:09<00:00,  1.71it/s]
loss: 0.3593 ||: 100%|██████████| 16/16 [00:09<00:00,  1.77it/s]
loss: 0.3272 ||: 100%|██████████| 16/16 [00:09<00:00,  1.70it/s]
loss: 0.2837 ||: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]
loss: 0.2532 ||: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]
loss: 0.2225 ||: 100%|██████████| 16/16 [00:09<00:00,  1.81it/s]
loss: 0.1910 ||: 100%|██████████| 16/16 [00:09<00:00,  1.70it/s]
loss: 0.1770 ||: 100%|██████████| 16/16 [00:09<00:00,  1.72it/s]
loss: 0.1575 ||: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]
loss: 0.1401 ||: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]
loss: 0.1329 ||: 100%|███

###BERT



In [0]:
config = Config(
    testing=True,
    seed=1,
    batch_size=64,
    lr=3e-4,
    epochs=2,
    hidden_sz=64,
    max_seq_len=100, # necessary to limit memory usage
    max_vocab_size=100000,
)

You’re probably thinking that switching to BERT is mostly the same as above. Well, you’re right – mostly. BERT has a few quirks that make it slightly different from your traditional model. One quirk is that BERT uses wordpiece embeddings so we need to use a special tokenizer.

In [0]:
from allennlp.data.token_indexers import PretrainedBertIndexer

token_indexer = PretrainedBertIndexer(
    pretrained_model="bert-base-uncased",
    max_pieces=config.max_seq_len,
    do_lowercase=True,
 )
# apparently we need to truncate the sequence here, which is a stupid design decision
def tokenizer(s: str):
    return token_indexer.wordpiece_tokenizer(s)[:config.max_seq_len - 2]

100%|██████████| 231508/231508 [00:00<00:00, 929600.31B/s]


In [0]:
reader = FNDDatasetReader(
    tokenizer=tokenizer,
    token_indexers={"tokens": token_indexer}
)

In [0]:
vocab = Vocabulary()

In [0]:
iterator = BucketIterator(batch_size=config.batch_size, 
                          sorting_keys=[("tokens", "num_tokens")],
                         )

In [0]:
iterator.index_with(vocab)

In [0]:
batch = next(iter(iterator(train_ds)))
batch["tokens"]["tokens"].shape

torch.Size([64, 100, 50])

In [0]:
batch

In [0]:
class BaselineModel3(Model):
    def __init__(self, word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 out_sz: int=len(label_cols)):
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.projection = nn.Linear(self.encoder.get_output_dim(), out_sz)
        self.loss = nn.BCEWithLogitsLoss()
        
    def forward(self, tokens: Dict[str, torch.Tensor],
                id: Any, label: torch.Tensor) -> torch.Tensor:
        mask = get_text_field_mask(tokens)
        embeddings = self.word_embeddings(tokens)
        state = self.encoder(embeddings, mask)
        class_logits = self.projection(state)
        
        output = {"class_logits": class_logits}
        output["loss"] = self.loss(class_logits, label)

        return output

In [0]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders.bert_token_embedder import PretrainedBertEmbedder

bert_embedder = PretrainedBertEmbedder(
        pretrained_model="bert-base-uncased",
        top_layer_only=True, # conserve memory
)
word_embeddings: TextFieldEmbedder = BasicTextFieldEmbedder({"tokens": bert_embedder},
                                                            # we'll be ignoring masks so we'll need to set this to True
                                                           allow_unmatched_keys = True)

100%|██████████| 407873900/407873900 [00:13<00:00, 29521335.45B/s]


In [0]:
BERT_DIM = word_embeddings.get_output_dim()

class BertSentencePooler(Seq2VecEncoder):
    def forward(self, embs: torch.tensor, 
                mask: torch.tensor=None) -> torch.tensor:
        # extract first token tensor
        return embs[:, 0]
    
    @overrides
    def get_output_dim(self) -> int:
        return BERT_DIM
    
encoder = BertSentencePooler(vocab)

In [0]:
model = BaselineModel(
    word_embeddings, 
    encoder, 
)

In [0]:
if USE_GPU: model.cuda()
else: model

In [0]:
batch = nn_util.move_to_device(batch, 0 if USE_GPU else -1)

In [0]:
tokens = batch["tokens"]
labels = batch

In [0]:
mask = get_text_field_mask(tokens)
mask

tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')

In [0]:
embeddings = model.word_embeddings(tokens)
state = model.encoder(embeddings, mask)
class_logits = model.projection(state)
class_logits

In [0]:
model(**batch)

In [0]:
optimizer = optim.Adam(model.parameters(), lr=config.lr)

In [0]:
trainer = Trainer(
    model=model,
    optimizer=optimizer,
    iterator=iterator,
    train_dataset=train_ds,
    cuda_device=0 if USE_GPU else -1,
    num_epochs=config.epochs,
)

In [0]:
metrics = trainer.train()