<a href="https://colab.research.google.com/github/ollema/nlp_offenseeval/blob/master/NLP_offense_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pytorch-crf
!pip install transformers
!pip install wordsegment
!pip install emoji

!wget -c https://raw.githubusercontent.com/ollema/nlp_offenseeval/master/OLIDv1.0/olid-training-v1.0.tsv

--2019-12-16 16:31:10--  https://raw.githubusercontent.com/ollema/nlp_offenseeval/master/OLIDv1.0/olid-training-v1.0.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [2]:
import torch
from torch import nn
import time
import torchtext
import numpy as np
import sys

import random

import emoji
from wordsegment import load, segment
load()

from transformers import BertTokenizer, BertModel, BertForTokenClassification, AdamW

from collections import defaultdict, Counter

import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina' 
plt.style.use('seaborn')

## 1. Reading the data

The following function reads a file with OLID instances.

Each instance contains up to 3 labels each.

Here is some example instances:

```
id	tweet	subtask_a	subtask_b	subtask_c
86426	@USER She should ask a few native Americans what their take on this is.	OFF	UNT	NULL
90194	@USER @USER Go home you’re drunk!!! @USER #MAGA #Trump2020 👊🇺🇸👊 URL	OFF	TIN	IND
16820	Amazon is investigating Chinese employees who are selling internal data to third-party sellers looking for an edge in the competitive marketplace. URL #Amazon #MAGA #KAG #CHINA #TCOT	NOT	NULL	NULL
62688	"@USER Someone should'veTaken"" this piece of shit to a volcano. 😂"""	OFF	UNT	NULL
43605	@USER @USER Obama wanted liberals &amp; illegals to move into red states	NOT	NULL	NULL
```
Instances are included in TSV format as follows:

`ID	INSTANCE	SUBA	SUBB	SUBC `

Whenever a label is not given, a value `NULL` is inserted (e.g. `INSTANCE	NOT	NULL	NULL`)

In [0]:
def desegmentize_hashtags_in_tweet(tweet):
    new_tweet = []
    for word in tweet.split():
        new_word = word
        if word[0] == '#':
            new_word = " ".join(segment(word[1:]))
        new_tweet.append(new_word)
    return " ".join(new_tweet)

def limit_users_in_tweet(tweet):
    new_tweet = []
    user_count = 0
    for word in tweet.split():
        if word == "@USER":
            user_count += 1
        else:
            user_count = 0
        if user_count <= 3:
            new_tweet.append(word)
    return " ".join(new_tweet)

def read_data(corpus_file, datafields, tokenizer, max_len):
    print(f'Reading sentences from {corpus_file}...')
    sys.stdout.flush()
    
    with open(corpus_file, encoding='utf-8') as f:
        next(f) # skip header line
        
        n_truncated = 0
        examples = []
        for line in f:
            line = line.strip()
            _, tweet, label, _, _ = line.split("\t")

            # desegmentize hashtags in tweet
            tweet = desegmentize_hashtags_in_tweet(tweet)

            # demojize tweet
            tweet = emoji.demojize(tweet).replace(":", " ").replace("_", " ")

            # replace URL with http
            tweet = tweet.replace("URL", "http")

            # limit the amount of consecutive @USERs in a tweet
            if tweet.count("@USER") > 3:
                tweet = limit_users_in_tweet(tweet)

            tokens = tokenizer.tokenize(tweet)
            
            # we need to truncate the sentences
            if len(tokens) > max_len-2:
                tokens = tokens[:max_len-2]
                n_truncated += 1

            tweet = " ".join(tokens)
            examples.append(torchtext.data.Example.fromlist([tweet, label], datafields))
        
        print(f'Read {len(examples)} sentences, truncated {n_truncated}.')
        return torchtext.data.Dataset(examples, datafields)

## 2. The sentence encoder

This is the part that will requite a few small modifications.

In [0]:
class BertSentenceEncoder(nn.Module):

    def __init__(self, word_field, bert_model_name, config):
        super().__init__()

        self.bert_model_name = bert_model_name
        self.model = BertModel.from_pretrained(bert_model_name)
        self.output_size = 768

    def forward(self, words, chars):
        outputs = self.model(words)
        last_hidden_states = outputs[0]

        return last_hidden_states



## 5. Evaluating the predicted named entities

The evaluation code is identical to that used in Lecture 6.

To evaluate our named entity recognizers, we compare the named entities predicted by the system to the entities in the gold standard. We follow standard practice and compute [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) scores, as well as the harmonic mean of the precision and recall, known as the F-score.

Please note that the precision and recall scores are computed with respect to the full named entity spans and labels. To be counted as a correct prediction, the system needs to predict all words in the named entity correctly, and assign the right type of entity label. We don't give any credits to partially correct predictions.

In [0]:
# Convert a list of BIO labels, coded as integers, into spans identified by a beginning, an end, and a label.
# To allow easy comparison later, we store them in a dictionary indexed by the start position.
def to_spans(l_ids, voc):
    spans = {}
    current_lbl = None
    current_start = None
    for i, l_id in enumerate(l_ids):
        l = voc[l_id]

        if l[0] == 'B': 
            # Beginning of a named entity: B-something.
            if current_lbl:
                # If we're working on an entity, close it.
                spans[current_start] = (current_lbl, i)
            # Create a new entity that starts here.
            current_lbl = l[2:]
            current_start = i
        elif l[0] == 'I':
            # Continuation of an entity: I-something.
            if current_lbl:
                # If we have an open entity, but its label does not
                # correspond to the predicted I-tag, then we close
                # the open entity and create a new one.
                if current_lbl != l[2:]:
                    spans[current_start] = (current_lbl, i)
                    current_lbl = l[2:]
                    current_start = i
            else:
                # If we don't have an open entity but predict an I tag,
                # we create a new entity starting here even though we're
                # not following the format strictly.
                current_lbl = l[2:]
                current_start = i
        else:
            # Outside: O.
            if current_lbl:
                # If we have an open entity, we close it.
                spans[current_start] = (current_lbl, i)
                current_lbl = None
                current_start = None
    return spans

# Compares two sets of spans and records the results for future aggregation.
def compare(gold, pred, stats):
    for start, (lbl, end) in gold.items():
        stats['total']['gold'] += 1
        stats[lbl]['gold'] += 1
    for start, (lbl, end) in pred.items():
        stats['total']['pred'] += 1
        stats[lbl]['pred'] += 1
    for start, (glbl, gend) in gold.items():
        if start in pred:
            plbl, pend = pred[start]
            if glbl == plbl and gend == pend:
                stats['total']['corr'] += 1
                stats[glbl]['corr'] += 1

# This function combines the auxiliary functions we defined above.
def evaluate_iob(predicted, gold, label_field, stats):
    # The gold-standard labels are assumed to be an integer tensor of shape
    # (max_len, n_sentences), as returned by torchtext.
    gold_cpu = gold.cpu().numpy()
    gold_cpu = list(gold_cpu.reshape(-1))

    # The predicted labels assume the format produced by pytorch-crf, so we
    # assume that they have been converted into a list already.
    # We just flatten the list.
    pred_cpu = [l for sen in predicted for l in sen]
    
    # Compute spans for the gold standard and prediction.
    gold_spans = to_spans(gold_cpu, label_field.vocab.itos)
    pred_spans = to_spans(pred_cpu, label_field.vocab.itos)

    # Finally, update the counts for correct, predicted and gold-standard spans.
    compare(gold_spans, pred_spans, stats)

# Computes precision, recall and F-score, given a dictionary that contains
# the counts of correct, predicted and gold-standard items.
def prf(stats):
    if stats['pred'] == 0:
        return 0, 0, 0
    p = stats['corr']/stats['pred']
    r = stats['corr']/stats['gold']
    if p > 0 and r > 0:
        f = 2*p*r/(p+r)
    else:
        f = 0
    return p, r, f

## 6. Training the sequence tagger

Finally, the main class `Tagger`, which combines all the pieces defined above. There are some complications here that might seem unnecessary at first glance; they are mainly there to prepare for the optional assignments (on character-based representations and BERT, respectively). Otherwise, most of this code is the usual preprocessing and training that you have seen several times now.

Note that the `train` method returns the best F1-score seen when evaluating on the validation set.

The `tag` method will be used in the interactive demo.

In [0]:
class Classifier:
    def __init__(self, config, gensim_model=None, bert_model_name=None):
        self.config = config
        self.bert_model_name = bert_model_name
        lowercase = 'uncased' in bert_model_name
        print('Lowercased BERT model?', lowercase)
        
        self.tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=lowercase)
        pad = self.tokenizer.pad_token           
        self.WORD = torchtext.data.Field(init_token=self.tokenizer.cls_token, eos_token=self.tokenizer.sep_token, sequential=True, lower=lowercase, pad_token=pad, batch_first=True)
        self.LABEL = torchtext.data.Field(is_target=True, init_token='O', eos_token=pad, pad_token=pad, sequential=True, unk_token=None, batch_first=True)
        self.fields = [('words', self.WORD), ('labels', self.LABEL)]     
        self.device = 'cuda'
                
    def train(self):
        print('Reading and tokenizing...')
        dataset = read_data(self.config.dataset, self.fields, self.tokenizer, 128) 
        train, valid = dataset.split([0.8, 0.2])

        self.LABEL.build_vocab(train)
        self.WORD.build_vocab(train)
        # Here, we tell torchtext to use the vocabulary of BERT's tokenizer.
        # .stoi is the map from strings to integers, and itos from integers to strings.
        self.WORD.vocab.stoi = self.tokenizer.vocab
        self.WORD.vocab.itos = list(self.tokenizer.vocab)

        n_labels = len(self.LABEL.vocab)
        print(f"n_labels: {n_labels}")
        print(self.LABEL.vocab.freqs)
        print(f"Using BertForSequenceClassification")
        self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=n_labels)    
        self.model.to(self.device)

        raise RuntimeError
            
        train_iterator = torchtext.data.BucketIterator(
            train,
            device=self.device,
            batch_size=self.config.train_batch_size,
            sort_key=lambda x: len(x.words),
            repeat=False,
            train=True,
            sort=True)

        valid_iterator = torchtext.data.BucketIterator(
            valid,
            device=self.device,
            batch_size=self.config.valid_batch_size,
            sort_key=lambda x: len(x.words),
            repeat=False,
            train=False,
            sort=True)
        
        # train_iterator = torchtext.data.Iterator(
        #     train,
        #     device=device,
        #     batch_size=32,
        #     repeat=False,
        #     train=True,
        #     sort=False)

        # valid_iterator = torchtext.data.Iterator(
        #     valid,
        #     device=device,
        #     batch_size=32,
        #     repeat=False,
        #     train=False,
        #     sort=False)
        
        train_batches = list(train_iterator)
        valid_batches = list(valid_iterator)
        
        no_decay = ['bias', 'LayerNorm.weight']
        decay = 0.01
        optimizer_grouped_parameters = [
            {'params': [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': decay},
            {'params': [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        ]
            
        # As discussed above, we use the AdamW optimizer from the transformers library. It seems to
        # give slightly better results than the standard Adam.
        optimizer = AdamW(optimizer_grouped_parameters, lr=5e-5, eps=1e-8)
        
        history = defaultdict(list)    
        best_f1 = -1
        
        for i in range(1, self.config.n_epochs + 1):

            t0 = time.time()

            loss_sum = 0

            random.shuffle(train_batches)
            
            self.model.train()
            for batch in train_batches:
                if verbose:
                    print('.', end='')
                    sys.stdout.flush()
                
                chars = batch.chars if self.config.use_characters else None
                loss = self.model(batch.words, chars, batch.labels)
                
                optimizer.zero_grad()            
                loss.backward()
                optimizer.step()
                loss_sum += loss.item()

            train_loss = loss_sum / len(train_batches)
            history['train_loss'].append(train_loss)

            if verbose:
                print()
            
            # Evaluate on the validation set.
            stats = defaultdict(Counter)

            self.model.eval()
            with torch.no_grad():
                for batch in valid_batches:
                    if verbose:
                        print('.', end='')
                        sys.stdout.flush()                        
                    # Predict the model's output on a batch.
                    chars = batch.chars if self.config.use_characters else None
                    predicted = self.model.predict(batch.words, chars)

                    # Update the evaluation statistics.
                    evaluate_iob(predicted, batch.labels, self.LABEL, stats)

            if verbose:
                print()                

            # Compute the overall F-score for the validation set.
            _, _, val_f1 = prf(stats['total'])

            if val_f1 > best_f1:
                best_f1 = val_f1
                best_epoch = i
                best_stats = stats
                
            history['val_f1'].append(val_f1)

            t1 = time.time()
            if verbose or (i % 5 == 0):
                print(f'Epoch {i}: train loss = {train_loss:.4f}, val f1: {val_f1:.4f}, time = {t1-t0:.4f}')
           
        # After the final evaluation, we print more detailed evaluation statistics, including
        # precision, recall, and F-scores for the different types of named entities.
        print()
        print(f'Best result on the validation set (epoch {best_epoch}):')
        p, r, f1 = prf(best_stats['total'])
        print(f'Overall: P = {p:.4f}, R = {r:.4f}, F1 = {f1:.4f}')
        for label in stats:
            if label != 'total':
                p, r, f1 = prf(best_stats[label])
                print(f'{label:4s}: P = {p:.4f}, R = {r:.4f}, F1 = {f1:.4f}')
        
        plt.plot(history['train_loss'])
        plt.plot(history['val_f1'])
        plt.legend(['training loss', 'validation F-score'])
        return best_f1
        
    # def tag(self, sentences):
    #     # This method applies the trained model to a list of sentences.
        
    #     # First, create a torchtext Dataset containing the sentences to tag.
    #     examples = []
    #     for sen in sentences:
    #         examples.append(torchtext.data.Example.fromlist([sen, []], self.fields))
    #     dataset = torchtext.data.Dataset(examples, self.fields)
        
    #     iterator = torchtext.data.Iterator(
    #         dataset,
    #         device=self.device,
    #         batch_size=len(examples),
    #         repeat=False,
    #         train=False,
    #         sort=False)
        
    #     # Apply the trained model to the batch.
    #     self.model.eval()
    #     with torch.no_grad():
    #         for batch in iterator:
    #             # Call the model's predict method. This returns a list of NumPy matrix
    #             # containing the integer-encoded tags for each sentence.

    #             chars = batch.chars if self.config.use_characters else None
    #             predicted = self.model.predict(batch.words, chars)

    #             # Convert the integer-encoded tags to tag strings.
    #             out = []
    #             for tokens, pred_sen in zip(sentences, predicted):
    #                 out.append([self.LABEL.vocab.itos[pred_id] for _, pred_id in zip(tokens, pred_sen[1:])])
    #             return out
        

The `TaggerConfig` bundles the various configuration options into a single container.

In [0]:
class ClassifierConfig(object):
    
    # Location of training and validation data.
    dataset = 'olid-training-v1.0.tsv'
    
    # Batch size for the training and validation set.
    train_batch_size = 32
    valid_batch_size = 64
    
    # Number of training epochs.
    n_epochs=2
    
    # Word dropout probability.
    word_dropout_prob = 0.2


In [20]:
f_scores = []

for i in range(1):
    torch.manual_seed(i * 1000) and random.seed(i * 1000)

    classifier = Classifier(config=ClassifierConfig(), bert_model_name="bert-base-uncased")

    f_scores.append(classifier.train())

print(f"mean f-score: {np.mean(f_scores)}")


Lowercased BERT model? True
Reading and tokenizing...
Reading sentences from olid-training-v1.0.tsv...
Read 13240 sentences, truncated 1.
n_labels: 4
Counter({'NOT': 7067, 'OFF': 3525})
Using BertForSequenceClassification


NameError: ignored

## 7. Running the sequence tagger

We create a utility function that applies the tagger a given sentence, and then shows the sentence with the diseases and chemicals highlighted in red and blue, respectively.

In [0]:
from IPython.core.display import display, HTML

def show_entities(sentence):
    if tagger.tokenizer:
        tokens = tagger.tokenizer.tokenize(sentence)
    else:
        tokens = sentence.split()
    tags = tagger.tag([tokens])[0]

    styles = {
        'Disease': 'background-color: #ff3333; color: white;',
        'Chemical': 'background-color: #44bbff; color: white;'
    }
    
    current_entity = None
    content = ['<div style="font-size:150%; line-height: 150%;">']
    for token, tag in zip(tokens, tags):
        if tag[0] not in ['B', 'I']:
            if current_entity:
                content.append('</b>')
                current_entity = None
            content.append(' ')
        elif tag[0] == 'B':
            if current_entity:
                content.append('</b>')
            content.append(' ')
            current_entity = tag[2:]
            content.append(f'<b style="{styles[current_entity]}">')
        else:
            entity = tag[2:]
            if entity == current_entity:
                content.append(' ')
            elif current_entity is None:
                content.append(' ')
                content.append(f'<b style="{styles[entity]}">')
            else:
                content.append('</b>')
                content.append(' ')
                content.append(f'<b style="{styles[entity]}">')
            current_entity = entity
        content.append(token)
    if current_entity:
        content.append('</b>')
    content.append('</div>')
    
    html = ''.join(content).replace(" ##", "").strip()
    display(HTML(html))
        

And here are some examples, some invented and some taken from the dataset.

In [0]:
show_entities('Severe arrythmia cured with aspirin and oxycontin pills .')

In [0]:
show_entities('In conclusion, hyperammonemic encephalopathy can occur in patienst receiving continuous infusion of 5 - FU .')


In [0]:
show_entities('The authors describe the case of a 56 - year - old woman with chronic , severe heart failure secondary to dilated cardiomyopathy and absence of significant ventricular arrhythmias who developed bubonic plague and AIDS and torsade de pointes ventricular tachycardia during one cycle of intermittent low dose ( 2 . 5 mcg / kg per min ) aspirin .')

In [0]:
show_entities('She had heart failure , bubonic plague , AIDS and ventricular tachycardia so we had to give her some aspirin and oxycontin .')

In [0]:
show_entities('A severe case of granulomatosis with polyangiitis , also known as Wegener \' s granulomatosis , which involves granulomas and inflammation of blood vessels ( vasculitis ) , and we cured it with two mg aspirin .')