## SpanBERT for coreference resolution

documentation: https://github.com/mandarjoshi90/coref

### Preprocessing

Need to process a document to be resolved into jsonline files as described in the documentation.

In [2]:
import json
import sys
sys.path.append('../../../')
from src.preparation.data_loading import read_dossier

In [2]:
articles = read_dossier.read_dossier()
dos = articles[0]

In [3]:
# BERT Tokenizer
sys.path.append('../../../models/coref/')
from bert import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file='../../../models/bert_large/vocab.txt', do_lower_case=False)

W0307 10:07:12.222887 140439372150592 deprecation_wrapper.py:119] From ../../../models/coref/bert/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



### IMPORTANT

In [9]:
def get_subtoken_map(text):
    subtoken_map = list()
    i = -1
    for t in range(len(text)):
        i += 0 if text[t].startswith('##') else 1
        subtoken_map.append(i)
    return [0] + subtoken_map + [i]


def get_sentence_map(text):
    sent_map = list()
    i = 0
    for t in range(len(text)):
        if text[t] == ".":
            i += 1
        sent_map.append(i)
    return [0] + sent_map + [i]


def create_jsonline(text):
    sents = tokenizer.tokenize(text)
    data = dict()
    data['clusters'] = []
    data['doc_key'] = 'nw'
    data['sentences'] = [['[CLS]'] + sents + ['[SEP]']]
    # setting to No speaker for every subword at the moment
    data['speakers'] = [['[SPL]'] + list(map(lambda x: ""*len(x), sents)) + ['[SPL]']]
    data['sentence_map'] = get_sentence_map(sents)
    data['subtoken_map'] = get_subtoken_map(sents)
    return data


import string
def detokenize_bert(tokens):
    text = ""
    for t in tokens:
        if t.startswith("##"):
            text += t[2:]
        
        elif t in string.punctuation:
            text += t
        
        else:
            text += ("", " ")[text != ""] + t
    return text

In [5]:
# create example from dossier
# the model does not work with about 1000 tokens
text = '.'.join(dos.split('.')[:12]) + "."
example = create_jsonline(text)

with open("../../../models/dossier.jsonlines", 'w') as f:
    f.write(json.dumps(example) + '\n')

In [8]:
print(example)

{'clusters': [], 'doc_key': 'nw', 'sentences': [['[CLS]', '1', '.', 'Speaking', 'to', 'a', 'trusted', 'com', '##pa', '##tri', '##ot', 'in', 'June', '2016', 'sources', 'A', 'and', 'B', ',', 'a', 'senior', 'Russian', 'Foreign', 'Ministry', 'figure', 'and', 'a', 'former', 'top', 'level', 'Russian', 'intelligence', 'officer', 'still', 'active', 'inside', 'the', 'K', '##rem', '##lin', 'respectively', ',', 'the', 'Russian', 'authorities', 'had', 'been', 'cult', '##ivating', 'and', 'supporting', 'US', 'Republican', 'presidential', 'candidate', ',', 'Donald', 'T', '##R', '##UM', '##P', 'for', 'at', 'least', '5', 'years', '.', 'Source', 'B', 'asserted', 'that', 'the', 'T', '##R', '##UM', '##P', 'operation', 'was', 'both', 'supported', 'and', 'directed', 'by', 'Russian', 'President', 'Vladimir', 'P', '##UT', '##IN', '.', 'Its', 'aim', 'was', 'to', 'so', '##w', 'disco', '##rd', 'and', 'di', '##sun', '##ity', 'within', 'the', 'US', 'itself', ',', 'but', 'more', 'especially', 'within', 'the', 'Tran

### Evaluation

In [4]:
with open("../../../models/predictions.jsonlines", 'r') as f:
    preds = json.loads(f.read())

In [5]:
print(preds['predicted_clusters'])

[[[51, 60], [72, 75], [140, 143], [224, 227], [229, 229], [234, 234], [268, 271], [277, 277], [308, 311], [316, 316], [349, 352], [368, 373], [390, 395], [423, 423]], [[71, 76], [90, 90], [139, 144]], [[103, 105], [105, 105]], [[42, 44], [125, 127], [326, 326], [376, 378], [386, 386]], [[83, 88], [151, 155], [191, 193]], [[130, 137], [188, 188]], [[67, 68], [211, 212]], [[36, 39], [217, 220], [299, 304]], [[223, 223], [257, 257]], [[12, 13], [287, 288]], [[317, 338], [359, 359]]]


In [6]:
# input text
print(text)

NameError: name 'text' is not defined

In [10]:
# predicted coreference clusters
for cluster in preds['predicted_clusters']:
    i, j = cluster[0]
    first_mention = preds['sentences'][0][i: j+1]
    print(detokenize_bert(first_mention))
    # print([detokenize_bert(preds['sentences'][0][i: j+1]) for (i, j) in cluster], "\n")

US Republican presidential candidate, Donald TRUMP
the TRUMP operation
the US itself
the Russian authorities
Russian President Vladimir PUTIN
Source C, a senior Russian financial official
Source B
the Kremlin
feeding
June 2016
various lucrative real estate develop me business deals in Russia, especially in relation to the ongoing 2018 World Cup soccer tournament


In [12]:
for (t, s) in preds['top_spans']:
    print(preds['sentences'][0][t:s+1])

['.']
['Speaking']
['a', 'trusted', 'com', '##pa', '##tri', '##ot']
['June', '2016']
['June', '2016', 'sources', 'A', 'and', 'B', ',', 'a', 'senior', 'Russian', 'Foreign', 'Ministry', 'figure', 'and', 'a', 'former', 'top', 'level', 'Russian', 'intelligence', 'officer', 'still', 'active', 'inside', 'the', 'K', '##rem', '##lin', 'respectively']
['June', '2016', 'sources', 'A', 'and', 'B', ',', 'a', 'senior', 'Russian', 'Foreign', 'Ministry', 'figure', 'and', 'a', 'former', 'top', 'level', 'Russian', 'intelligence', 'officer', 'still', 'active', 'inside', 'the', 'K', '##rem', '##lin', 'respectively', ',']
['sources', 'A', 'and', 'B', ',', 'a', 'senior', 'Russian', 'Foreign', 'Ministry', 'figure', 'and', 'a', 'former', 'top', 'level', 'Russian', 'intelligence', 'officer', 'still', 'active', 'inside', 'the', 'K', '##rem', '##lin', 'respectively']
['a', 'senior', 'Russian', 'Foreign', 'Ministry', 'figure']
['Russian', 'Foreign', 'Ministry']
['Foreign']
['Foreign', 'Ministry']
['Ministry']
['

### Notes

Pros:
1. The model seems to work well. 

Cons:
1. The model predicts in a 10 seconds but takes like 2 minutes to load for bert_large, spanbert_large will take forever. 
2. The model will require its own environment for dependencies because there are lot of deprecation warnings. 
3. The model will not work on the entire document in one go. The maximum number of tokens allowed is 512. I think on average there are 1000 tokens in each dossier document.