## e2e-coref Model from Allen NLP

Allen NLP is a python package to load and run NLP models. It has the pretrained End-to-End Coreference Model SOTA from 2017.

installation: `pip install allennlp`

documentation: https://github.com/allenai/allennlp

In [14]:
import sys
from allennlp import pretrained
from nltk import pos_tag
from tqdm import tqdm
from traceback import format_exc

sys.path.append("../../")
from src.preparation.data_loading import read_dossier

In [4]:
# load coref model
model = pretrained.neural_coreference_resolution_lee_2017()

# sample text
dos = read_dossier.read_dossier()

results = model.predict(dos[0])

  "num_layers={}".format(dropout, num_layers))


In [3]:
type(results)

dict

In [4]:
results.keys()

dict_keys(['top_spans', 'predicted_antecedents', 'document', 'clusters'])

In [5]:
results

{'top_spans': [[4, 9],
  [8, 9],
  [8, 14],
  [9, 9],
  [12, 21],
  [16, 17],
  [16, 21],
  [18, 20],
  [23, 29],
  [27, 28],
  [31, 31],
  [32, 32],
  [33, 34],
  [37, 37],
  [39, 41],
  [44, 44],
  [47, 50],
  [47, 53],
  [48, 48],
  [52, 52],
  [56, 58],
  [60, 61],
  [62, 62],
  [64, 66],
  [65, 65],
  [71, 71],
  [73, 76],
  [78, 78],
  [87, 89],
  [89, 89],
  [96, 96],
  [102, 102],
  [104, 105],
  [104, 106],
  [110, 118],
  [115, 115],
  [119, 119],
  [121, 123],
  [122, 122],
  [128, 132],
  [130, 131],
  [134, 134],
  [136, 140],
  [141, 141],
  [142, 142],
  [142, 143],
  [142, 144],
  [144, 144],
  [147, 152],
  [153, 153],
  [155, 157],
  [159, 159],
  [161, 161],
  [162, 162],
  [165, 166],
  [168, 168],
  [169, 169],
  [171, 172],
  [177, 179],
  [179, 179],
  [181, 182],
  [183, 183],
  [185, 186],
  [189, 189],
  [190, 190],
  [190, 193],
  [192, 192],
  [192, 195],
  [192, 198],
  [197, 197],
  [197, 198],
  [197, 205],
  [201, 205],
  [201, 210],
  [208, 209],
  [208

In [6]:
cluster1 = results['clusters'][0]

for cluster in results['clusters']:
    print([results['document'][c[0]:c[-1]+1] for c in cluster])

[['the', 'Trump', 'operation'], ['Its'], ['the', 'Trump', 'operation']]
[['the', 'US', 'itself'], ['itself']]
[['Source', 'B'], ['Source', 'C', ',', 'a', 'senior', 'Russian', 'financial', 'official', ','], ['She'], ['Source', 'A'], ['she'], ['Source', 'E'], ['Source', 'E'], ['Source', 'B'], ['Source', 'B'], ['She'], ['her']]
[['Trump'], ['Trump'], ['Trump'], ['his'], ['his'], ['Trump'], ['Trump'], ['Trump'], ['Trump', "'s"], ['Trump', "'s"], ['Trump', "'s"], ['Trump'], ['Trump'], ['Trump', "'s"], ['Trump'], ['Trump']]
[['Russian', 'President', 'Vladimir', 'Putin'], ['Putin', "'s"], ['Putin'], ['him'], ['him'], ['he'], ['President'], ['he'], ['him'], ['him']]
[['the', 'Kremlin'], ['the', 'Kremlin'], ['The', 'Kremlin', "'s"], ['the', 'Kremlin', "'s"], ['Kremlin'], ['Kremlin']]
[['Russia', "'s"], ['Russia'], ['Russia'], ['Russia'], ['Russia']]
[['2016'], ['2018'], ['2016']]
[['World', 'War', 'II'], ['World', 'Cup']]
[['the', 'Russian', 'authorities'], ['the', 'Russian', 'authorities'], ['

### Evaluation

1. The model gets a little weird when the entities have almost similar names e.g. `[['World', 'War', 'II'], ['World', 'Cup']]`

2. There are two clusters for Putin: 
    
    a. `[['Russian', 'President', 'Vladimir', 'Putin'], ['Putin', "'s"], ['Putin'], ['him'], ['him'], ['he'], ['President'], ['he'], ['him'], ['him']]`
    
    b. `[['Putin', 'himself'], ['himself'], ['his'], ['Putin', "'s"]]`

3. Sometimes it even leaves it cluster unresolved: `[['their'], ['they'], ['they']]`

4. Some clusters are invalid: `[['Transatlantic'], ['Mrs', 'Obama'], ['FSB'], ['2013'], ['the', 'FSB']]`

5. Otherwise the model is fairly fast. It took 25 secs to find clusters in 17 documents in the Steele Dossier Report. (Without resolving the text) 

In [107]:
predictions = [model.predict(d) for d in tqdm(dos)]

100%|██████████| 17/17 [00:25<00:00,  1.50s/it]


### TODO

1. Try to train the model for better accuracy. 
2. implement an efficient resolver. 
3. Look for cluster merging (maybe)

## CorefResolver

Just sketching a idea for the resolver. 

In [24]:
class CorefResolver(object):
    """
        Class implementation for predicting and resolving coreference clusters 
    """
    def __init__(self, doc):
        """
            Initialises two attributes:
            1. doc: can be a string for one document or a string list of several documents
            2. preds: predictions are computed for the coreference clusters.
        """
        self.doc = [doc] if type(doc) != list else doc
        self.preds = [model.predict(d) for d in self.doc]
    
    def print_clusters(self):
        """
            prints the string form for all coreference clusters. 
        """
        for i in range(len(self.preds)):
            for cluster in self.preds[i]['clusters']:
                print([' '.join(self.preds[i]['document'][c[0]:c[1]+1]) for c in cluster])
    
    def resolve(self):
        """
            replace all the mentions in the document with the first mention for a cluster. 
        """
        resolved = []
        for pred in self.preds:
            doc, pos = zip(*pos_tag(pred['document']))
            doc, pos = list(doc), list(pos)

            for cluster in pred['clusters']:
                fm = cluster[0]
                first_mention =  " ".join(doc[fm[0]: fm[1]+1])
                
                for m in cluster[1:]:
                    mention = " ".join(doc[m[0]:m[1]+1])
                    try:
                        if pos[m[0]].startswith('PRP'):
                            doc[m[0]] = first_mention
                            
                            if pos[m[0]].endswith('$'):
                                doc[m[0]] += " 's"
                            
                            doc[m[0]+1:m[1]+1] = ""
                            
                            print(doc[m[0]:m[1]+1])
                    
                    except:
                        print(format_exc())
            
            resolved.append(" ".join(filter(lambda x: x != "", doc)))
        
        return resolved