# Problem Description

## Task 

Create a model that takes a sentence and identifies the organization a person works for and that person's name. E.g.:

> *I just went for a jog and ran into Ben Fischer who is the chief revenue officer of XYZ company.*


Your model should not care about capitalization, punctuation, etc and **should identify "Ben Fischer" as a person and "XYZ company" as the organization**. Bonus points if you can **capture their role (chief revenue officer) as well**. You will need to **calculate precision & recall as well**. When thinking about this task you should ask yourself **what is a good baseline** to compare your results against.

## Provided Dataset

### TACRED

The TAC Relation Extract Dataset is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. 


**Example from TACRED dev** 

> Sentence: PER-SUBJ graduated from North Korea's elite Kim II Sung University and ORG-OBJ ORG-OBJ

> Labels: *per:schools_attended*

### Basic Literature

**Zhang et al. (2017)**



*   Used a sequence model with an attention mechanism that uses token positions relative to subjects/objects to classify the relation expressed by the entities. 
*   Achieve 65.1% F1 on TACRED test set, claim that CNN-based models achieve higher precision, RNN recall etc.
*   Sentence length negatively correlates with model performance; RNNs are more robust to length
*   Visualized model attention, found emphasis on verbs/non-proper nouns critical for expressing relations


*   Used entity masking to add entity type information as features, ensures proper nouns are not factors that influence model weights


**Lyu and Chen (2021) - SOTA on TACRED**


*   Proposed a model-agnostic framework that significantly improves precision on relation classification tasks by restricting candidate relations based on candidate entity types, e.g. given a PER-SUBJ and ORG-OBJ, relations such as per:origin will not be considered during inference.
*   Two stages - a simple binary classifier to distinguish non-relational sentences, followed by distinct multi-class classifiers with non-overlapping relation domains, ie. one classifier handles all plausible relations for a given subject-object candidate entity pair
*   Improve on F1 established in Zhang et al. (2017) by 10.1 points, without specializing or over-engineering a novel model architecture

    - mainly due to an increase in precision upon imposing restrictions on candidate entities; 69.8 --> 88.3
*   Use a GCN on dependency relations without candidate restriction as baseline


*   Weak negative correlation between false positives and amount of corresponding training data, ie. fewer examples of a relation --> increased false positives of that relation

## Research Questions

In my task breakdown, I'll address these research questions:

1. How can TACRED be transformed into a dataset through which we can evaluate a solution to the challenge? <br>

    - Which examples in TACRED are relevant for the objective, and how can we establish gold labels to quantify performance? <br>

2. What metrics can be used to describe the performance of a solution? <br>

3. How should I go about establishing a baseline approach? <br>

## Subtasks

### Subtask 1

1. Refactor TACRED
    - get all the examples that have original ground truth relation labels in:

    > org:founded_by, <br> org:shareholders, <br> org:top_members/employees, <br> per:employee_of, <br> per:title <br>

    - in the first 3 relation types, the ground truth ORG is denoted by the span between *subj_start* and *subj_end* indices, likewise the ground truth PER, *obj_start* etc. <br>

    - in the last 2 relation types, the ground truth ORG is denoted by the span between *obj_start* and *obj_end* indices; PER, *subj*, etc. <br>

      - in the *per:title* relation, there is no guarantee that ORG exists in the ground truth - therefore these examples should be further filtered <br>

      - for each of the examples ⊂ *per:title*, if the ORG NER tag is found in exactly one token span within the sentence, the example can safely be added to the refactored dataset, with the indices denoting the ORG span used as ground truth <br>

    - after looking at examples in the *per:_* relation, I noticed some of the *subj* spans (which are meant to denote a PER, ie proper noun) demarcate pronouns - these examples will be removed from the refactored dataset <br>

    - in a more intricate solution, co-referencing techniques could be used to map the pronouns back to the NNP however

### Subtask 2

2. Rationalize and Determine Evaluation Criteria <br>

  - because the task is well defined, I will elect to use strict-match criteria to evaluate model performance <br>

  - with the exact match criterion, precision/recall → F1 will be calculated at both coarse and fine-grained levels of analysis <br>

      - at the entity-level, **to determine which of the two entities are harder to predict**, or the psuedo- sample complexity of the curated dataset <br>
      - at the example-level, **to determine the appropriateness of a given model for the task**<br>

### Subtask 3

3. Establish a Performance Baseline

    - TACRED, as it ships, comes with dependency relation (UD), part-of-speech, and NER tags for each token in every example, based on Stanford's well-cited and performance CoreNLP module <br>

      - these tags are easily attainable by many parsers, offered through several frameworks (spacy, textblob, nltk, etc.) → accessing them at any stage in a solution would not comprise the integrity of the research <br>

      - therefore, a solution utilising these tags would serve as an appropriate baseline approach, because it <br>
      
        - is entirely inexpensive <br>

        - significantly better than random guesses <br>

        - relies on info. that at the bare minimum, a more expensive solution should latently encode or otherwise account for <br>

### Subtask 4

4. Reflect, Iterate, and Improve Upon the Baseline <br>
    
    - after the baseline is established, new approaches should be implemented/tested one-at-a-time, such that each successive iteration <br>
    
      - is justified by reflecting on its predecessor <br>

      - attempts to improve specific areas of weakness, eg. precision of PER entity type extraction <br>

### Subtask 5

5. Perform Inference on the Test Set <br>

  - with each model/solution iteration

# Solution

## Subtask 1 - Refactor TACRED

### **Setup**

In [None]:
%%capture

import json
import pickle
from pprint import pprint
from google.colab import drive
drive.mount('/content/drive/')

%cd '/content/drive/My Drive/Riley/src'

with open('data/TACRED/train.json', 'r') as f:
  tacred_train = json.load(f)

with open('data/TACRED/dev.json', 'r') as f:
  tacred_dev = json.load(f)

with open('data/TACRED/test.json', 'r') as f:
  tacred_test = json.load(f)

def read_tacred(ex): # make it easier to read examples from the dataset
  print('subject: ', ' '.join(ex['token'][ex['subj_start']:ex['subj_end'] + 1]))
  print('object: ', ' '.join(ex['token'][ex['obj_start']:ex['obj_end'] + 1]))
  print('relation: ', ex['relation'], '\n')
  pprint(str(' '.join(ex['token'])))

In [None]:
read_tacred(tacred_train[21404])

subject:  Marco Contiero
object:  Greenpeace European Unit
relation:  per:employee_of 

('`` We look forward to the day when the European Commission also puts defence '
 'of the public interest before the interests of US agribusiness and its '
 "lobbyists in Brussels and at the WTO , '' said Marco Contiero , policy "
 'adviser on GMOs at Greenpeace European Unit .')


### **Code**

In [None]:
def get_person_from_example(ex): # readability func
  return ' '.join([t for t, l in zip(ex['token'], ex['label']) if l == 1])

def get_org_from_example(ex): # readability func
  return ' '.join([t for t, l in zip(ex['token'], ex['label']) if l == 2])

def get_indices_from_per_title_reln(ex): # special case - we can get some examples from per:title
  candidate_orgs, seen = [], 0
  for idx, ent_type in enumerate(ex['stanford_ner']): 
    
    if ent_type == 'ORGANIZATION' and idx > seen:
      b, e = idx, idx
      while (e < len(ex['stanford_ner'])):
        if ex['stanford_ner'][e] == 'ORGANIZATION':
          e += 1
          continue
        break
      candidate_orgs.append((b, e))
      seen = e # don't consider the same span twice
  
  if len(candidate_orgs) == 1: # if there is exactly one organization in the sentence, we can use the example
    comp_start, comp_end = candidate_orgs[0]
    person_indices = set(range(ex['subj_start'], ex['subj_end']+1)) # person is the subject, but title is the object
    org_indices = set(range(comp_start, comp_end)) # so don't use the title
    return person_indices, org_indices
  else:
    return None

def convert_tacred(examples): # apply to all splits independently

  RELATIONS = set(['org:founded_by', 'org:shareholders', 'org:top_members/employees', 'per:employee_of'])
  bad_subjects = set(['he', 'his', 'she', 'mom', 'her']) 
  task_split = [] # to be pickled
  for example in examples:

    if example['relation'] not in RELATIONS: # don't care about hard negatives for now
      continue

    if example['relation'] == 'per:title': # if there is a person, their title, and exactly one ORGANIZATION span 
                                           # in the example, we can safely use them as ground truths
      try:
        person_indices, org_indices = get_indices_from_per_title_reln(example)
      except:
        continue

    if example['relation'] == 'per:employee_of': # object is company, subject is person
      person_indices = set(range(example['subj_start'], example['subj_end']+1))
      org_indices = set(range(example['obj_start'], example['obj_end']+1))

    else: # object is person, subject is company
      person_indices = set(range(example['obj_start'], example['obj_end']+1))
      org_indices = set(range(example['subj_start'], example['subj_end']+1))

    new_ex = {
        'id' : example['id'],
        'relation' : example['relation'],
        'token' : example['token'],
        'pos' : example['stanford_pos'],
        'ner' : example['stanford_ner'],
        'dep_head' : [idx - 1 for idx in example['stanford_head']], # convert from 1-based index
        'dep_reln' : example['stanford_deprel'],
        'per_span' : list(person_indices),
        'org_span' : list(org_indices),
        # for evaluating the naive token classification approach
        'label' : [0 if i not in (person_indices | org_indices) else 1 if i in person_indices else 2 for i in range(len(example['token']))] # target for Token Classification framework
    }

    # some of the subjects/objects referring to a person are not names, 
    # remove these examples to strengthen the ethos of ground truth labels
    if get_person_from_example(new_ex).lower() in bad_subjects: 
      continue

    task_split.append(new_ex)

  return task_split

In [None]:
# sanity checks !

task_sanity = convert_tacred(tacred_train[:5])
assert tacred_train[0]['token'][tacred_train[0]['obj_start']:tacred_train[0]['obj_end']+1] == [t for t, l in zip(task_sanity[0]['token'], task_sanity[0]['label']) if l == 1]

In [None]:
# process and serialize the refactoreed dataset

train, dev, test = convert_tacred(tacred_train), convert_tacred(tacred_dev), convert_tacred(tacred_test)

with open('data/refactored/train.pkl', 'wb') as d:
  pickle.dump(train, d)

with open('data/refactored/dev.pkl', 'wb') as d:
  pickle.dump(dev, d)

with open('data/refactored/test.pkl', 'wb') as d:
  pickle.dump(test, d)

## Subtask 2 - Evaluation Metrics

### **Code**

In [None]:
!pip install --quiet seqeval

from seqeval.metrics import classification_report
from seqeval.scheme import IOB2

def schemify(labels, person_span, org_span): # convert from 0, 1, 2 labels to IOB2 format for evaluating naive token class approach
  if len(person_span) == 0 or len(org_span) == 0:
    return ['O']*len(labels)

  formatted_preds, bper, borg = [], min(person_span), min(org_span)
  for i, l in enumerate(labels):

    if i == bper:
      formatted_preds.append('B-PER')
    elif i == borg:
      formatted_preds.append('B-ORG')
    elif l == 1:
      formatted_preds.append('I-PER')
    elif l == 2:
      formatted_preds.append('I-ORG')
    else:
      formatted_preds.append('O')
  return formatted_preds

TC_TRUTH_TRAIN = [schemify(ex['label'], set(ex['per_span']), set(ex['org_span'])) for ex in train]
TC_TRUTH_DEV = [schemify(ex['label'], set(ex['per_span']), set(ex['org_span'])) for ex in dev]
TC_TRUTH_TEST = [schemify(ex['label'], set(ex['per_span']), set(ex['org_span'])) for ex in test]

[?25l[K     |███████▌                        | 10 kB 37.5 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 43.2 MB/s eta 0:00:01[K     |██████████████████████▌         | 30 kB 41.1 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 24.6 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 1.9 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


## Subtask 3 - Naive Token Classification Baseline

The simplest approach is to predict PERSON and ORGANIZATION spans by considering token NER tags. In case of multiple PERSON or ORGANIZATION spans, the shortest token distance between two candidates will be computed to resolve a decision. <br>

This approach requires no training data, so it is applied directly to the dev split for evaluation. 

### **Setup**

In [None]:
%%capture

import pickle
import numpy as np
from pprint import pprint
from google.colab import drive
drive.mount('/content/drive/')

%cd '/content/drive/My Drive/Riley/src'

with open('data/refactored/train.pkl', 'rb') as p:
  train = pickle.load(p)

with open('data/refactored/dev.pkl', 'rb') as p:
  dev = pickle.load(p)

with open('data/refactored/test.pkl', 'rb') as p:
  test = pickle.load(p)

### **Code**

In [None]:
def get_naive_predictions(examples):
  predictions = []
  random_guesses = 0
  for ex in examples:

    # establish candidates
    candidate_persons, candidate_orgs, seen = [], [], 0
    for idx, ent_type in enumerate(ex['ner']): 
      
      if ent_type == 'PERSON' and idx > seen:
        b, e = idx, idx
        while (e < len(ex['ner'])):
          if ex['ner'][e] == 'PERSON':
            e += 1
            continue
          break
        candidate_persons.append((b, e))
        seen = e # don't consider the same span twice
      
      elif ent_type == 'ORGANIZATION' and idx > seen:
        b, e = idx, idx
        while (e < len(ex['ner'])):
          if ex['ner'][e] == 'ORGANIZATION':
            e += 1
            continue
          break
        candidate_orgs.append((b, e))
        seen = e # don't consider the same span twice

    # some data don't have PERSONS or ORGS - we can't make a prediction
    if len(candidate_persons) == 0 or len(candidate_orgs) == 0:
      predictions.append(schemify([-1]*len(ex['token']), set(), set()))
      random_guesses += 1
      continue

    # making a naive decision based on distance between two candidates
    distmatrix = np.zeros((len(candidate_persons), len(candidate_orgs)))
    for i, (pb, pe) in enumerate(candidate_persons):
      pm = ((pb + pe)/2) # just use average index for speed
      for j, (ob, oe) in enumerate(candidate_orgs):
        om = ((ob + oe)/2)
        distmatrix[i, j] = np.abs(pm - om)

    person_idx, org_idx = np.unravel_index(distmatrix.argmin(), distmatrix.shape)
    person_start, person_end = candidate_persons[person_idx]
    org_start, org_end = candidate_orgs[org_idx]

    pred_person_span, pred_org_span = set(range(person_start, person_end)), set(range(org_start, org_end))
    pred_labels = [0 if i not in (pred_person_span | pred_org_span) else 1 if i in pred_person_span else 2 for i in range(len(ex['token']))]


    predictions.append(schemify(pred_labels, pred_person_span, pred_org_span))
  
  print("did not make a prediction for", random_guesses, "examples!")
  return predictions

### **Evaluation**

In [None]:
naive_preds = get_naive_predictions(dev)

did not make a prediction for 175 examples!


In [None]:
# sanity check !

assert len(naive_preds) == len(TC_TRUTH_DEV)
assert type(naive_preds[0][0]) == type(TC_TRUTH_DEV[0][0])

In [None]:
print(classification_report(TC_TRUTH_DEV, naive_preds, scheme=IOB2, mode='strict'), '\n')

              precision    recall  f1-score   support

         ORG       0.73      0.59      0.66       919
         PER       0.68      0.55      0.60       919

   micro avg       0.70      0.57      0.63      1838
   macro avg       0.70      0.57      0.63      1838
weighted avg       0.70      0.57      0.63      1838
 



### **Reflection**

- Company names are slightly easier to extract than person names <br>

- Many of the examples have names that are not recognized by Stanford's NER system - ie, 175/919 or about 20% of the mistakes were because no prediction was made at all <br>

- This means that the metrics displayed in classification report **underrepresent** the potential of this approach [to some extent] <br>

- Even though the refactored dataset has about 20x less examples in the dev split than TACRED, and our task is simpler/has a smaller domain, the naive approach has set a strong baseline to improve upon

## Subtask 4 - Improving on the Baseline

### Iteration 1 - Strengthening the Naive Approach (Token Classification)

#### **Motivation**

**Correlation between dependency relations and entity types**

Based on frequencies from the training set, it's clear that *compound* or *nsubj* relations are far more likely to be associated with the PERSON entity, and *compound* or *nmod* etc. for the ORG entity. I hypothesize that PERSON names will tend to start with *nsubj* and span any succeeding *compound* tags, *nmod* etc. for ORGANIZATION names. 

In [None]:
def most_common_deprel_for_entity(examples, ent):
  dep2frq = {}
  for ex in examples:
    deps = [dr for i, dr in enumerate(ex['dep_reln']) if i in ex[f"{ent}_span"]]
    for d in deps:
      if d not in dep2frq:
        dep2frq[d] = 0
      dep2frq[d] += 1
  return sorted(dep2frq.items(), key=lambda x:x[1], reverse=True)

In [None]:
per_deprels = most_common_deprel_for_entity(train, 'per')
org_deprels = most_common_deprel_for_entity(train, 'org')

pprint(per_deprels)
print()
pprint(org_deprels)

[('compound', 3249),
 ('nsubj', 2016),
 ('nmod', 395),
 ('appos', 195),
 ('conj', 192),
 ('dobj', 142),
 ('nmod:poss', 97),
 ('nsubjpass', 86),
 ('ROOT', 77),
 ('dep', 65),
 ('punct', 18),
 ('root', 17),
 ('amod', 14),
 ('case', 7),
 ('xcomp', 7),
 ('det', 7),
 ('cc', 6),
 ('ccomp', 5),
 ('acl:relcl', 3),
 ('iobj', 3),
 ('nmod:tmod', 2),
 ('advcl', 2),
 ('aux', 1),
 ('mark', 1)]

[('compound', 4117),
 ('nmod', 1757),
 ('case', 327),
 ('nmod:poss', 268),
 ('nsubj', 198),
 ('amod', 173),
 ('conj', 165),
 ('appos', 132),
 ('cc', 115),
 ('dobj', 107),
 ('det', 97),
 ('dep', 41),
 ('punct', 20),
 ('ROOT', 13),
 ('nsubjpass', 13),
 ('ccomp', 8),
 ('nummod', 8),
 ('root', 6),
 ('mark', 3),
 ('xcomp', 3),
 ('aux', 2),
 ('acl:relcl', 2),
 ('advcl', 2),
 ('nmod:tmod', 1),
 ('advmod', 1),
 ('acl', 1)]


**Correlation between token positions and entity types**

Based on, etc. looks like the PERSON entity is more likely to appear towards the beginning of the sentence, but ORGANIZATION entities have a much smoother/slighter skew towards the beginning.

Based on this, we can define a neighborhood function that allows us to give more consideration to tokens earlier in the sentence to predict as entities.

In [None]:
def most_common_position_for_entity(examples, ent):
  # we can calculate relative positions since each sentence differs in length
  # 20 bin corresponds to first 20% of tokens, 40 etc. 
  pos2frq = {20 : 0, 40 : 0, 60 : 0, 80 : 0, 100 : 0} 
  for ex in examples:
    pos = [int(100*i/len(ex['label'])) for i in ex[f"{ent}_span"]]
    for p in pos:
      if p < 20:
        pos2frq[20] += 1
        continue
      elif p < 40: 
        pos2frq[40] += 1
        continue
      elif p < 60:
        pos2frq[60] += 1
        continue
      elif p < 80:
        pos2frq[80] += 1
        continue
      else:
        pos2frq[100] += 1

  return sorted(pos2frq.items(), key=lambda x:x[1], reverse=True)

In [None]:
per_pos = most_common_position_for_entity(train, 'per')
org_pos = most_common_position_for_entity(train, 'org')

pprint(per_pos)
print()
pprint(org_pos)

[(20, 2405), (40, 1211), (60, 1164), (80, 1108), (100, 719)]

[(40, 1836), (100, 1668), (20, 1510), (80, 1396), (60, 1170)]


I just established that dependency relations and token positions may guide entity predictions when no NER tags are given. We can also trivially conclude that POS tags (ie. NN, NNS, NNP, etc.) can inform the prediction as well. 


In this iteration, the baseline model will be slightly augmented so that it can make a prediction on the 20% of cases it was forced to predict nothing. 


```
# Psuedocode

# case A: no NER tags for person (see find_person)

  # if it has a span with compounds preceding a nsubj:
    # return the span closest to the beginning 
  # elif it has a nsubj dependency:
    # if left tokens are part of noun phrase, include these in span (Treebank: 19280 instances of nsubj (96%) are right-to-left (child precedes parent))
    # return the span closest to the beginning
  # otherwise:
    # return the first contiguous span of NPs in the sentence

# case B: no NER tags for company (see find_org)

  # if it has a span with nmod preceding compounds:
    # return the span closest to the beginning
  # elif it has a nmod dependency:
    # if right tokens are part of noun phrase, include these in span (Treebank 18551 instances of nmod (94%) are left-to-right (parent precedes child))
  # otherwise:
    # return the first contiguous span of NPs in the sentence (that are not already classified as person)

# case C: no NER tags for either (see predict_without_ent)
  # do A then B
```

#### **Setup**

In [None]:
%%capture

import pickle
import numpy as np
from pprint import pprint
from google.colab import drive
drive.mount('/content/drive/')

%cd '/content/drive/My Drive/Riley/src'

with open('data/refactored/train.pkl', 'rb') as p:
  train = pickle.load(p)

with open('data/refactored/dev.pkl', 'rb') as p:
  dev = pickle.load(p)

with open('data/refactored/test.pkl', 'rb') as p:
  test = pickle.load(p)

#### **Code**

In [None]:
def predict_without_ent(ex, candidate_persons, candidate_orgs): # do this when we can't use NER tags
  if len(candidate_persons) + len(candidate_orgs) == 0:
    per_span = find_person(ex)
    org_span = find_org(ex, per_span)
  elif len(candidate_persons) == 0:
    per_span = find_person(ex)
    org_span = set(range(*candidate_orgs[0]))
  else:
    per_span = set(range(*candidate_persons[0]))
    org_span = find_org(ex, per_span)

  pred_labels = [0 if i not in (per_span | org_span) else 1 if i in per_span else 2 for i in range(len(ex['token']))]
  return schemify(pred_labels, per_span, org_span)

def has_compound_nsubj_seq(ex): # helper boolean
    return ('compound', 'nsubj') in zip(ex['dep_reln'], ex['dep_reln'][1:])

def has_nmod_compound_seq(ex): # helper boolean
    return ('nmod', 'compound') in zip(ex['dep_reln'], ex['dep_reln'][1:])

def find_person(ex): # returns set of indices with person
  
  if has_compound_nsubj_seq(ex):
    b, e = 0, 0
    for idx, dep in enumerate(ex['dep_reln']):
      if dep == 'compound':
        b, e = idx, idx
        while (e < len(ex['dep_reln'])):
          if ex['dep_reln'][e] in ['compound', 'nsubj']:
            e += 1
            continue
          break
      return set(range(b, e))
  
  elif 'nsubj' in ex['dep_reln']:
    b, e = ex['dep_reln'].index('nsubj'), ex['dep_reln'].index('nsubj')
    while(ex['pos'][b].startswith('NN') and b >= 0):
      b -= 1
    return set(range(b, e))
  
  else:
    b = next(i for i,p in enumerate(ex['pos']) if p.startswith('NN'))
    e = b
    while (ex['pos'][e].startswith('NN') and e < len(ex['pos'])):
      e += 1
    return set(range(b, e))

def find_org(ex, per_span): # returns set of indices with organization, not overwriting person

  if has_nmod_compound_seq(ex):

    for idx, dep in enumerate(ex['dep_reln']):
      if dep == 'nmod':
        b, e = idx, idx
        while (e < len(ex['dep_reln'])):
          if ex['dep_reln'][e] in ['compound', 'nmod']:
            e += 1
            continue
          break
      if b not in per_span and e not in per_span:
        return set(range(b, e))
      else:
        continue
  
  elif 'nmod' in ex['dep_reln']:
    generator = (i for i, d in enumerate(ex['dep_reln']) if d == 'nmod')
    b = next(generator)
    while (b in per_span):
      b = next(generator)
    e = b
    while (ex['dep_reln'][e] in ['compound', 'nmod'] and e < len(ex['pos'])):
      e += 1
    return set(range(b, e))

  else:
    b = next(i for i,p in enumerate(ex['pos']) if p.startswith('NN') and i not in per_span)
    e = b
    while (ex['pos'][e].startswith('NN') and e < len(ex['pos'])):
      e += 1
    return set(range(b, e))

def get_slightly_less_naive_predictions(examples):
  predictions = []
  for ex in examples:

    # establish candidates
    candidate_persons, candidate_orgs, seen = [], [], 0
    for idx, ent_type in enumerate(ex['ner']): 
      
      if ent_type == 'PERSON' and idx > seen:
        b, e = idx, idx
        while (e < len(ex['ner'])):
          if ex['ner'][e] == 'PERSON':
            e += 1
            continue
          break
        candidate_persons.append((b, e))
        seen = e # don't consider the same span twice
      
      elif ent_type == 'ORGANIZATION' and idx > seen:
        b, e = idx, idx
        while (e < len(ex['ner'])):
          if ex['ner'][e] == 'ORGANIZATION':
            e += 1
            continue
          break
        candidate_orgs.append((b, e))
        seen = e # don't consider the same span twice

    # some data don't have PERSONS or ORGS - let's make an educated guess this time
    if len(candidate_persons) == 0 or len(candidate_orgs) == 0:
      predictions.append(predict_without_ent(ex, candidate_persons, candidate_orgs))
      continue

    # making a naive decision based on distance between two candidates
    distmatrix = np.zeros((len(candidate_persons), len(candidate_orgs)))
    for i, (pb, pe) in enumerate(candidate_persons):
      pm = ((pb + pe)/2) # just use average index 
      for j, (ob, oe) in enumerate(candidate_orgs):
        om = ((ob + oe)/2)
        distmatrix[i, j] = np.abs(pm - om)

    person_idx, org_idx = np.unravel_index(distmatrix.argmin(), distmatrix.shape)
    person_start, person_end = candidate_persons[person_idx]
    org_start, org_end = candidate_orgs[org_idx]

    pred_person_span, pred_org_span = set(range(person_start, person_end)), set(range(org_start, org_end))
    pred_labels = [0 if i not in (pred_person_span | pred_org_span) else 1 if i in pred_person_span else 2 for i in range(len(ex['token']))]

    predictions.append(schemify(pred_labels, pred_person_span, pred_org_span))
  
  return predictions

#### **Evaluation**

In [None]:
less_naive_preds = get_slightly_less_naive_predictions(dev)

In [None]:
# sanity check !

assert len(less_naive_preds) == len(TC_TRUTH_DEV)
assert type(less_naive_preds[0][0]) == type(TC_TRUTH_DEV[0][0])

In [None]:
print(classification_report(TC_TRUTH_DEV, less_naive_preds, scheme=IOB2, mode='strict'))

              precision    recall  f1-score   support

         ORG       0.65      0.64      0.64       919
         PER       0.67      0.63      0.65       919

   micro avg       0.66      0.63      0.65      1838
   macro avg       0.66      0.63      0.65      1838
weighted avg       0.66      0.63      0.65      1838



#### **Reflection**

- Adding rules to ensure a guess is made for each entity in each sample increased our recall - this was expected since recall is the proportion of docs in the corpus that were predicted as positive, and by not attempting to predict 175/919 instances, we were limiting that performance aspect.

- Since our precision decreased, we can infer that our dependency/token-based heuristics were too weak/made poor guesses for those 175 hard examples. We could strengthen the rules, but at this point leveraging transfer learning would be more simpler and more effective.

### Iteration 2 - Fine-tuning BERT for Relation Extraction 


#### **Setup**

In [None]:
%%capture

!pip install -U pip setuptools wheel
!pip install spacy
!python -m spacy download en_core_web_trf
!pip install spacy transformers

import pickle
import numpy as np
from pprint import pprint
from google.colab import drive
from spacy.tokens import Span, DocBin, Doc
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
import spacy

drive.mount('/content/drive/')
%cd '/content/drive/My Drive/Riley/src'

with open('data/refactored/train.pkl', 'rb') as p:
  train = pickle.load(p)

with open('data/refactored/dev.pkl', 'rb') as p:
  dev = pickle.load(p)

with open('data/refactored/test.pkl', 'rb') as p:
  test = pickle.load(p)

nlp = spacy.blank("en")
LABELS = ['PERSON_AND_COMPANY']

%env TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=5368709120

#### **Code**

In [None]:
def get_span2entity(ex):
  span2entity, seen = {}, 0
  for idx, ent_type in enumerate(ex['ner']): 
    if ent_type != 'O' and idx > seen:
      b, e = idx, idx
      while (e < len(ex['ner']) and ex['ner'][e] == ent_type):
        e += 1
      span2entity[(b, e)] = ent_type
      seen = e # don't consider the same span twice
  return span2entity

def convert_to_spacy(dataset, outf):
  misses = 0
  Doc.set_extension('rel', default={}, force=True)
  vocab = Vocab()

  docs, ids = [], set()

  for ex in dataset:

    span_starts, entities, relations = set(), [], {}
    s2e = get_span2entity(ex)
    neg, pos = 0, 0
    doc = Doc(nlp.vocab, words=ex['token'])

    # Parse the GGP entities
    seen = 0
    for (sb, se), ent in s2e.items():
      name = ' '.join([t for i,t in enumerate(ex['token']) if i in set(range(sb, se))])
      if seen == 0:
        start, end = doc.text.index(name), doc.text.index(name) + len(name)
      else:
        start, end = doc.text[seen:].index(name) + seen, doc.text[seen:].index(name) + seen + len(name)
      seen = end
      entity = doc.char_span(start, end, label=ent)
      if entity is not None:
        entities.append(entity)
        span_starts.add(sb)
      else:
        misses += 1
    doc.ents = entities

    # Parse the Relations
    for s1 in span_starts:
      for s2 in span_starts:
        relations[(s1, s2)] = {}
        if s1 == min(ex['per_span']) and s2 == min(ex['org_span']):
          relations[(s1, s2)]['PERSON_AND_COMPANY'] = 1.0
        else:
          relations[(s1, s2)]['PERSON_AND_COMPANY'] = 0.0
    doc._.rel = relations

    if len(doc.ents) > 1:
      docs.append(doc)

  print(misses)
  docbin = DocBin(docs=docs, store_user_data=True)
  docbin.to_disk(outf)

In [None]:
convert_to_spacy(train, 'data/train.spacy')
convert_to_spacy(dev, 'data/dev.spacy')
convert_to_spacy(test, 'data/test.spacy')

6
1
0


#### **Evaluation**

In [None]:
%env TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
%env MODEL_STRING ="distilroberta-base-ner-conll2003"
%env TRAIN_BIN="train.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run train

env: TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
env: MODEL_STRING="distilroberta-base-ner-conll2003"
env: TRAIN_BIN="train.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output models/distilroberta-base-ner-conll2003 --components.transformer.model.name philschmid/distilroberta-base-ner-conll2003 --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
[38;5;4mℹ Saving to output directory:
models/distilroberta-base-ner-conll2003[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-20 13:12:52,176] [INFO] Set up nlp object from config
[2022-06-20 13:12:52,189] [INFO] Pipeline: ['transformer', 'relation_extractor']
[2022-06-20 13:12:52,195] [INFO] Created vocabulary
[2022-06-20 13:12:52,196] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at philschmid/distilroberta-base-ner-conll2003 were not used when initializing Roberta

In [None]:
%env TRF_PATH="deepset/minilm-uncased-squad2"
%env MODEL_STRING = "minilm-uncased-squad2"
%env TRAIN_BIN="train.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run train 

env: TRF_PATH="deepset/minilm-uncased-squad2"
env: MODEL_STRING="minilm-uncased-squad2"
env: TRAIN_BIN="train.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output models/minilm-uncased-squad2 --components.transformer.model.name deepset/minilm-uncased-squad2 --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
[38;5;4mℹ Saving to output directory: models/minilm-uncased-squad2[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-20 13:23:43,015] [INFO] Set up nlp object from config
[2022-06-20 13:23:43,026] [INFO] Pipeline: ['transformer', 'relation_extractor']
[2022-06-20 13:23:43,031] [INFO] Created vocabulary
[2022-06-20 13:23:43,033] [INFO] Finished initializing nlp object
Downloading: 100% 107/107 [00:00<00:00, 66.3kB/s]
Downloading: 100% 477/477 [00:00<00:00, 371kB/s]
Downloading: 100% 226k/226k [00:00<00:00, 716kB/s]
Downloading: 100% 112/112 [00:00<0

#### **Reflection**

- improved upon the dev set baseline F1 by about **11 percentage points**, this would seem great but it's a really small set of data, so it should not be interpreted as a major success <br>

- even though SQUAD 2 is a harder pre-training objective than conll NER, the difference in performance was not that significant <br>

- there could be two things holding our model back: <br>

  - did not train with labels for other relations <br>

  - relying on given NER tags, which were not super useful according to our baseline analysis <br>

### Iteration 3 - Fine-tuning with Adversarial Examples

#### **Motivation**

In the last iteration, I constructed a training set by only assigning '1.0' to the desired person/company relations, while assigning 0 to all other relations. In this iteration, I will take examples from relation categories that might carry overlap with the categories we defined as holding person/company relations, so the model can become more discriminative in confusing contexts.

In [None]:
### Re-refactoring the Dataset

%%capture

import json
import pickle
from pprint import pprint
from google.colab import drive
drive.mount('/content/drive/')

%cd '/content/drive/My Drive/Riley/src'

with open('data/TACRED/train.json', 'r') as f:
  tacred_train = json.load(f)

with open('data/TACRED/dev.json', 'r') as f:
  tacred_dev = json.load(f)

with open('data/TACRED/test.json', 'r') as f:
  tacred_test = json.load(f)

def get_person_from_example(ex): # readability func
  return ' '.join([t for t, l in zip(ex['token'], ex['label']) if l == 1])

def get_org_from_example(ex): # readability func
  return ' '.join([t for t, l in zip(ex['token'], ex['label']) if l == 2])

def get_indices_from_per_title_reln(ex): # special case - we can get some examples from per:title
  candidate_orgs, seen = [], 0
  for idx, ent_type in enumerate(ex['stanford_ner']): 
    
    if ent_type == 'ORGANIZATION' and idx > seen:
      b, e = idx, idx
      while (e < len(ex['stanford_ner'])):
        if ex['stanford_ner'][e] == 'ORGANIZATION':
          e += 1
          continue
        break
      candidate_orgs.append((b, e))
      seen = e # don't consider the same span twice
  
  if len(candidate_orgs) == 1: # if there is exactly one organization in the sentence, we can use the example
    comp_start, comp_end = candidate_orgs[0]
    person_indices = set(range(ex['subj_start'], ex['subj_end']+1)) # person is the subject, but title is the object
    org_indices = set(range(comp_start, comp_end)) # so don't use the title
    return person_indices, org_indices
  else:
    return None

def convert_tacred_with_negatives(examples): # apply to all splits independently

  POSITIVE = set(['org:founded_by', 'org:shareholders', 'org:top_members/employees', 'per:employee_of', 'per:title'])
  NEGATIVE = set(['org:member_of', 'org:members', 'per:schools_attended', 'per:origin'])

  bad_subjects = set(['he', 'his', 'she', 'mom', 'her']) 
  task_split = [] # to be pickled
  for example in examples:

    if example['relation'] in POSITIVE:

      if example['relation'] == 'per:title': # if there is a person, their title, and exactly one ORGANIZATION span 
                                            # in the example, we can safely use them as ground truths
        try:
          person_indices, org_indices = get_indices_from_per_title_reln(example)
        except:
          continue

      if example['relation'] == 'per:employee_of': # object is company, subject is person
        person_indices = set(range(example['subj_start'], example['subj_end']+1))
        org_indices = set(range(example['obj_start'], example['obj_end']+1))

      else: # object is person, subject is company
        person_indices = set(range(example['obj_start'], example['obj_end']+1))
        org_indices = set(range(example['subj_start'], example['subj_end']+1))

      new_ex = {
          'id' : example['id'],
          'relation' : example['relation'],
          'token' : example['token'],
          'pos' : example['stanford_pos'],
          'ner' : example['stanford_ner'],
          'dep_head' : [idx - 1 for idx in example['stanford_head']], # convert from 1-based index
          'dep_reln' : example['stanford_deprel'],
          'per_span' : list(person_indices),
          'org_span' : list(org_indices),
          # for evaluating the naive token classification approach
          'label' : [0 if i not in (person_indices | org_indices) else 1 if i in person_indices else 2 for i in range(len(example['token']))] # target for Token Classification framework
      }

    elif example['relation'] in NEGATIVE:

      if example['relation'].startswith('org'):
        person_indices = set(range(example['obj_start'], example['obj_end']+1))
        org_indices = set(range(example['subj_start'], example['subj_end']+1))
      else:
        person_indices = set(range(example['subj_start'], example['subj_end']+1))
        org_indices = set(range(example['obj_start'], example['obj_end']+1))

      new_ex = {
          'id' : example['id'],
          'relation' : example['relation'],
          'token' : example['token'],
          'pos' : example['stanford_pos'],
          'ner' : example['stanford_ner'],
          'dep_head' : [idx - 1 for idx in example['stanford_head']], # convert from 1-based index
          'dep_reln' : example['stanford_deprel'],
          'per_span' : list(person_indices),
          'org_span' : list(org_indices),
          # for evaluating the naive token classification approach
          'label' : [0 if i not in (person_indices | org_indices) else 1 if i in person_indices else 2 for i in range(len(example['token']))] # target for Token Classification framework
      }

    else:
      continue
      
    # some of the subjects/objects referring to a person are not names, 
    # remove these examples to strengthen the ethos of ground truth labels
    if get_person_from_example(new_ex).lower() in bad_subjects: 
      continue

    task_split.append(new_ex)

  return task_split

In [None]:
# process and serialize the re-refactored dataset

train, dev, test = convert_tacred_with_negatives(tacred_train), convert_tacred_with_negatives(tacred_dev), convert_tacred_with_negatives(tacred_test)

with open('data/with_negatives/train.pkl', 'wb') as d:
  pickle.dump(train, d)

with open('data/with_negatives/dev.pkl', 'wb') as d:
  pickle.dump(dev, d)

with open('data/with_negatives/test.pkl', 'wb') as d:
  pickle.dump(test, d)

#### **Setup**

In [None]:
%%capture

!pip install -U pip setuptools wheel
!pip install spacy
!python -m spacy download en_core_web_trf
!pip install spacy transformers

import pickle
import numpy as np
from pprint import pprint
from google.colab import drive
from spacy.tokens import Span, DocBin, Doc
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
import spacy

drive.mount('/content/drive/')
%cd '/content/drive/My Drive/Riley/src'

with open('data/with_negatives/train.pkl', 'rb') as p:
  train = pickle.load(p)

with open('data/with_negatives/dev.pkl', 'rb') as p:
  dev = pickle.load(p)

with open('data/with_negatives/test.pkl', 'rb') as p:
  test = pickle.load(p)

LABELS = ['PERSON_AND_COMPANY']
POSITIVE = set(['org:founded_by', 'org:shareholders', 'org:top_members/employees', 'per:employee_of', 'per:title'])
# NEGATIVE = set(['org:member_of', 'org:members', 'per:schools_attended', 'per:origin'])

nlp = spacy.blank("en")

%env TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=5368709120

env: TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=5368709120


#### **Code**

In [None]:
def get_span2entity(ex):
  span2entity, seen = {}, 0
  for idx, ent_type in enumerate(ex['ner']): 
    if ent_type != 'O' and idx > seen:
      b, e = idx, idx
      while (e < len(ex['ner']) and ex['ner'][e] == ent_type):
        e += 1
      span2entity[(b, e)] = ent_type
      seen = e # don't consider the same span twice
  return span2entity

def convert_to_spacy_with_adversarial_examples(dataset, outf):
  misses = count = 0
  Doc.set_extension('rel', default={}, force=True)
  vocab = Vocab()

  docs, ids = [], set()

  for ex in dataset:

    span_starts, entities, relations = set(), [], {}
    s2e = get_span2entity(ex)
    neg, pos = 0, 0
    doc = Doc(nlp.vocab, words=ex['token'])

    # Parse the GGP entities
    seen = 0
    for (sb, se), ent in s2e.items():
      name = ' '.join([t for i,t in enumerate(ex['token']) if i in set(range(sb, se))])
      if seen == 0:
        start, end = doc.text.index(name), doc.text.index(name) + len(name)
      else:
        start, end = doc.text[seen:].index(name) + seen, doc.text[seen:].index(name) + seen + len(name)
      seen = end
      entity = doc.char_span(start, end, label=ent)
      if entity is not None:
        entities.append(entity)
        span_starts.add(sb)
      else:
        misses += 1
    doc.ents = entities

    # Parse the Relations
    for s1 in span_starts:
      for s2 in span_starts:
        relations[(s1, s2)] = {}
        if s1 == min(ex['per_span']) and s2 == min(ex['org_span']):
          if ex['relation'] in POSITIVE:
            relations[(s1, s2)]['PERSON_AND_COMPANY'] = 1.0
          else:
            relations[(s1, s2)]['PERSON_AND_COMPANY'] = 0.5 # the entities are person and org, but the relation may or may not exist
                                                            # we lack ground truth data, so make it ambiguous
        else:
          relations[(s1, s2)]['PERSON_AND_COMPANY'] = 0.0

          # if ex['relation'] in NEGATIVE:
          #   relations[(s1, s2)]['OTHER_RELATION'] = 1.0
          # else:
          #   relations[(s1, s2)]['OTHER_RELATION'] = 0.0
    doc._.rel = relations

    if len(doc.ents) > 1:
      docs.append(doc)
      count += 1

  print(misses)
  print(count)
  docbin = DocBin(docs=docs, store_user_data=True)
  docbin.to_disk(outf)

In [None]:
convert_to_spacy_with_adversarial_examples(train, 'data/train_withnegatives.spacy')

#### **Evaluation**

In [None]:
%env TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
%env MODEL_STRING ="distilroberta-base-ner-conll2003_withnegatives"
%env TRAIN_BIN="train_withnegatives.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run train

env: TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
env: MODEL_STRING="distilroberta-base-ner-conll2003_withnegatives"
env: TRAIN_BIN="train_withnegatives.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output models/distilroberta-base-ner-conll2003_withnegatives --components.transformer.model.name philschmid/distilroberta-base-ner-conll2003 --paths.train data/train_withnegatives.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
[38;5;2m✔ Created output directory:
models/distilroberta-base-ner-conll2003_withnegatives[0m
[38;5;4mℹ Saving to output directory:
models/distilroberta-base-ner-conll2003_withnegatives[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-20 13:49:15,154] [INFO] Set up nlp object from config
[2022-06-20 13:49:15,166] [INFO] Pipeline: ['transformer', 'relation_extractor']
[2022-06-20 13:49:15,170] [INFO] Created vocabulary
[2022-06-20 13:49:15,172

In [None]:
%env TRF_PATH="deepset/minilm-uncased-squad2"
%env MODEL_STRING = "minilm-uncased-squad2_withnegatives"
%env TRAIN_BIN="train_withnegatives.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run train 

env: TRF_PATH="deepset/minilm-uncased-squad2"
env: MODEL_STRING="minilm-uncased-squad2_withnegatives"
env: TRAIN_BIN="train_withnegatives.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output models/minilm-uncased-squad2_withnegatives --components.transformer.model.name deepset/minilm-uncased-squad2 --paths.train data/train_withnegatives.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
[38;5;4mℹ Saving to output directory:
models/minilm-uncased-squad2_withnegatives[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-20 14:01:54,018] [INFO] Set up nlp object from config
[2022-06-20 14:01:54,030] [INFO] Pipeline: ['transformer', 'relation_extractor']
[2022-06-20 14:01:54,035] [INFO] Created vocabulary
[2022-06-20 14:01:54,037] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertModel: [

#### **Reflection**

- I guess my hypothesis was incorrect <br>

  - given the size of the dataset, adding adversarial examples seems to have "confused" the model <br>

- an NER component trained on a large corpus to identify only PERSON and ORGANIZATION tags might allow further improvement <br>

- training a model from scratch with classic NSP/MLM training objectives, but on a corpus like TACRED, might also yield further improvements <br>

- restricting candidates by accounting for dependency relations heuristically might also be useful... <br>

  - eg. not all entity pairs in a given example are related, or a person/org might exist in separate clauses and thus should not be considered

## Subtask 5 - Test Set Inference

### **Code**

In [None]:
%env TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
%env MODEL_STRING ="distilroberta-base-ner-conll2003"
%env TRAIN_BIN="train.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run evaluate

env: TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
env: MODEL_STRING="distilroberta-base-ner-conll2003"
env: TRAIN_BIN="train.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 ./scripts/evaluate.py models/distilroberta-base-ner-conll2003/model-best data/test.spacy False

Random baseline:
threshold 0.00 	 {'rel_micro_p': '3.50', 'rel_micro_r': '100.00', 'rel_micro_f': '6.77'}
threshold 0.05 	 {'rel_micro_p': '3.51', 'rel_micro_r': '95.09', 'rel_micro_f': '6.77'}
threshold 0.10 	 {'rel_micro_p': '3.54', 'rel_micro_r': '90.75', 'rel_micro_f': '6.81'}
threshold 0.20 	 {'rel_micro_p': '3.52', 'rel_micro_r': '80.35', 'rel_micro_f': '6.75'}
threshold 0.30 	 {'rel_micro_p': '3.56', 'rel_micro_r': '71.39', 'rel_micro_f': '6.78'}
threshold 0.40 	 {'rel_micro_p': '3.54', 'rel_micro_r': '61.27', 'rel_micro_f': '6.69'}
threshold 0.50 	 {'rel_micro_p': '3.37', 'rel_micro_r': '48.55', 'rel_micro_f': '6.31'}
threshold 0.60 	 {'rel_micro_p': '3.4

In [None]:
%env TRF_PATH="deepset/minilm-uncased-squad2"
%env MODEL_STRING = "minilm-uncased-squad2"
%env TRAIN_BIN="train.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run evaluate

env: TRF_PATH="deepset/minilm-uncased-squad2"
env: MODEL_STRING="minilm-uncased-squad2"
env: TRAIN_BIN="train.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 ./scripts/evaluate.py models/minilm-uncased-squad2/model-best data/test.spacy False

Random baseline:
threshold 0.00 	 {'rel_micro_p': '3.50', 'rel_micro_r': '100.00', 'rel_micro_f': '6.77'}
threshold 0.05 	 {'rel_micro_p': '3.52', 'rel_micro_r': '95.66', 'rel_micro_f': '6.79'}
threshold 0.10 	 {'rel_micro_p': '3.55', 'rel_micro_r': '91.33', 'rel_micro_f': '6.84'}
threshold 0.20 	 {'rel_micro_p': '3.54', 'rel_micro_r': '80.92', 'rel_micro_f': '6.78'}
threshold 0.30 	 {'rel_micro_p': '3.53', 'rel_micro_r': '70.23', 'rel_micro_f': '6.73'}
threshold 0.40 	 {'rel_micro_p': '3.47', 'rel_micro_r': '59.54', 'rel_micro_f': '6.56'}
threshold 0.50 	 {'rel_micro_p': '3.46', 'rel_micro_r': '49.71', 'rel_micro_f': '6.48'}
threshold 0.60 	 {'rel_micro_p': '3.56', 'rel_micro_r': '40.46', 'rel_mic

In [None]:
%env TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
%env MODEL_STRING ="distilroberta-base-ner-conll2003_withnegatives"
%env TRAIN_BIN="train_withnegatives.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run evaluate

env: TRF_PATH="philschmid/distilroberta-base-ner-conll2003"
env: MODEL_STRING="distilroberta-base-ner-conll2003_withnegatives"
env: TRAIN_BIN="train_withnegatives.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 ./scripts/evaluate.py models/distilroberta-base-ner-conll2003_withnegatives/model-best data/test.spacy False

Random baseline:
threshold 0.00 	 {'rel_micro_p': '3.50', 'rel_micro_r': '100.00', 'rel_micro_f': '6.77'}
threshold 0.05 	 {'rel_micro_p': '3.55', 'rel_micro_r': '96.24', 'rel_micro_f': '6.85'}
threshold 0.10 	 {'rel_micro_p': '3.58', 'rel_micro_r': '92.20', 'rel_micro_f': '6.90'}
threshold 0.20 	 {'rel_micro_p': '3.52', 'rel_micro_r': '80.06', 'rel_micro_f': '6.74'}
threshold 0.30 	 {'rel_micro_p': '3.54', 'rel_micro_r': '70.23', 'rel_micro_f': '6.73'}
threshold 0.40 	 {'rel_micro_p': '3.58', 'rel_micro_r': '61.56', 'rel_micro_f': '6.76'}
threshold 0.50 	 {'rel_micro_p': '3.50', 'rel_micro_r': '50.00', 'rel_micro_f': '6.

In [None]:
%env TRF_PATH="deepset/minilm-uncased-squad2"
%env MODEL_STRING = "minilm-uncased-squad2_withnegatives"
%env TRAIN_BIN="train_withnegatives.spacy"
%env DEV_BIN="dev.spacy"
%env TEST_BIN="test.spacy"

!spacy project run evaluate

env: TRF_PATH="deepset/minilm-uncased-squad2"
env: MODEL_STRING="minilm-uncased-squad2_withnegatives"
env: TRAIN_BIN="train_withnegatives.spacy"
env: DEV_BIN="dev.spacy"
env: TEST_BIN="test.spacy"
[1m
Running command: /usr/bin/python3 ./scripts/evaluate.py models/minilm-uncased-squad2_withnegatives/model-best data/test.spacy False

Random baseline:
threshold 0.00 	 {'rel_micro_p': '3.50', 'rel_micro_r': '100.00', 'rel_micro_f': '6.77'}
threshold 0.05 	 {'rel_micro_p': '3.46', 'rel_micro_r': '94.22', 'rel_micro_f': '6.67'}
threshold 0.10 	 {'rel_micro_p': '3.47', 'rel_micro_r': '89.31', 'rel_micro_f': '6.68'}
threshold 0.20 	 {'rel_micro_p': '3.39', 'rel_micro_r': '77.75', 'rel_micro_f': '6.49'}
threshold 0.30 	 {'rel_micro_p': '3.35', 'rel_micro_r': '67.05', 'rel_micro_f': '6.38'}
threshold 0.40 	 {'rel_micro_p': '3.42', 'rel_micro_r': '58.38', 'rel_micro_f': '6.46'}
threshold 0.50 	 {'rel_micro_p': '3.39', 'rel_micro_r': '48.27', 'rel_micro_f': '6.33'}
threshold 0.60 	 {'rel_micro_p'

# Results

The adversarial-trained model actually matched the performance of the Iteration 2 model, albeit with a very low threshold - this makes sense because by adding adversarial examples, we forced model weights to be more discriminative. 

In practice, performance will be better because we rely on a better NER tagger for candidate entity pairs. 

Base Model  | Threshold | Micro-Precision | Micro-Recall | Micro-F1 
------------|-----------|-----------------|--------------|--------------
distilroberta-base-ner-conll2003 | 0.4 | 70.4 | 74.9 | 72.6
**minilm-uncased-squad2** | **0.3** | **76.2** | **77.8** | **77.0**
distilroberta-base-ner-conll2003 * | 0.3 | 73.2 | 72.5 | 72.9
minilm-uncased-squad2 * | 0.2 | 77.8 | 76.01 | 76.9


# Demo

In [None]:
%%capture

!pip install -U pip setuptools wheel
!pip install spacy
!python -m spacy download en_core_web_trf
!pip install spacy transformers

In [None]:
!spacy project run clean

[1m
Running command: rm -rf assets/sentences.spacy
Running command: rm -rf 'assets/output/*'
Running command: rm -rf training


In [None]:
%env TRF_PATH="deepset/minilm-uncased-squad2"
%env MODEL_STRING = "minilm-uncased-squad2" # best model 
!spacy project run infer

env: TRF_PATH="deepset/minilm-uncased-squad2"
env: MODEL_STRING="minilm-uncased-squad2"
[1m
Running command: /usr/bin/python3 ./scripts/inference.py assets/sentences.txt models/minilm-uncased-squad2/model-best assets/output/extracted_sentences.txt
[38;5;4mℹ {'index': 0, 'person_and_company': ('Daniel Spielman', 'Yale
University')}[0m
[38;5;4mℹ {'index': 1, 'person_and_company': ('Spielman', 'MIT')}[0m
[38;5;4mℹ {'index': 1, 'person_and_company': ('Shang-Hua Teng', 'MIT')}[0m
[38;5;4mℹ {'index': 3, 'person_and_company': ('Joanna Drążkowska', 'Ludwig
Maximilian University of Munich')}[0m
