Future work: fine tune ner

Idea based on https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/:
 1. Subject (nsubj) - combine compound and Object (dobj) extraction
 2. Adjective Noun
 3. We look for tokens that have a Noun POS tag and have subject or object dependency
 4. Then we look at the child nodes of these tokens and append it to the phrase only if it modifies the noun
 5. Rule on Prepositions
 6. We iterate over all the tokens looking for prepositions. For example, in this sentence
 7. On encountering a preposition, we check if it has a headword that is a noun. For example, the word faith in this sentence
 8. Then we look at the child tokens of the preposition token falling on its right side. For example, the word democracy
 9. Append modifier attached to a noun
 10. if it has subject and noun: run relation extraction
 11. run relation extraction between entities extracted from NER and add them if its score is above a threshold (0.8 for now but this could change)!
 12. TODO: group a nouns with conjunctions as a single entity???

## Extract data about Model.SPR
Webscrape data from Swanton Pacific Ranch wikipedia page

In [5]:
# https://en.wikipedia.org/wiki/Swanton_Pacific_Ranch
from bs4 import BeautifulSoup
import requests

spr_wiki = requests.get('https://en.wikipedia.org/wiki/Swanton_Pacific_Ranch')

soup = BeautifulSoup(spr_wiki.text, 'html.parser')


In [6]:
for script in soup(["style"]):                   
    script.decompose() 

In [7]:
spr_wiki_text = ' '.join(p.text.strip() for p in soup.find_all("div", {"class": "mw-parser-output"})[0].find_all('p'))

In [8]:
# remove special characters
spr_wiki_text = spr_wiki_text.encode('ascii', 'ignore').decode('ascii')

In [9]:
# remove in-line citation
import re
spr_wiki_text = re.sub(r'\[\d*\]', '', spr_wiki_text)
spr_wiki_text = re.sub(r'Full Report', '', spr_wiki_text)
# remove parenthesis
spr_wiki_text = re.sub(r'\(.*?\) ', '', spr_wiki_text)

## Coreference Resolution (https://github.com/huggingface/neuralcoref)
Install Coreference Resolution - had to downgrade Spacy to 2.1.0 because of neuralcoref's incompatibility with Spacy >= 2.1.8

In [1]:
import spacy

print(spacy.__version__)

2.1.0


In [5]:
!pip install spacy==2.1.0

Collecting spacy==2.1.0
  Downloading spacy-2.1.0-cp37-cp37m-manylinux1_x86_64.whl (27.7 MB)
[K     |████████████████████████████████| 27.7 MB 2.0 MB/s 
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 2.1.8
    Uninstalling spacy-2.1.8:
      Successfully uninstalled spacy-2.1.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 2.2.5 requires spacy>=2.2.2, but you have spacy 2.1.0 which is incompatible.[0m
Successfully installed spacy-2.1.0


In [2]:
!pip install neuralcoref

Collecting neuralcoref
  Downloading neuralcoref-4.0-cp37-cp37m-manylinux1_x86_64.whl (286 kB)
[?25l[K     |█▏                              | 10 kB 19.6 MB/s eta 0:00:01[K     |██▎                             | 20 kB 21.4 MB/s eta 0:00:01[K     |███▍                            | 30 kB 24.1 MB/s eta 0:00:01[K     |████▋                           | 40 kB 27.6 MB/s eta 0:00:01[K     |█████▊                          | 51 kB 30.6 MB/s eta 0:00:01[K     |██████▉                         | 61 kB 31.4 MB/s eta 0:00:01[K     |████████                        | 71 kB 31.2 MB/s eta 0:00:01[K     |█████████▏                      | 81 kB 31.0 MB/s eta 0:00:01[K     |██████████▎                     | 92 kB 27.3 MB/s eta 0:00:01[K     |███████████▌                    | 102 kB 28.7 MB/s eta 0:00:01[K     |████████████▋                   | 112 kB 28.7 MB/s eta 0:00:01[K     |█████████████▊                  | 122 kB 28.7 MB/s eta 0:00:01[K     |██████████████▉                 | 

## Spacy
Load spacy english model and add coreference resolution to its pipeline

In [2]:
!python -m spacy download en

Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 26.1 MB/s 
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.1.0-py3-none-any.whl size=11074431 sha256=e78d0aa08d14d62853c3ccddd079511a3394f623ae362728d2f7bb153337af25
  Stored in directory: /tmp/pip-ephem-wheel-cache-48tyiasq/wheels/59/4f/8c/0dbaab09a776d1fa3740e9465078bfd903cc22f3985382b496
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-2.1.0
[38;5;2m✔ Download and installation suc

In [3]:
import spacy
import neuralcoref

nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7fbd9a881cd0>

In [10]:
coref_doc = nlp(spr_wiki_text)
# print(coref_doc._.coref_clusters)
coref_resolved_spr_wiki_text = coref_doc._.coref_resolved

## Visualize Spacy Tree

In [11]:
!pip install visualise_spacy_tree

Collecting visualise_spacy_tree
  Downloading visualise_spacy_tree-0.0.6-py3-none-any.whl (5.0 kB)
Collecting pydot==1.4.1
  Downloading pydot-1.4.1-py2.py3-none-any.whl (19 kB)
Installing collected packages: pydot, visualise-spacy-tree
  Attempting uninstall: pydot
    Found existing installation: pydot 1.3.0
    Uninstalling pydot-1.3.0:
      Successfully uninstalled pydot-1.3.0
Successfully installed pydot-1.4.1 visualise-spacy-tree-0.0.6


In [12]:
from spacy import displacy 
import visualise_spacy_tree
from IPython.display import Image, display

# doc = nlp(sentences[0])
# png = visualise_spacy_tree.create_png(doc)
# display(Image(png))

In [13]:
def draw_dependency_graph(doc):
  displacy.render(doc, style='dep', jupyter=True)

## Rules

In [11]:
def check_btw_ends(idx, compound_indices):
  for compound_idx in compound_indices:
      if idx > compound_idx[0] and idx < compound_idx[1]:
        return True
  return False

def can_extend_compound(idx, compound_indices):
  for i in range(len(compound_indices)):
    compound_idx = compound_indices[i]
    if idx == compound_idx[1]:
      return i
  return -1

# compounds
def get_compounds(doc):
  compounds = []
  compound_indices = []
  for token in doc:
    if token.dep_ == 'compound':
      # if current token.i is between previously found start and end indices of a compound, skip
      if not check_btw_ends(token.i, compound_indices):
        # if current token.i == end index of a previsouly found compound, extend
        idx_to_extend = can_extend_compound(token.i, compound_indices)
        if idx_to_extend != -1:
          compound_indices[idx_to_extend][1] = token.head.i
        else:
          compound_indices.append([token.i, token.head.i])

  compounds = [doc[compound_idx[0]: compound_idx[1] + 1] for compound_idx in compound_indices]

  return compounds, compound_indices

In [12]:
def combine_verbs_with_conj(doc):
  combined_verbs = []
  combined_verbs_indices = []
  for token in doc:
    if token.pos_ == 'CCONJ':
      combined_verb = ''
      left = doc[token.i - 1]
      right = doc[token.i + 1]

      if left.pos_ == 'VERB' and right.pos_ == 'VERB':
        combined_verb += (left.text + ' ' + token.text + ' ' + right.text)
        combined_verbs.append(combined_verb)
        combined_verbs_indices.append((left.i, right.i))
  return combined_verbs, combined_verbs_indices

In [13]:
# combine compounds and nouns with a modifier to a single token
def update_tokenizer(doc):
  compounds, compounds_indices = get_compounds(doc)
  assert len(compounds) == len(compounds_indices)

  with doc.retokenize() as retokenizer:
    for i in range(len(compounds)):
      compound = compounds[i]
      retokenizer.merge(doc[compounds_indices[i][0]: compounds_indices[i][1] + 1], attrs={"LEMMA": compound.text.lower()})

  mod_nouns, mod_nouns_indices = get_noun_mod(doc)
  assert len(mod_nouns) == len(mod_nouns_indices)

  with doc.retokenize() as retokenizer:
    for i in range(len(mod_nouns)):
      mod_noun = mod_nouns[i]
      retokenizer.merge(doc[mod_nouns_indices[i][0]: mod_nouns_indices[i][-1] + 1], attrs={"LEMMA": mod_noun.lower()})

  combined_verbs, combined_verbs_indices = combine_verbs_with_conj(doc)

  assert len(combined_verbs) == len(combined_verbs_indices)

  with doc.retokenize() as retokenizer:
    for i in range(len(combined_verbs)):
      combined_verb = combined_verbs[i]
      retokenizer.merge(doc[combined_verbs_indices[i][0]: combined_verbs_indices[i][-1] + 1], attrs={"LEMMA": combined_verb.lower()})

  return doc

In [14]:
def split_verb_w_conj(subject, root_verb, obj, preposition=''):
  triples = []
  if 'or' in root_verb:
    verbs = root_verb.split('or')
    for verb in verbs:
      triples.append((subject, verb.strip() + ' ' + preposition, obj))
  elif 'and' in root_verb:
    verbs = root_verb.split('and')
    for verb in verbs:
      triples.append((subject, verb.strip() + ' ' + preposition, obj))
  else:
    triples.append((subject, root_verb.strip() + ' ' + preposition, obj))
  return triples

In [15]:
# function for rule 1: noun(subject), verb, noun(object)
def rule_1(doc):
        
    sent = []
    
    for token in doc:
        
        # if the token is a verb
        if (token.pos_=='VERB'):
            
            # phrase = []
            subject = ''
            verb = ''
            # only extract noun or pronoun subjects
            for sub_tok in token.lefts:
                
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    
                    # add subject to the phrase
                    subject = sub_tok.text

                    # save the root of the verb in phrase
                    verb = token.text 

                    # check for noun or pronoun direct objects
                    for sub_tok in token.rights:
                        
                        # save the object in the phrase
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            obj = sub_tok.text
                            phrase = split_verb_w_conj(subject, verb, obj)

                            sent.append(tuple(phrase))
            
    return sent

In [16]:
# if a ROOT verb is a be verb
def rule_2(doc):
  verb = []
  for token in doc:
    if token.tag_.startswith('V') and token.pos_ == 'AUX' and token.dep_ == 'ROOT':
      verb = [token, token.i]
      break
  if len(verb):
    # make this triplets and add to the list
    return [(doc[:verb[1]].text, verb[0].text, doc[verb[1] + 1:][0]), *rule_3(doc[verb[1] + 1:])]
  else:
    return ()

In [17]:
# get nouns with modifiers
def get_noun_mod(doc):
  nouns = []
  nouns_indices = []
  for token in doc:
    if token.pos_ in ['NOUN', 'PROPN'] and token.dep_ in ['attr', 'pobj', 'dobj']:
      modifier = ''
      modifier_indices = []
      for left in token.lefts:
        if left.dep_ in ['det', 'nummod', 'nmod'] or left.pos_ == 'ADJ':
          modifier += ' ' + left.text
          modifier_indices.append(left.i)
      if len(modifier):
        modifier_indices.append(token.i)
        nouns.append((modifier + ' ' + token.text).strip())
        nouns_indices.append(modifier_indices)
  return nouns, nouns_indices

# rule 3 noun + preposition + noun
def rule_3(doc):
        
    sent = []
    
    for token in doc:

        # look for prepositions
        if token.pos_=='ADP':

            phrase = []
            
            # if its head word is a noun
            if token.head.pos_=='NOUN':
                
                # append noun and preposition to phrase
                phrase.append(token.head.text)

                phrase.append(token.text)

                # check the nodes to the right of the preposition
                for right_tok in token.rights:
                    # append if it is a noun or proper noun
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                      phrase.append(right_tok.text)
                
                if len(phrase) > 2:
                    sent.append(tuple(phrase))
                
    return sent 

In [18]:
# handle passive sentences
def rule_4(doc):
  all_triples = []
  root_verb = ''
  subject = ''
  obj = ''
  for token in doc:
    # passive verb
    if token.pos_ == 'AUX' and token.head.pos_ == 'VERB':
      subject = doc[:token.i].text
      root_verb = token.head.text
    
    if token.pos_ in ['PROPN', 'NOUN']:
      if (token.dep_ == 'conj' and token.head.head.head.text == root_verb):
        obj = token.text
        triples = split_verb_w_conj(subject, root_verb, obj, token.head.head.text)
        if len(triples):
          all_triples.extend(triples)
        obj = ''
      elif (token.head.head.text == root_verb):
        obj = token.text
        triples = split_verb_w_conj(subject, root_verb, obj, token.head.text)
        if len(triples):
          all_triples.extend(triples)
        obj = ''
  return all_triples

In [19]:
# from NER
def get_named_entities(doc):
  entities = []
  for ent in doc.ents:
    entities.append((ent, ent.label_, (ent.start, ent.end)))
  return entities

## Relation Extraction
OpenNRE https://github.com/thunlp/OpenNRE

### Install OpenNRE

In [20]:
!git clone https://github.com/thunlp/OpenNRE.git --depth 1


Cloning into 'OpenNRE'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 68 (delta 21), reused 29 (delta 8), pack-reused 0[K
Unpacking objects: 100% (68/68), done.


In [21]:
%cd OpenNRE/

/content/OpenNRE


In [22]:
!pip install -r requirements.txt

Collecting torch==1.6.0
  Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)
[K     |████████████████████████████████| 748.8 MB 17 kB/s 
[?25hCollecting transformers==3.4.0
  Downloading transformers-3.4.0-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 48.7 MB/s 
[?25hCollecting pytest==5.3.2
  Downloading pytest-5.3.2-py3-none-any.whl (234 kB)
[K     |████████████████████████████████| 234 kB 37.7 MB/s 
[?25hCollecting scikit-learn==0.22.1
  Downloading scikit_learn-0.22.1-cp37-cp37m-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 52.5 MB/s 
Collecting nltk==3.4.5
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 41.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.7 MB/s 
Collecting tokenizers==0.9.2
  Downloading tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K

In [23]:
!python setup.py install

running install
running bdist_egg
running egg_info
creating opennre.egg-info
writing opennre.egg-info/PKG-INFO
writing dependency_links to opennre.egg-info/dependency_links.txt
writing top-level names to opennre.egg-info/top_level.txt
writing manifest file 'opennre.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'opennre.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/opennre
copying opennre/pretrain.py -> build/lib/opennre
copying opennre/__init__.py -> build/lib/opennre
creating build/lib/opennre/tokenization
copying opennre/tokenization/bert_tokenizer.py -> build/lib/opennre/tokenization
copying opennre/tokenization/basic_tokenizer.py -> build/lib/opennre/tokenization
copying opennre/tokenization/__init__.py -> build/lib/opennre/tokenization
copying opennre/tokenization/word_piece_tokenizer.py -> build/lib/opennre/tokenization
copying open

### Get relation extraction model

In [24]:
import opennre
model = opennre.get_model('wiki80_bert_softmax')

2021-12-01 02:37:48,651 - root - INFO - Loading BERT pre-trained checkpoint.


## Extract Triplets

In [25]:
nlp.entity.labels

('NORP',
 'EVENT',
 'WORK_OF_ART',
 'CARDINAL',
 'TIME',
 'ORDINAL',
 'GPE',
 'LANGUAGE',
 'ORG',
 'PERCENT',
 'DATE',
 'LAW',
 'PRODUCT',
 'QUANTITY',
 'LOC',
 'PERSON',
 'MONEY',
 'FAC')

In [26]:
sentences = [sent.text for sent in nlp(coref_resolved_spr_wiki_text).sents]
print("The number of sentences to extract triplets:", len(sentences))

The number of sentences to extract triplets: 271


In [27]:
def extract_triplets(doc):
  # update tokenizer
  doc = update_tokenizer(doc)

  # apply rules to extract triplets
  rule_1_triples = rule_1(doc)
  rule_2_triples = rule_2(doc)
  rule_3_triples = rule_3(doc)
  rule_4_triples = rule_4(doc)

  return set([*rule_1_triples, *rule_2_triples, *rule_3_triples, *rule_4_triples])

In [28]:
from itertools import combinations
def extract_triples_opennre(model, doc, threshold):
  triples = []

  # update tokenizer
  doc = update_tokenizer(doc)

  entities = get_named_entities(doc)
  if len(entities) < 2:
    return triples
  combs = combinations(entities, 2)
  for comb in combs:
    entity_1, entity_2 = comb
    relation = model.infer({'text': doc.text, 'h': {'pos': entity_1[2]}, 't': {'pos': entity_2[2]}})
    # print(entity_1[0], relation, entity_2[0])
    if relation[1] >= threshold:
      triples.append((entity_1[0], relation[0], entity_2[0]))
  return triples

In [29]:
%cd /content

/content


In [30]:
with open('triplets.txt', 'w') as fw:
  for sentence in sentences:
    doc = nlp(sentence)
    triplets = extract_triplets(doc)
    opennre = extract_triples_opennre(model, doc, 0.8)
    for triplet in triplets:
      fw.write(str(triplet) + '\n')

    for triplet in opennre:
      fw.write(str(triplet) + '\n')


Swanton Pacific Ranch ('has part', 0.9012349843978882) Santa Cruz County
Swanton Pacific Ranch ('said to be the same as', 0.5751903057098389) California
Swanton Pacific Ranch ('subsidiary', 0.44678452610969543) Davenport
Santa Cruz County ('has part', 0.5785011649131775) California
Santa Cruz County ('has part', 0.48109742999076843) Davenport
California ('subsidiary', 0.5553775429725647) Davenport
Swanton Pacific Ranch ('operator', 0.590891420841217) California Polytechnic State University
Swanton Pacific Ranch ('has part', 0.7554218769073486) the College of Agriculture, Food and Environmental Sciences
Swanton Pacific Ranch ('followed by', 0.4934234321117401) Swanton
the College of Agriculture, Food and Environmental Sciences ('followed by', 0.4124290347099304) Swanton
Waddell Creek ('has part', 0.8398447632789612) the mid 19th century
November 1843 ('followed by', 0.9910421371459961) Ramon Rodriguez
November 1843 ('location', 0.17697328329086304) Francisco Alviso
November 1843 ('follo

## Joint Entity and Relation Extraction: Partition Filter Network
*** Uses a different version of transformers than OpenNRE***

In [None]:
%cd /content

/content


In [None]:
!git clone https://github.com/Coopercoppers/PFN.git

Cloning into 'PFN'...
remote: Enumerating objects: 457, done.[K
remote: Counting objects: 100% (408/408), done.[K
remote: Compressing objects: 100% (396/396), done.[K
remote: Total 457 (delta 233), reused 9 (delta 1), pack-reused 49[K
Receiving objects: 100% (457/457), 9.70 MiB | 14.55 MiB/s, done.
Resolving deltas: 100% (255/255), done.


In [None]:
%cd ./PFN

/content/PFN


In [None]:
!pip install -r requirements.txt


Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 2.0 kB/s 
[?25hCollecting tqdm==4.51.0
  Downloading tqdm-4.51.0-py2.py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 6.5 MB/s 
[?25hCollecting numpy==1.20.2
  Downloading numpy-1.20.2-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 160 kB/s 
[?25hCollecting transformers==4.9.1
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 28.9 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 24.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.

In [None]:
sentences[0]

'Swanton Pacific Ranch is a 3,200-acre ranch in Santa Cruz County, California, outside the town of Davenport.'

In [None]:
!python inference.py \
--model_file /content/drive/MyDrive/CSC580-Model.SPR/Relation\ Extraction/bert-nyt/nyt_test.pt \
--sent 'Swanton Pacific Ranch is a 3,200-acre (1,300ha) ranch in Santa Cruz County, California, outside the town of Davenport.'

Downloading: 100% 29.0/29.0 [00:00<00:00, 25.8kB/s]
Downloading: 100% 570/570 [00:00<00:00, 467kB/s]
Downloading: 100% 213k/213k [00:00<00:00, 2.70MB/s]
Downloading: 100% 436k/436k [00:00<00:00, 3.38MB/s]
Downloading: 100% 436M/436M [00:14<00:00, 29.2MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identic

In [None]:
!python inference.py \
--model_file /content/drive/MyDrive/CSC580-Model.SPR/Relation\ Extraction/bert-webnlg/web_test.pt \
--sent 'Swanton Pacific Ranch is a 3,200-acre (1,300ha) ranch in Santa Cruz County, California, outside the town of Davenport.'

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "inference.py", line 78, in <module>
    model.load_state_dict(torch.load(args.model_file))
  File "/usr/local/lib/

In [None]:
!python inference.py \
--model_file /content/drive/MyDrive/CSC580-Model.SPR/Relation\ Extraction/albert-ace2005/ace_test.pt \
--sent 'Headquartered in San Jose, California, Orchard Supply Hardware had dozens of locations throughout California, with expansions into Oregon and Florida.'

Downloading: 100% 760k/760k [00:00<00:00, 5.86MB/s]
Downloading: 100% 1.31M/1.31M [00:00<00:00, 8.50MB/s]
Downloading: 100% 706/706 [00:00<00:00, 551kB/s]
Downloading: 100% 893M/893M [00:30<00:00, 29.3MB/s]
Some weights of the model checkpoint at albert-xxlarge-v1 were not used when initializing AlbertModel: ['predictions.LayerNorm.weight', 'predictions.bias', 'predictions.dense.bias', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.decoder.weight', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  F

In [None]:
!python inference.py \
--model_file /content/drive/MyDrive/CSC580-Model.SPR/Relation\ Extraction/ACE2004/0/ace04_test_fold_0.pt \
--sent 'Headquartered in San Jose, California, Orchard Supply Hardware had dozens of locations throughout California, with expansions into Oregon and Florida.'

Traceback (most recent call last):
  File "inference.py", line 77, in <module>
    model = PFN(args, input_size, ner2idx, rel2idx)
NameError: name 'input_size' is not defined


In [None]:
!python inference.py \
--model_file /content/drive/MyDrive/CSC580-Model.SPR/Relation\ Extraction/scibert-scierc/sci_test.pt \
--sent 'Headquartered in San Jose, California, Orchard Supply Hardware had dozens of locations throughout California, with expansions into Oregon and Florida.'

Downloading: 100% 385/385 [00:00<00:00, 259kB/s]
Downloading: 100% 228k/228k [00:00<00:00, 2.12MB/s]
Downloading: 100% 442M/442M [00:12<00:00, 36.0MB/s]
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model 

In [None]:
import spacy

doc = nlp('The ranch is owned and operated by California Polytechnic State University (Cal Poly) for educational and research in sustainable agriculture.')

In [None]:
draw_dependency_graph(doc)

In [None]:
doc = update_tokenizer(doc)
rule_4(doc)

[('The ranch',
  'owned and operated by',
  'California Polytechnic State University'),
 ('The ranch', 'owned and operated for', 'research'),
 ('The ranch', 'owned and operated in', 'sustainable agriculture')]

In [None]:
draw_dependency_graph(doc)