# Relation Extraction EDA

This notebook explores approaches for relationship extraction. Formulated as follows:

    Given a set of sentences & key term pairs -> 
    Classify any relations present between each pair of terms 

In [10]:
import pandas as pd
import pickle
import json
from collections import defaultdict
import spacy
import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
#stanfordnlp.download('en')
snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)

import warnings
warnings.filterwarnings('ignore')

# fix for importing utils
import os
import sys
module_path = os.path.abspath(os.path.join('../utils'))
if module_path not in sys.path:
    sys.path.append(module_path)
from utils import tag_terms

data_dir = '../data/relation_extraction'

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_parser.pt', 'pretrain_path': '/home/mattboggess7/sta

# Distant Supervision

Here, we leverage an existing partial knowledge base (from Inquire in this case) and use to it to train a model to predict further relations between key terms.This is known as distant supervision because we construct labels noisily assuming every pair of terms exhibiting a relationship in the knowledge base will exhibit that relationship for all instances in the text.

Based on the following work:
- CS224U Tutorials:
    - https://nbviewer.jupyter.org/github/cgpotts/cs224u/blob/master/rel_ext_01_task.ipynb
    - https://nbviewer.jupyter.org/github/cgpotts/cs224u/blob/master/rel_ext_02_experiments.ipynb

Key Limitations:
- Does not generalize to new relations. Thus this might not generalize well to other textbooks/subject matters, especially if they have additional new relations that would need to be added.

Questions to Investigate:
- How many training examples are needed to reliably learn a relation?

## Read Data

### KB Terms

In [2]:
with open(f"{data_dir}/concepts.txt", "r") as f:
    terms = f.readlines()
terms = set([c.split('|')[1].strip() for c in terms])
print(f"Number of Manually Extracted KB Terms: {len(terms)}")

Number of Manually Extracted KB Terms: 5933


### KB Relationships

In [3]:
def wrangle_relations(triples):
    relations = defaultdict(lambda: [])
    for triple in triples:
        relation = triple[1]
        relations[relation].append(triple)

    relation_words = list(relations.keys())
    relation_counts = [len(relations[rel]) for rel in relation_words]
    relation_table = pd.DataFrame({'relations': relation_words, 'triples_count': relation_counts})
    return relations, relation_table.sort_values('triples_count', ascending=False).reset_index(drop=True)

#### Structure Relations

In [4]:
with open(f"{data_dir}/structure.txt", "r") as f:
    structure_triples = f.readlines()
structure_triples = [s.split("|") for s in structure_triples]
structure_triples = set([(s[3].strip(), s[4].strip(), s[-1].strip()) for s in structure_triples])

structure_relations, structure_info = wrangle_relations(structure_triples)
structure_info["relation_type"] = "structure"
print(f"Number of Unique KB Structure Relations: {structure_info.shape[0]}")
structure_info

Number of Unique KB Structure Relations: 44


Unnamed: 0,relations,triples_count,relation_type
0,has-part,4281,structure
1,has-region,1561,structure
2,possesses,784,structure
3,is-inside,604,structure
4,encloses,603,structure
5,element,497,structure
6,size,349,structure
7,is-between,267,structure
8,is-at,246,structure
9,does-not-enclose,169,structure


#### Process Relations

In [5]:
with open(f"{data_dir}/process.txt", "r") as f:
    process_triples = f.readlines()
process_triples = [s.split("|") for s in process_triples]
process_triples = set([(s[3].strip(), s[4].strip(), s[-1].strip()) for s in process_triples])

process_relations, process_info = wrangle_relations(process_triples)
process_info["relation_type"] = "process"
print(f"Number of Unique KB Process Relations: {process_info.shape[0]}")
process_info

Number of Unique KB Process Relations: 13


Unnamed: 0,relations,triples_count,relation_type
0,object,2520,process
1,subevent,1839,process
2,base,1784,process
3,result,1678,process
4,agent,1488,process
5,raw-material,827,process
6,next-event,613,process
7,first-subevent,330,process
8,instrument,255,process
9,donor,164,process


#### Combine Relations

In [6]:
relations = {**process_relations, **structure_relations}
relations_info = pd.concat([structure_info, process_info]).sort_values("triples_count", ascending=False).reset_index(drop=True)
relations_info

Unnamed: 0,relations,triples_count,relation_type
0,has-part,4281,structure
1,object,2520,process
2,subevent,1839,process
3,base,1784,process
4,result,1678,process
5,has-region,1561,structure
6,agent,1488,process
7,raw-material,827,process
8,possesses,784,structure
9,next-event,613,process


### Textbook Sentences

In [7]:
with open(f"{data_dir}/selected_textbook_sentences.txt", "r") as f:
    sentences = f.readlines()
sentences = [" ".join(sent.split("\t")[1:]) for sent in sentences]
print(f"Number of Textbook Sentences: {len(sentences)}")

Number of Textbook Sentences: 18730


## Tag Terms & Relations


### Pre-process Terms & Sentences

In [9]:
processed_terms = [nlp(term) for term in terms]

In [23]:
processed_sentences = [nlp(sentence) for sentence in sentences]

KeyboardInterrupt: 

### Tag Sentences w/ Terms

In [None]:
tagged_sentences = []
for sentence in processed_sentences:
    found_terms, tokenized_sentence, tagged_sentence = tag_terms(sentence, processed_terms)
    tagged_sentences.append({"terms": found_terms,
                             "tokenized_sentence": tokenized_sentence,
                             "tagged_sentence": tagged_sentence})

In [None]:
with open("tagged_sentences.json", "w") as f:
    json.dump(tagged_sentences, f)

### Tag Sentences w/ Relations 

### Summary Statistics

- How many negative examples?
- How many duplicate relations for same pair?
- How many examples / relation?
- How many sentences / relation?
- How many sentences / example / relation?