# Data Processing EDA

This notebook computes summary statistics and performs sanity checks on the data processing pipelines (raw_data -> relations database and sentences tagged with key terms).

In [18]:
import pandas as pd
import pickle
import json

from collections import defaultdict
import spacy

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
#stanfordnlp.download('en')
snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

# fix for importing utils
import os
import sys
module_path = os.path.abspath(os.path.join('../utils'))
if module_path not in sys.path:
    sys.path.append(module_path)
from utils import tag_text, write_spacy_docs, read_spacy_docs

data_dir = '../data/relation_extraction'

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/Users/mattboggess/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/Users/mattboggess/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/Users/mattboggess/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/Users/mattboggess/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/Users/mattboggess/stanfordnlp_resources/en_ewt_models/en_ewt_parser.pt', 'pretrain_path': '/Users/mattboggess/sta

ImportError: cannot import name 'tag_text' from 'utils' (/Users/mattboggess/tokn/utils/utils.py)

Relation EDA:
- Programmatically ensure there are no sentences that contain a term pair that weren't tagged
- Determine any multi-labels

Double Check:
- Bi-directionality of relations
- Way that text representations are being collected

# Relations Database

In [19]:
with open("../data/relation_extraction/relations_db.json", "r") as f:
    rdb = json.load(f)

## Summary Statistics

In [22]:
long_df = {"relation": [], "term_pair": [], "count_sentences": [], "found_sentence": []}
for relation in rdb:
    for term_pair in rdb[relation]: 
        long_df["relation"].append(relation)
        long_df["term_pair"].append(term_pair)
        long_df["count_sentences"].append(len(rdb[relation][term_pair]["sentences"]))
        long_df["found_sentence"].append(len(rdb[relation][term_pair]["sentences"]) > 0)
long_df = pd.DataFrame(long_df)

In [25]:
summary_df = long_df.groupby(["relation", "found_sentence"]).agg({"term_pair": "count",
                                                                  "count_sentences": ["sum", "mean"]})
summary_df

Unnamed: 0_level_0,Unnamed: 1_level_0,term_pair,count_sentences,count_sentences
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean
relation,found_sentence,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
abuts,False,92,0,0.0
abuts,True,18,146,8.111111
element,False,421,0,0.0
element,True,76,530,6.973684
has-part,False,3816,0,0.0
has-part,True,465,4714,10.137634
has-region,False,1444,0,0.0
has-region,True,117,861,7.358974
is-at,False,210,0,0.0
is-at,True,36,316,8.777778


## Multi-Label Term Pairs

In [27]:
term_relation_mapping = {}
for relation in rdb:
    for tp in rdb[relation]:
        if tp in term_relation_mapping:
            term_relation_mapping[tp].append(relation)
        else:
            term_relation_mapping[tp] = [relation]
{tp:r for tp, r in term_relation_mapping.items() if len(r) > 1}

{'mixture -> substance': ['subclass-of', 'has-part'],
 'membrane -> membrane': ['subclass-of', 'is-at'],
 'segment of body -> anatomical structure': ['subclass-of', 'has-part'],
 'bone -> connective tissue': ['subclass-of', 'has-part'],
 'CDNA -> dna': ['subclass-of', 'abuts'],
 'circular dna -> dna': ['subclass-of', 'has-part'],
 'diet -> object': ['subclass-of', 'element'],
 'gene -> dna sequence': ['subclass-of', 'has-part'],
 'glycosidic linkage -> polar covalent bond': ['subclass-of', 'element'],
 'homeodomain -> protein domain': ['subclass-of', 'has-region'],
 'hydrophobic core -> hydrophobic region': ['subclass-of', 'abuts'],
 'nucleic acid probe -> nucleic acid strand': ['subclass-of', 'has-part'],
 'polymer -> chemical': ['subclass-of', 'has-part'],
 'proteasome -> protein': ['subclass-of', 'has-part'],
 'seed -> tissue': ['subclass-of', 'has-part'],
 'substance -> object': ['subclass-of', 'has-part'],
 'trimer -> molecule': ['subclass-of', 'has-part'],
 'variable domain -> pr

## Sanity Checks

### Are we somehow not matching sentences?

- Lemmatization is not robust (organisms and amphibians do not cut the s).
- There are some terms in relations that are not in the lexicon dump?? (i.e. purine)
- Still missing concepts that only have the Word Frame, not text representations in the current dump

In [41]:
nlp('amphibians')[0].lemma_

'amphibians'

In [37]:
bio_sentences = pd.read_csv("../data/raw_data/openstax/sentences_Biology_2e_parsed.csv")
EXCLUDE_SECTIONS = [
    "Preface", "Chapter Outline", "Index", "Chapter Outline", "Summary", "Multiple Choice",
    "Fill in the Blank", "short Answer", "Critical Thinking", "References", 
    "Units", "Conversion Factors", "Fundamental Constants", "Astronomical Data",
    "Mathematical Formulas", "The Greek Alphabet", "Chapter 1", "Chapter 2",
    "Chapter 3", "Chapter 4", "Chapter 5", "Chapter 6", "Chapter 7", "Chapter 8"
    "Chapter 9", "Chapter 10", "Chapter 11", "Chapter 12", "Chapter 13", "Chapter 14", 
    "Chapter 15", "Chapter 16", "Chapter 17", "Critical Thinking Questions", 
    "Visual Connection Questions", "Key Terms", "Review Questions", 
    "The Periodic Table of Elements", "Measurements and the Metric System"]
bio_sentences = bio_sentences[~bio_sentences.section_name.isin(EXCLUDE_SECTIONS)]
for relation in rdb:
    for term_pair in rdb[relation]:
        if not len(rdb[relation][term_pair]["sentences"]):
            terms = term_pair.split(" -> ")
            if terms[0] in terms[1] or terms[1] in terms[0]:
                continue
            term_sentences = bio_sentences[(bio_sentences.sentence.str.contains(terms[0])) & (bio_sentences.sentence.str.contains(terms[1]))].sentence
            term_sentences = list(term_sentences)
            if len(term_sentences):
                print(term_pair)
                print(relation)
                print(rdb[relation][term_pair])
                print(term_sentences[:2])

role -> thing
subclass-of
{'sentences': [], 'e1_representations': ['role'], 'e2_representations': ['thing']}
['By the end of this section, you will be able to do the following:    Describe the role of cells in organisms   Compare and contrast light microscopy and electron microscopy   Summarize cell theory       A cell is the smallest unit of a living thing.']
active transport -> transport work
subclass-of
{'sentences': [], 'e1_representations': ['active transport'], 'e2_representations': ['transport work']}
['Recall the active transport work of the sodium-potassium pump in cell membranes.']
adaptive immunity -> immune response
subclass-of
{'sentences': [], 'e1_representations': ['adaptive immunity'], 'e2_representations': ['immune response']}
['By the end of this section, you will be able to do the following:    Explain adaptive immunity   Compare and contrast adaptive and innate immunity   Describe cell-mediated immune response and humoral immune response   Describe immune tolerance 

KeyboardInterrupt: 