# Lexicon Extraction

When we extract a term, we want to extract a bunch of lexical information/attributes for each term, not just a single string. This notebook explores how we could automate extraction of all of the information we care about.

High Level Workflow:
- Get list of terms by applying classifier over text
- Consolidate these terms by reducing on lemmatized version of word & matching acronyms
- Go back through the text and find all matches for each lemmatized term
- For each of these matches, use Spacy processing/other heuristics as described below to categorize plurals, conjugations, nominalizations, etc.
- For any attributes still missing for terms, try to predict/guess these either using some algorithm or lookup resource (not sure best way to do this)
- Result in a list of term objects that contain a whole set of lexicon information about the term

In [21]:
import warnings
import spacy
import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
from nltk.stem import PorterStemmer

ps = PorterStemmer()
snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)

warnings.filterwarnings("ignore")


Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/home/mattboggess7/stanfordnlp_resources/en_ewt_models/en_ewt_parser.pt', 'pretrain_path': '/home/mattboggess7/sta

# Noun Terms & Term Phrases

## Attributes to Extract

* **lemmatized singular form**: 
  * token.lemma_ in Spacy preprocessing
  * this will be used as base representation for matching and everything else
* **acronym**: 
  * Extract from text where defined (e.g. deoxyribonucleic acid (DNA))
* **plural form**: 
  * token.tag_ == NNS | NNPS in Spacy preprocessing when a lemmatized form matches 
  * If no plural match present for a singular lemma form try to directly predict using https://pypi.org/project/inflect/ or better tool (although see note on countability)
* **proper vs. improper**: 
  * token.tag_ == NNP | NNPS in Spacy preprocessing
* **indefinite determiner (a vs. an)**: 
  * Extract from text by looking for occurrences before matches in text 
* **countable vs. uncountable**: 
  * Not sure how to extract. May have to use heuristic where non-countable if no plural form detected in text.

## Challenges 

* Predicting/matching plurals for non-standard terms will probably need to be handled manually on a case by case base (for example, mitochondria vs. mitochondrion)
* Determining countability vs. missing plural form: It is difficult to determine countability of a term. One heuristic could be to see if noun appears at least X times. If so, and there is no plural form assume non-countable, otherwise try to predict?
* Countable in acronym form (DNAs)

## Examples

### deoxyribonucleic acid (DNA)

* **lemmatized singular form**: deoxyribonucleic acid
* **acronym**: DNA
  * Example Sentence: 'The two main types of nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA)'
* **plural form**: DNAs
  * present in text, but fails to match on lemmatized version
* **proper vs. improper**: improper
* **indefinite determiner (a vs. an)**: 
  * uncountable... 
* **countable vs. uncountable**: 
  * Both countable and uncountable? 
* **parts of speech**: ['JJ', 'NN']

In [22]:
# fails to correctly lemmatize plural though detects as plural
for token in nlp(('The kinetoplastid subgroup is named after the kinetoplast, a large'
                   'modified mitochondrion carrying multiple circular DNAs.')):
    if token.text == 'DNAs':
        print(token.lemma_)
        print(token.tag_)

print()
# parts of speech for full phrase
print(nlp('deoxyribonucleic acid')[0].tag_)
print(nlp('deoxyribonucleic acid')[1].tag_)

DNAs
NNS

JJ
NN


### mitochondrion

* **lemmatized singular form**: mitochondrion 
* **acronym**: n/a
* **plural form**: mitochondria
  * present in text, but fails to match on lemmatized version or be detected
* **proper vs. improper**: improper
* **indefinite determiner (a vs. an)**: a 
  * uncountable... 
* **countable vs. uncountable**: countable
* **parts of speech**: ['NN']

In [31]:
# doesn't detect plural form or correctly lemmatize
print(nlp('there is a mitochondrion')[-1].tag_)
print(nlp('mitochondrion')[0].lemma_)
print(nlp('there are many mitochondria')[-1].tag_)
print(nlp('mitochondria')[0].lemma_)

NN
mitochondrion
NN
mitochondria


# Verb Terms

## Attributes to Extract

* **infinitive (root form)**:
  * token.lemma_ in Spacy preprocessing
* **present participle/progressive**:
  * token.tag_ == VBG
  * how to extract if not present in text?
* **past participle**:
  * token.tag_ == VBN
  * how to extract if not present in text?
* **past tense**:
  * token.tag_ == VBD
  * how to extract if not present in text?
* **3rd person singular**:
  * token.tag_ == VBZ
  * how to extract if not present in text?
* **Nominalization(s)**:
  * how to extract?
  * one heuristic: look for nouns with matching stem (could additionally verify with common nominalization endings such as ion, ation, etc.)

## Challenges

* Extracting particular inflectional derivations if not present in text and Spacy taggable. Perhaps look for external site that could be used to look up? Lemmatization can also be imperfect across various conjugations
* Matching nominalizations: not sure if there is a better tool than the hack I suggested above

## Examples

### diffuse/diffusion

* **infinitive (root form)**: diffuse
* **present participle/progressive**: diffusing (fails to match on lemma)
* **past participle**: diffused
* **past tense**: diffused
* **3rd person singular**: diffuses
* **Nominalization(s)**: diffusion


In [50]:
# 3rd person singular
print(nlp('diffuses')[0].tag_)
print(nlp('diffuses')[0].lemma_)
print()
# past tense/participle
print(nlp('diffused')[0].tag_)
print(nlp('it diffused')[1].tag_)
print(nlp('diffused')[0].lemma_)
print()
# present participle
print(nlp('it is diffusing')[-1].tag_)
print(nlp('it is diffusing')[-1].lemma_)

VBZ
diffuse

VBN
VBD
diffuse

VBG
diffusing


In [40]:
# can match on stem with different POS tags and key ending -ion
print(ps.stem("diffusion"))
print(ps.stem("diffuses"))

diffus
diffus


# Adjective Terms

## Attributes to Extract

* **base form**:
  * token.tag_ == JJ
* **comparative**:
  * token.tag_ == JJR
  * How to extract/determine exists if not present?
* **superlative**:
  * token.tag_ == JJS
  * How to extract/determine exists if not present?
* **non-attributive vs. attributive**:
  * ??? How to extract?


## Challenges

* How to determine attributiveness?
* HOw to determine comparative & superlative forms if not present?

## Examples

### anabolic

* base form: anabolic
* comparative: more anabolic 
* superlative: most anabolic
* non-attributive

-> Would anabolism be separate? Or are there 

# Summary of Key Challenges

* Even if present in text, inflections do not always match on lemmatized versions (mitochondria vs. mitochondrion, diffusing vs. diffuse)
* If inflectional forms are not present, need way to predict or external lookup source (ideally)
* Certain aspects are not easily extractable from common preprocessing (nominalizations, countability, etc.) 