# Stanza Biomedical Models NLP Example

_V. Keith Hughitt_ (Aug 2020)

## Overview

In this notebook we will explore the use of pre-trained [Stanza Bio models](https://stanfordnlp.github.io/stanza/biomed.html) for analyzing text from Pubmed articles retrieved via [PubTator Central](https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).

For more information, check out the pre-print on arxiv:

- [Biomedical and Clinical English Model Packages in the Stanza Python NLP Library (Zhang et al., 2020)](https://arxiv.org/abs/2007.14640)

## Setup

In [1]:
#from pathlib import Path
import json
import requests
import stanza
import pandas as pd

## Query PubTator Central

First, let's retrieve some article text using the [PubTator Central API](https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).

Depending on what we are interested in, we can either retrieve article abstracts or full-texts.

For the former, we provide one or more Pubmed ID's ("pmids"), and for the later, one or more Pubmed Central ID's ("pmcids").

Below, we will retrieve first the abstract, and then the full-text for a recently published article on venetoclax sensitivity in multiple myeloma:

- [Electron transport chain activity is a predictor and target for venetoclax sensitivity in multiple myeloma (Bajpai et al., 2020)](https://www.nature.com/articles/s41467-020-15051-z)

In [2]:
# Pubmed id to query 
pmid = "32144272"
pmcid = "PMC7060223"
        
# PubTator Central API
base_url = "https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson"

# retrieve a single article abstract
url = f"{base_url}?pmids={pmid}"

# display api url to be queried
print(url)

https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=32144272


In [3]:
# submit query and check response code to make sure it was successful
response = requests.get(url)

if response.status_code != 200:
    raise Exception(response.text)

Since we requested a response in the "biocjson" format, the response text will contain a block of JSON text, which we can easily convert to a Python dict using the `response.json()` function.

In [4]:
res = response.json()
res

{'_id': '32144272|None',
 'id': '32144272',
 'infons': {},
 'passages': [{'infons': {'journal': 'Nat Commun; 2020 Mar 06 ; 11 (1) 1228. doi:10.1038/s41467-020-15051-z',
    'year': '2020',
    'article-id_pmc': 'PMC7060223',
    'type': 'title',
    'authors': 'Bajpai R, Sharma A, Achreja A, Edgar CL, Wei C, Siddiqa AA, Gupta VA, Matulis SM, McBrayer SK, Mittal A, Rupji M, Barwick BG, Lonial S, Nooka AK, Boise LH, Nagrath D, Shanmugam M, ',
    'section': 'Title'},
   'offset': 0,
   'text': 'Electron transport chain activity is a predictor and target for venetoclax sensitivity in multiple myeloma.',
   'sentences': [],
   'annotations': [{'id': '1',
     'infons': {'identifier': 'MESH:D009101', 'type': 'Disease'},
     'text': 'multiple myeloma',
     'locations': [{'offset': 90, 'length': 16}]}],
   'relations': []},
  {'infons': {'type': 'abstract', 'section': 'Abstract'},
   'offset': 108,
   'text': "The BCL-2 antagonist venetoclax is highly effective in multiple myeloma (MM) pati

The response contains two main parts ("passages") relating to:

1. Article title
2. Article abstract

Each passage includes the text, some metadata, and any Pubtator annotations.

For now, let's get the abstract text so that we can infer our own annotations using Stanza.

In [5]:
abstract = res['passages'][1]['text']

print(abstract)

The BCL-2 antagonist venetoclax is highly effective in multiple myeloma (MM) patients exhibiting the 11;14 translocation, the mechanistic basis of which is unknown. In evaluating cellular energetics and metabolism of t(11;14) and non-t(11;14) MM, we determine that venetoclax-sensitive myeloma has reduced mitochondrial respiration. Consistent with this, low electron transport chain (ETC) Complex I and Complex II activities correlate with venetoclax sensitivity. Inhibition of Complex I, using IACS-010759, an orally bioavailable Complex I inhibitor in clinical trials, as well as succinate ubiquinone reductase (SQR) activity of Complex II, using thenoyltrifluoroacetone (TTFA) or introduction of SDHC R72C mutant, independently sensitize resistant MM to venetoclax. We demonstrate that ETC inhibition increases BCL-2 dependence and the 'primed' state via the ATF4-BIM/NOXA axis. Further, SQR activity correlates with venetoclax sensitivity in patient samples irrespective of t(11;14) status. Use 

## Named Entity Recognition

### Process article abstract

To begin, we will download and initialize a model which uses:

- [CRAFT](https://github.com/UCDenver-ccp/CRAFT) tokenization
- [BioNLP13CG](http://2013.bionlp-st.org/) NER

In [6]:
# download model and initialize pipeline
stanza.download('en', package='craft', processors={'ner': 'bionlp13cg'})

nlp = stanza.Pipeline('en', package='craft', processors={'ner': 'bionlp13cg'})

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.0.0.json: 120kB [00:00, 7.61MB/s]                    
2020-08-13 11:44:03 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package    |
--------------------------------
| tokenize        | craft      |
| pos             | craft      |
| lemma           | craft      |
| depparse        | craft      |
| ner             | bionlp13cg |
| backward_charlm | pubmed     |
| forward_charlm  | pubmed     |
| pretrain        | craft      |

2020-08-13 11:44:03 INFO: File exists: /home/keith/stanza_resources/en/tokenize/craft.pt.
2020-08-13 11:44:03 INFO: File exists: /home/keith/stanza_resources/en/pos/craft.pt.
2020-08-13 11:44:03 INFO: File exists: /home/keith/stanza_resources/en/lemma/craft.pt.
2020-08-13 11:44:03 INFO: File exists: /home/keith/stanza_resources/en/depparse/craft.pt.
2020-08-13 11:44:03 INFO: File exists: /home/keith/stanza_resources/en/ner/

In [7]:
# now we are ready to annotate some text..
doc = nlp(abstract)

In [8]:
# the result it, not surprisingly, a stanza "Document" instance
print(type(doc))

# public attributes / methods
[x for x in dir(doc) if not x.startswith('_')]

<class 'stanza.models.common.doc.Document'>


['build_ents',
 'entities',
 'ents',
 'get',
 'get_mwt_expansions',
 'iter_tokens',
 'iter_words',
 'num_tokens',
 'num_words',
 'sentences',
 'set',
 'set_mwt_expansions',
 'text',
 'to_dict']

In [9]:
# number of entities?
len(doc.entities)

# what does a single entity result look like?
doc.entities[0]

{
  "text": "BCL-2",
  "type": "GENE_OR_GENE_PRODUCT",
  "start_char": 4,
  "end_char": 9
}

In [10]:
# build a table containing all of the entity annotations
rows = []

for ent in doc.entities:
    rows.append([ent.text, ent.type])
    
dat = pd.DataFrame(rows, columns=['text', 'type'])

dat.sort_values(['type', 'text'])


Unnamed: 0,text,type
3,MM,CANCER
21,MM,CANCER
2,multiple myeloma,CANCER
7,myeloma,CANCER
5,cellular,CELL
8,mitochondrial,CELLULAR_COMPONENT
25,ATF4,GENE_OR_GENE_PRODUCT
0,BCL-2,GENE_OR_GENE_PRODUCT
24,BCL-2,GENE_OR_GENE_PRODUCT
26,BIM,GENE_OR_GENE_PRODUCT


Next, it would be interesting to see how this compares to PubTatorCentral's own annotations..

Recall that, in addition the the text itself, PTC API requests also return lists of annotations -- this is what PTC was built for!

In [11]:
annot = res['passages'][1]['annotations']

In [12]:
annot[0:3]

[{'id': '21',
  'infons': {'identifier': '596', 'type': 'Gene', 'ncbi_homologene': '527'},
  'text': 'BCL-2',
  'locations': [{'offset': 112, 'length': 5}]},
 {'id': '22',
  'infons': {'identifier': 'MESH:C579720', 'type': 'Chemical'},
  'text': 'venetoclax',
  'locations': [{'offset': 129, 'length': 10}]},
 {'id': '23',
  'infons': {'identifier': 'MESH:D009101', 'type': 'Disease'},
  'text': 'multiple myeloma',
  'locations': [{'offset': 163, 'length': 16}]}]

In [13]:
# let's construct a table similar to the one we built for the Stanza result
ptc_rows = []

for ent in annot:
    ptc_rows.append([ent['text'], ent['infons']['type']])
    
dat_ptc = pd.DataFrame(ptc_rows, columns = ['text', 'type'])

dat_ptc.sort_values(['type', 'text'])

Unnamed: 0,text,type
8,TTFA,Chemical
7,thenoyltrifluoroacetone,Chemical
1,venetoclax,Chemical
11,venetoclax,Chemical
18,venetoclax,Chemical
3,MM,Disease
5,MM,Disease
10,MM,Disease
16,MM,Disease
2,multiple myeloma,Disease


A decent amount of overlap, but also some important differences:


- The compound "IACS-010759" only detected in Stanza
- The gene "BIM" only detected in Stanza
- SQR (enzyme complex) only detected in Stanza, but marked as a "Chemical"
- Stanza also tags "electron transport chain" as a "Gene or gene product", which is a bit of a stretch..
- PTC on the other hand detects "R72C" a mutation; AFAIK, none of the Stanza Bio models currently include mutations, so if this is your goal, this could be pretty important..

### Full-text

Next, let's try the same approach, but applied to the full-text for the same article.

In [14]:
# Pubmed id to query 
pmcid = "PMC7060223"

# query url
url = f"{base_url}?pmcids={pmcid}"

response = requests.get(url)

if response.status_code != 200:
    raise Exception(response.text)

res = response.json()

In [15]:
# In this case, instead of just two passages (title/abstract), 
# we now have ~180 passages, each one corresponding to a section
# (header, sub-header, paragraph, etc.) in the article..
len(res['passages'])

184

In [16]:
# example: intro paragraph
res['passages'][6]

{'infons': {'section_type': 'INTRO',
  'type': 'paragraph',
  'section': 'Introduction'},
 'offset': 3772,
 'text': 'BH3 mimetics are a class of small molecules that block the interaction of specific proapoptotics with cognate antiapoptotics, releasing bound proapoptotic activators. Venetoclax is one such selective, potent BCL-2 antagonist. It is highly effective in BCL-2-dependent malignancies and FDA-approved for the treatment of chronic lymphocytic leukemia (CLL) and with hypomethylating agents azacitidine or decitabine (NCT02203773) or low dose cytarabine (NCT02287233) in acute myeloid leukemia (AML). Intriguingly, a small fraction (approximately 7%) of MM patients (about 40% of the 15-20% of patients exhibiting the 11;14 translocation) respond to single-agent venetoclax. Given the plethora of new myeloma therapies, there is need for precision therapy informed by biomarkers or molecular traits. Understanding the basis for single-agent efficacy of venetoclax in t(11;14) myeloma can 

In [17]:
# example: figure caption
res['passages'][10]

{'infons': {'section_type': 'FIG',
  'file': '41467_2020_15051_Fig1_HTML.jpg',
  'id': 'Fig1',
  'type': 'fig_title_caption',
  'section': 'Results'},
 'offset': 6288,
 'text': 'Venetoclax-sensitive MM exhibits reduced cellular energetics in contrast to the venetoclax-resistant cells.',
 'sentences': [],
 'annotations': [],
 'relations': []}

In [18]:
# so, suppose we want to get all of the "paragraph" text entries, we could do something like:
paragraphs = [x['text'] for x in res['passages'] if x['infons']['type'] == 'paragraph']

In [19]:
len(paragraphs)

53

In [20]:
# we can then process then with Stanza, just like before..
rows = []

for i, paragraph in enumerate(paragraphs):
    doc = nlp(paragraph)
     
    for ent in doc.entities:
        # here, we add a third entry in each row to keep track of the paragraph number
        # the annotation came from
        rows.append([i, ent.text, ent.type])

In [21]:
# convert to a pandas DataFrame
dat = pd.DataFrame(rows, columns=['paragraph', 'text', 'type'])

dat.shape

# show 25 randomly-selected annotations
dat.sample(25).sort_values(['type', 'text'])

Unnamed: 0,paragraph,text,type
1252,50,myeloma,CANCER
1009,32,ATCC,CELL
780,27,B-cell lineage,CELL
511,16,U266,CELL
705,25,mitochondrial,CELLULAR_COMPONENT
520,16,BCL-2,GENE_OR_GENE_PRODUCT
398,13,BCL-2,GENE_OR_GENE_PRODUCT
13,1,BCL-2,GENE_OR_GENE_PRODUCT
1058,34,BCL-2,GENE_OR_GENE_PRODUCT
525,16,BIM,GENE_OR_GENE_PRODUCT


# Final thoughts

In the above, we really only just scratched the surface of what you can do with either PubTator Central or Stanza.

Hopefully it at least provided a sense of some of the kinds of analyses one can use these tools for, as well as how easy it can be to annotate scientific text.

A few examples of things you could do from here:

- Construct annotation co-occurence matrices and use them to detect unexpected `<gene, drug>`, `<gene, disease>`, etc. pairs across a range of documents.
- Use a [word embedding approach](https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d) to create a vector representation of a collection of articles, followed by [vector similarity](https://medium.com/@adriensieg/text-similarities-da019229c894) to assess the similarity of the texts.
- Using a co-occurence matrix of Genes, Drugs, and SNPs (annotated in PubTator Central), build a weighted heterogenous network with each annotation represented as a node and edge weights assigned as a function of the co-occurence counts. This could then be followed by [community detection](https://www.analyticsvidhya.com/blog/2020/04/community-detection-graphs-networks/) to detect groups of interacting drugs, genes, and mutations.