# Information Extraction
- 06/28/30 Extract the entity relations related to covid-19
    - Whether there is relationship or not
    - If so, what kind, and how likely
    - Use the openNRE to do so

- Examples: 
    - Risk Factors
        - “The two-way interaction between COVID-19 and diabetes mellitus sets up a vicious cycle wherein COVID-19 leads to worsening of dysglycemia and diabetes mellitus”
            1.	Covid-19 has relationship with diabetes. 
            2.	Covid-19 worsen diabetes
            3.	Does not indicate diabetes is a risk factor; but indicate covid-19 impact diabetes. 
    - Efficacy of therapeutics and interventions
        - i. “Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.”
            1.	lopinavir may be efficient for treating covid-19.
            2.	OpenNRE: “eosinophils” has part (P527)  “covid-19”

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from covid.models.relation.extraction import RelationExtractor


# Sample Extraction
- There are two model available: wiki80_bert_softmax or wiki80_cnn_softmax. The first one is better. The model is from the supervised relation extraction: http://opennre.thunlp.ai/#/sent_re. 
- We need to add an additiona NER classifier to classify interesting term relationship (in addition to our defined one)
- Among all the relations, there are several relations that are interesting, e.g., "has part (P527)", "part of (P361)", "said to be the same as", etc.

In [3]:
# initiate the extractor
rextractor = RelationExtractor("../covid/models/paperclassifier/interest.yaml")

2020-11-07 12:33:26,363 - root - INFO - Loading BERT pre-trained checkpoint.


OSError: Can't load config for '/Users/johnnyh/.opennre/pretrain/bert-base-uncased'. Make sure that:

- '/Users/johnnyh/.opennre/pretrain/bert-base-uncased' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/Users/johnnyh/.opennre/pretrain/bert-base-uncased' is the correct path to a directory containing a config.json file



In [None]:
# extract
text = 'Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.'
relation = rextractor.extract(text, 'eosinophils', 'COVID-19')
print(relation)

relation = rextractor.extract(text, 'WORDNOTINTEXT', 'COVID-19')
print(relation)

# Paper relationship extraction
- We can utilize the interest.yml to accelerate the search and relationship extraction
- We will focus on the relationship of covid to a particular keyword

In [2]:
from covid.models.relation.extraction import RelationExtractor

covidre = RelationExtractor(km_path='../covid/models/paperclassifier/interest.yaml')

ModuleNotFoundError: No module named 'opennre'

### Load the paper-classified information

In [1]:
df = pd.read_csv('../data/paperclassifier/classified_merged_covid.csv')
df.head(3)

NameError: name 'pd' is not defined

In [49]:
df.columns

Index(['sha', 'title', 'abstract', 'publish_time', 'affiliations', 'location',
       'text', 'risk_factor', 'diagnostic', 'treatment_and_vaccine', 'outcome',
       'risk_factor_common_name', 'gender', 'age', 'disease_comorbidity',
       'smoking', 'exercise', 'occupation', 'weather',
       'diagnostic_common_name', 'symptom', 'imaging_diagnosis',
       'clinical_diagnosis', 'genetic_diagnosis',
       'treatment_and_vaccine_common_name', 'treatment', 'outcome_common_name',
       'clinical_outcome', 'keywords', 'affiliations_country',
       'location_country'],
      dtype='object')

### Information extraction for each keyword

In [125]:
relations = covidre.extract_all(df)

HBox(children=(FloatProgress(value=0.0, max=9446.0), HTML(value='')))




In [126]:
df['relations'] = relations

### Save the information

In [127]:
df.to_csv('../data/paperclassifier/classified_merged_covid_relation.csv')

# Analyze the relations
- This is more like for setting up examples to better look at the data using declarative tool, i.e., altair

In [3]:
import ast
import altair as alt
import pandas as pd
import numpy as np

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [4]:
# load the data
df_r = pd.read_csv('../data/paperclassifier/classified_merged_covid_relation.csv')
df_r.head(3)

Unnamed: 0.1,Unnamed: 0,sha,title,abstract,publish_time,affiliations,location,text,risk_factor,diagnostic,...,clinical_diagnosis,genetic_diagnosis,treatment_and_vaccine_common_name,treatment,outcome_common_name,clinical_outcome,keywords,affiliations_country,location_country,relations
0,0,0104f6ceccf92ae8567a0102f89cbb976969a774,Association of HLA class I with severe acute r...,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,,,,0,1,...,0,0,0,0,0,0,fever,,,"[('coronavirus', 'fever', ('part of', 0.206526..."
1,1,5b68a553a7cbbea13472721cd1ad617d42b40c26,A double epidemic model for the SARS propagation,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,,,,1,0,...,0,0,0,0,0,0,weather,,,"[('coronavirus', 'weather', ('part of', 0.7285..."
2,2,3ed670f60a7be2e3e2a991ea8af1fdd5fa5e2b2c,Cloaked similarity between HIV-1 and SARS-CoV ...,BACKGROUND: Severe acute respiratory syndrome ...,2003-09-21,,,,0,0,...,0,0,0,0,0,0,0,,,


#### An example of string modification

In [5]:
import math

# modify the relations columns to make it more readable
# string to list
rs = []
for x in df_r['relations'].tolist():
    if isinstance(x, str):
        rx = ast.literal_eval(x)
        
        # create content string
        ss = []
        for c in rx:
            s = "%s('%s', %.3f)"%(c[1],c[2][0],c[2][1])
            ss.append(s)
        rx = ", ".join(ss)
    else:
        rx = None
    rs.append(rx)
print(len(rs))
rs

9446


["fever('part of', 0.207)",
 "weather('part of', 0.729)",
 None,
 "rna('part of', 0.829), dna('follows', 0.352)",
 "dna('part of', 0.853)",
 None,
 "treatment('has part', 0.488)",
 "rna('has part', 0.734)",
 "rna('part of', 0.499), death('has part', 0.825)",
 "rna('part of', 0.664)",
 None,
 "rna('part of', 0.542)",
 "discharge('has part', 0.599)",
 None,
 None,
 None,
 "fever('said to be the same as', 0.912)",
 None,
 "rna('part of', 0.952)",
 "rna('platform', 0.357)",
 "rna('has part', 0.537)",
 "male('part of', 0.955), sex('part of', 0.955), age('part of', 0.955)",
 "rna('instance of', 0.546)",
 "fever('league', 0.492)",
 "rna('has part', 0.969)",
 "rna('part of', 0.774)",
 "age('has part', 0.612), respiratory tract infection('part of', 0.782), rna('has part', 0.690)",
 None,
 None,
 None,
 None,
 None,
 None,
 "temperature('part of', 0.891), rna('has part', 0.613)",
 "symptom('has part', 0.863), rna('has part', 0.863), treatment('has part', 0.863)",
 "rna('has part', 0.633)",
 None

## Visualization

As of 10/11/20, I am planning to generate multiple plots for user to visually understand what the relation data table represents. Here are couple of figure to be plotted

1. (Not a plot) Filter the data
    - Select meaningful relationship; filter the rest; may choose only a few interpretable relationship
        - "has part" = "part of" = "coronavirus is related to ..."
        - (10/27/20 ABORT) "said to be the same as" = "instance of" = "corvonavirus is ..."
    - Choose the paper that is published after covid breakout, i.e., 2020 Feb
2. Plot x along time by month (since covid breakout). X can be
    - the number of a relationship (e.g., part of) with an entity (e.g., RNA)
3. Plot the summary count of different relationships with respect to the entity. 


In [6]:
import altair as alt

In [7]:
# ========================================
# preprocess
# ========================================

# ---------- 1. only select the paper published after covid time in 2020 Feb
df_r['publish_time'] = pd.to_datetime(df_r['publish_time'])
df_p = df_r.loc[df_r['publish_time'] > '2020-02-01', :]
df_p = df_p.drop(columns='Unnamed: 0')
df_p.head()

# ---------- 2. Reformulate the dataframe
# a. use a unique keyterm and a unique relation as a row, along with the paper title, sha, publish time, location.
# This mean that multiple rows can have the same paper title. 
# b. only keep the useful relations that make sense to readers. Rename the relation
# parameters
df_new = []
accept_relations = {'has part': 'is related to',
                    'part of': 'is related to',
                    'said to be the same as': 'is', 
                    'instance of': 'is'}
accept_relations = {'has part': 'is related to',
                    'part of': 'is related to',}
for i in range(0, df_p.shape[0]):
    s = df_p.iloc[i,:]
    
    # basic info 
    sha = s['sha']
    title = s['title']
    publish_time = s['publish_time']
    location = s['location']
    
    # extract relations
    rs = s['relations']
    if isinstance(rs, str):
        rx = ast.literal_eval(rs)

        # create content string
        for c in rx:
            if c[2][0] in accept_relations.keys():
                content = [c[1], accept_relations[c[2][0]], c[2][1], sha, title, publish_time, location]
                df_new.append(content)

# create dataframe
df_new = pd.DataFrame(df_new, columns=['keyword', 'relation', 'probability', 'sha',
                                       'title', 'publish_time', 'location'])
print(df_new.shape)
df_new.head(5)

(2023, 7)


Unnamed: 0,keyword,relation,probability,sha,title,publish_time,location
0,age,is related to,0.911713,2089844227bd2ff234044e4b311bf1c2bcb3838c,Respiratory viruses in mechanically ventilated...,2020-02-13,; ; ; ; ; ; ; ; ; ;
1,acute respiratory failure,is related to,0.911713,2089844227bd2ff234044e4b311bf1c2bcb3838c,Respiratory viruses in mechanically ventilated...,2020-02-13,; ; ; ; ; ; ; ; ; ;
2,treatment,is related to,0.663419,2089844227bd2ff234044e4b311bf1c2bcb3838c,Respiratory viruses in mechanically ventilated...,2020-02-13,; ; ; ; ; ; ; ; ; ;
3,virus infection,is related to,0.9313,3078165241b24fed34596efa9f563661f1caa00c,MiR-10a-5p-Mediated Syndecan 1 Suppression Res...,2020-02-20,
4,rna,is related to,0.653477,3078165241b24fed34596efa9f563661f1caa00c,MiR-10a-5p-Mediated Syndecan 1 Suppression Res...,2020-02-20,


In [8]:
# ========================================
# Overview plot
# Plot the number of identified relationship along time
# ========================================
# base plot
chart_col = alt.vconcat().configure_axis(
    labelFontSize=20,
    titleFontSize=20
).configure_title(
    fontSize=20,
)

# plot
chart = alt.Chart(df_new).mark_circle(size=60).encode(
    x='publish_time',
    y='probability',
    color='relation'
).properties(
        width=400,
        height=500,
        title='Strength of discovered relationship along paper publication month').interactive(0)
chart.encoding.y.title='Strength (Probability)'

chart_col &= chart
chart_col

In [9]:
# ========================================
# keyword-specific plot
# - Choose a particular keyword
# - Each relation will have a line plot
# - Aggregated probability per month
# ========================================
kw = 'fever'
relations = ['is related to', 'is']

# base plot
chart_col = alt.vconcat().configure_axis(
    labelFontSize=20,
    titleFontSize=20
).configure_title(
    fontSize=20,
)

# create groups
df_grps = []
for relation in relations:
    # sub-df
    df_sg = df_new.loc[(df_new['keyword'] == kw) & (df_new['relation']==relation), :]

    # aggregated data
    grp = df_sg.groupby(df_sg['publish_time'].dt.strftime('%B'))['probability']
    statistics = [grp.mean(), grp.std(), grp.sem(), grp.count()]
    df_grp = pd.DataFrame(statistics).transpose()
    df_grp.columns = ['proba_mean', 'proba_std', 'proba_stderr', 'n']
    df_grp = df_grp.fillna(0)
    
    # new column
    df_grp['publish_month'] = df_grp.index
    df_grp['relation'] = relation
    df_grp['proba_min'] = df_grp['proba_mean'] - df_grp['proba_stderr']
    df_grp['proba_max'] = df_grp['proba_mean'] + df_grp['proba_stderr']
    

    # append
    df_grps.append(df_grp)
df_grps = pd.concat(df_grps)
df_grps


Unnamed: 0_level_0,proba_mean,proba_std,proba_stderr,n,publish_month,relation,proba_min,proba_max
publish_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
April,0.759726,0.236564,0.044706,28.0,April,is related to,0.715019,0.804432
February,0.754847,0.213588,0.035598,36.0,February,is related to,0.719249,0.790445
March,0.782197,0.207349,0.021386,94.0,March,is related to,0.760811,0.803584
May,0.753281,0.0,0.0,1.0,May,is related to,0.753281,0.753281


In [10]:
chart_col = alt.vconcat().configure_axis(
    labelFontSize=20,
    titleFontSize=20
).configure_title(
    fontSize=20,
).configure_point(
    size=100
)

# plot
line = alt.Chart(df_grps).mark_line(point=True).encode(
    x='publish_month',
    y='proba_mean',
    color='relation',
    shape=alt.Shape('relation', scale=alt.Scale(range=['cross', 'circle', 'square', 'triangle-right', 'diamond'])),
).properties(
    width=400,
    height=500,
    title="coronavirus - '%s' relationship" %kw
)

# generate the error bars
errorbars = alt.Chart(df_grps).mark_errorbar().encode(
    x="publish_month",
    y="proba_min:Q",
    y2="proba_max:Q",
    color='relation'
)
errorbars.encoding.x.title='Month'
errorbars.encoding.y.title='Strength'
chart_col &= (line + errorbars)

# rename axis title

# plot
chart_col

In [11]:
# ========================================
# Get top keywords for each relationship 
# ========================================
relations = ['is related to', 'is']

# loop
for r in relations:
    df_sg = df_new.loc[df_new['relation'] == r]
    df_sg = df_sg.groupby('keyword')['probability'].mean().sort_values(ascending=False).to_frame()
    df_sg.columns = ['strength']
    
    # print
    print('The relationship is:', r)
    print(df_sg.head(10))
    print()
    


The relationship is: is related to
                             strength
keyword                              
smoke                        0.985188
upper respiratory infection  0.978268
occupation                   0.936047
lung capacity                0.919303
long qt syndrome             0.906110
mobility                     0.852858
virus infection              0.850200
sneezing                     0.849226
hospitalization              0.842664
exercising                   0.842629

The relationship is: is
Empty DataFrame
Columns: [strength]
Index: []



In [12]:
# ========================================
# Overview plot for several interesting keyword relationship
# Look at the strength of the relationship along time
# https://altair-viz.github.io/gallery/natural_disasters.html
# ========================================
kw_interest = ['sore throat', 'fatigue', 'fever', 'upper respiratory infection', 'lung capacity',
              'hospitalization', 'dry cough', 'sneezing', 'death', 'shortness of breath']
df_new_p = df_new.copy()
df_new_p = df_new_p[df_new_p['keyword'].isin(kw_interest)]

# base plot
chart_col = alt.vconcat().configure_axis(
    labelFontSize=14,
    titleFontSize=14
).configure_title(
    fontSize=20,
)

# plot
chart = alt.Chart(df_new_p).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    alt.X('publish_time'),
    alt.Y('keyword'),
    alt.Size('probability:Q',
        scale=alt.Scale(range=[50, 200]),
        legend=alt.Legend(title='Strength')
    ),
).properties(
    width=450,
    height=320
)
chart.encoding.x.title='Month'
chart.encoding.y.title='Entity'

# plot
chart_col &= chart
chart_col