# Information Extraction
- 06/28/30 Extract the entity relations related to covid-19
    - Whether there is relationship or not
    - If so, what kind, and how likely
    - Use the openNRE to do so

- Examples: 
    - Risk Factors
        - “The two-way interaction between COVID-19 and diabetes mellitus sets up a vicious cycle wherein COVID-19 leads to worsening of dysglycemia and diabetes mellitus”
            1.	Covid-19 has relationship with diabetes. 
            2.	Covid-19 worsen diabetes
            3.	Does not indicate diabetes is a risk factor; but indicate covid-19 impact diabetes. 
    - Efficacy of therapeutics and interventions
        - i. “Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.”
            1.	lopinavir may be efficient for treating covid-19.
            2.	OpenNRE: “eosinophils” has part (P527)  “covid-19”

In [3]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#from covid.models.relation.extraction import RelationExtractor


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Sample Extraction
- There are two model available: wiki80_bert_softmax or wiki80_cnn_softmax. The first one is better. The model is from the supervised relation extraction: http://opennre.thunlp.ai/#/sent_re. 
- We need to add an additiona NER classifier to classify interesting term relationship (in addition to our defined one)
- Among all the relations, there are several relations that are interesting, e.g., "has part (P527)", "part of (P361)", "said to be the same as", etc.

In [None]:
# initiate the extractor
rextractor = RelationExtractor("../covid/models/paperclassifier/interest.yaml")

In [None]:
# extract
text = 'Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.'
relation = rextractor.extract(text, 'eosinophils', 'COVID-19')
print(relation)

relation = rextractor.extract(text, 'WORDNOTINTEXT', 'COVID-19')
print(relation)

# Paper relationship extraction
- We can utilize the interest.yml to accelerate the search and relationship extraction
- We will focus on the relationship of covid to a particular keyword

In [2]:
from covid.models.relation.extraction import RelationExtractor

covidre = RelationExtractor(km_path='../covid/models/paperclassifier/interest.yaml')

ModuleNotFoundError: No module named 'opennre'

### Load the paper-classified information

In [1]:
df = pd.read_csv('../data/paperclassifier/classified_merged_covid.csv')
df.head(3)

NameError: name 'pd' is not defined

In [49]:
df.columns

Index(['sha', 'title', 'abstract', 'publish_time', 'affiliations', 'location',
       'text', 'risk_factor', 'diagnostic', 'treatment_and_vaccine', 'outcome',
       'risk_factor_common_name', 'gender', 'age', 'disease_comorbidity',
       'smoking', 'exercise', 'occupation', 'weather',
       'diagnostic_common_name', 'symptom', 'imaging_diagnosis',
       'clinical_diagnosis', 'genetic_diagnosis',
       'treatment_and_vaccine_common_name', 'treatment', 'outcome_common_name',
       'clinical_outcome', 'keywords', 'affiliations_country',
       'location_country'],
      dtype='object')

### Information extraction for each keyword

In [125]:
relations = covidre.extract_all(df)

HBox(children=(FloatProgress(value=0.0, max=9446.0), HTML(value='')))




In [126]:
df['relations'] = relations

### Save the information

In [127]:
df.to_csv('../data/paperclassifier/classified_merged_covid_relation.csv')

# Analyze the relations
- This is more like for setting up examples to better look at the data using declarative tool, i.e., altair

In [3]:
import ast
import altair as alt
import pandas as pd

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [4]:
# load the data
df_r = pd.read_csv('../data/paperclassifier/classified_merged_covid_relation.csv')
df_r.head(3)

Unnamed: 0.1,Unnamed: 0,sha,title,abstract,publish_time,affiliations,location,text,risk_factor,diagnostic,...,clinical_diagnosis,genetic_diagnosis,treatment_and_vaccine_common_name,treatment,outcome_common_name,clinical_outcome,keywords,affiliations_country,location_country,relations
0,0,0104f6ceccf92ae8567a0102f89cbb976969a774,Association of HLA class I with severe acute r...,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,,,,0,1,...,0,0,0,0,0,0,fever,,,"[('coronavirus', 'fever', ('part of', 0.206526..."
1,1,5b68a553a7cbbea13472721cd1ad617d42b40c26,A double epidemic model for the SARS propagation,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,,,,1,0,...,0,0,0,0,0,0,weather,,,"[('coronavirus', 'weather', ('part of', 0.7285..."
2,2,3ed670f60a7be2e3e2a991ea8af1fdd5fa5e2b2c,Cloaked similarity between HIV-1 and SARS-CoV ...,BACKGROUND: Severe acute respiratory syndrome ...,2003-09-21,,,,0,0,...,0,0,0,0,0,0,0,,,


#### An example of string modification

In [6]:
import math

# modify the relations columns to make it more readable
# string to list
rs = []
for x in df_r['relations'].tolist():
    if isinstance(x, str):
        rx = ast.literal_eval(x)
        
        # create content string
        ss = []
        for c in rx:
            s = "%s('%s', %.3f)"%(c[1],c[2][0],c[2][1])
            ss.append(s)
        rx = ", ".join(ss)
    else:
        rx = None
    rs.append(rx)
print(len(rs))
rs

9446


["fever('part of', 0.207)",
 "weather('part of', 0.729)",
 None,
 "rna('part of', 0.829), dna('follows', 0.352)",
 "dna('part of', 0.853)",
 None,
 "treatment('has part', 0.488)",
 "rna('has part', 0.734)",
 "rna('part of', 0.499), death('has part', 0.825)",
 "rna('part of', 0.664)",
 None,
 "rna('part of', 0.542)",
 "discharge('has part', 0.599)",
 None,
 None,
 None,
 "fever('said to be the same as', 0.912)",
 None,
 "rna('part of', 0.952)",
 "rna('platform', 0.357)",
 "rna('has part', 0.537)",
 "male('part of', 0.955), sex('part of', 0.955), age('part of', 0.955)",
 "rna('instance of', 0.546)",
 "fever('league', 0.492)",
 "rna('has part', 0.969)",
 "rna('part of', 0.774)",
 "age('has part', 0.612), respiratory tract infection('part of', 0.782), rna('has part', 0.690)",
 None,
 None,
 None,
 None,
 None,
 None,
 "temperature('part of', 0.891), rna('has part', 0.613)",
 "symptom('has part', 0.863), rna('has part', 0.863), treatment('has part', 0.863)",
 "rna('has part', 0.633)",
 None

## Visualization

As of 10/11/20, I am planning to generate multiple plots for user to visually understand what the relation data table represents. Here are couple of figure to be plotted

1. (Not a plot) Filter the data
    - Select meaningful relationship; filter the rest; may choose only a few interpretable relationship
        - "said to be the same as" = "instance of" = "corvonavirus is ..."
        - "has part" = "part of" = "coronavirus is related to ..."
    - Choose the paper that is published after covid breakout, i.e., 2020 Feb
2. Plot x along time by month (since covid breakout). X can be
    - the number of a relationship (e.g., part of) with an entity (e.g., RNA)
3. Plot the summary count of different relationships with respect to the entity. 


In [12]:
# ---------- Create a new dataframe to stores explicitly of all the relationship extracted
# string to list
rs = [ast.literal_eval(x) for x in df_r['relations'].dropna().tolist()]

# flatten the arrays
all_rs = []
for r in rs:
    for i in r:
        all_rs.append(i)

# format to pandas
rs = []
for r in all_rs:
    rs.append([r[0], r[1], r[2][0], r[2][1]])
df_new = pd.DataFrame(rs)
df_new.columns = ['keyword1', 'keyword2', 'relation', 'probability']
df_new.head(3)

Unnamed: 0,keyword1,keyword2,relation,probability
0,coronavirus,fever,part of,0.206526
1,coronavirus,weather,part of,0.728578
2,coronavirus,rna,part of,0.828718


In [8]:
# NOTE: some of the relationships do not make sense. Can ignore them. For example, "sport"
df_new['relation'].value_counts()

has part                                            4094
part of                                             2225
instance of                                          655
follows                                              438
followed by                                          416
main subject                                         382
said to be the same as                               206
league                                               175
characters                                            92
developer                                             33
sport                                                 26
platform                                              19
location                                              11
distributor                                           11
operating system                                       9
performer                                              7
military branch                                        7
participant                    

### Visualize
- feel free to do any visualization on the data

In [27]:
df_tmp = df_new['relation'].value_counts().to_frame()
df_tmp['count'] = df_tmp['relation']
df_tmp['relation'] = df_tmp.index
df_tmp

Unnamed: 0,relation,count
has part,has part,4094
part of,part of,2225
instance of,instance of,655
follows,follows,438
followed by,followed by,416
main subject,main subject,382
said to be the same as,said to be the same as,206
league,league,175
characters,characters,92
developer,developer,33


In [30]:
chart = alt.Chart(df_tmp).mark_bar().encode(
    x='relation',
    y='count',
)
chart

In [16]:
df_new['relation'].tolist()

['part of',
 'part of',
 'part of',
 'follows',
 'part of',
 'has part',
 'has part',
 'part of',
 'has part',
 'part of',
 'part of',
 'has part',
 'said to be the same as',
 'part of',
 'platform',
 'has part',
 'part of',
 'part of',
 'part of',
 'instance of',
 'league',
 'has part',
 'part of',
 'has part',
 'part of',
 'has part',
 'part of',
 'has part',
 'has part',
 'has part',
 'has part',
 'has part',
 'part of',
 'follows',
 'follows',
 'part of',
 'part of',
 'has part',
 'has part',
 'has part',
 'has part',
 'part of',
 'has part',
 'has part',
 'instance of',
 'has part',
 'part of',
 'has part',
 'has part',
 'has part',
 'part of',
 'part of',
 'part of',
 'part of',
 'part of',
 'part of',
 'has part',
 'part of',
 'has part',
 'has part',
 'follows',
 'instance of',
 'has part',
 'has part',
 'main subject',
 'part of',
 'part of',
 'part of',
 'main subject',
 'has part',
 'part of',
 'part of',
 'has part',
 'followed by',
 'main subject',
 'has part',
 'said to b