# Information Extraction
- 06/28/30 Extract the entity relations related to covid-19
    - Whether there is relationship or not
    - If so, what kind, and how likely
    - Use the openNRE to do so

- Examples: 
    - Risk Factors
        - “The two-way interaction between COVID-19 and diabetes mellitus sets up a vicious cycle wherein COVID-19 leads to worsening of dysglycemia and diabetes mellitus”
            1.	Covid-19 has relationship with diabetes. 
            2.	Covid-19 worsen diabetes
            3.	Does not indicate diabetes is a risk factor; but indicate covid-19 impact diabetes. 
    - Efficacy of therapeutics and interventions
        - i. “Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.”
            1.	lopinavir may be efficient for treating covid-19.
            2.	OpenNRE: “eosinophils” has part (P527)  “covid-19”

In [61]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from covid.models.relation.extraction import RelationExtractor
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Sample Extraction
- There are two model available: wiki80_bert_softmax or wiki80_cnn_softmax. The first one is better. The model is from the supervised relation extraction: http://opennre.thunlp.ai/#/sent_re. 
- We need to add an additiona NER classifier to classify interesting term relationship (in addition to our defined one)
- Among all the relations, only the "has part (P527)" is useful. 

In [43]:
# initiate the extractor
rextractor = RelationExtractor()

In [44]:
# extract
text = 'Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.'
relation = rextractor.extract(text, 'eosinophils', 'COVID-19')
print(relation)

('eosinophils', 'covid-19', ('has part', 0.3858766555786133))


# Paper relationship extraction
- We can utilize the interest.yml to accelerate the search and relationship extraction
- We will focus on the relationship of covid to a particular keyword

### Load the paper-classified information

In [47]:
df = pd.read_csv('../data/paperclassifier/classified_merged_covid.csv')
df.head()

Unnamed: 0,sha,title,abstract,publish_time,affiliations,location,text,risk_factor,diagnostic,treatment_and_vaccine,...,imaging_diagnosis,clinical_diagnosis,genetic_diagnosis,treatment_and_vaccine_common_name,treatment,outcome_common_name,clinical_outcome,keywords,affiliations_country,location_country
0,0104f6ceccf92ae8567a0102f89cbb976969a774,Association of HLA class I with severe acute r...,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,,,,0,1,0,...,0,0,0,0,0,0,0,fever,,
1,5b68a553a7cbbea13472721cd1ad617d42b40c26,A double epidemic model for the SARS propagation,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,,,,1,0,0,...,0,0,0,0,0,0,0,weather,,
2,3ed670f60a7be2e3e2a991ea8af1fdd5fa5e2b2c,Cloaked similarity between HIV-1 and SARS-CoV ...,BACKGROUND: Severe acute respiratory syndrome ...,2003-09-21,,,,0,0,0,...,0,0,0,0,0,0,0,0,,
3,59ffca26c9d1f881e4e37e7c5af626d72d858dc4,Relationship of SARS-CoV to other pathogenic R...,BACKGROUND: The exact origin of the cause of t...,2003-09-20,Yee Leng Yap ; Xue Wu Zhang ; Antoine Danchin ...,"; ; 75724, Cedex 15 ; Paris ; France",Background\n\nSevere Acute Respiratory Syndrom...,0,1,0,...,0,0,1,0,0,0,0,"rna,dna",France,France
4,c63c4d58d170136b8d3b5a66424b5ac3f73a92d9,Viral Discovery and Sequence Recovery Using DN...,Because of the constant threat posed by emergi...,2003-11-17,David Wang (University of California San Franc...,San Francisco ; California ; United States of ...,Introduction\n\nOver the past two decades tech...,0,1,0,...,0,0,1,0,0,0,0,dna,United States,United States


In [49]:
df.columns

Index(['sha', 'title', 'abstract', 'publish_time', 'affiliations', 'location',
       'text', 'risk_factor', 'diagnostic', 'treatment_and_vaccine', 'outcome',
       'risk_factor_common_name', 'gender', 'age', 'disease_comorbidity',
       'smoking', 'exercise', 'occupation', 'weather',
       'diagnostic_common_name', 'symptom', 'imaging_diagnosis',
       'clinical_diagnosis', 'genetic_diagnosis',
       'treatment_and_vaccine_common_name', 'treatment', 'outcome_common_name',
       'clinical_outcome', 'keywords', 'affiliations_country',
       'location_country'],
      dtype='object')

In [54]:
# refer to the interest.yaml file and identify which entities we are interest to know about the
# relationship with covid
classes = {'risk_factor': ['smoking', 'smoke', 'tobacco', 'exercise','exercising', 'workout'],
           'diagnostic', 
           'treatment_and_vaccine', 
           'outcome'}

In [60]:
df.iloc[0]['abstract']

'BACKGROUND: The human leukocyte antigen (HLA) system is widely used as a strategy in the search for the etiology of infectious diseases and autoimmune disorders. During the Taiwan epidemic of severe acute respiratory syndrome (SARS) many health care workers were infected. In an effort to establish a screening program for high risk personal the distribution of HLA class I and II alleles in case and control groups was examined for the presence of an association to a genetic susceptibly or resistance to SARS coronavirus infection. METHODS: HLA-class I and II allele typing by PCR-SSOP was performed on 37 cases of probable SARS 28 fever patients excluded later as probable SARS and 101 non-infected health care workers who were exposed or possibly exposed to SARS coronavirus. An additional control set of 190 normal healthy unrelated Taiwanese was also used in the analysis. RESULTS: Woolf and Haldane Odds ratio (OR) and corrected P-value (Pc) obtained from two tails Fisher exact test were use