# Information Extraction
- 06/28/30 Extract the entity relations related to covid-19
    - Whether there is relationship or not
    - If so, what kind, and how likely
    - Use the openNRE to do so

- Examples: 
    - Risk Factors
        - “The two-way interaction between COVID-19 and diabetes mellitus sets up a vicious cycle wherein COVID-19 leads to worsening of dysglycemia and diabetes mellitus”
            1.	Covid-19 has relationship with diabetes. 
            2.	Covid-19 worsen diabetes
            3.	Does not indicate diabetes is a risk factor; but indicate covid-19 impact diabetes. 
    - Efficacy of therapeutics and interventions
        - i. “Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.”
            1.	lopinavir may be efficient for treating covid-19.
            2.	OpenNRE: “eosinophils” has part (P527)  “covid-19”

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from covid.models.relation.extraction import RelationExtractor


# Sample Extraction
- There are two model available: wiki80_bert_softmax or wiki80_cnn_softmax. The first one is better. The model is from the supervised relation extraction: http://opennre.thunlp.ai/#/sent_re. 
- We need to add an additiona NER classifier to classify interesting term relationship (in addition to our defined one)
- Among all the relations, there are several relations that are interesting, e.g., "has part (P527)", "part of (P361)", "said to be the same as", etc.

In [None]:
# initiate the extractor
rextractor = RelationExtractor("../covid/models/paperclassifier/interest.yaml")

In [None]:
# extract
text = 'Increasing eosinophils may be an indicator of COVID-19 improvement. The COVID-19 patients may benefit from sustained lopinavir use.'
relation = rextractor.extract(text, 'eosinophils', 'COVID-19')
print(relation)

relation = rextractor.extract(text, 'WORDNOTINTEXT', 'COVID-19')
print(relation)

# Paper relationship extraction
- We can utilize the interest.yml to accelerate the search and relationship extraction
- We will focus on the relationship of covid to a particular keyword

In [None]:
from covid.models.relation.extraction import RelationExtractor

covidre = RelationExtractor(km_path='../covid/models/paperclassifier/interest.yaml')

### Load the paper-classified information

In [None]:
df = pd.read_csv('../data/paperclassifier/classified_merged_covid.csv')
df.head(3)

In [None]:
df.columns

### Information extraction for each keyword

In [None]:
relations = covidre.extract_all(df)

In [None]:
df['relations'] = relations

### Save the information

In [None]:
df.to_csv('../data/paperclassifier/classified_merged_covid_relation.csv')

# Analyze the relations
- This is more like for setting up examples to better look at the data using declarative tool, i.e., altair

In [None]:
import ast
import altair as alt
import pandas as pd
import numpy as np

alt.data_transformers.disable_max_rows()

In [None]:
# load the data
df_r = pd.read_csv('../data/paperclassifier/classified_merged_covid_relation.csv')
df_r.head(3)

#### An example of string modification

In [None]:
import math

# modify the relations columns to make it more readable
# string to list
rs = []
for x in df_r['relations'].tolist():
    if isinstance(x, str):
        rx = ast.literal_eval(x)
        
        # create content string
        ss = []
        for c in rx:
            s = "%s('%s', %.3f)"%(c[1],c[2][0],c[2][1])
            ss.append(s)
        rx = ", ".join(ss)
    else:
        rx = None
    rs.append(rx)
print(len(rs))
rs

## Visualization

As of 10/11/20, I am planning to generate multiple plots for user to visually understand what the relation data table represents. Here are couple of figure to be plotted

1. (Not a plot) Filter the data
    - Select meaningful relationship; filter the rest; may choose only a few interpretable relationship
        - "has part" = "part of" = "coronavirus is related to ..."
        - (10/27/20 ABORT) "said to be the same as" = "instance of" = "corvonavirus is ..."
    - Choose the paper that is published after covid breakout, i.e., 2020 Feb
2. Plot x along time by month (since covid breakout). X can be
    - the number of a relationship (e.g., part of) with an entity (e.g., RNA)
3. Plot the summary count of different relationships with respect to the entity. 


In [None]:
import altair as alt
import plotly.express as px

## Overview plot
Plot the number of identified relationship along time



In [None]:
# ========================================
# preprocess
# ========================================

# ---------- 1. only select the paper published after covid time in 2020 Feb
df_r['publish_time'] = pd.to_datetime(df_r['publish_time'])
df_p = df_r.loc[df_r['publish_time'] > '2020-02-01', :]
df_p = df_p.drop(columns='Unnamed: 0')
df_p.head()

# ---------- 2. Reformulate the dataframe
# a. use a unique keyterm and a unique relation as a row, along with the paper title, sha, publish time, location.
# This mean that multiple rows can have the same paper title. 
# b. only keep the useful relations that make sense to readers. Rename the relation
# parameters
df_new = []
# accept_relations = {'has part': 'is related to',
#                     'part of': 'is related to',
#                     'said to be the same as': 'is', 
#                     'instance of': 'is'}
accept_relations = {'has part': 'is related to',
                    'part of': 'is related to',}
for i in range(0, df_p.shape[0]):
    s = df_p.iloc[i,:]
    
    # basic info 
    sha = s['sha']
    title = s['title']
    publish_time = s['publish_time']
    location = s['location']
    
    # extract relations
    rs = s['relations']
    if isinstance(rs, str):
        rx = ast.literal_eval(rs)

        # create content string
        for c in rx:
            if c[2][0] in accept_relations.keys():
                content = [c[1], accept_relations[c[2][0]], c[2][1], sha, title, publish_time, location]
                df_new.append(content)

# create dataframe
df_new = pd.DataFrame(df_new, columns=['keyword', 'relation', 'probability', 'sha',
                                       'title', 'publish_time', 'location'])
print(df_new.shape)
df_new.head(5)

# ========================================
# Plotly plot
# ========================================
fig = px.scatter(df_new, x='publish_time', y='probability',
                )
fig.update_layout(
    title='Strength of discovered relationship along paper publication month',
    xaxis_title="Publish Time",
    yaxis_title="Strength",
    font=dict(
        family="Courier New, monospace",
        size=18,
    )
)
fig.show()

#### Altair

In [None]:
# # base plot
# chart_col = alt.vconcat().configure_axis(
#     labelFontSize=20,
#     titleFontSize=20
# ).configure_title(
#     fontSize=20,
# )

# # plot
# chart = alt.Chart(df_new).mark_circle(size=60).encode(
#     x='publish_time',
#     y='probability',
#     color='relation'
# ).properties(
#         width=400,
#         height=500,
#         title='Strength of discovered relationship along paper publication month').interactive(0)
# chart.encoding.y.title='Strength (Probability)'

# chart_col &= chart
# chart_col

## keyword-specific plot
- Choose a particular keyword
- Each relation will have a line plot
- Aggregated probability per month

In [None]:
# =======================================================
# Data Preparation 
# =======================================================
# Define what keyword we are interested to see the trend
kw = 'fever'
relations = ['is related to']

# create groups
df_grps = []
for relation in relations:
    # sub-df
    df_sg = df_new.loc[(df_new['keyword'] == kw) & (df_new['relation']==relation), :]

    # aggregated data
    grp = df_sg.groupby(df_sg['publish_time'].dt.strftime('%B'))['probability']
    statistics = [grp.mean(), grp.std(), grp.sem(), grp.count()]
    df_grp = pd.DataFrame(statistics).transpose()
    df_grp.columns = ['proba_mean', 'proba_std', 'proba_stderr', 'n']
    df_grp = df_grp.fillna(0)
    
    # new column
    df_grp['publish_month'] = df_grp.index
    df_grp['relation'] = relation
    df_grp['proba_min'] = df_grp['proba_mean'] - df_grp['proba_stderr']
    df_grp['proba_max'] = df_grp['proba_mean'] + df_grp['proba_stderr']
    
    # append
    df_grps.append(df_grp)
df_grps = pd.concat(df_grps)
df_grps

# =======================================================
# Plotly
# =======================================================
import plotly.graph_objects as go

fig = go.Figure(data=go.Scatter(
        x=df_grps['publish_month'],
        y=df_grps['proba_mean'],
        error_y=dict(
            type='data', # value of error bar given in data coordinates
            array=df_grps['proba_stderr'],
            visible=True)
    ))
fig.update_layout(
    yaxis=dict(range=[0, 1]),
    title="coronavirus - '%s' relationship" %kw,
    xaxis_title="Month",
    yaxis_title="Strength",
    font=dict(
        family="Courier New, monospace",
        size=18,
    )
)
fig.show()

#### Altair

In [None]:
# chart_col = alt.vconcat().configure_axis(
#     labelFontSize=20,
#     titleFontSize=20
# ).configure_title(
#     fontSize=20,
# ).configure_point(
#     size=100
# )

# # plot
# line = alt.Chart(df_grps).mark_line(point=True).encode(
#     x='publish_month',
#     y='proba_mean',
#     color='relation',
#     shape=alt.Shape('relation', scale=alt.Scale(range=['cross', 'circle', 'square', 'triangle-right', 'diamond'])),
# ).properties(
#     width=400,
#     height=500,
#     title="coronavirus - '%s' relationship" %kw
# )

# # generate the error bars
# errorbars = alt.Chart(df_grps).mark_errorbar().encode(
#     x="publish_month",
#     y="proba_min:Q",
#     y2="proba_max:Q",
#     color='relation'
# )
# errorbars.encoding.x.title='Month'
# errorbars.encoding.y.title='Strength'
# chart_col &= (line + errorbars)

# # rename axis title

# # plot
# chart_col

## Get top keywords for each relationship 

In [None]:
relations = ['is related to', 'is']

# loop
for r in relations:
    df_sg = df_new.loc[df_new['relation'] == r]
    df_sg = df_sg.groupby('keyword')['probability'].mean().sort_values(ascending=False).to_frame()
    df_sg.columns = ['strength']
    
    # print
    print('The relationship is:', r)
    print(df_sg.head(10))
    print()


## Overview plot for several interesting keyword relationship
- Look at the strength of the relationship along time
- https://altair-viz.github.io/gallery/natural_disasters.html

In [None]:
# =======================================================
# Data Preparation 
# =======================================================
kw_interest = ['sore throat', 'fatigue', 'fever', 'upper respiratory infection', 'lung capacity',
              'hospitalization', 'dry cough', 'sneezing', 'death', 'shortness of breath']
df_new_p = df_new.copy()
df_new_p = df_new_p[df_new_p['keyword'].isin(kw_interest)]
df_new_p.head()

# =======================================================
# Plotly plot
# =======================================================
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_new_p, x="publish_time", y="probability", color="keyword",
                 size='probability')
fig.update_layout(
    yaxis=dict(range=[0, 1.1]),
    title="coronavirus - keyword relationship",
    xaxis_title="Publish Time",
    yaxis_title="Strength",
    font=dict(
        family="Courier New, monospace",
        size=18,
    )
)
fig.show()

#### Altair

In [None]:
# kw_interest = ['sore throat', 'fatigue', 'fever', 'upper respiratory infection', 'lung capacity',
#               'hospitalization', 'dry cough', 'sneezing', 'death', 'shortness of breath']
# df_new_p = df_new.copy()
# df_new_p = df_new_p[df_new_p['keyword'].isin(kw_interest)]

# # base plot
# chart_col = alt.vconcat().configure_axis(
#     labelFontSize=14,
#     titleFontSize=14
# ).configure_title(
#     fontSize=20,
# )

# # plot
# chart = alt.Chart(df_new_p).mark_circle(
#     opacity=0.8,
#     stroke='black',
#     strokeWidth=1
# ).encode(
#     alt.X('publish_time'),
#     alt.Y('keyword'),
#     alt.Size('probability:Q',
#         scale=alt.Scale(range=[50, 200]),
#         legend=alt.Legend(title='Strength')
#     ),
# ).properties(
#     width=450,
#     height=320
# )
# chart.encoding.x.title='Month'
# chart.encoding.y.title='Entity'

# # plot
# chart_col &= chart
# chart_col