# Covid causal mechanism
Gerd Graßhoff, Humboldt University of Berlin, Max Planck Institute for History of Science, BIFOLD

# Goal

Analysis of abstracts of Covid publications (more than 210 000) provides a set of terms used for expressing causal relationships. Often they are identified as "mechanism". We group them:

## Cause

- cause
- factor

## Effects

- disease
- events

## causal relevance

### positive

- increase
- stimulate

### negative

- inhibits
- prevents

In [2]:
import pandas as pd

In [3]:
import spacy
print(spacy.__version__) 

3.0.1


In [4]:
# Import library
nlp = spacy.load('en_core_web_lg')

# Load dimension covid publication dataframe 
Note that the data directory is parallel to the notebook directory to save github storage space.
The data files are hosted in figshare and its file name need to be renamed appropriately

https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063

In [6]:
# takes some time to read raw data, then creates a parquet data format for faster loading
# is uncommented once parquet data are available
df=pd.read_excel("../coviddata/dimensions-covid-2021-Feb-19.xlsx")
df.to_parquet('../coviddata/covid.parquet')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216722 entries, 0 to 216721
Data columns (total 31 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   Date added                             216722 non-null  object 
 1   Publication ID                         216722 non-null  object 
 2   DOI                                    212447 non-null  object 
 3   PMID                                   98507 non-null   float64
 4   PMCID                                  75794 non-null   object 
 5   Title                                  216722 non-null  object 
 6   Abstract                               142644 non-null  object 
 7   Source title                           197976 non-null  object 
 8   Source UID                             197976 non-null  object 
 9   Publisher                              202095 non-null  object 
 10  MeSH terms                             40609 non-null   

In [9]:
df.to_parquet('../coviddata/covid.parquet')
pd.read_parquet('../coviddata/file.parquet')

RuntimeError: Compression 'snappy' not available.  Options: ['BROTLI', 'GZIP', 'UNCOMPRESSED']

In [None]:
# filter subset of those publications with abstract "mechanism"
dff=df[df["Abstract"].str.contains("mechanism",na=False)]
print(len(dff))
dff.head(3)

In [None]:
# manual selection of example to analyse
example=dff.iloc[4953] # 233
example["Abstract"]

# NLP of sentences in abstracts

In [None]:
def data2DF():
    sents=[]
    for par,row in dff.iterrows():
        sentences=nlp(row["Abstract"]).sents
        chID=row["PMID"]
        for sentID,sent in enumerate(sentences):
            sents.append({"chID":chID,"sentID":sentID,"sent":sent})
    return(sents)

In [None]:
sents=data2DF()
print(f"number of sentences: {len(sents)}")

## Filter sentences
Filtering sentence items increases the efficancy of subsequent processing for information extraction and semantic modelling. It should be fast enough to reduce efficiently

Filter categories operate on the token level of spacy processed sentence docs. It can therefore filter with enriched attributes from the spacy nlp:

- matches for the following keys using their values:
    - "text"
    - "lemma"
    - "dep"
    - "pos"
    - "compound"
    - "pattern" 
    
- match pattern is provided by a JSON object: a list of dicts. 
    - each item of the list is matched on each token.
        e.g. [{"lemma":"law"}] matches if a token has a lemma=="law"
    
Logic of matches: at least one match of a dict on a token of a sentence matches the entire sentence, hence each dict of the list is an or-condition. Each dict element then forms an and-condition.

In [None]:
def lmat(t,dfi):
    switch={"lemma":t.lemma_,
           "pos":t.pos_,
           "dep":t.dep_,
            "text":t.text}
    logs=False
    for pat in dfi:
        likeys=pat.keys()
        for k in likeys:
            wert=switch[k]
            pt=pat[k]
            if pt==wert:
                logs=True
    return(logs)
        
        
def filtsent(row,dfi):
    sent=row["sent"]
    lfi=False
    for t in sent:
        if lmat(t,dfi):
            lfi=True
            break
    return(lfi)

def filterdf(df,fdict):
    ''' 
        df dataframe with sentences after nlp processing,
        fdict: dictionary with filter categories and match terms
    '''
    return(df[df.apply(lambda x:filtsent(x,fdict),axis=1)])

In [None]:
pat1=[{"lemma":"mechanism"}]
# used for training puposes for selecting few cases
dff=df.iloc[:]
filterdf(dff,pat1)

In [None]:
for word in doc:
    subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
    print(subtree_span.root.text,"::",word.dep_,"::","--->",subtree_span.text,)
   # print("".join(w.text_with_ws for w in word.subtree))

In [None]:
for word in doc:
    if word.dep_ in ("ROOT"):
        subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
        print(subtree_span.root.text,"::",word.dep_,"::","--->",subtree_span.text,)
        chds=[t.text for t in word.children]
        print("children:",chds)
        for t in doc:
            if t.text in chds:
                subtree_span = doc[t.left_edge.i : t.right_edge.i + 1]
                print(subtree_span.root.text,"::",t.dep_,"::","--->",subtree_span.text,)



In [None]:
df=pd.DataFrame(s.to_dict())
df[["id","text","upos","head","deprel"]]

In [None]:
graphviz.Source(deplacy.dot(doc))

In [None]:
semtree=MultiDiGraph()
for i,e in df.iterrows():
    semtree.add_edge(e["id"],e["head"],label=e["deprel"],arrowsize=1, arrowstyle='fancy')