### MeSH Based Query Search

In this notebook, we will discuss how to search documents based on the Disease described in the CVD tree.

In [1]:
import pandas as pd
import json
from neo4j import GraphDatabase
import csv

#### Authentication to access covidgraph.org graph

In [2]:
covid_browser = "https://covid.petesis.com:7473"
covid_url = "bolt://covid.petesis.com:7687"
user = "public"
password = "corona"

#driver = GraphDatabase.driver(uri, auth=(user, password))
driver = GraphDatabase.driver(uri = covid_url,\
                              auth = (user,password))

#### MeSH descriptor to its entity list
- Ex. ```C01.925.782.600.550.200.360: [feline infectious peritonitis]```
- Pandas Dataframe is very convenient for handeling a CSV file specifically for data transformation with ```lambda``` mapping functon.

#### Obtaining terms related to heart disease

In [3]:
MeSH_heart = pd.read_csv("input/mesh/heart_disease.csv")
MeSH_heart = MeSH_heart.set_index('ID')
MeSH_heart.head()

Unnamed: 0_level_0,name
ID,Unnamed: 1_level_1
C14.280.647,Myocardial Ischemia
C14.280.647.124,Acute Coronary Syndrome
C14.280.647.187,Angina Pectoris
C14.280.647.187.150,"Angina, Unstable"
C14.280.647.187.150.150,"Angina Pectoris, Variant"


- Implementing ```lambda``` function to map one column to another column

In [4]:
MeSH_heart['phrases'] = MeSH_heart['name'].apply(lambda x: x.lower().strip())

In [5]:
MeSH_heart['phrases'] = MeSH_heart['phrases'].apply(lambda x:x.split(','))

In [6]:
MeSH_heart.head()

Unnamed: 0_level_0,name,phrases
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
C14.280.647,Myocardial Ischemia,[myocardial ischemia]
C14.280.647.124,Acute Coronary Syndrome,[acute coronary syndrome]
C14.280.647.187,Angina Pectoris,[angina pectoris]
C14.280.647.187.150,"Angina, Unstable","[angina, unstable]"
C14.280.647.187.150.150,"Angina Pectoris, Variant","[angina pectoris, variant]"


#### Obtaining immune system pathway terms

In [7]:
MeSH_immune = pd.read_csv("input/pathways/immune_system_pathways.csv", index_col=0)
MeSH_immune = MeSH_immune.set_index('RID')
MeSH_immune.head()

Unnamed: 0_level_0,name,species
RID,Unnamed: 1_level_1,Unnamed: 2_level_1
R-HSA-174577,Activation of C3 and C5,Homo sapiens
R-HSA-1280218,Adaptive Immune System,Homo sapiens
R-HSA-879415,Advanced glycosylation endproduct receptor sig...,Homo sapiens
R-HSA-173736,Alternative complement activation,Homo sapiens
R-HSA-983170,"Antigen Presentation: Folding, assembly and pe...",Homo sapiens


In [8]:
MeSH_immune['name'] = MeSH_immune['name'].apply(lambda x: x.lower().strip())
MeSH_immune = MeSH_immune.drop(columns='species')
MeSH_immune.head()

Unnamed: 0_level_0,name
RID,Unnamed: 1_level_1
R-HSA-174577,activation of c3 and c5
R-HSA-1280218,adaptive immune system
R-HSA-879415,advanced glycosylation endproduct receptor sig...
R-HSA-173736,alternative complement activation
R-HSA-983170,"antigen presentation: folding, assembly and pe..."


In [9]:
MeSH_immune['name'] = MeSH_immune['name'].apply(lambda x: x.split(':')[0].strip())
MeSH_immune['name'] = MeSH_immune['name'].apply(lambda x: x.split('&'))

In [10]:
for val_list in MeSH_immune['name'].values:
    for val in val_list:
        val.strip()
        if '(' in val:
            open_in = val.find('(')
            close_in = val.find(')')
            val = val[0:open_in].strip() + ' ' + val[close_in+1:len(val)].strip()
            val_list.append(val[open_in:close_in].strip())

In [11]:
MeSH_immune.head()

Unnamed: 0_level_0,name
RID,Unnamed: 1_level_1
R-HSA-174577,[activation of c3 and c5]
R-HSA-1280218,[adaptive immune system]
R-HSA-879415,[advanced glycosylation endproduct receptor si...
R-HSA-173736,[alternative complement activation]
R-HSA-983170,[antigen presentation]


#### Combine both value sets
- Heart disease mesh descriptions and immune system pathways

In [12]:
all_ = [(x, y) for x in MeSH_heart['phrases'] for y in MeSH_immune['name']]

In [13]:
all_[0:4]

[(['myocardial ischemia'], ['activation of c3 and c5']),
 (['myocardial ischemia'], ['adaptive immune system']),
 (['myocardial ischemia'],
  ['advanced glycosylation endproduct receptor signaling']),
 (['myocardial ischemia'], ['alternative complement activation'])]

#### MeSH to Doc Mapping
- Create a dictionary where the key is a MeSH descriptor, and the value is a list of papers (publications) that contains mention of the MeSH terms in its body text
- Each paper is represented as dictionary linking each attribute name in the paper (cord_uid, journal, title, etc.) with its actual information

##### Example of a paper node in the covid graph

In [14]:
paper_query = "MATCH (n:Paper) RETURN n LIMIT 1"
Data = []
with driver.session() as session:
    info = session.run(paper_query)
    for item in info:
        print(item)

<Record n=<Node id=3198 labels={'Paper'} properties={'cord_uid': 'zrmkq3mz', 'cord19-fulltext_hash': '41c7a01f11ed47591d99f45774e43e45aeba0619', 'journal': 'BMC Microbiol', 'publish_time': '2009-08-12', 'source': 'PMC', 'title': 'CAPIH: A Web interface for comparative analyses and visualization of host-HIV protein-protein interactions', '_hash_id': '3c4b2ee1430dc9ac53aca87c0fc0f7eb', 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2782265/'}>>


#### Writes data to file

In [16]:
ff = open("data/mesh_queries_heartdisease_immune.json", 'w')

In [17]:
for heart, immune in all_:
    #Builds each part of the query based on the MeSH descriptor entity list
    query = "MATCH (p:Paper)-[:PAPER_HAS_BODYTEXTCOLLECTION]-(:BodyTextCollection)-" \
                                        "[:BODYTEXTCOLLECTION_HAS_BODYTEXT]-(a:BodyText) WHERE ("
    for i in range(len(heart)):
        query += "LOWER(a.text) CONTAINS '" + heart[i] + "' AND "
        for j in range(len(immune)):
            if i == len(heart)-1 and j == len(immune)-1 :
                query += "LOWER(a.text) CONTAINS '" + immune[j] + "') RETURN DISTINCT p"
            else:
                query += "LOWER(a.text) CONTAINS '" + immune[j] + "' AND "

    MeSH_result = []
        
    with driver.session() as session:
        info = session.run(query)
        for item in info:
            try:
                node_keys = list((item.values(0)[0]).keys())
                node_values = list((item.values(0)[0]).values())
                paper = {}
                for i in range(len(node_keys)):
                    paper[node_keys[i]] = node_values[i]
                MeSH_result.append(paper)
            except:
                continue
    
    if MeSH_result != []:
        print(MeSH_result)
        json.dump(MeSH_result, ff)

ff.close()

[{'cord_uid': 'xwjqvgic', 'cord19-fulltext_hash': '6f1a0067a1612a8293e7cb64c89bf2e92b674fae', 'journal': 'Front Neurol', 'publish_time': '2019-03-22', 'source': 'PMC', 'title': 'Traumatic Spinal Cord Injury: An Overview of Pathophysiology, Models and Acute Injury Mechanisms', '_hash_id': 'f9238cecac49b0043f22456437d1d220', 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439316/'}]
[{'cord_uid': 'lcqrd4v0', 'cord19-fulltext_hash': '14c4936378b78d5039dc318e1bc6d2dd3044b014', 'journal': 'Molecular and Translational Vascular Medicine', 'publish_time': '2012-02-23', 'source': 'PMC', 'title': 'The Molecular Biology and Treatment of Systemic Vasculitis in Children', '_hash_id': '70f504f78f176f4efb8f3679a9f0a1e3', 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121654/'}]
[{'cord_uid': '3sba91nh', 'cord19-fulltext_hash': 'cf253a3cc9c4a21b8c8ba76f23495dde2d908194', 'journal': 'Emergency Medicine Clinics of North America', 'publish_time': '2018-11-30', 'source': 'Elsevier', 'title': 

In [20]:
f2 = open("data/mesh_texts_heartdisease_immune.json", 'w')

In [None]:
for heart, immune in all_:
    #Builds each part of the query based on the MeSH descriptor entity list
    query = "MATCH (a:BodyText) WHERE ("
    for i in range(len(heart)):
        query += "LOWER(a.text) CONTAINS '" + heart[i] + "' AND "
        for j in range(len(immune)):
            if i == len(heart)-1 and j == len(immune)-1 :
                query += "LOWER(a.text) CONTAINS '" + immune[j] + "') RETURN DISTINCT a"
            else:
                query += "LOWER(a.text) CONTAINS '" + immune[j] + "' AND "

    MeSH_result = []
        
    with driver.session() as session:
        info = session.run(query)
        for item in info:
            try:
                node_keys = list((item.values(0)[0]).keys())
                node_values = list((item.values(0)[0]).values())
                paper = {}
                for i in range(len(node_keys)):
                    if (node_keys[i] == 'text'):
                        paper[node_keys[i]] = node_values[i]
                MeSH_result.append(paper)
            except:
                continue
    
    if MeSH_result != []:
        print(MeSH_result)
        json.dump(MeSH_result, f2)

f2.close()