# Open discovery model

## Motivation

Shortly after his groundbreaking study on the effects of fish oil on Raynaud's disease, Swanson (1988) applied the same research methodology to explore the role of magnesium in migraine disorders. He observed that a lack of magnesium might exacerbate migraines due to various factors including stress, spreading cortical depression, epilepsy, platelet aggregation, serotonin and substance P levels, inflammation, vasoconstriction, prostaglandin formation, and hypoxia. However, the capability of magnesium to inhibit calcium channels could potentially prevent migraine episodes.

The figure below illustrates the open discovery setting. Open discovery fundamentally serves as a method for generating new hypotheses. This process can be described using Swanson's (1986) ABC model: Suppose the research begins with a term, 'c', with 'C' representing the body of literature associated with 'c'. Assume that domain 'A' comprises literature that may reveal previously unknown connections to 'C'. The aim is to identify 'A'. According to Swanson, this involves navigating

> through some intermediate literature (B) toward an unknown destination A. The success of this endeavor relies entirely on the knowledge and creativity of the researcher...

<img src="img/open_discovery.png" alt="Open discovery model" width="300px"/>

To streamline the search process during LBD, we will demonstrate how representing documents with knowledge concepts, rather than merely words and phrases, can be more effective.

## MeSH ontology

[MeSH](https://www.nlm.nih.gov/mesh/intro_record_types.html) contains three basic types of MeSH Records:

1. *Descriptors* characterize the subject matter or content. This record type plays a central role in MeSH vocabulary as a unit of indexing and retrieval. With the exception of Class 3 descriptors (see below), all descriptors are organised into a numbered tree structure or hierarchy that allows users to browse in a orderly fashion from broader to narrower topics.

2. *Qualifiers* are used with descriptors and afford a means of grouping together those documents concerned with a particular aspect of a subject. There are 78 topical Qualifiers (also known as Subheadings) used for indexing and cataloging in conjunction with Descriptors. Qualifiers afford a convenient means of grouping together those citations which are concerned with a particular aspect of a subject. For example, Liver/drug effects indicates that the article or book is not about the liver in general, but about the effect of drugs on the liver Qualifiers are searchable in PubMed as MeSH Subheadings (SH field).

3. *Supplementary Concept Records* (SCRs) are used to index chemicals, drugs, and other concepts such as rare diseases for MEDLINE and are searchable by Substance Name field (NM) in PubMed. Unlike Descriptors, SCRs are not organised in a tree hierarchy.

Descriptors are divided into four classes:

1. *Main Headings* (Class 1).
These records are topical headings that are used to index citations in NLM's MEDLINE database, for cataloging of publications, and other databases, and are searchable in PubMed as [MH]. Most Descriptors indicate the subject of an indexed item, such as a journal article, that is, what the article is about. Descriptors are generally updated on an annual basis but may, on occasion, be updated more frequently.

2. *Publication Characteristics* (Class 2).
These records indicate what the indexed item is, i.e., its genre, rather than what it is about, for example, Historical Article. They may include Publication Components, such as Charts; Publication Formats, such as Editorial; and Study Characteristics, such as Clinical Trial. They function as metadata, rather than being about the content. These records are searchable in PubMed as Publication Type [PT], and the terms in MEDLINE records are labeled as "PT" or <PublicationType> rather than "MH" or <MeSHHeading>. They are listed in category V of the MeSH Tree Structures. A list is available of Publication Types, with Scope Notes.

3. *Check Tags* (Class 3).
This class of descriptors is used solely for tagging citations that contain certain categories of information. They do not appear in the MeSH tree. Modernization has largely eliminated the need for the data type and many of the Check Tags have been changed to Class 1 headings that can be used either a MH or a Check Tag. Currently only two Class 3 descriptors remain: "Male" and "Female".

4. *Geographics* (Class 4).
Descriptors which include continents, regions, countries, states, and other geographic subdivisions. They are not used to characterize subject content but rather physical location. They are listed in category Z of the MeSH Tree Structures.

In [148]:
# Load required libraries for NLP and data analysis
import gzip
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

from utils import *

In [170]:
# Load dataset for the first domain (migraine literature)
# The dataset is stored as pipe-separated value (PSV) compressed files.
# Only PMID and MeSH Headings columns are used for subsequent processing.
# Read migraine Medline data with MeSH headings
# 1. Unpivot a DataFrame from wide to long format by transform each element of a 'mh' list-like column to a row

df_1 = pd.read_csv('./input/pmid_dp_ti_mh_migraine_20241124.psv.gz', sep='|', names=['pmid', 'dp', 'ti', 'mh']) \
         .assign(mh=lambda df: df['mh'].str.split(';')) \
         .explode('mh', ignore_index=True) \
         .loc[:, ['pmid', 'mh']]

print(f"No. of unique PMIDs: {df_1['pmid'].nunique()}")
print(f"No. of unique MeSH terms: {df_1['mh'].nunique()}")

df_1

No. of unique PMIDs: 1156
No. of unique MeSH terms: 1812


Unnamed: 0,pmid,mh
0,12653,Adrenergic beta-Antagonists
1,12653,Anxiety
2,12653,"Arrhythmias, Cardiac"
3,12653,Asthma
4,12653,Heart Diseases
...,...,...
13264,22216523,Male
13265,22216523,Metoclopramide
13266,22216523,Middle Aged
13267,22216523,Migraine Disorders


To further narrow the search space in the discovery process, we filter the results of the text-to-concept mapping process using semantic types. At different stages of the LBD process, we can apply various semantic filters. For example, when selecting intermediate concepts (b), we focus on concepts associated with functional semantic types such as Biologic Function, Cell Function, Phenomenon or Process, and Physiologic Function. Similarly, when identifying dietary factors as A-concepts, a typical semantic filter includes semantic types like Vitamin and Element, Ion, or Isotope.

Next, we will augment the data frame `df_1`, prepared earlier, by adding semantic types corresponding to each extracted MeSH term. The `df_2` data frame will store a pre-prepared table (file `./input/d2024.psv.gz`) that maps each MeSH heading to its corresponding semantic type identifiers. Additionally, we will filter out any MeSH terms (`df_3`) that do not belong to the main headings (i.e., Descriptor Class 1, as described above).

In [172]:
df_2 = pd.read_csv('./input/d2024.psv.gz', sep='|', names=['dui','mh','sty','dclass'])
df_2

Unnamed: 0,dui,mh,sty,dclass
0,D000001,Calcimycin,T109;T195,1
1,D000002,Temefos,T109;T131,1
2,D000003,Abattoirs,T073,1
3,D000004,Abbreviations as Topic,T170,1
4,D000005,Abdomen,T029,1
...,...,...,...,...
30759,D000097924,Paraptosis,T043,1
30760,D000097942,"Hearing Loss, Hidden",T184,1
30761,D000097962,Gender-Affirming Care,T058,1
30762,D000097983,ERRalpha Estrogen-Related Receptor,T116;T123,1


In [173]:
print(f"No. of unique MeSH terms in df_2: {df_2['mh'].nunique()}")

No. of unique MeSH terms: 30764


In [162]:
df_2['dclass'].value_counts()

dclass
1    30172
4      401
2      189
3        2
Name: count, dtype: int64

In [168]:
df_3 = df_2.query('dclass == 1') \
           .loc[:, ['mh', 'sty']] \
           .assign(sty=lambda x: x['sty'].str.split(';'))

df_3

Unnamed: 0,mh,sty
0,Calcimycin,"[T109, T195]"
1,Temefos,"[T109, T131]"
2,Abattoirs,[T073]
3,Abbreviations as Topic,[T170]
4,Abdomen,[T029]
...,...,...
30759,Paraptosis,[T043]
30760,"Hearing Loss, Hidden",[T184]
30761,Gender-Affirming Care,[T058]
30762,ERRalpha Estrogen-Related Receptor,"[T116, T123]"


In [174]:
sty_filt = {'T116': 'Amino Acid, Peptide, or Protein',
            'T033': 'Finding',
            'T046': 'Pathologic Function',
            'T038': 'Biologic Function',
            'T067': 'Phenomenon or Process',
            'T043': 'Cell Function',
            'T044': 'Molecular Function',
            'T040': 'Organism Function',
            'T042': 'Organ or Tissue Function',
            'T039': 'Physiologic Function'}

df_4 = pd.DataFrame(sty_filt.items(), columns=['sty', 'sty_name'])
df_4

Unnamed: 0,sty,sty_name
0,T116,"Amino Acid, Peptide, or Protein"
1,T033,Finding
2,T046,Pathologic Function
3,T038,Biologic Function
4,T067,Phenomenon or Process
5,T043,Cell Function
6,T044,Molecular Function
7,T040,Organism Function
8,T042,Organ or Tissue Function
9,T039,Physiologic Function


In [167]:
df_5 = df_3.explode('sty', ignore_index=True) \
           .merge(right=df_4, how='right', on='sty') \
           .filter(items=['mh']) \
           .drop_duplicates()

df_5

Unnamed: 0,mh
0,Abrin
1,Acetate Kinase
2,Acetoin Dehydrogenase
3,Acetolactate Synthase
4,Acetyl-CoA C-Acetyltransferase
...,...
6567,Sterilizing Immunity
6568,Vernalization
6569,Angiogenesis
6570,Host Tropism


In [175]:
# Select only PMIDs with MesH headings corresponding to desired semantic types
df_6 = df_1.merge(right=df_5, how='inner', on='mh')
df_6

Unnamed: 0,pmid,mh
0,12653,"Arrhythmias, Cardiac"
1,26908,Vasoconstriction
2,47017,Follicle Stimulating Hormone
3,47017,Luteinizing Hormone
4,47017,Menarche
...,...,...
1263,15645828,Regional Blood Flow
1264,17152738,Postoperative Complications
1265,17152743,Blood Pressure
1266,17152743,Heart Rate


In [176]:
print(f"No. of unique PMIDs: {df_6['pmid'].nunique()}")
print(f"No. of unique MeSH terms: {df_6['mh'].nunique()}")

No. of unique PMIDs: 626
No. of unique MeSH terms: 283


In [177]:
l2l = df_6.groupby('pmid')['mh'].apply(list).to_list()

# From l2l_sel to TFIDF
tf = TfidfVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, token_pattern=None, min_df=3)
tf_fit = tf.fit_transform(l2l)
wrd_lst = tf.get_feature_names_out()

score_lst = np.array(tf_fit.sum(axis=0)).reshape(-1).tolist()
wrd2score = dict(zip(wrd_lst, score_lst))

df_tf_sel = (pd.DataFrame()
             .from_dict(wrd2score, orient='index')
             .reset_index()
             .set_axis(['name', 'score'], axis=1)
             .sort_values(by='score', ascending=False)
            )

df_tf_sel.head(n=10)

Unnamed: 0,name,score
17,Cerebrovascular Circulation,49.627577
18,Cerebrovascular Disorders,33.723809
72,Muscle Contraction,29.629725
8,Blood Pressure,26.578643
81,Pregnancy,25.46745
109,Vasoconstriction,24.989756
77,Platelet Aggregation,21.781851
90,Recurrence,17.103982
93,Regional Blood Flow,15.505759
71,Monoamine Oxidase,13.56637


In [150]:
# b1: Vasoconstriction
# b2: Platelet Aggregation
# b3: Spreading Cortical Depression

sty_filt = {'T127': 'Vitamin',
            'T196': 'Element, Ion, or Isotope'}

b1_df = rank_terms(in_df = './input/pmid_dp_ti_mh_vasoconstriction_20241125.psv.gz',
                   mh_df = df_3,
                   filt = sty_filt)

b2_df = rank_terms(in_df = './input/pmid_dp_ti_mh_platelet_20241125.psv.gz',
                   mh_df = df_3,
                   filt = sty_filt)

b3_df = rank_terms(in_df = './input/pmid_dp_ti_mh_cortical_20241125.psv.gz',
                   mh_df = df_3,
                   filt = sty_filt)

In [153]:
print(f'No. of ranked MeSH terms for b1: {b1_df.shape[0]}')
print(f'No. of ranked MeSH terms for b2: {b2_df.shape[0]}')
print(f'No. of ranked MeSH terms for b3: {b3_df.shape[0]}')

No. of ranked MeSH terms for b1: 21
No. of ranked MeSH terms for b2: 37
No. of ranked MeSH terms for b3: 6


In [154]:
# Common MeSH headings across all three B terms
set(b1_df['name'].to_list()) & set(b2_df['name'].to_list()) & set(b3_df['name'].to_list())

{'Calcium', 'Magnesium', 'Manganese', 'Oxygen', 'Potassium', 'Sodium'}

In [None]:
(Migraine Disorders[MH] AND Calcium[MH]) AND 1966/01/01:1987/12/31[DP] AND medline[sb] AND hasabstract AND english[LA]
(Migraine Disorders[MH] AND Magnesium[MH]) AND 1966/01/01:1987/12/31[DP] AND medline[sb] AND hasabstract AND english[LA]

In [None]:
'Calcium' 4
'Magnesium' 0
'Manganese' 0
'Oxygen' 6
'Potassium' 4
'Sodium' 2