# 3.4 Replicating Yale MeSH Analyzer

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-4-replicating-yale-mesh-analyzer.ipynb) 

In this notebook we will replicate the [Yale MeSH Analyzer](https://mesh.med.yale.edu/). This is a tool that allows users to input a series of PMIDs and it will return a list of MeSH terms that are associated with those studies.  We will see how we can use the APIs we have discussed to completely replicate this tool. We'll continue to use the running example of data from "[Blue-Light Therapy for Acne Vulgaris: A Systematic Review and Meta-Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6846280/)"

In [1]:
!pip install requests pandas -q
import requests
import pandas as pd
from collections import Counter

We will keep using the same set of seed studies we've been using in the last few notebooks.

In [2]:
seed_studies = ["27575854", "25594129", "20098847", "22091799", "23278295", "24313686", "29152718", "10809858",
                "18664153", "15379878"]

First, we need to get the MeSH terms associated with each study. For a deeper explaination of the next cell, take a look at [Section 3.2](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-2-searching-clinicaltrials-gov.ipynb).

In [3]:
response = requests.get(  # GET request
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",  # URL of the API
    params={  # Parameters of the request
        "db": "pubmed",
        "id": ",".join(seed_studies),  # We can get multiple PMIDs at once
        "rettype": "medline",
        "retmode": "text"
    }
).text

pubmed_studies = []  # This will contain all the studies once processed
sections = response.split("\n\n")  # Thankfully, the responses can be split easily on two empty lines

for section in sections:  # Now, we process each section.
    # The next few lines of code convert the lines into a JSON format
    data_dict = {}
    last_key = None
    for line in section.splitlines():
        if line.strip() == "":
            continue
        if line[4] == "-":
            line = line.split("-")
            last_key = line[0].strip()
            if last_key in data_dict:
                if not isinstance(data_dict[last_key], list):
                    data_dict[last_key] = [data_dict[last_key]]
                data_dict[last_key].append(line[1].strip())
            else:
                data_dict[last_key] = line[1].strip()
        else:
            data_dict[last_key] += line.strip() + " "

    pubmed_studies.append(data_dict)
len(seed_studies), len(pubmed_studies)

(10, 10)

The cell above should have returned the same number of studies as we have seed studies. Now we can take a look at the MeSH terms associated with each study. First, let's look at all the data we extracted.

In [4]:
df_studies = pd.DataFrame(pubmed_studies)
df_studies

Unnamed: 0,PMID,OWN,STAT,DCOM,LR,IS,VI,IP,DP,TI,...,EDAT,MHDA,CRDT,PHST,AID,PST,SO,OTO,OT,TT
0,27575854,NLM,MEDLINE,20170626,20220408,"[1365, 0011]",55,12,2016 Dec,"A multicenter, randomized, splitsafety of chro...",...,2016/11/05 06:00,2017/06/27 06:00,2016/08/31 06:00,"[2015/12/06 00:00 [received], 2016/02/28 00:00...",10.1111/ijd.13349 [doi],ppublish,Int J Dermatol. 2016 Dec;55(12):132130.,,,
1,25594129,NLM,MEDLINE,20160512,20161125,"[1476, 1476]",17,4,2015,A randomized controlled study for the treatmen...,...,2015/01/17 06:00,2016/05/14 06:00,2015/01/17 06:00,"[2015/01/17 06:00 [entrez], 2015/01/17 06:00 [...",10.3109/14764172.2015.1007064 [doi],ppublish,J Cosmet Laser Ther. 2015;17(4):1702015 Feb 20.,NOTNLM,"[LED, RCT, acne vulgaris, light, photorejuvena...",
2,20098847,NLM,MEDLINE,20100603,20191120,"[1806, 0365]",84,5,2009 Sep,"[A prospective, randomized, open and comparati...",...,2010/01/26 06:00,2010/06/04 06:00,2010/01/26 06:00,"[2008/05/20 00:00 [received], 2009/07/31 00:00...","[S0365, 10.1590/s0365]",ppublish,An Bras Dermatol. 2009 Sep,,,"Estudo clinico, prospectivo, aberto, randomiza..."
3,22091799,NLM,MEDLINE,20120308,20111118,"[1476, 1476]",13,6,2011 Dec,Clinical efficacy of home,...,2011/11/19 06:00,2012/03/09 06:00,2011/11/19 06:00,"[2011/11/19 06:00 [entrez], 2011/11/19 06:00 [...",10.3109/14764172.2011.630081 [doi],ppublish,J Cosmet Laser Ther. 2011 Dec;13(6):308,,,
4,23278295,NLM,MEDLINE,20131113,20221207,"[1365, 0007]",168,5,2013 May,The clinical and histological effect of homeph...,...,2013/01/03 06:00,2013/11/14 06:00,2013/01/03 06:00,"[2013/01/03 06:00 [entrez], 2013/01/03 06:00 [...",10.1111/bjd.12186 [doi],ppublish,Br J Dermatol. 2013 May;168(5):1088,,,
5,24313686,NLM,MEDLINE,20150522,20140916,"[1600, 0905]",30,5,2014 Oct,Randomized trial of three phototherapy methods...,...,2013/12/10 06:00,2015/05/23 06:00,2013/12/10 06:00,"[2013/12/03 00:00 [accepted], 2013/12/10 06:00...",10.1111/phpp.12098 [doi],ppublish,Photodermatol Photoimmunol Photomed. 2014 Oct;...,NOTNLM,"[acne vulgaris, intense pulsed light, light, p...",
6,29152718,NLM,MEDLINE,20180807,20180807,"[1365, 0011]",57,1,2018 Jan,"An extension of a multicenter, randomized, spl...",...,2017/11/21 06:00,2018/08/08 06:00,2017/11/21 06:00,"[2016/12/21 00:00 [received], 2017/08/17 00:00...",10.1111/ijd.13814 [doi],ppublish,Int J Dermatol. 2018 Jan;57(1):94,,,
7,10809858,NLM,MEDLINE,20000629,20220316,"[0007, 0007]",142,5,2000 May,Phototherapy with blue (415 nm) and red (660 n...,...,2000/05/16 09:00,2000/07/06 11:00,2000/05/16 09:00,"[2000/05/16 09:00 [pubmed], 2000/07/06 11:00 [...","[bjd3481 [pii], 10.1046/j.1365]",ppublish,Br J Dermatol. 2000 May;142(5):973,,,
8,18664153,NLM,MEDLINE,20080926,20220311,"[1545, 1545]",7,7,2008 Jul,Phototherapy in the treatment of acne vulgaris.,...,2008/07/31 09:00,2008/09/27 09:00,2008/07/31 09:00,"[2008/07/31 09:00 [pubmed], 2008/09/27 09:00 [...",,ppublish,J Drugs Dermatol. 2008 Jul;7(7):627,,,
9,15379878,NLM,MEDLINE,20050201,20220317,"[0905, 0905]",20,5,2004 Oct,Blue light phototherapy in the treatment of acne.,...,2004/09/24 05:00,2005/02/03 09:00,2004/09/24 05:00,"[2004/09/24 05:00 [pubmed], 2005/02/03 09:00 [...","[PPP109 [pii], 10.1111/j.1600]",ppublish,Photodermatol Photoimmunol Photomed. 2004 Oct;...,,,


We can filter the columns of this table down to just the PMID, Title, and MeSH terms.

In [5]:
df_studies[["PMID", "TI", "MH"]]

Unnamed: 0,PMID,TI,MH
0,27575854,"A multicenter, randomized, splitsafety of chro...","[Acne Vulgaris/complications/*therapy, Adolesc..."
1,25594129,A randomized controlled study for the treatmen...,"[Acne Vulgaris/*therapy, Adolescent, Adult, Co..."
2,20098847,"[A prospective, randomized, open and comparati...","[Acne Vulgaris/*therapy, Administration, Topic..."
3,22091799,Clinical efficacy of home,"[Acne Vulgaris/pathology/*therapy, Adult, Face..."
4,23278295,The clinical and histological effect of homeph...,"[Acne Vulgaris/pathology/*therapy, Asian Peopl..."
5,24313686,Randomized trial of three phototherapy methods...,"[Acne Vulgaris/*therapy, Adolescent, Adult, Fe..."
6,29152718,"An extension of a multicenter, randomized, spl...","[Acne Vulgaris/*therapy, Adolescent, Adult, Co..."
7,10809858,Phototherapy with blue (415 nm) and red (660 n...,"[Acne Vulgaris/*therapy, Adolescent, Adult, Be..."
8,18664153,Phototherapy in the treatment of acne vulgaris.,"[Acne Vulgaris/drug therapy/*therapy, Adult, F..."
9,15379878,Blue light phototherapy in the treatment of acne.,[Acne Vulgaris/classification/pathology/*thera...


This isn't particularly useful, since all the MeSH terms are in a single string. We need to instead invert this list so that each MeSH term corresponds to a study. That's exactly what the next cell does.

In [6]:
mesh_index = {}
for study in pubmed_studies:
    for term in study["MH"]:
        if term not in mesh_index:
            mesh_index[term] = [0] * len(pubmed_studies)
        if term in study["MH"]:
            mesh_index[term][pubmed_studies.index(study)] = 1
mesh_index["Adult"]

[1, 1, 0, 1, 0, 1, 1, 1, 1, 1]

Now, we've got this data in a format that tells us which studies the MeSH term appears in. If it's a 1, then the MeSH term appears in the study. If it's a 0, then it doesn't. We can now convert this into a table for easier viewing.

In [7]:
pd.DataFrame(mesh_index).T.rename(columns={i: seed_studies[i] for i in range(len(seed_studies))}).sort_index()

Unnamed: 0,27575854,25594129,20098847,22091799,23278295,24313686,29152718,10809858,18664153,15379878
*Phototherapy,0,0,1,0,0,0,0,0,0,0
*Phototherapy/instrumentation,0,0,0,1,0,0,0,0,0,0
Acne Vulgaris/*therapy,0,1,1,0,0,1,1,1,0,0
Acne Vulgaris/classification/pathology/*therapy,0,0,0,0,0,0,0,0,0,1
Acne Vulgaris/complications/*therapy,1,0,0,0,0,0,0,0,0,0
Acne Vulgaris/drug therapy/*therapy,0,0,0,0,0,0,0,0,1,0
Acne Vulgaris/pathology/*therapy,0,0,0,1,1,0,0,0,0,0
"Administration, Topical",0,0,1,0,0,0,0,0,0,0
Adolescent,1,1,1,0,0,1,1,1,0,1
Adult,1,1,0,1,0,1,1,1,1,1


We can now even go one step further and use this data to count the number of times each MeSH term appears in the studies. We just need to count the number of 1s in each column.

In [8]:
mesh_count = Counter()
for term, vec in mesh_index.items():
    mesh_count[term] = sum(vec)
pd.DataFrame(mesh_count.most_common(), columns=["MeSH Term", "Count"])

Unnamed: 0,MeSH Term,Count
0,Female,10
1,Humans,10
2,Male,10
3,Adult,8
4,Adolescent,7
5,Severity of Illness Index,6
6,Acne Vulgaris/*therapy,5
7,Treatment Outcome,5
8,Young Adult,4
9,Phototherapy/adverse effects/*methods,4


As before, we can save this data to a CSV file for further analysis.

In [9]:
pd.DataFrame(mesh_count.most_common(), columns=["MeSH Term", "Count"]).to_csv("mesh_terms.csv", index=False)
# This should save a file called mesh_terms.csv in the current directory

## Summary

In this notebook, we have replicated the Yale MeSH Analyzer. We have taken a list of PMIDs and extracted the MeSH terms associated with each study. We then inverted this data to show which MeSH terms appear in which studies. Finally, we counted the number of times each MeSH term appeared in the studies. The results of this notebook can be used to quickly identify the most common MeSH terms associated with a set of studies, for use in search strategy development.

If you found any mistakes or have any feedback about any of the chapters, please reach out, or open an issue on the [GitHub repository](https://github.com/hscells/apis-for-evidence-identification). Thank you for reading!

---
[top](https://github.com/hscells/apis-for-evidence-identification#table-of-contents)<br/>
[previous: Frequency Analysis](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-3-frequency-analysis.ipynb)<br/>