# 3.3 Term Frequency Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-3-frequency-analysis.ipynb) 

In this notebook, we will show how the PubMed API can be used to perform a term frequency analysis by analysing the text of the retrieved documents and extracting common words and synonyms. We'll then use the MeshMate API, which uses machine learning methods developed in [this paper](https://www.sciencedirect.com/science/article/pii/S2667305322000783) to group the terms and suggest appropriate MeSH terms. We'll continue to use the running example of data from "[Blue-Light Therapy for Acne Vulgaris: A Systematic Review and Meta-Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6846280/)"

In [1]:
!pip install requests pandas -q
import string
import requests
import pandas as pd
from collections import Counter

## Frequency Analysis

We'll assume we don't yet have a query and we want to get an understanding of which terms might retrieve studies. We'll use the seed studies to extract terms and synonyms.

In [2]:
seed_studies = ["27575854", "25594129", "20098847", "22091799", "23278295", "24313686", "29152718", "10809858",
                "18664153", "15379878"]

We can immediately grab the title and abstracts of these studies to extract terms. For a deeper explaination of the next cell, take a look at [Section 3.2]([Section 3.2](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-2-searching-clinicaltrials-gov.ipynb)).

In [3]:
response = requests.get(  # GET request
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",  # URL of the API
    params={  # Parameters of the request
        "db": "pubmed",
        "id": ",".join(seed_studies),  # We can get multiple PMIDs at once
        "rettype": "medline",
        "retmode": "text"
    }
).text

pubmed_studies = []  # This will contain all the studies once processed
sections = response.split("\n\n")  # Thankfully, the responses can be split easily on two empty lines

for section in sections:  # Now, we process each section.
    # The next few lines of code convert the lines into a JSON format
    data_dict = {}
    last_key = None
    for line in section.splitlines():
        if line.strip() == "":
            continue
        if line[4] == "-":
            line = line.split("-")
            last_key = line[0].strip()
            if last_key in data_dict:
                if not isinstance(data_dict[last_key], list):
                    data_dict[last_key] = [data_dict[last_key]]
                data_dict[last_key].append(line[1].strip())
            else:
                data_dict[last_key] = line[1].strip()
        else:
            data_dict[last_key] += line.strip() + " "

    pubmed_studies.append(data_dict)

Now that we have this data, we can check it looks good in a DataFrame.

In [4]:
pd.DataFrame(pubmed_studies)

Unnamed: 0,PMID,OWN,STAT,DCOM,LR,IS,VI,IP,DP,TI,...,EDAT,MHDA,CRDT,PHST,AID,PST,SO,OTO,OT,TT
0,27575854,NLM,MEDLINE,20170626,20220408,"[1365, 0011]",55,12,2016 Dec,"A multicenter, randomized, splitsafety of chro...",...,2016/11/05 06:00,2017/06/27 06:00,2016/08/31 06:00,"[2015/12/06 00:00 [received], 2016/02/28 00:00...",10.1111/ijd.13349 [doi],ppublish,Int J Dermatol. 2016 Dec;55(12):132130.,,,
1,25594129,NLM,MEDLINE,20160512,20161125,"[1476, 1476]",17,4,2015,A randomized controlled study for the treatmen...,...,2015/01/17 06:00,2016/05/14 06:00,2015/01/17 06:00,"[2015/01/17 06:00 [entrez], 2015/01/17 06:00 [...",10.3109/14764172.2015.1007064 [doi],ppublish,J Cosmet Laser Ther. 2015;17(4):1702015 Feb 20.,NOTNLM,"[LED, RCT, acne vulgaris, light, photorejuvena...",
2,20098847,NLM,MEDLINE,20100603,20191120,"[1806, 0365]",84,5,2009 Sep,"[A prospective, randomized, open and comparati...",...,2010/01/26 06:00,2010/06/04 06:00,2010/01/26 06:00,"[2008/05/20 00:00 [received], 2009/07/31 00:00...","[S0365, 10.1590/s0365]",ppublish,An Bras Dermatol. 2009 Sep,,,"Estudo clinico, prospectivo, aberto, randomiza..."
3,22091799,NLM,MEDLINE,20120308,20111118,"[1476, 1476]",13,6,2011 Dec,Clinical efficacy of home,...,2011/11/19 06:00,2012/03/09 06:00,2011/11/19 06:00,"[2011/11/19 06:00 [entrez], 2011/11/19 06:00 [...",10.3109/14764172.2011.630081 [doi],ppublish,J Cosmet Laser Ther. 2011 Dec;13(6):308,,,
4,23278295,NLM,MEDLINE,20131113,20221207,"[1365, 0007]",168,5,2013 May,The clinical and histological effect of homeph...,...,2013/01/03 06:00,2013/11/14 06:00,2013/01/03 06:00,"[2013/01/03 06:00 [entrez], 2013/01/03 06:00 [...",10.1111/bjd.12186 [doi],ppublish,Br J Dermatol. 2013 May;168(5):1088,,,
5,24313686,NLM,MEDLINE,20150522,20140916,"[1600, 0905]",30,5,2014 Oct,Randomized trial of three phototherapy methods...,...,2013/12/10 06:00,2015/05/23 06:00,2013/12/10 06:00,"[2013/12/03 00:00 [accepted], 2013/12/10 06:00...",10.1111/phpp.12098 [doi],ppublish,Photodermatol Photoimmunol Photomed. 2014 Oct;...,NOTNLM,"[acne vulgaris, intense pulsed light, light, p...",
6,29152718,NLM,MEDLINE,20180807,20180807,"[1365, 0011]",57,1,2018 Jan,"An extension of a multicenter, randomized, spl...",...,2017/11/21 06:00,2018/08/08 06:00,2017/11/21 06:00,"[2016/12/21 00:00 [received], 2017/08/17 00:00...",10.1111/ijd.13814 [doi],ppublish,Int J Dermatol. 2018 Jan;57(1):94,,,
7,10809858,NLM,MEDLINE,20000629,20220316,"[0007, 0007]",142,5,2000 May,Phototherapy with blue (415 nm) and red (660 n...,...,2000/05/16 09:00,2000/07/06 11:00,2000/05/16 09:00,"[2000/05/16 09:00 [pubmed], 2000/07/06 11:00 [...","[bjd3481 [pii], 10.1046/j.1365]",ppublish,Br J Dermatol. 2000 May;142(5):973,,,
8,18664153,NLM,MEDLINE,20080926,20220311,"[1545, 1545]",7,7,2008 Jul,Phototherapy in the treatment of acne vulgaris.,...,2008/07/31 09:00,2008/09/27 09:00,2008/07/31 09:00,"[2008/07/31 09:00 [pubmed], 2008/09/27 09:00 [...",,ppublish,J Drugs Dermatol. 2008 Jul;7(7):627,,,
9,15379878,NLM,MEDLINE,20050201,20220317,"[0905, 0905]",20,5,2004 Oct,Blue light phototherapy in the treatment of acne.,...,2004/09/24 05:00,2005/02/03 09:00,2004/09/24 05:00,"[2004/09/24 05:00 [pubmed], 2005/02/03 09:00 [...","[PPP109 [pii], 10.1111/j.1600]",ppublish,Photodermatol Photoimmunol Photomed. 2004 Oct;...,,,


In reality,we would want to extract terms from the title and abstracts. We can do this by tokenising the text and counting the frequency of each word. We can then extract the most common words and find synonyms for them.

In [5]:
# Combine the title and abstracts
study_data = [study["TI"] for study in pubmed_studies] + [study["AB"] for study in pubmed_studies]

# Contains all the punctuation we want to remove
table = str.maketrans(string.punctuation, ' ' * len(string.punctuation))

# Count the frequency of each term
term_frequency = Counter()
for s in study_data:
    term_frequency.update(s.lower().translate(table).split())

# Show the results of the top 100 most commonly occuring terms
pd.DataFrame(term_frequency.most_common(100), columns=["Term", "Frequency"])

Unnamed: 0,Term,Frequency
0,the,160
1,of,128
2,and,101
3,a,65
4,in,64
...,...,...
95,well,6
96,conclusions,6
97,use,6
98,their,6


We can also do a bit better, by looking at phrases of two to three words.

In [6]:
# Count the frequency of each term
phrase_frequency = Counter()
for s in study_data:
    s = s.lower().translate(table).split()
    for i in range(len(s) - 1):
        phrase_frequency.update([" ".join(s[i:i + 2])])
    for i in range(len(s) - 2):
        phrase_frequency.update([" ".join(s[i:i + 3])])

# Show the results of the top 100 most commonly occuring phrases
pd.DataFrame(phrase_frequency.most_common(100), columns=["Phrase", "Frequency"])

Unnamed: 0,Phrase,Frequency
0,blue light,32
1,in the,27
2,the treatment,25
3,acne vulgaris,18
4,treatment of,17
...,...,...
95,light device,4
96,severe acne,4
97,6 weeks,4
98,weeks of,4


We can do even better by filtering out common function words.

Let's just manually curate a list of terms we don't care about.

In [7]:
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
             "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
             "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this",
             "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "the", "of", "and",
             "a", "to", "in", "for", "at", "with", "an", "on"]

Now, we can apply these "stopwords" to the frequency data we created.

In [8]:
no_stopwords_frequency = Counter()
for term, count in (phrase_frequency + term_frequency).items():
    if all([word not in stopwords and str.isalpha(word) for word in term.split()]):
        no_stopwords_frequency[term] = count
for term, count in no_stopwords_frequency.most_common(100):
    print(count, term)

55 treatment
55 acne
54 light
43 blue
32 blue light
31 patients
24 led
22 phototherapy
21 weeks
19 moderate
18 acne vulgaris
18 vulgaris
18 as
17 trial
17 inflammatory
17 lesions
16 week
15 treated
14 after
13 group
12 study
12 efficacy
12 red
12 severe
12 results
12 by
11 safety
11 main
10 main trial
10 randomized
10 effective
10 using
10 patient
10 lesion
10 or
9 light phototherapy
9 red light
9 mild
9 methods
9 one
9 mean
9 ipl
8 benzoyl peroxide
8 benzoyl
8 peroxide
8 hemiface
8 iga
8 baseline
8 reduction
8 rate
8 no
8 facial
8 improvement
8 p
8 irradiation
7 blue light phototherapy
7 biophotonic system
7 nm
7 evaluate
7 grade
7 clinical
7 biophotonic
7 system
7 device
7 randomly
7 twice
7 not
7 adverse
7 demonstrated
7 therapy
7 control
7 achieved
7 side
7 sessions
6 inflammatory acne
6 background
6 face
6 assessment
6 well
6 conclusions
6 use
6 treatments
6 have
5 photo converter
5 converter chromophores
5 lesion counts
5 baseline iga
5 iga grade
5 success rate
5 photo converter 

## Finding MeSH Terms

Now that we have our list of the most common terms and phrases from our studies, we can also use the MeshMate API to find MeSH terms for these terms. 

The MeshMate API will take the term frequency data we just extracted, group similar terms together, and then suggest MeSH terms for these groups of terms.

In [9]:
mesh_response = requests.get(
    url="http://13.54.37.195:5000/api/v1/resources/mesh",
    params={
        "term": "$".join([item[0] for item in no_stopwords_frequency.most_common(100)]),
        "type": "Semantic"
    }
).json()

We can now directly visualise the suggested MeSH terms and how the MeshMate API has grouped the terms.

In [10]:
suggested_mesh_terms = []
for suggestions in mesh_response["Data"]:
    suggested_mesh_terms.append((suggestions["Keywords"], list(suggestions["MeSH_Terms"].values())))
pd.DataFrame(suggested_mesh_terms, columns=["Terms", "MeSH Terms"])

Unnamed: 0,Terms,MeSH Terms
0,"[acne, acne vulgaris, inflammatory acne, acne ...","[Acne Vulgaris, Administration, Topical, Psori..."
1,"[light, blue light, light phototherapy, light ...","[Light, Photochemotherapy, Ultraviolet Rays, P..."
2,"[trial, main trial, randomized]","[Randomized Controlled Trials as Topic, Double..."
3,"[lesions, lesion]","[Magnetic Resonance Imaging, Biopsy, Tomograph..."
4,"[benzoyl peroxide, benzoyl, peroxide]","[Hydrogen Peroxide, Peroxides, Benzoyl Peroxid..."
...,...,...
67,[have],"[United States, United Kingdom, Societies, Med..."
68,[lesion counts],"[CD4 Lymphocyte Count, Lymphocyte Count, Disea..."
69,[baseline iga],"[Glomerulonephritis, IGA, Immunoglobulin G, Im..."
70,[extension],"[Range of Motion, Articular, Biomechanical Phe..."


## Summary

In this notebook, we showed how the PubMed API can be used to perform a term frequency analysis by analysing the text of the retrieved documents and extracting common words and synonyms. We then used the MeshMate API to group the terms and suggest appropriate MeSH terms. The results could be used to assist with search strategy development.

---
[top](https://github.com/hscells/apis-for-evidence-identification#table-of-contents)<br/>
[previous: Searching ClinicalTrials.gov](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-2-searching-clinicaltrials-gov.ipynb)<br/>
[next: Replicating Yale MeSH Analyzer](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-4-replicating-yale-mesh-analyzer.ipynb)<br/>
