## Developing a training dataset for Topic Category classifications

MeSH to EDAM Topic mappings can be found here: https://bioportal.bioontology.org/mappings/EDAM?target=https%3A%2F%2Fdata.bioontology.org%2Fontologies%2FMESH

Unfortunately, there does not appear to be a clean way to pull this data programmatically, so we'll just manually copy/paste it from the website into a tab delimited file and go from there.

To develop the training dataset, 
1. Pull all MeSH terms associated with a dataset, via a dataset's citation PMID
2. Map MeSH terms to EDAM Topics

If the training dataset is not comprehensive enough, consider:
1. Pull the MeSH mappings of EDAM Topics
2. For each mapping, pull 500 titles and abstracts from PubMed and use that as the traininig data

In [1]:
from Bio import Entrez
from Bio import Medline
import requests
import pandas as pd
import text2term




* 'underscore_attrs_are_private' has been removed


In [5]:
Entrez.email = "your email here"

In [2]:
citation_file = 'data/citation_df_clean.tsv'
citationdf = pd.read_csv(citation_file, delimiter='\t',header=0,index_col=0)
print(citationdf.head(n=2))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  


  citationdf = pd.read_csv(citation_file, delimiter='\t',header=0,index_col=0)


In [3]:
test_pmid = citationdf.iloc[0]['pmid']
print(test_pmid)

34874923


In [6]:
handle = Entrez.efetch(db="pubmed", id=test_pmid, rettype="medline", retmode="text")
records = Medline.parse(handle) ##parses pubmed entry for that ID and records the author
for record in records:
    MESHSet = record.get("MH","?") #writes the record to a list called MH
    

In [7]:
text2term.cache_ontology("https://data.bioontology.org/ontologies/EDAMT/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=rdf", "EDAMT")

2023-10-12 11:11:11 INFO [text2term.term_collector]: Loading ontology https://data.bioontology.org/ontologies/EDAMT/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=rdf...
2023-10-12 11:11:12 INFO [text2term.term_collector]: ...done (ontology loading time: 0.83s)
2023-10-12 11:11:12 INFO [text2term.term_collector]: Collecting ontology term details...
2023-10-12 11:11:13 INFO [text2term.term_collector]: ...done: collected 272 ontology terms (collection time: 1.22s)


<text2term.onto_cache.OntologyCache at 0x19704c33fd0>

In [8]:
df1 = text2term.map_terms(["asthma", "acute bronchitis"], "EDAMT", use_cache=True)
print(df1)

2023-10-12 11:11:33 INFO [text2term.tfidf_mapper]: Mapping 2 source terms...
2023-10-12 11:11:33 INFO [text2term.tfidf_mapper]: ...against 226 ontology terms (363 labels/synonyms)
2023-10-12 11:11:44 INFO [text2term.tfidf_mapper]: ...done (mapping time: 11.45s seconds)
                               Source Term ID       Source Term  \
0  http://ccb.hms.harvard.edu/t2t/R4Kdt82v5Ef  acute bronchitis   

        Mapped Term Label Mapped Term CURIE  \
0  Critical care medicine   EDAM.TOPIC:3403   

                      Mapped Term IRI  Mapping Score  Tags  
0  http://edamontology.org/topic_3403          0.462  None  


In [9]:
meshtest = [x.replace('/',',') for x in MESHSet]
dftest = text2term.map_terms(meshtest,"EDAMT", use_cache=True)
print(dftest)

2023-10-12 11:15:30 INFO [text2term.tfidf_mapper]: Mapping 17 source terms...
2023-10-12 11:15:30 INFO [text2term.tfidf_mapper]: ...against 226 ontology terms (363 labels/synonyms)
2023-10-12 11:15:30 INFO [text2term.tfidf_mapper]: ...done (mapping time: 0.06s seconds)
                                Source Term ID  \
0   http://ccb.hms.harvard.edu/t2t/R84GM5W24Ap   
1   http://ccb.hms.harvard.edu/t2t/R84GM5W24Ap   
2   http://ccb.hms.harvard.edu/t2t/R84GM5W24Ap   
3   http://ccb.hms.harvard.edu/t2t/R9gTWbnWbyz   
4   http://ccb.hms.harvard.edu/t2t/RF6HnvUi6Wa   
5   http://ccb.hms.harvard.edu/t2t/R5HU4YdZkbe   
6   http://ccb.hms.harvard.edu/t2t/R5HU4YdZkbe   
7   http://ccb.hms.harvard.edu/t2t/R5HU4YdZkbe   
8   http://ccb.hms.harvard.edu/t2t/R7dSudQvDDQ   
9   http://ccb.hms.harvard.edu/t2t/R7dSudQvDDQ   
10  http://ccb.hms.harvard.edu/t2t/R7dSudQvDDQ   
11  http://ccb.hms.harvard.edu/t2t/RAqRerDDpmi   
12  http://ccb.hms.harvard.edu/t2t/RAqRerDDpmi   
13  http://ccb.hms.harvard.edu

In [12]:
print(dftest.head(n=1))

                               Source Term ID  \
0  http://ccb.hms.harvard.edu/t2t/R84GM5W24Ap   

                                      Source Term Mapped Term Label  \
0  Alveolar Epithelial Cells,metabolism,pathology         Pathology   

  Mapped Term CURIE                     Mapped Term IRI  Mapping Score  Tags  
0   EDAM.TOPIC:0634  http://edamontology.org/topic_0634          0.447  None  


In [None]:
def retrieve_mesh_by_pmids(PMIDList):
    print(datetime.datetime.now().time())
    meshdf = pd.DataFrame(columns=['pmid','Source Term ID','Source Term','Mapped Term Label',
                                   'Mapped Term CURIE','Mapped Term IRI','Mapping Score','Tags']
    PMIDFails = []
    for PMID in PMIDList: #iterates through the PMID list
        try:
            #print('fetching authors for: '+str(PMID))
            handle = Entrez.efetch(db="pubmed", id=PMID, rettype="medline", retmode="text")
            records = Medline.parse(handle) ##parses pubmed entry for that ID and records the author
            for record in records:
                meshset = record.get("MH","?") 
                tempmesh = [x.replace('/',',') for x in meshSet]
                tempdf = text2term.map_terms(tempmesh,"EDAMT", use_cache=True)
                tempdf['pmid'] = PMID
                
        except:
            PMIDFails.append(PMID)
            print("pmid not found: ",PMID)

    print(datetime.datetime.now().time())
    return(PublicationDF,author_df,PMIDFails)