_Adapted from notebook used by [nbclab.github.io](https://nbclab.github.io)._

# Retrieve new publications from PubMed 

This notebook is used to search for and retrieve latest publications by Dr. Khan using BioPython's PubMed search tool. A publication-specific MarkDown file is generated for each unique paper, with many elements automatically set up. As noted in the original notebook, you generally should check that the link to the markdown file exists. Unfortunately, preprints cannot be found via this method (though they can be added manually). This notebook cannot find new preprints. The process is automated and run monthly using Github actions.

## Steps (via Github or manual)

1. Run this notebook.
2. If any new papers were grabbed, check the following:
    1. The paper has either of the lab PIs as an author. Ensure that it isn't by *another* AR Laird or MT Sutherland.
    2. The paper is not a duplicate of a preprint or another version of the paper. If so, merge the two versions.
3. Save the changes to the notebook.
4. Push changes to the notebook and affected files to GitHub.
5. Open a pull request to khanlab/khanlab.github.io

In [1]:
# Libraries
import re
from glob import glob

from Bio import Entrez, Medline
from datetime import datetime
from dateutil import parser
import pandas as pd

In [2]:
# First count number of articles from previous grab
df = pd.read_csv("_data/publications/publications.csv")
old_count = len(df)

In [3]:
# Only grab papers from after the lab PI came to UWO
search_criteria = ['''"Khan AR"[AUTH] AND ("2015/01/01"[PDAT] : "3000/12/31"[PDAT]) AND
                    ("Western University"[AFFL] OR "University of Western Ontario"[AFFL] OR
                     "Brain and Mind Institute"[AFFL] OR "Robarts Research Institute"[AFFL])''']

# Email required to search
Entrez.email = ''

In [7]:
rows = []

for TERM in search_criteria:
    search = Entrez.esearch(db="pubmed", retmax="2", term=TERM)
    result = Entrez.read(search)
    print(f"Total number of publications containing {TERM}: {result['Count']}")
    
    search_all = Entrez.esearch(db="pubmed", term=TERM, retmax=result["Count"])
    result_all = Entrez.read(search_all)
    ids_all = result_all['IdList']
    pubs_all = Entrez.efetch(db="pubmed", id=ids_all, rettype='medline', retmode='text')
    records = Medline.parse(pubs_all)
    
    acceptable_formats = ["journal article", "comparative study", "editorial"]
    
    for record in records:
        if any([type_.lower() in acceptable_formats for type_ in record.get('PT')]):
            pmid = record.get("PMID")
            pmcid = record.get("PMC", "")
            
            doi = [aid for aid in record.get("AID", []) if aid.endswith(" [doi]")]
            if doi:
                doi = doi[0].replace(" [doi]", "")
            else:
                doi = ""
            
            title = record.get("TI").rstrip(".")
            authors = record.get("AU")
            
            # Allow for cell to continue even if error with parsing date
            try:
                pub_date = parser.parse(record.get("DP"))
            except:
                None
            journal = record.get('TA')
            volume = record.get('VI', '')
            issue = record.get('IP', '')
            pages = record.get('PG', '')
            
            row = [pmid, pmcid, doi, title, authors, pub_date.year, pub_date.month,
                   pub_date.day, journal, volume, issue, pages]
            rows += [row]
            
# Save all relevant info from articles to a csv.
print("Saving identified publications to csv...")
df = pd.DataFrame(columns=['pmid', 'pmcid', 'doi', 'title', 'authors',
                           'year', 'month', 'day',
                           'journal', 'volume', 'issue', 'pages'],
                  data=rows)
df = df.sort_values(by=['year', 'month', 'day'], ascending=False)
df.to_csv('_data/publications/publications.csv', index=False)
df = df.fillna('')

Total number of publications containing "Khan AR"[AUTH] AND ("2015/01/01"[PDAT] : "3000/12/31"[PDAT]) AND
                    ("Western University"[AFFL] OR "University of Western Ontario"[AFFL] OR
                     "Brain and Mind Institute"[AFFL] OR "Robarts Research Institute"[AFFL]): 41
Saving identified publications to csv...


In [5]:
# Publications to skip (possibly due to another user with same initial)
skip_pmids = []

# Add papers we already have pages for.
if len(skip_pmids) > 0:
    for pmid in skip_pmids:
        df = df[df['pmid'] != pmid]
        
    
print(f"{len(df)} total articles found.")
print(f"{len(df) - old_count} new articles found.")

39 total articles found.
0 new articles found.
