# Scrape Pubmed To Fastrack Literature Searches
## Lets say you are a med student tasked with doing a lit review, or a scientist interested in certain topics. You could use a plethora of the free tools available (Pubmed, google scholar, elicit, etc), but if you are lazy and want to automate some of the hard work, you could also build a web scraper in Python that will expedite this process for you.
## Here we will go through the construction of a very basic web scraper to dig through Pubmed and return articles and some metadata about them automatically. Our topic will be AI in medical education

### Install dependecies. Biopython is a set of python tools for biologic computation and you import it using Bio. Entrez is a search engine that provides access to manu biomedical databases

In [7]:
%pip install biopython

from Bio import Entrez
import pandas as pd
import numpy as np
import sys



Note: you may need to restart the kernel to use updated packages.


### For this project we will be using functions. Functions are a tool in Python to neatly package code that you will reuse over and over again. We will be making a 'search' function to probe the database and a 'get_details' functino to retrieve metadata about the identified articles. There is information about the code in commented text and we also provide explanations above each block. Functions are great tools in Python because they allow you to build blocks of code that are versatile and can be used many times. Say you have a bunch of lists you need to iterate over- instead of making a ton of for loops you could make one 'iterate' function and then call the function each time you have to use it. Functions are a power tool in the computer scientists toolkit, and for the duration of Code Grand Rounds (and in your own work) you will see them everywhere. It always good to try and write 'clean' code, and functinos are a great step towards doing that.

### That being said, as we are building functions now and calling them later the workflow of the code might not make complete sense until you see it all in action, so just hang in there and try to understand what the functions are doing for now.
### We will start with building our search function which will help us send some parameters to the database and retrieve a list of relevant articles.

### def search(query, min_date, max_date, number):' This line is declaring our function named "search" which takes four parameters explained below. Remember that you declare functions in Python using the 'def' keyword

- ### 'query': The topic or keywords we're interested in.
- ### 'min_date': The earliest publication date we want in our results.
- ### 'max_date': The latest publication date for our search.
- ### 'number': How many articles do we want to fetch?

### We then set up the search using 'search_setup = Entrez.esearch' from the Entrez module. It’s a little like walking up to a librarian (in this case, PubMed's database) and handing them a note with what we're looking for. All of the paramters are used in within this esearch call, and we have extensively commented the code in case you get confused.
The search function returns the list of articles (pubmed IDs) that were identified! Remember to get data back from functions we have to use the 'return "stuff to return"' keyword


In [8]:
# query- keywords you want to search
# min_date- earliest date to start the search
# max_date- latest date to end the search
# number- number of articles to return
def search(query, min_date, max_date, number):
        #db: the database we want to search. Find more about the various database options in table 1 of 'A General Introduction to the E-utilities' paper
    search_setup = Entrez.esearch(db='pubmed', 
             # An email is required in case something goes wrong
            email='gnahas2@uic.edu', 
             # This decides how you want to sort the identified articles
            sort='relevance',
            retmax=number,
            mindate = min_date, 
            maxdate = max_date, 
            term=query) 
    results = Entrez.read(search_setup) 
    return results

### Now that we identified the list of pubmed IDs that are relevant to our query we want to go and retrieve the metadata associated with these articles such as authors, titles, keywords, etc. We will build a function 'fetch_details' to handle this for us. It will take as input the lsit of pubmed IDs we calculated from the 'search' function and return all of this metadata in a dictionary style format. 

### Even though this is a more advance Python algorithm, you will see how basic concetps in Python keep popping up. We submit a list of PubMed IDs to the database, but it takes as input a list of comma separated strings. We can easily generate this from a list using the 'join' command on our input list! This function will return the metadata in a specific data type specific to BioPython which we will talk about soon!

In [9]:
# id_list- list of pubmed IDs that were identified earlier
def fetch_details(id_list):
    # make a new single string of all the ids separated by a comma in id_list to submit
    ids = ','.join(id_list) 
     #db: the database we want to search. Find more about the various database options in table 1 of 'A General Introduction to the E-utilities' paper
    fetch_setup = Entrez.efetch(db='pubmed', 
           # An email is required in case something goes wrong
            email='gnahas2@uic.edu',  
            # ids we will be searching for in the database
            id=ids) 
    results = Entrez.read(fetch_setup) #this stores the results of the metadata of the pubmed IDs in a pretty messy format, but one that is accessible to the computer! Read below to see how to extract it
    return results



## A Brief Detour

### The next functions we will build will be utilized to extract the data from our 'fetch_detai1ls' function. Before we go to much into the details of the functions, we will first talk about how BioPython and Entrez return data to us as users from the database

### First, it's important to understand the context. Biopython is a set of tools and libraries for computational biology and bioinformatics in Python. One of its many functionalities is to provide ways to interface with biological databases, such as those provided by the NCBI (National Center for Biotechnology Information).

### Entrez is NCBI's database system, and Biopython provides a module, Bio.Entrez (which we imported behind the scenes for you), to interface with it. When you make a query to Entrez using Biopython's tools, the results are often returned in a structured format that you can then manipulate with Python.

### The DictionaryElement class in Bio.Entrez.Parser represents a specific way that Biopython structures the data it gets back from Entrez. In simpler terms, when you fetch data from Entrez using Biopython, you often get back complex nested structures of data. These structures can include lists, dictionaries, and other custom data types that Biopython provides to make it easier to navigate the data. One of these custom data types is DictionaryElement.

### When you encounter a DictionaryElement, you can typically interact with it as you would with a standard Python dictionary, using keys to access values, iterating over its items, etc. However, if there are additional methods or attributes provided by the DictionaryElement class, you would need to refer to Biopython's documentation or utilize Python's introspection tools (like the dir() function) to understand and leverage them.

### WHen it comes down to it, you will not encounter dictinoary elements that often, but it is how the data from BioPython get's returned, and knowing that more complicated data structures such as this exist is important. If it doesn't make complete sense, that is *totally* OK! When you encounter weird data structures like this and get confused, the first step is to not panic, and the second is to try and get an idea conceptually of how the data is stored to figure out how to get what you want. Tools like ChatGPT and Stack Overflow are great resources for things like this!

## Extraction Functions

### Ok now lets build some of the extraction functions. As mentioned earlier, the point of this module is about continuing to introduce you to python concepts, but also getting you used to looking at more complex code and trying to reason through (with our help) how and why it is working. Remember to add print statements if you are interested and check out what is going on under the hood. 

### Let's start to build the our function to extract the authors. This function will take as input `medline`. You will see where we get this from in a bit, but in the Dictinoary Element from Biopython, *all* of the data for each pubmed article starts at the root node with the key `PubmedArticle`. Within the `PubmedArticle` level, the author information is contained under `MedlineCitation` key. So now we have descended down two branches into the land of `MedlineKey`, and from here we can check if the author data was properly stored in the database retrieval (just in case something went wrong...), and if it was we can then iterate over each author and store theur first and last name in  a list a We also make sure they have both a first and a last name as well

In [10]:
# Helper function to extract authors from a given article
def extract_authors(medline):
    # define a new list to store the author names in 
    authors = [] 
    if 'AuthorList' in medline['Article']: # check if there is even an author list in the current article
        # if there is an author list, iterate through it so we can store the names
        for author in medline['Article']['AuthorList']: 
            # use .get to retrieve the value associated with the author- remember this from Amino acids?
            last_name = author.get('LastName', '') 
            fore_name = author.get('ForeName', '') 
            # check if only last or first name
            if last_name or fore_name: 
                authors.append(last_name + ', ' + fore_name) 
    # In case no authors found
    else: 
        authors.append('Author(s) not found')
    # Return list of the authors
    return ', '.join(authors)

### The Process is basically the same for abstract. We will check to see if the article has an abstract, and if it does we will join all of the strings and return it as a list

In [11]:
# Helper function to extract abstract from a given article
def extract_abstract(medline):
    if 'Abstract' in medline['Article']: 
        # the abstract is a list of strings. We want to join them into one string separating them by spaces
        return ' '.join(medline['Article']['Abstract']['AbstractText']) 
    else:
        return 'No abstract available' 

### Same thing for Keyword extraction. The `KeywordList` is under the `Article` section of the Dictionary Element as opposed to `Medline` however, so we pass that as an argument here. We used loop comprehension to store all the keywords in a list. It is good practice to convert this code to a normal for loop format to make sure you are really understanding list comprehension!

In [12]:
# Helper function to extract keywords from a given article
def extract_keywords(article): 
    if 'KeywordList' in article and len(article['KeywordList']) > 0:
        # if there are keywords, store them in a list and join the list with a comma (this is just a for loop, but its called loop comprehension)
        return ', '.join([keyword for keyword in article['KeywordList'][0]]) 
    else:
        return 'No keywords available' # return this if no keywords are found

### Alright this is the function we are going to call to kick off the whole chain. First we iterate through every article in our results using the keyword `PubmedArticle` as this is the root node and gives us access to everything under it we need. After that we will store all of the info under `Medline` to send to the helper functinos we defined earlier, we will extract the pubmed ID to write out later using the `PMID` key, as well as the title and publication date. All of this stuff is iterative, and we had to spend some time reading the documentation/ googling when making this lesson in order to figure it all out, so if it is not immediately clear, do not worry! Iterative googling and reading is **very** much part of the process.

### We do want you to note a couple things here. The first being that we call the other functinos we defined earlier. This is kind of a meta idea that you can call functions within functions, but you totally can and should as it makes your code easier to read and understand! Functions are really just little computational tools that you can build to make your life easier!

### We will store all of the data we gathered in a nice dictionary that we will return so we can easily access the data and write it later!  We also convert the datafrane (which is in a list) to somthing called a `pandas datafrane` and return 

In [13]:
# results- this is the data from the fetch_details function above!
def extract_info(results):
    parsed_articles = []
    for article in results['PubmedArticle']: 
        medline = article['MedlineCitation'] 
         # extract the pubmed ID using the key PMID       
        pmid = medline['PMID']
        # Extract the article title (the title is nested under the article key so you have to do two layers of keys)
        title = medline['Article']['ArticleTitle'] 
        # Get the publication date and store it in a variable
        pub_date = medline['Article']['Journal']['JournalIssue']['PubDate'] 
        # get the authors from our helper function
        authors = extract_authors(medline) 
        # get the abstract from our helper function
        abstract = extract_abstract(medline) 
        # get the keywords from our helper function
        keywords = extract_keywords(article)

        parsed_articles.append({
            'pmid': pmid,
            'title': title,
            'authors': authors,
            'pub_date': pub_date,
            'abstract': abstract,
            'keywords': keywords
        })

    return pd.DataFrame(parsed_articles)

### Now we are going to set up the experiment and call all the functions we just made! This is where we will get our results and reap the benefits of our hard work. This block of code is pretty self explanatory if you followed along the earlier explanations of the functions. All we are doing is deciding what keywords, number of results we want, and data raange, and then we send it down our function chain! 

In [14]:
# Set up the paramters for your literature crawl
keywords = 'artificial intelligence in medical school education'
number = 20 
date_begin = '2010/01/01' 
date_end = '2023/05/01'  

 # We call the search function and send in our parameters in order to receive the pubmed IDs from our search
studies = search(keywords, date_begin, date_end, number)
# We want to extract the IdList (it is in a dictionary, so we use the key associated with it)
studiesIdList = studies['IdList'] 
 # we send the Ids to our fetch_details function to get more information about the identified articles
metadata = fetch_details(studiesIdList)
 # we extract the information that we received from the fetch_details
data = extract_info(metadata)

data.to_csv(keywords + '.csv', index=False) # we plot our data as a CSV



<class 'Bio.Entrez.Parser.DictionaryElement'>


### And with that, if you made it through all of this then give yourself a big pat on the back. Some of this code is pretty complex, and it is great that you ar spending the time going through it. We hope you are starting to how in computer science, the hard stuff is made up of the easy stuff, just combined in various different ways. That is why it is so important to cement your foundation! Once you feel like you are really starting to understand the syntax (or at least developing an intuition and know where to look for help) you are more than ready to move on to data science. Remember- this is an iterative process! Stay with it and with time you will see results.

In [15]:
# %pip install bert-extractive-summarizer
# from summarizer import Summarizer

# def summarize_abstract(abstract):
#     model = Summarizer()
#     return model(abstract, num_sentences=3)

# summary = summarize_abstract(abstract)

# parsed_articles.append({
#     'pmid': pmid,
#     'title': title,
#     'authors': ', '.join(authors),
#     'pub_date': pub_date,
#     'abstract': abstract,
#     'summary' : summary,
#     'keywords': ', '.join(keywords)
# })
    