## Pub Med Scraper

In [1]:
# scrape dependencies
import requests
import re
from bs4 import BeautifulSoup as bs

# data analysis dependencies
import pandas as pd
import numpy as np
import csv

# ipynb dependencies
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')

# viz dependencies
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

import datetime as dt
import time

In [2]:
# set the url to scrape
url = 'https://www.ncbi.nlm.nih.gov/pubmed/?term=parkinsons+disease'
print(url)

https://www.ncbi.nlm.nih.gov/pubmed/?term=parkinsons+disease


In [3]:
# set up beautiful soup to scrape
response = requests.get(url)
soup = bs(response.text, 'html.parser')

In [4]:
# lets scrape the article titles
journals = soup.find_all("p", attrs={'class':'title'})

In [5]:
# searching for the journal titles
journals_len = len(journals)
print(f"There are {journals_len} journals to scrape on the first page.")

There are 20 journals to scrape on the first page.


In [6]:
# loop through journals to print titles
for i in range(0,20):
    journals[i].text.strip()

"Trx-1 ameliorates learning and memory deficits in MPTP-induced Parkinson's disease model in mice."

'Lipid vesicles affect the aggregation of 4-hydroxy-2-nonenal-modified α-synuclein oligomers.'

"MANF protects dopamine neurons and locomotion defects from a human α-synuclein induced Parkinson's disease model in C. elegans by regulating ER stress and autophagy pathways."

'Long-term evolution of patient-reported outcome measures in spinocerebellar ataxias.'

'Effects of deep brain stimulation on rest tremor progression in early stage Parkinson disease.'

"Patients' shifting goals for deep brain stimulation and informed consent."

'Glycosaminoglycans have variable effects on α-synuclein aggregation and differentially affect the activities of the resulting amyloid fibrils.'

'Association between attention-deficit/hyperactivity disorder and amyotrophic lateral sclerosis.'

"The factors associated with impulse control behaviors in Parkinson's disease: A 2-year longitudinal retrospective cohort study."

"Evaluation of Linguistic Markers of Word-Finding Difficulty and Cognition in Parkinson's Disease."

'Implementation and evaluation of Parkinson disease management in an outpatient clinical pharmacist-run neurology telephone clinic.'

'Treatment of psychotic symptoms in patients with Parkinson disease.'

'Pimavanserin (Nuplazid™) for the treatment of Parkinson disease psychosis: A review of the literature.'

'Drug-induced parkinsonism: A case report.'

'Evidence for the use of "medical marijuana" in psychiatric and neurologic disorders.'

'Interaction between Monoamine Oxidase B Inhibitors and Selective Serotonin Reuptake Inhibitors.'

'Comparative Study of MRI Biomarkers in the Substantia Nigra to Discriminate Idiopathic Parkinson Disease.'

"Visual hallucinations in dementia and Parkinson's disease: A qualitative exploration of patient and caregiver experiences."

'Therapy With Mesenchymal Stem Cells in Parkinson Disease: History and Perspectives.'

'Level of uric acid and uric acid/creatinine ratios in correlation with stage of Parkinson disease.'

## Set main url to concat with pubmed ids

In [7]:
# set the main url that we will concatanate with the pubmed id
main_url = 'https://www.ncbi.nlm.nih.gov/pubmed/'
print(main_url)

https://www.ncbi.nlm.nih.gov/pubmed/


## Function to create array of links to scrape

In [8]:
# set empty links_all list to append to 
links_all = []

# set pubmed ids list to append to
pubmed_ids = []

# set empty list to append scrape_links to
scrape_links = []

# function to get links
def get_links(main_url):
    
    # use bs to scarpe p tags with class - title
    links = soup.find_all("p",attrs={'class':'title'})
      
    # testing to see how my links / journals to scrape
    articles_to_scrape = len(links)
    print(f"There are {articles_to_scrape} articles to scrape.")
    print("----------------------------------------------")
    
    # loop through links to convert to string
    for i in range (len(links)):
        links_all.append(str(links[i]))
        print(links[i])
        print("----------------------------------------------")
        
    # slice through links_all to test
    len(links_all)
    links_all[1]
    
    # loop through links all and use regex to grab the id numbers
    for i in range (len(links_all)):
        pubmed_ids.append(re.findall(r'\d{8}',links_all[i]))
    
    # print out info for pubmed_ids
    len(pubmed_ids)
    type(pubmed_ids)
    print(pubmed_ids)
    print("----------------------------------------------")
    
    # use itertools to transform pubmed ids from an array withn an array into one list
    import itertools
    pubmed_merged = list(itertools.chain.from_iterable(pubmed_ids))
    
    # slice through pubmed_merged to see what itertools did
    pubmed_merged[0]
    
    # concat main_url with a slice of pubmed_merged before we loop
    print(main_url + str(pubmed_merged[0]))
    
    # append merged links to links_all
    for i in range (len(pubmed_merged)):
        scrape_links.append(main_url + str(pubmed_merged[i]))

In [9]:
# RUN FUNCTION
get_links(main_url)

There are 20 articles to scrape.
----------------------------------------------
<p class="title" xmlns:mml="http://www.w3.org/1998/Math/MathML"><a href="/pubmed/29960099" ref="ordinalpos=1&amp;ncbi_uid=29960099&amp;link_uid=29960099&amp;linksrc=docsum_title">Trx-1 ameliorates learning and memory deficits in MPTP-induced <b>Parkinson</b>'s <b>disease</b> model in mice.</a></p>
----------------------------------------------
<p class="title" xmlns:mml="http://www.w3.org/1998/Math/MathML"><a href="/pubmed/29960040" ref="ordinalpos=2&amp;ncbi_uid=29960040&amp;link_uid=29960040&amp;linksrc=docsum_title">Lipid vesicles affect the aggregation of 4-hydroxy-2-nonenal-modified α-synuclein oligomers.</a></p>
----------------------------------------------
<p class="title" xmlns:mml="http://www.w3.org/1998/Math/MathML"><a href="/pubmed/29959908" ref="ordinalpos=3&amp;ncbi_uid=29959908&amp;link_uid=29959908&amp;linksrc=docsum_title">MANF protects dopamine neurons and locomotion defects from a human α

<br>
There are duplicates in our **scrape_links** array. Use `list` to delete the duplicates.

In [10]:
# delete duplicates in scrape_links and assign to new variable scrape_links_final
scrape_links_final = list(set(scrape_links))
len(scrape_links_final)
scrape_links_final

20

['https://www.ncbi.nlm.nih.gov/pubmed/29955562',
 'https://www.ncbi.nlm.nih.gov/pubmed/29959908',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955495',
 'https://www.ncbi.nlm.nih.gov/pubmed/29953040',
 'https://www.ncbi.nlm.nih.gov/pubmed/29952939',
 'https://www.ncbi.nlm.nih.gov/pubmed/29959225',
 'https://www.ncbi.nlm.nih.gov/pubmed/29956879',
 'https://www.ncbi.nlm.nih.gov/pubmed/29959266',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955528',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955193',
 'https://www.ncbi.nlm.nih.gov/pubmed/29960099',
 'https://www.ncbi.nlm.nih.gov/pubmed/29960040',
 'https://www.ncbi.nlm.nih.gov/pubmed/29958655',
 'https://www.ncbi.nlm.nih.gov/pubmed/29954816',
 'https://www.ncbi.nlm.nih.gov/pubmed/29953689',
 'https://www.ncbi.nlm.nih.gov/pubmed/29959555',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955500',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955532',
 'https://www.ncbi.nlm.nih.gov/pubmed/29955824',
 'https://www.ncbi.nlm.nih.gov/pubmed/29959262']

## Main array of links to scrape:

Here we use selenium to iterate through these links. Seleium will click on each link then scrape the title and abstracts on each page. 

In [11]:
# testing scrape_links
for i in scrape_links_final:
    print(i)

https://www.ncbi.nlm.nih.gov/pubmed/29955562
https://www.ncbi.nlm.nih.gov/pubmed/29959908
https://www.ncbi.nlm.nih.gov/pubmed/29955495
https://www.ncbi.nlm.nih.gov/pubmed/29953040
https://www.ncbi.nlm.nih.gov/pubmed/29952939
https://www.ncbi.nlm.nih.gov/pubmed/29959225
https://www.ncbi.nlm.nih.gov/pubmed/29956879
https://www.ncbi.nlm.nih.gov/pubmed/29959266
https://www.ncbi.nlm.nih.gov/pubmed/29955528
https://www.ncbi.nlm.nih.gov/pubmed/29955193
https://www.ncbi.nlm.nih.gov/pubmed/29960099
https://www.ncbi.nlm.nih.gov/pubmed/29960040
https://www.ncbi.nlm.nih.gov/pubmed/29958655
https://www.ncbi.nlm.nih.gov/pubmed/29954816
https://www.ncbi.nlm.nih.gov/pubmed/29953689
https://www.ncbi.nlm.nih.gov/pubmed/29959555
https://www.ncbi.nlm.nih.gov/pubmed/29955500
https://www.ncbi.nlm.nih.gov/pubmed/29955532
https://www.ncbi.nlm.nih.gov/pubmed/29955824
https://www.ncbi.nlm.nih.gov/pubmed/29959262


## Regex Notes

In [12]:
# Regex
# Identifiers:
# \d any number
# \D anything but a number
# \s space
# \S anything but a space
# \w any character
# \W anything but a character
# . any character, except for a newline
# \b the whitespace around words
# \. a period

# Modifiers:
# {1,3} we're expecting 1-3 \d{1-3}
# + Match 1 or more
# ? Match 0 or more
# * Match 0 or more
# $ Match the end of a string
# ^ matching the beginning of a string
# | either or
# [] range or "variance" [A-Za-z] [1-5a-qA-Z]
# {x} expecting "x" amount

# White Space Characters: 
# \n new line
# \s space
# \t tab
# \e escape
# \f form feed
# \r return

# DONT FORGET!:
# . + * [] $ ^ () {} | \

## Selenium
**Web Browser Automation**

In [13]:
from splinter import Browser
from selenium import webdriver

In [14]:
# make sure chrome browser exe is in current directory
# chrome browser exe is not necessary for MACS
executable_path = {'executable_path': 'chromedriver'}

## Set up dictionary to append data to

In [16]:
article_dict = {"title": [],
               "abstract": []}

In [17]:
### use scrape_this to test scraper ###
scrape_this = ["https://www.ncbi.nlm.nih.gov/pubmed/29959262","https://www.ncbi.nlm.nih.gov/pubmed/29955824"]
scrape_this[0]
scrape_this[1]

'https://www.ncbi.nlm.nih.gov/pubmed/29959262'

'https://www.ncbi.nlm.nih.gov/pubmed/29955824'

## Create get_article_info function

In [18]:
title = []
abstract = []

def get_article_info(scrape_this):
    
    # iterate through articles
    for i in scrape_this:
        
        # sets up scraper
        browser = Browser('chrome', headless=False)
        html = browser.html
        response2 = requests.get(i)
        soup2 = bs(response2.text, 'html.parser')
    
        browser.visit(i)
    
        # there are two 'h1' tags on this page. slice out index 0
        title_one = soup2.find_all('h1')
        article_one_title = title_one[1].text.strip()
    
        # slice h1 at index 1 to grab article title
        title.append(article_one_title)
    
        # get abstract
        abstract.append(soup2.find("div", attrs={'class': 'rprt_all'}).text.strip())     

## Run article_info function

In [19]:
get_article_info(scrape_this)

In [20]:
for i in title:
    print(i)

Patients' shifting goals for deep brain stimulation and informed consent.
Evaluation of Linguistic Markers of Word-Finding Difficulty and Cognition in Parkinson's Disease.


In [21]:
for i in abstract:
    print(i)
    print("\n")

Neurology. 2018 Jun 29. pii: 10.1212/WNL.0000000000005917. doi: 10.1212/WNL.0000000000005917. [Epub ahead of print]Patients' shifting goals for deep brain stimulation and informed consent.Kubu CS1, Frazier T2, Cooper SE2, Machado A2, Vitek J2, Ford PJ2.Author information1From the Center for Neurological Restoration (C.S.K.), Neuroethics Program (P.J.F.), Center for Pediatric Behavioral Health (T.F.), and Center for Neurological Restoration (A.M.), Cleveland Clinic, OH; and Department of Neurology (S.E.C., J.V.), University of Minnesota, Minneapolis. kubuc@ccf.org.2From the Center for Neurological Restoration (C.S.K.), Neuroethics Program (P.J.F.), Center for Pediatric Behavioral Health (T.F.), and Center for Neurological Restoration (A.M.), Cleveland Clinic, OH; and Department of Neurology (S.E.C., J.V.), University of Minnesota, Minneapolis.AbstractOBJECTIVE: To determine using a repeated-measures, prospective design whether deep brain stimulation (DBS) results in changes in the impor

## Add title and abstract to article_dict

In [22]:
article_dict["title"].append(title)
article_dict["abstract"].append(abstract)
print(article_dict)

{'title': [["Patients' shifting goals for deep brain stimulation and informed consent.", "Evaluation of Linguistic Markers of Word-Finding Difficulty and Cognition in Parkinson's Disease."]], 'abstract': [["Neurology. 2018 Jun 29. pii: 10.1212/WNL.0000000000005917. doi: 10.1212/WNL.0000000000005917. [Epub ahead of print]Patients' shifting goals for deep brain stimulation and informed consent.Kubu CS1, Frazier T2, Cooper SE2, Machado A2, Vitek J2, Ford PJ2.Author information1From the Center for Neurological Restoration (C.S.K.), Neuroethics Program (P.J.F.), Center for Pediatric Behavioral Health (T.F.), and Center for Neurological Restoration (A.M.), Cleveland Clinic, OH; and Department of Neurology (S.E.C., J.V.), University of Minnesota, Minneapolis. kubuc@ccf.org.2From the Center for Neurological Restoration (C.S.K.), Neuroethics Program (P.J.F.), Center for Pediatric Behavioral Health (T.F.), and Center for Neurological Restoration (A.M.), Cleveland Clinic, OH; and Department of Ne

<br>
<br>
<br>