<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Article-Abstract-Data" data-toc-modified-id="Article-Abstract-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Article Abstract Data</a></span><ul class="toc-item"><li><span><a href="#Using-xml-package" data-toc-modified-id="Using-xml-package-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Using <code>xml</code> package</a></span></li><li><span><a href="#Using-lxml-package" data-toc-modified-id="Using-lxml-package-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Using <code>lxml</code> package</a></span></li><li><span><a href="#Extracting-PMC-Article-Link-from-Article-Abstract-Page" data-toc-modified-id="Extracting-PMC-Article-Link-from-Article-Abstract-Page-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Extracting PMC Article Link from Article Abstract Page</a></span></li></ul></li></ul></div>

# Web Scraping Guide

## Import Packages

We'll use `requests` to pull the data and `lxml` and/or `xml` to parse it.

In [20]:
import requests
import xml.etree.ElementTree as ET

from lxml import html

[This article](https://docs.python-guide.org/scenarios/scrape/) is a concise summary of some helpful ideas. Below I'll do it to a pubmed article and I can add other examples as time goes on. 

## Article Abstract Data

### Using `xml` package

Below I'll pull the abstract from a PubMed [article](https://www.ncbi.nlm.nih.gov/pubmed/12734240) using the `xml` package. There's always lots of ways to do stuff in python so I'm going to show you two different scraping packages so you can compare and choose the one the makes sense.

In [44]:
### Save the URL as a variable
url='https://www.ncbi.nlm.nih.gov/pubmed/12734240'

### Using the package `requests` ping the page and save the response as an object.
### requests docs : https://requests.readthedocs.io/en/master/user/quickstart/#response-content
requests_object = requests.get(url)

### Take the page's text from our requests object
requests_object_content = requests_object.text

requests_object_content[:2000]

'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\n    <head xmlns:xi="http://www.w3.org/2001/XInclude"><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n    <!-- meta -->\n    <meta name="author" content="pubmeddev" /><meta name="keywords" content="PubMed, National Center for Biotechnology Information, NCBI, United States National Library of Medicine, NLM, MEDLINE, Medical Journals, pub med, Entrez, Journal Articles, Citation search" /><meta name="description" content="PubMed comprises more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites." /><meta name="robots" content="index,nofollow,noarchive" /><meta property="og:image" content="http

In [45]:
### Using the `xml` package, we can convert the gross string above into a python data structure that 
### we can easily flip through.
### xml docs : https://docs.python.org/3.3/library/xml.etree.elementtree.html
root = ET.fromstring(requests_object_content)
### The `root` is the root of the xml file and then we can find elements beneath that l
root.findall('./')

[<Element '{http://www.w3.org/1999/xhtml}head' at 0x111cd27c8>,
 <Element '{http://www.w3.org/1999/xhtml}body' at 0x111d6f408>]

In [46]:
root.findall('./{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title')

[<Element '{http://www.w3.org/1999/xhtml}title' at 0x111d66c78>]

In [47]:
### This isn't a common thing but this `{http://www.w3.org/1999/xhtml}` prefix thing on each element is super
### annoying so I found some code that'll just remove that from all of the element tree.
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
root.findall('./')

[<Element 'head' at 0x111cd27c8>, <Element 'body' at 0x111d6f408>]

In [48]:
### If you check out the articles page, right click and inspect element you can see how its all set up. 
### I like to use XPATH to select stuff while web scraping. You can use less specific stuff but it tends to 
### difficult to use in real situations.
### XPATH docs : 

list(root.findall('.//div[@class="rprt_all"]//div[@class="abstr"]/')[1].itertext())

['Due to the inadequate automation in the amplification and sequencing procedures, the use of 16S rRNA gene sequence-based methods in clinical microbiology laboratories is largely limited to identification of strains that are difficult to identify by phenotypic methods. In this study, using conventional full-sequence 16S rRNA gene sequencing as the "gold standard," we evaluated the usefulness of the MicroSeq 500 16S ribosomal DNA (rDNA)-based bacterial identification system, which involves amplification and sequencing of the first 527-bp fragment of the 16S rRNA genes of bacterial strains and analysis of the sequences using the database of the system, for identification of clinically significant bacterial isolates with ambiguous biochemical profiles. Among 37 clinically significant bacterial strains that showed ambiguous biochemical profiles, representing 37 nonduplicating aerobic gram-positive and gram-negative, anaerobic, and Mycobacterium species, the MicroSeq 500 16S rDNA-based bac