# Parsing HTML with BeautifulSoup

In this example, we want to look at a website and get a list of all the available downloadable files from that website.

https://catalog.data.gov/dataset?res_format=CSV&tags=hospital

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
r = requests.get('https://catalog.data.gov/dataset?res_format=CSV&tags=hospital')

In [3]:
r.status_code

200

In [4]:
print(r.text[0:1000])

<!DOCTYPE html>
<!--[if IE 7]> <html lang="en" class="ie ie7"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="ie ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en"> <!--<![endif]-->
  <head>
    <!--[if lte ie 8]><script type="text/javascript" src="/fanstatic/vendor/:version:2020-02-25T19:16:26.64/html5.min.js"></script><![endif]-->
<link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2020-02-25T19:16:26.64/select2/select2.css" />
<link rel="stylesheet" type="text/css" href="/fanstatic/css/:version:2020-02-25T19:16:25.50/main.min.css" />
<link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2020-02-25T19:16:26.64/font-awesome/css/font-awesome.min.css" />
<!--[if ie 7]><link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2020-02-25T19:16:26.64/font-awesome/css/font-awesome-ie7.min.css" /><![endif]-->
<link rel="stylesheet" type="text/css" href="/fanstatic/ckanext-h

In [5]:
soup = BeautifulSoup(r.text)

In [6]:
for link in soup.find_all('h3'):
    print(link.a.text)

Hospitals
Community Points of Interest
IDPH Hospital Directory
2009 VHA Facility Quality and Safety Report - Population Quality of Care
Hospitals
Disproportionate Share Hospital (DSH) Eligibility for State Fiscal Years 2010-2019
NYC Health + Hospitals patient satisfaction scores – 2009
NYC Health + Hospitals/Options - fees - 2011
NYC Health + Hospitals Options - income eligibility - 2011
NYC Health + Hospitals patient care locations - 2011
Historical - ccgisdata - Hospital Point 2014
Hospitals
Historical - ccgisdata - Hospital Boundary 2014
Community HealthCare Centers
Connecticut Hospital Liquidity And Solvency Trend Data
Hospitals in Hawaii
Licensed Veterinary Hospitals for Fiscal Year 2018 (July 1, 2017 through June 30, 2018)
Prevention Quality Indicator (PQI) Composite Measure Rates by County, 2008-2015
GNIS: Buildings (2013)


In [7]:
for element in soup.find_all('li', 'dataset-item'):
    name = element.h3.text.strip()
    resources = element.ul
    for item in resources.find_all('li'):
        if item.text.strip() == 'CSV':
            print("Download information about '{}' from {}".format(name,item.a.attrs['href']))
            
    


Download information about 'Hospitals' from https://data.baltimorecity.gov/api/views/g9ck-7zns/rows.csv?accessType=DOWNLOAD
Download information about 'Community Points of Interest' from https://data.townofcary.org/api/v2/catalog/datasets/points-of-interest/exports/csv
Download information about 'IDPH Hospital Directory' from https://data.illinois.gov/api/views/wsms-teqm/rows.csv?accessType=DOWNLOAD
Download information about '2009 VHA Facility Quality and Safety Report - Population Quality of Care' from https://www.va.gov/VETDATA/docs/Datagov/FY_07_Insurance_Expenditure_by_CD_and_State.csv
Download information about 'Hospitals' from http://gis-cityofsfgis.opendata.arcgis.com/datasets/4b6fa48a0c6d4fcb98edbc55c13a634f_11.csv
Download information about 'Disproportionate Share Hospital (DSH) Eligibility for State Fiscal Years 2010-2019' from https://data.chhs.ca.gov/dataset/a7379d3c-1e56-4e56-b9f6-398b2a9c7760/resource/b2f096ad-9681-4310-9ab0-6cbf4945c3bf/download/dsh-eligibility_data_rol

# Getting Table Data


In this example, we're going to find an HTML table and extract the data from that table

https://open.epic.com/Clinical/Allergy - Error Codes

In [8]:
import requests
from bs4 import BeautifulSoup
import json

In [9]:
url = 'https://open.epic.com/Clinical/Allergy'
r  = requests.get(url)
data = r.text

soup = BeautifulSoup(data)

table = soup.find('table',id='errors')
print(table)

<table id="errors">
<thead>
<tr>
<th>Error Code</th>
<th>Severity</th>
<th>Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>4100</td>
<td>Fatal</td>
<td>The resource request contained an invalid parameter</td>
<td>Invalid parameter such as a non existent patient ID: <code>AllergyIntolerance?patient=foo</code></td>
</tr>
<tr>
<td>4101</td>
<td>Resource request returns no results</td>
<td>A request for data that was otherwise valid but no information was documented or found (i.e. a patient with no pertinent implanted devices, or a demographic search where no patients met the search criteria).</td>
</tr>
<tr>
<td>4102</td>
<td>Fatal</td>
<td>The read resource request contained an invalid ID</td>
<td>Invalid Resource ID: <code>AllergyIntolerance/foo</code></td>
</tr>
<tr>
<td>4107 </td>
<td>Fatal</td>
<td>The read resource request has been merged</td>
<td>Requesting a Patient which has been merged - in this event, in addition to the error response, we will respond with an 

In [10]:
# In HTML tables, there is usually a <thead> section to tell us what the column headers are.
# Let's load those into a simple list of headers[]
headers = []
for cell in table.thead.tr.find_all('th'):
    headers.append(cell.text)

headers

['Error Code', 'Severity', 'Description', 'Example']

In [11]:
# In HTML tables, the rows are in a <tbody> section
errors = {}
for row in table.tbody.find_all('tr'):
    colnum = 0
    for cell in row.find_all('td'):
        if colnum == 0:
            error_cd = cell.text
            errors.setdefault(error_cd, {})
        else:
            column = headers[colnum]
            errors[error_cd][column] = cell.text
        colnum += 1

In [12]:
print(json.dumps(errors, indent=4))

{
    "4100": {
        "Severity": "Fatal",
        "Description": "The resource request contained an invalid parameter",
        "Example": "Invalid parameter such as a non existent patient ID: AllergyIntolerance?patient=foo"
    },
    "4101": {
        "Description": "Resource request returns\u00a0no results",
        "Example": "A request for data that was otherwise valid but no information was documented or found (i.e. a patient with no pertinent implanted devices, or a demographic search where no patients met the search criteria)."
    },
    "4102": {
        "Severity": "Fatal",
        "Description": "The read resource request contained an invalid ID",
        "Example": "Invalid Resource ID: AllergyIntolerance/foo"
    },
    "4107 ": {
        "Severity": "Fatal",
        "Description": "The read resource request has been merged",
        "Example": "Requesting a Patient which has been merged - in this event, in addition to the error response, we will respond with an HTTP R

In [13]:
errors.get('4119')

 'Description': 'Additional data may be present for patient',
 'Example': 'Request data while authenticated as an authorized patient or patient proxy. Inidicates that data available to the patient may not be the complete medical record within the system.'}

In [14]:
errors.get('4119')['Severity']



## Reading HTML Tables with Pandas

Pandas has the ability to read HTML, too.  In ideal circumstances, it will scour whatever page you give it and find all of the tables there.  The result from `read_html()` will be a list of dataframes.

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html

In [15]:
import pandas as pd

In [16]:
dfs = pd.read_html('https://open.epic.com/Clinical/Allergy')

In [17]:
dfs

[                            Relative URL FHIR Interaction HTTP Method  \
 0  /AllergyIntolerance?[parameter=value]           Search         Get   
 
                                               Action  
 0  Retrieve AllergyIntolerance resources using th...  ,
   Parameter Name Parameter Type  \
 0            _id      Reference   
 1        patient      Reference   
 2          onset           Date   
 
                                          Description  
 0  Search for AllergyIntolerance resources using ...  
 1  Search for AllergyIntolerance resources for a ...  
 2  Further refine a search for AllergyIntolerance...  ,
                                                Query  \
 0  /AllergyIntolerance?patient=Tbt3KuCY0B5PSrJvCu...   
 
                                       Result  
 0  Returns the allergies for Jason Argonaut.  ,
                Relative URL FHIR Interaction HTTP Method  \
 0  /AllergyIntolerance/{ID}             Read         Get   
 
                             

In [18]:
dfs[4]

Unnamed: 0,Error Code,Severity,Description,Example
0,4100,Fatal,The resource request contained an invalid para...,Invalid parameter such as a non existent patie...
1,4101,Warning,Resource request returnsÂ no results,A request for data that was otherwise valid bu...
2,4102,Fatal,The read resource request contained an invalid ID,Invalid Resource ID: AllergyIntolerance/foo
3,4107,Fatal,The read resource request has been merged,Requesting a Patient which has been merged - i...
4,4110,Fatal,No parameters are provided in the search request,An invalid search request such as : AllergyInt...
5,4111,Fatal,Required search parameter missing from request,A request missing a required parameter (such a...
6,4112,Fatal,The resource request contained an invalid comb...,A search containing multiple different patient...
7,4113,Fatal,Session ID for cached search results has expired.,Making a request for previously accessed pagin...
8,4115,Fatal,Required search parameter has an invalid value,An invalid parameter required for searching: C...
9,4117,Warning,No CVX code for Immunization resource,Request for an Immunization resource without a...
