# Web scraping: FDA warning letters

In this notebook, we're going to write some code to scrape [tables of data on FDA warning letters issued in 2018](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm).

We'll also talk about how we might extend this idea to get warning letters from other years, as well.

First, let's think through the process and decide what tools we'll need. Our goal is to fetch a web page (`requests`), parse the HTML (`bs4`) and write to a local file (`csv`). Let's import what we need:

In [2]:
import csv

import requests
from bs4 import BeautifulSoup

### Now we noodle

The data table of warning letters is spread across multiple pages. What happens when you click "Next" or a page number? The URL changes from this:

[`https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm`](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm)

to this:

[`https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=2`](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=2)

What's happening is, a [URL _parameter_](https://en.wikipedia.org/wiki/Query_string) is being appended. This gives the database instructions about how to formulate a response to your browser's request: Show me the letters on page 2.

What happens when you specify `Page=1`?

[`https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=1`](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=1)

We're back to the first page. This is good news, because it means we can iterate over the number of pages in the results, starting with Page 1, and grab what we need.

What happens when we specify a page number that doesn't exist?

[`https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=100`](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm?Page=100)

We get a page that includes the text "No Current Postings Available" -- something we'll remember as we write our scraper.

### Save files locally

It's good practice to save the HTML files you want to scrape. So we're going to do that here -- save a copy of each page of results on the FDA site.

First, let's establish a few variables:
- The base URL we'll start from
- The naming pattern of the local files we're going to save -- we're going to use [string formatting](../reference/String%20formatting.ipynb)

In [1]:
BASE_URL = 'https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm'
LOCAL_FILE_PATT = 'fda-warning-letters-{pagenum}.html'

There are a bunch of different strategies we could use to grab these pages. Here's what we'll try today:
1. Look at the page numbers you could navigate to and grab the largest one
2. Loop over a range of numbers from 1 to the largest number in the pagination list (which we just grabbed)
3. Retrieve each page of results
4. If that page exists -- in other words, if the web page doesn't say "No Current Postings Available" -- save it; if not, break out of the loop because we're done (this is just an extra sanity check)

Why, you ask, don't we just loop over the numbers 1-5? We know that there are only 5 pages. Great question! Generally, you want to avoid hard-coding numbers in situations where numbers could change. As the FDA adds more warning letters, the database will return more than 5 pages of results, and our script would miss those.

First, let's grab the page and isolate the element with the maximum page number. Then we'll know how many pages we need to loop over.

In a new tab, crack open the page to view the source code. Then use `requests` to get the page and `bs4` to parse the HTML.

In [4]:
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')

Find the list of page numbers at the bottom -- I Ctrl-F'd for "Previous" to find it. Turns out it's an unordered list (`ul`) with the class `pagination-clean`.

In [8]:
pagination = soup.find('ul', {'class': 'pagination-clean'})
print(pagination)

<ul class="pagination-clean">
<li class="previous-off">Previous</li>
<li class="active">1</li>
<li class="next"><a href="?Page=2">Next</a></li>
</ul>


We want to isolate the numbers and grab the biggest. Couple ways to skin this cat, but today, let's pull the text of each item in that `ul` into a Python list -- only if it's a number, though! -- and find the biggest one.

Three new-to-us functions are going to help us out here:
- [`isnumeric()`](https://www.tutorialspoint.com/python/string_isnumeric.htm), a string method that checks whether the contents of a string are numeric ('4' => True, 'Hello!' => False)
- [`int()`](https://docs.python.org/3/library/functions.html#int), a function to coerce a value to an integer
- [`max()`](https://docs.python.org/3/library/functions.html#max), a function to get the biggest number out of a list

We're going to use a _for loop_ to iterate over the items (`li`) inside the list (`ul`).

👉 For a refresher on for loops, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops)

In [26]:
# an empty list to hold the page numbers
page_numbers = []

# loop over the items in the pagination list
for li in pagination.find_all('li'):
    
    # get the text inside the tag and strip whitespace
    txt = li.string.strip()
    
    # if it's numeric
    if txt.isnumeric():
        
        # coerce to an integer (`int()`) and `append()` to our list
        page_numbers.append(int(txt))

# create a new variable with the biggest number in that list
max_page_num = max(page_numbers)

'''
~ BONUS CONTENT ~

A one-liner to do this would use something called a "list comprehension"
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

max_page_num = max([int(x.string.strip()) for x in pagination.find_all('li')
                    if x.string.strip().isnumeric()])

'''

print(max_page_num)

5


So now we have the number of pages we need to loop over. We'll use the [`range()`](https://docs.python.org/3/library/functions.html#func-range) function to loop over the range of numbers we're interested in -- keeping in mind that the second number you give to the range function is the _first_ number that you _don't_ want to include. In other words, we need to add 1 to our `max_page_num` variable to get what we need.

To demonstrate:

In [22]:
for i in range(1, max_page_num):
    print(i)

1
2
3
4


In [23]:
for i in range(1, max_page_num+1):
    print(i)

1
2
3
4
5


Five requests to a server isn't really a big deal, but just to be courteous we'll pause for a second between requests. For that, we'll use `sleep()`, a method in Python's built-in `time` module. Let's import `time` now.

In [24]:
import time

Now, to save our pages. We'll loop over a list of page numbers (1 to max page number) and request the FDA page of results associated with that page number and save to file.

Along the way, if we see the 'no current postings available' text, something went wrong and we want to break out of the loop. To do this, we'll use a [`break`](https://docs.python.org/3/reference/simple_stmts.html#break).

👉 For more information on _if_ statements, [check out this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#if-statements).

In [28]:
# loop over the numbers 1 to max_page_num+1
for i in range(1, max_page_num+1):
    
    # get the page, specifying the 'Page' parameter
    r = requests.get(BASE_URL, params={'Page': i})
    
    # bail if the "nothing here" text shows up
    if 'no current postings available' in r.text.lower():
        break

    # otherwise, save the page
    else:
        with open(LOCAL_FILE_PATT.format(pagenum=i), 'w') as o:
            o.write(r.text)

    # print something to let us know it's working
    print('Saving page', i)
    
    # pause 1 second
    time.sleep(1)

### Scrape the HTML

Now that we've saved the HTML locally, let's get to work scraping it. Let's open the first one as a test and turn it into soup.

In [29]:
with open('fda-warning-letters-1.html', 'r') as i:
    html = i.read()
    soup = BeautifulSoup(html, 'html.parser')

Next, look for the data we want to scrape. It's in a `<table>` element with the ID `WarningLetter_sortid`. (It also appears to be the only table on the page.)

In [31]:
table = soup.find('table', {'id': 'WarningLetter_sortid'})
print(table)

<thead>
<tr>
<th data-toggle="true" data-type="numeric" scope="col"> Letter Issue Date </th>
<th scope="col">  Company Name </th>
<th data-hide="phone" scope="col">  Issuing Office </th>
<th data-hide="phone" scope="col">  Subject </th>
<th data-hide="phone,tablet" data-type="numeric" scope="col">  Close Out Date </th>
</tr>
</thead>
<tbody>
<tr>
<td>07/19/2018 </td>
<td> Dallas District Office</td>
<td> Unapproved New Drugs/Misbranded</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/10/2018 </td>
<td> Detroit District Office</td>
<td> Compounding Pharmacy/Adulterated Drug Products</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/09/2018 </td>
<td> Kansas City District Office</td>
<td> Food/Edible Tissue/Drug Residue/Adulterated</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/06/2018 </td>
<td> Dallas District Office</td>
<td> CGMP/Dietary Supplement/Adulterated/Misbranded</td>


Neato. What data are we extracting? The fields on the table are: `Letter Issue`, `Date`, `Company Name`, `Issuing Office`, `Subject`, `Close Out Date`. The company name also has a link to the actual letter, so we'd want to grab that, too. Let's define the field names for our CSV in a list:

In [32]:
headers = ['company', 'issue_date', 'letter_url', 'issuing_office', 'subject', 'closeout_date']

We'll start by looping over the rows (`<tr>`) in the table and seeing what we can pull out. Remember: The `find_all()` method returns a list that we can iterate over.

In [33]:
for row in table.find_all('tr'):
    print(row)

<tr>
<th data-toggle="true" data-type="numeric" scope="col"> Letter Issue Date </th>
<th scope="col">  Company Name </th>
<th data-hide="phone" scope="col">  Issuing Office </th>
<th data-hide="phone" scope="col">  Subject </th>
<th data-hide="phone,tablet" data-type="numeric" scope="col">  Close Out Date </th>
</tr>
<tr>
<td>07/19/2018 </td>
<td> Dallas District Office</td>
<td> Unapproved New Drugs/Misbranded</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/10/2018 </td>
<td> Detroit District Office</td>
<td> Compounding Pharmacy/Adulterated Drug Products</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/09/2018 </td>
<td> Kansas City District Office</td>
<td> Food/Edible Tissue/Drug Residue/Adulterated</td>
<td nowrap="">
<a href="#wldisclaimer">Not Issued *</a>
</td>
</tr>
<tr>
<td>07/06/2018 </td>
<td> Dallas District Office</td>
<td> CGMP/Dietary Supplement/Adulterated/Misbranded</td>
<td nowrap="">
<a href="#

Perf. Let's use list slicing to skip the header row, though, and start extracting the data.

In [53]:
# loop over the table rows, skipping the first [0] one
for row in table.find_all('tr')[1:]:
    
    # get a list of table data elements in this row
    cols = row.find_all('td')
    
    # first one [0] has the date
    date = cols[0].string.strip()
    
    # second one [1] has the company
    company = cols[1].string.strip()
    
    # URL also in second one; prepend the base URL
    url = 'https://www.fda.gov' + cols[1].a['href']
    
    # third one [2] has the office
    office = cols[2].string.strip()

    # fourth one [3] has the subject
    subject = cols[3].string.strip()
    
    # fifth one [4] has the closeout date
    # using `text` instead of `string` because the text is actually inside the nested `a`
    closeout_date = cols[4].text.strip()
    
    # print it to see what we've got
    print(company, date, url, office, subject, closeout_date)
    print('-'*60)

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------

Cool! Let's try it with the other files. How might we get a list of the HTML files we downloaded earlier? If you didn't know already, how might you formulate your search terms for Google? (I might start with something like "[python get specific files in directory](https://www.google.com/search?q=python+get+specific+files+in+directory)," which leads me to [a StackOverflow question](https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python) from someone who was trying to find all of the `.txt` files in a directory.)

Appears that [`glob`](https://docs.python.org/3/library/glob.html) is our horse here. Let's import `glob` and target our files:

In [57]:
import glob

results_pages = glob.glob('fda-warning-letters-*.html')
print(results_pages)



Winner winner, chicken dinner. Now we can loop over each one and extract the data we need from each.

In [58]:
# loop over the results pages
for page in results_pages:
    
    # open the page, read the HTML, turn it into soup
    with open(page, 'r') as i:
        html = i.read()
        soup = BeautifulSoup(html, 'html.parser')

    # find the table
    table = soup.find('table', {'id': 'WarningLetter_sortid'})

    # loop over the table rows, skipping the first [0] one
    for row in table.find_all('tr')[1:]:

        # get a list of table data elements in this row
        cols = row.find_all('td')

        # first one [0] has the date
        date = cols[0].string.strip()

        # second one [1] has the company
        company = cols[1].string.strip()

        # URL also in second one; prepend the base URL
        url = 'https://www.fda.gov' + cols[1].a['href']

        # third one [2] has the office
        office = cols[2].string.strip()

        # fourth one [3] has the subject
        subject = cols[3].string.strip()

        # fifth one [4] has the closeout date
        # using `text` instead of `string` because the text is actually inside the nested `a`
        closeout_date = cols[4].text.strip()

        # print it to see what we've got
        print(company, date, url, office, subject, closeout_date)
        print('-'*60)

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------

Excellent! Instead of just printing the rows, we want to write them out to a CSV. Let's put it all together:

In [59]:
# open a new csv file to write to
with open('fda-warning-letters-2018.csv', 'w') as o:
    
    # create a writer object, specifying the fieldnames
    writer = csv.DictWriter(o, fieldnames=headers)
    
    # write the header row
    writer.writeheader()
    
    # loop over the HTML pages
    for page in results_pages:
        # open the page, read the HTML, turn it into soup
        with open(page, 'r') as i:
            html = i.read()
            soup = BeautifulSoup(html, 'html.parser')
        
        # find the table
        table = soup.find('table', {'id': 'WarningLetter_sortid'})

        # loop over the table rows, skipping the first [0] one
        for row in table.find_all('tr')[1:]:

            # get a list of table data elements in this row
            cols = row.find_all('td')

            # first one [0] has the date
            date = cols[0].string.strip()

            # second one [1] has the company
            company = cols[1].string.strip()

            # URL also in second one; prepend the base URL
            url = 'https://www.fda.gov' + cols[1].a['href']

            # third one [2] has the office
            office = cols[2].string.strip()

            # fourth one [3] has the subject
            subject = cols[3].string.strip()

            # fifth one [4] has the closeout date
            # using `text` instead of `string` because the text is actually inside the nested `a`
            closeout_date = cols[4].text.strip()

            # write row to csv
            writer.writerow({
                'company': company,
                'issue_date': date,
                'letter_url': url,
                'issuing_office': office,
                'subject': subject,
                'closeout_date': closeout_date   
            })

So there we have it. At this point, I would go back and refactor the code to break up tasks into discrete [functions](../reference/Functions.ipynb).

### Next up: How to get multiple years' worth of data?

A class discussion + extra credit assignment.