# Web scraping with Python

The goal: Scrape the data from the table on [this page listing WARN notices filed in South Dakota](https://dlr.sd.gov/workforce_services/businesses/warn_notices.aspx) and write to a local CSV file. Open that page in a new tab. Also, if you need an introduction (or refresher) on Python syntax, [open this notebook as well](Python%20syntax%20cheat%20sheet.ipynb).

### Table of contents
- [Using Jupyter notebooks](#Using-Jupyter-notebooks)
- [Researching the target page](#Researching-the-target-page)
- [Import libraries](#Import-libraries)
- [Request the page](#Request-the-page)
- [Turn your HTML into soup](#Turn-your-HTML-into-soup)
- [Targeting and extracting data](#Targeting-and-extracting-data)
- [Write the results to file](#Write-the-results-to-file)
- [Extra credit](#Extra-credit)
- [Links to other resources](#Links-to-other-resources)

### Using Jupyter notebooks

There are lots of ways to write and run Python code on your computer. One way -- the method we're using today -- is to use [Jupyter notebooks](https://jupyter.org), which run in your browser and allow you to add documentation alongside your code. (Here, we're using JupyterLab.) They're handy for bundling your code with a human-readable explanation of what's happening at each step. Check out some examples from the [L.A. Times](https://github.com/datadesk/notebooks) and [BuzzFeed News](https://github.com/BuzzFeedNews/everything#data-and-analyses).

In this notebook are [markdown](https://daringfireball.net/projects/markdown/) cells, such as this one, and code cells (below). (Double-click on a markdown cell to edit it.)

**To add a new cell to your notebook**: Click the + button in the menu or press the `b` button on your keyboard.

**To run a cell of code**: Select the cell and click the "Run" button in the menu, or you can press Shift+Enter.

**One common gotcha**: The notebook doesn't "know" about code you've written until you've _run_ the cell containing it. For example, if you define a variable called `my_name` in one cell, and later, when you try to access that variable in another cell but get an error that says `NameError: name 'my_name' is not defined`, the most likely solution is to run (or re-run) the cell in which you defined `my_name`.

### Researching the target page

A web-scraping project usually involves writing some code to complete these tasks:
- Handle the HTTP requests to retrieve the content available at a given URL -- typically a string of HTML
- Parse that string of HTML (or whatever) into the data structures available in your coding language -- here, Python lists, dictionaries and other objects
- Traverse these data structures to extract the data you need in the format expected by whatever object you're using to write the data to file -- here, we're going to be creating lists, which Python's _[csv.writer()](https://docs.python.org/3/library/csv.html#csv.writer)_ object can work with
- Write the data to a local file (and/or load it into a dataframe for analysis, fire a conditional Slack alert, send you an email, whatever you're trying to accomplish)

A good starting point is to examine how the web page(s) are put together. You can look at the HTML that makes up a web page by _inspecting the source_ in a web browser. IRE likes Chrome and Firefox for this task.

You can inspect specific elements on the page by loading the page, right-clicking anywhere and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page.

To examine all of the source code that makes up a page, you can also "view source." In Chrome, hit `Ctrl+U` on a PC or `⌘+Opt+U` on a Mac. (It's also in that right-click context menu ("View Source") and in the brower's menu bar: View > Developer > View Page Source.)

You'll get a page showing you all of the HTML code that makes up that page. Ignore 99% of it and try to locate the element(s) that you want to target, using `Ctrl+F` on a PC or `⌘+F` on a Mac to find them. One question you might want to answer when poking through the source code: Is this the only element of its type on the page? For example, is the `<table>` element you identified in your research the only `<table>` element in the HTML? (If so, this simplifies things at the "find this table in the HTML" step later on.)

Open up a Chrome browser and inspect the table on [our target page](https://dlr.sd.gov/workforce_services/businesses/warn_notices.aspx). Find the table we want to scrape.

### Import libraries

Step one is to _import_ two third-party Python libraries that will help us scrape the page:
- `requests` for making HTTP requests, similar to what happens when you type a URL into a browser window and hit enter ([docs](https://requests.readthedocs.io))
- `bs4`, or BeautifulSoup, is a popular library for parsing HTML into a data structure that Python can work with ([docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/))

These libraries were installed separately on a per-project basis ([read more about our recommendations for setting up Python projects here](https://docs.google.com/document/d/1cYmpfZEZ8r-09Q6Go917cKVcQk_d0P61gm0q8DAdIdg/edit#heading=h.od2v1nkge5t1)).

Run this cell (you'll only have to do this once):

In [None]:
import requests
import bs4

### Request the page

Next, we'll use the `get()` method of the `requests` library (which we just imported) to grab the web page.

While we're at it, we'll _assign_ all the stuff that comes back to a new variable using `=`.

The variable name is arbitrary, but it's usually good to pick something that reasonably describes the value it's pointing to (and follows [Python's variable-naming rules](https://realpython.com/python-variables/)).

Notice that the URL we're grabbing is wrapped in quotes, making it a _string_ that Python will interepret as text (as opposed to numbers, booleans, etc.). You can read up more on Python data types and variable assignment [in our syntax cheat sheet](Python%20syntax%20cheat%20sheet.ipynb).

Run this cell:

In [None]:
req = requests.get('https://dlr.sd.gov/workforce_services/businesses/warn_notices.aspx')

Nothing appears to have happened, which is (usually) a good sign.

If you want to make sure that your request was successful, you can check the `status_code` attribute of the Python object that was returned:

In [None]:
req.status_code

A `200` code means all is well. `404` means the page wasn't found, etc. ([Here's one of our favorite lists of HTTP status codes](https://http.cat/) ([or here, if you prefer dogs](https://httpstatusdogs.com/)).)

The object being stored as the `req` variable came back with a lot of potentially useful information we could access. Today, we're mostly interested in the `.text` attribute -- the HTML that makes up the web page, same as if we'd viewed the page source. Let's take a look:

In [None]:
req.text

If you're not familiar with HTML (HyperText Markup Language), which gives structure to a web page, [here's a great introduction](https://developer.mozilla.org/en-US/docs/Web/HTML).

Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag.

A table, for example, starts with `<table>` and ends with `</table>`. The opening tag tells the browser to render everything between it and the closing tag as a table (according to any styling rules that might be defined via [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS)). Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).

An HTML element can have any number of attributes, such as classes --

`<table class="cool-table">`

-- styles --

`<table style="width:95%;">`

-- hyperlinks to other pages --

`<a href="https://ire.org">Click here to visit IRE's website</a>`

-- and IDs --

`<table id="cool-table">`

-- all of which will be useful to know about when we're scraping.

### ✍️ Try it yourself

Use the code blocks below to experiment with requesting web pages and checking out the HTML that gets returned.

Some ideas to get you started:
- `'https://web.archive.org/web/20031202214318/http://www.tdcj.state.tx.us:80/stat/finalmeals.htm'`
- `'https://en.wikipedia.org/w/index.php?title=List_of_animal_names'`
- `'https://www.dailypress.senate.gov/membership/membership-lists'`

### Turn your HTML into soup
Before we start targeting and extracting pieces of data in our HTML, we first need to turn that chunk of text into a data structure that Python can work with. That's where the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (`bs4`) library comes in.

We'll create a new instance of a `BeautifulSoup` object, which lives under the top-level `bs4` library that we imported earlier. We need to give it two things:
- The HTML we'd like to parse -- `req.text`
- A string with the name of the type of parser to use -- `html.parser` is the default and usually fine, but [there are other options](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

We'll save the parsed HTML as a new variable, `soup`.

In [None]:
soup = bs4.BeautifulSoup(req.text, 'html.parser')

'''
# Uncomment and run this code if the Internet is down

with open('sd-warn.html', 'r') as infile:
    html = infile.read()
    soup = bs4.BeautifulSoup(html, 'html.parser')
'''

Nothing happened, which is good! You can take a look at what `soup` is, but it looks pretty much like `req.text`:

In [None]:
soup

If you want to be sure, you can use the Python function `type()` to check what sort of object you're dealing with:

In [None]:
# the `str` type means a string, or text
type(req.text)

In [None]:
# the `bs4.BeautifulSoup` type means we successfully created the object
type(soup)

### ✍️ Try it yourself

Use the code blocks below to experiment fetching HTML and turning it into soup (if you fetched some pages earlier and saved them as variables, that'd be a good start).

### Targeting and extracting data
There are many ways you can navigate an HTML tree with BeautifulSoup -- today, we'll mostly be using these two methods:
- `find()`, which returns one thing: the first element that meets your search criteria
- `find_all()`, which returns a list of things: all of the elements that meet your search criteria

So, if you're looking for a table, your strategy might be:
- Find the first table on the page, or
- Find all of the tables on the page, then select the correct one from the list, or
- Find the table with specific attributes (e.g., the table with an `id` of `some-table`), or
- Find a nearby element and use its relative position to navigate to the table (is it the element the `.parent`, or containing element, of the table? a `.sibling`? etc.)

Now that we have a BeautifulSoup object loaded up for the S.D. WARN page, we can go hunting for the specific HTML elements that contain the data we need. Our general strategy:
1. Find the main table with the data we want to grab
2. Get a list of rows (the `tr` element, which stands for "table row") in that table
3. Use a Python `for loop` to go through each table row and find the data inside it (`td`, or "table data")

#### Find the table

To start with, we need to find the table. In this case it's easy, because there's only one table on the page, so we can just use the `find()` method to get the table.

Run these cells:

In [None]:
table = soup.find('table')

In [None]:
table

#### Find the rows in the table

Next, use the `find_all()` method to fetch a list of rows in the table:

In [None]:
rows = table.find_all('tr')

In [None]:
rows

To see how many items are in this list -- in other words, how many rows are in the table (including the headers) -- you can use the `len()` function:

In [None]:
len(rows)

#### Loop through the rows and extract the data

Next, we can use a [`for` loop](Python%20syntax%20cheat%20sheet.ipynb#for-loops) to go through the list of rows and start grabbing data from each one.

Quick refresher on _for loop_ syntax: Start with the word `for` (lowercase), then a variable name to stand in for each item in the list that you're looping over, then the word `in` (lowercase), then the name of the list holding the items (`rows`, in our case), then a colon, then an indented block of code describing what we're doing to each item in the list.

Indentation matters in Python; all of the code indented at the same level will run as part of that indented block -- in this case, every command we write in the `for loop` indentation block will be applied to each thing in the list that we're looping over.

Each piece of data in the row is stored in a `td` tag, which stands for "table data." So inside the loop -- in the indented block -- we'll use the `find_all()` method to get a list of every `td` tag inside the row. And from there, we can access the content inside each item in the resulting list.

Our goal is to end up with a _list_ of data for each row that we will eventually write out to a file. Typically you'd probably do the work of looping and inspecting the results, step by step, in one code cell. But to show the thinking of how you might approach this (and to practice the syntax), we'll start by just printing out each row and then build from there.

In [None]:
for row in rows:
    print(row)

Notice that the first item that prints is the header row with the column labels. You are free to keep these headers if you want, but I typically skip that row and define my own list of column names. (Another thing to consider: Often the cells in the header row will be represented by `th` ("table header") tags, not `td` ("table data") tags, so calling `find_all('td')` to get a list of data points in each row -- the next step -- would return an empty list, and you'd need to handle that somehow.)

We can skip the first row by using _list slicing_: adding square brackets after the name of the list with some instructions about which items in the list we want to select.

Here, the syntax would be: `rows[1:]`, which means, take everything in the `rows` list starting with the item in position 1 (the second item) to the end of the list. Like many programming languages, Python starts counting at 0, so the result will leave off the first item in the list -- i.e. the item in position 0 -- the headers.

In [None]:
for row in rows[1:]:
    print(row)

Next, start pulling out the data in each row. Start by using `find_all()` to grab a list of `td` tags contained in each `tr`:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    print(cells)

Now we have, for each row, a _list_ of `td` tags. Next step is to look at the table and start grabbing specific values based on their position in the list and assigning them to human-readable variable names.

Quick refresher on list syntax: To access a specific item in a list, use square brackets `[]` and the index number of the item you'd like to access. For instance, to get the first cell in the row, use `[0]`.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    company = cells[0]
    print(company)

This is returning the entire `Tag` object -- we just want the contents inside it. You can access the `.text` attribute of the tag to get the text inside:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    company = cells[0].text
    print(company)

In the next cell (`[1]`), the `.text` attribute will give you the city. In the third cell (`[2]`) you'll get the date. Etc.

It's also generally good practice to trim off leading and trailing whitespace for each value, and you can use the Python built-in string method `strip()` to accomplish this as you march across each row. (If you need to remove repetitive internal whitespace characters, as well, you can use [the split-and-join method mentioned here](Python%20syntax%20cheat%20sheet.ipynb#String-methods).)

Which gets us this far:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    company = cells[0].text.strip()
    location = cells[1].text.strip()
    print(company, location)

### ✍️ Try it yourself

Now that you've gotten this far, see if you can isolate the other pieces of data in each row.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    company = cells[0].text.strip()
    location = cells[1].text.strip()
    
    # date

    # employees_laid_off

    # print(company, location, date, employees_laid_off)
    

### Write the results to file

Now that we've targeted our lists of data for each row, we can use Python's built-in [`csv`](https://docs.python.org/3/library/csv.html) module to write each list to a CSV file.

First, import the csv module. (Typically, the convention is to handle all of your import statements at the top of your script, but it's no big deal if you do it here.)

In [None]:
import csv

Now define a list of headers to match the data (each column header will be a string) -- run this cell:

In [None]:
csv_headers = [
    'company',
    'location',
    'date',
    'employees_laid_off'
]

Now, using something called a `with` block, open a new CSV file to write to and write some code to do the following things:
- Create a `csv.writer` object
- Write out the list of headers using the `writerow()` method of the `csv.writer` object
- Copy and paste the `for` loop you just wrote and modify it so that instead of just printing the contents of each cell, you'll create a list of items to write to file, then use the `writerow()` method of the `csv.writer` object to write each list of data to file

Pay attention to the levels of indentation:
- The code in the first indented block (and deeper), belonging to the `with` statement, will be executed in the context of the file being open
- The code in the second indented block, belonging to the `for` loop, will be executed for each item in the list we're looping over

In [None]:
# create a file called 'sd-warn-notices.csv' in write ('w') mode
# specify a `newline` argument to deal with how PCs handle line endings
# ensure the file is encoded in utf-8
# and use the `as` keyword to name the file handler (the variable name `outfile` is arbitrary)
with open('sd-warn-notices.csv', 'w', newline='', encoding='utf-8') as outfile:
    
    # from the csv module we imported, create a new .writer object attached to the open file
    # and assign it to a variable
    writer = csv.writer(outfile)

    # write the list of headers
    writer.writerow(csv_headers)
    
    # copy/paste the for loop you wrote earlier
    # it should be at this indentation level =>
    # for row in rows[1:]:
    #     cells = row.find_all('td')
    #     etc. ...
    # but at the end, instead of printing your list, create a list and write it to file:
    # data_out = [company, city, date, employees_laid_off]
    # `writer.writerow(data_out)`

The file will be written into the same folder as this notebook. You determine where your file is created when you specify a path to the file to write to in the `open()` function:
- `sd-warn-notices.csv` - same folder as the notebook
- `/Users/cjwinchester/Desktop/sd-warn-notices.csv` - on my Desktop
- `../sd-warn-notices.csv` - in the parent folder of this notebook
- etc.

### Extra credit

#### Validate and reformat dates
Figure out how to use the `strptime()` method of the `datetime` module in the standard library to validate the strings of dates by turning them into native datetime objects, then convert them to `YYYY-MM-DD`-formatted dates using the `isoformat()` method.

#### Write a list of dictionaries instead
Instead of using a `csv.writer()` object to write a list of lists, figure out how to use a _[csv.DictWriter()](https://docs.python.org/3/library/csv.html#csv.DictWriter)_ object to write a list of dictionaries. (You'll need to change how you collect the data in the `for` loop iterating over the table rows, as well -- end by appending a dictionary to the tracking list, not a list.)

### Links to other resources

- [Tipsheet on inspecting web elements](https://docs.google.com/document/d/12e_9VfNxME02qfSRZU_diF8Qqp6EEvnR-xtudI58GeI/edit)
- [Materials from a December 2023 IRE mini-bootcamp on web scraping with Python](https://github.com/ireapps/python-mini-bootcamp-scraping-2023) (includes completed scrapers you can use for reference)
- [First GitHub Scraper](https://palewi.re/docs/first-github-scraper/) (great option for putting your scrapers on a timer)
- [Tipsheet on saving HTML files before scraping them](https://docs.google.com/document/d/1SMpxt2b1ClEjLBZU2cg0Y2-yqxG_UoNq5f59h_5M4P8/edit)
- [Tipsheet with some miscellaneous scraping tips](https://docs.google.com/document/d/1-D1GhYJuOus7tXomPPYACaaer6He41cdcl_7J3rsqIk/edit)
- [Leon Yin's Inspect Element website](https://inspectelement.org/)