# Web scraping with Python

If you need data that's trapped on a website, writing some code to scrape the page could be your solution. This entry-level class will show you how to use the Python programming language to harvest information from websites into a spreadsheet. We'll introduce you to the command line and show you how to write enough code to fetch and parse web content.

### Class outline

- 🐍 [Python basics](#Python-basics)
- 🌐 [HTML basics](#HTML-basics)
- 🛠 [Putting it all together](#Putting-it-all-together)

## Python basics

- [Using Jupyter notebooks](#Using-Jupyter-notebooks)
- [Exceptions](#Exceptions)
- [Basic data types](#Basic-data-types)
    - [Strings](#Strings)
    - [Numbers and math](#Numbers-and-math)
    - [Booleans](#Booleans)
- [Variable assignment](#Variable-assignment)
- [Comments](#Comments)
- [Collections of data](#Collections-of-data)
    - [Lists](#Lists)
    - [Dictionaries](#Dictionaries)
- [Methods](#Methods)
- [The print() function](#The-print()-function)
- [Indentation](#Indentation)
- [`for` loops](#for-loops)
- [`if` statements](#if-statements)

### Using Jupyter notebooks

There are several ways to write and run Python code on your computer. One way -- the method we're using today -- is to use [Jupyter notebooks](https://jupyter.org/), which run in your browser and allow you to intersperse documentation with your code. They're handy for bundling your code with a human-readable explanation of what's happening at each step. Check out some examples from the [L.A. Times](https://github.com/datadesk/notebooks) and [BuzzFeed News](https://github.com/BuzzFeedNews/everything#data-and-analyses).

**To add a new cell to your notebook**: Click the + button in the menu or press the `b` button on your keyboard.

**To run a cell of code**: Select the cell and click the "Run" button in the menu, or you can press Shift+Enter. When you run a cell, Jupyter will display the last value returned from the executed code. (Note that some code does not return anything!)

### Exceptions

Even the most practiced programmers write broken code! When the code you write is invalid, Python will raise an exception, or a detailed report of where and why your program doesn't work.

Take one common gotcha of working with Jupyter notebooks as an example. In a normal Python script, you can reliably reference values after they've been defined. 

In [None]:
my_name = 'Hannah'
my_name

By contrast, Jupyter notebooks don't "know" about code you've written until you've _run_ the cell containing it. If you define a variable called `your_name` in one cell without running it, then try to access that variable from another cell, Python will raise an exception. 

In [None]:
your_name = 'Phil'

In [None]:
your_name

This is a pretty clear exception message: The variable you're trying to reference doesn't have a value. Try running (or re-running) the cell in which you defined `my_name`, and see what happens.

Sometimes, exceptions are less straightforward! You may even see a few as we walk through the rest of this course. If you're stumped, raise your hand and a coach can help get you unstuck. 

When you go back to your newsroom, remember that Google (or the search engine of your choice) is one of a programmer's most valuable tools. Try searching the web for exception messages that trip you up. Odds are, some intrepid hacker that came before can offer a solution. 🙃

### Basic data types
Just like Excel and other data processing software, Python recognizes a variety of data types, including three we'll focus on here:
- Strings (text)
- Numbers (integers, numbers with decimals and more)
- Booleans (`True` and `False`).

A function is a reusable piece of code. Python provides a handful of functions for very common operations, like type checking and coercion, by default. ([View the full list of builtin functions here](https://docs.python.org/3/library/functions.html).)

You can use the built-in [`type()`](https://docs.python.org/3/library/functions.html#type) function to check the data type of a value.

#### Strings

A string is a group of characters -- letters, numbers, whatever -- enclosed within single or double quotes (doesn't matter as long as they match). The code in these notebooks uses single quotes. (The Python style guide doesn't recommend one over the other: ["Pick a rule and stick to it."](https://www.python.org/dev/peps/pep-0008/#string-quotes))

If your string _contains_ apostrophes or quotes, you have two options: _Escape_ the offending character with a forward slash `\`:

```python
'Isn\'t it nice here?'
```

... or change the surrounding punctuation:

```python
"Isn't it nice here?"
```

The style guide recommends the latter over the former.

When you call the `type()` function on a string, Python will return `str`.

Calling the [`str()` function](https://docs.python.org/3/library/stdtypes.html#str) on a value will return the string version of that value (see examples below).

In [None]:
'Investigative Reporters and Editors'

In [None]:
type('hello!')

In [None]:
45

In [None]:
type(45)

In [None]:
str(45)

In [None]:
type(str(45))

In [None]:
str(True)

If you "add" strings together with a plus sign `+`, it will concatenate them:

In [None]:
'IRE' + '/' + 'NICAR'

#### Numbers and math

Python recognizes a variety of numeric data types. Two of the most common are integers (whole numbers) and floats (numbers with decimals).

Calling `int()` on a piece of numeric data (even if it's being stored as a string) will attempt to convert it to an integer; calling `float()` will try to convert it to a float.

In [None]:
12

In [None]:
12.4

In [None]:
type(12)

In [None]:
type(12.4)

In [None]:
int(35.6)

In [None]:
int('45')

In [None]:
float(46)

In [None]:
float('45')

You can do [basic math](https://www.digitalocean.com/community/tutorials/how-to-do-math-in-python-3-with-operators) in Python. You can also do [more advanced math](https://docs.python.org/3/library/math.html).

In [None]:
4+2

In [None]:
10-9

In [None]:
5*10

In [None]:
1000/10

In [None]:
# ** raises a number to the power of another number
5**2

#### Booleans

Just like in Excel, which has `TRUE` and `FALSE` data types, Python has boolean data types. They are `True` and `False` -- note that only the first letter is capitalized, and they are not sandwiched between quotes.

Boolean values are typically returned when you're evaluating some sort of conditional statement -- comparing values, checking to see if a string is inside another string or if a value is in a list, etc.

[Python's comparison operators](https://docs.python.org/3/reference/expressions.html#comparisons) include:

- `>` greater than
- `<` less than
- `>=` greater than or equal to
- `<=` less than or equal to
- `==` equal to
- `!=` not equal to

In [None]:
True

In [None]:
False

In [None]:
4 > 6

In [None]:
10 == 10

In [None]:
'crapulence' == 'Crapulence'

In [None]:
type(True)

### Variable assignment

The `=` sign assigns a value to a variable name that you choose. Later, you can retrieve that value by referencing its variable name. Variable names can be pretty much anything you want ([as long as you follow some basic rules](https://thehelloworldprogram.com/python/python-variable-assignment-statements-rules-conventions-naming/)).

This can be a tricky concept at first! For more detail, [here's a pretty good explainer from Digital Ocean](https://www.digitalocean.com/community/tutorials/how-to-use-variables-in-python-3).

In [None]:
my_name = 'Frank'

In [None]:
my_name

You can also _reassign_ a different value to a variable name, though it's usually better practice to create a new variable.

In [None]:
my_name = 'Susan'

In [None]:
my_name

A common thing to do is to "save" the results of an expression by assigning the result to a variable.

In [None]:
my_fav_number = 10 + 3

In [None]:
my_fav_number

It's also common to refer to previously defined variables in an expression: 

In [None]:
nfl_teams = 32
mlb_teams = 30
nba_teams = 30
nhl_teams = 31

number_of_pro_sports_teams = nfl_teams + mlb_teams + nba_teams + nhl_teams

In [None]:
number_of_pro_sports_teams

### Comments
A line with a comment -- a note that you don't want Python to interpret -- starts with a `#` sign. These are notes to collaborators and to your future self about what's happening at this point in your script, and why.

Typically you'd put this on the line right above the line of code you're commenting on:

In [None]:
avg_settlement = 40827348.34328237

# coercing this to an int because we don't need any decimal precision
int(avg_settlement)

Multi-line comments are sandwiched between triple quotes (or triple apostrophes):

`'''
this
is a long
comment
'''`

or

`"""
this
is a long
comment
"""`

## Collections of data

Now we're going to talk about two ways you can use Python to group data into a collection: lists and dictionaries.

### Lists

A _list_ is a comma-separated list of items inside square brackets: `[]`.

Here's a list of ingredients, each one a string, that together makes up a salsa recipe.

In [None]:
salsa_ingredients = ['tomato', 'onion', 'jalapeño', 'lime', 'cilantro']

To get an item out of a list, you'd refer to its numerical position in the list -- its _index_ (1, 2, 3, etc.) -- inside square brackets immediately following your reference to that list. In Python, as in many other programming languages, counting starts at 0. That means the first item in a list is item `0`.

In [None]:
salsa_ingredients[0]

In [None]:
salsa_ingredients[1]

You can use _negative indexing_ to grab things from the right-hand side of the list -- and in fact, `[-1]` is a common idiom for getting "the last item in a list" when it's not clear how many items are in your list.

In [None]:
salsa_ingredients[-1]

If you wanted to get a slice of multiple items out of your list, you'd use colons (just like in Excel, kind of!).

If you wanted to get the first three items, you'd do this:

In [None]:
salsa_ingredients[0:3]

You could also have left off the initial 0 -- when you leave out the first number, Python defaults to "the first item in the list." In the same way, if you leave off the last number, Python defaults to "the last item in the list."

In [None]:
salsa_ingredients[:3]

Note, too, that this slice is giving us items 0, 1 and 2. The `3` in our slice is the first item we _don't_ want. That can be kind of confusing at first. Let's try a few more:

In [None]:
# everything in the list except the first item
salsa_ingredients[1:]

In [None]:
# the second, third and fourth items
salsa_ingredients[1:4]

In [None]:
# the last two items
salsa_ingredients[-2:]

To see how many items are in a list, use the `len()` function:

In [None]:
len(salsa_ingredients)

### Dictionaries

A _dictionary_ is a comma-separated list of key/value pairs inside curly brackets: `{}`. Let's make an entire salsa recipe:

In [None]:
salsa = {
    'ingredients': salsa_ingredients,
    'instructions': 'Chop up all the ingredients and cook them for awhile.',
    'oz_made': 12
}

To retrieve a value from a dictionary, you'd refer to the name of its key inside square brackets `[]` immediately after your reference to the dictionary:

In [None]:
salsa['oz_made']

In [None]:
salsa['ingredients']

To add a new key/value pair to a dictionary, assign a new key to the dictionary inside square brackets and set the value of that key with `=`:

In [None]:
salsa['tastes_great'] = True

In [None]:
salsa

To delete a key/value pair out of a dictionary, use the `del` command and reference the key:

In [None]:
del salsa['tastes_great']

In [None]:
salsa

### Membership

You can use the [`in` and `not in`](https://docs.python.org/3/reference/expressions.html#membership-test-operations) expressions to test membership in a list or dictionary. These expressions will return booleans (True or False).

In [None]:
'lime' in salsa_ingredients

In [None]:
'cilantro' not in salsa_ingredients

In [None]:
'ingredients' in salsa

In [None]:
'tastes_great' in salsa

### Methods

Let's go back to strings for a second. 

A string is a Python "object". Objects generally have "methods", or reusable bits of code that perform tasks relevant to the object. A method and a function, which we talked about earlier, are essentially the same thing; it's more common to use the term method when talking about functions related to an object.

#### String methods

Python string objects have [useful methods for working with text](https://docs.python.org/3/library/stdtypes.html#string-methods). Let's use an example string to demonstrate.

In [None]:
my_cool_string = '    Hello, friends!'

To see a list of available methods for the object at hand, use the `dir()` method.

In [None]:
dir(my_cool_string)

To learn more about a particular method, consult the Python documentation – or, if you're using an interactive environment like Jupyter, use the `help()` function to view docstrings inline. 

In [None]:
help(my_cool_string.upper)

`upper()` converts the string to uppercase:

In [None]:
my_cool_string.upper()

`lower()` converts to lowercase:

In [None]:
my_cool_string.lower()

`replace()` will replace a piece of text with other text that you specify:

Try using the `help()` function to learn about the `string.replace()`, `string.split()`, and `string.strip()` methods.

Use the `string.replace()` method to transform `my_cool_string` from "Hello, friends" to "Hello, enemies".

Now try using `string.split()` to create a list of words from the phrase in `my_cool_string`.

Finally, use `string.strip()` to get rid of that pesky whitespace in `my_cool_string`.

You "chain" methods to combine their effects -- just tack 'em onto the end. Let's say we wanted to strip whitespace from our string _and_ make it uppercase:

In [None]:
my_cool_string.strip().upper()

Notice, however, that our original string is unchanged:

In [None]:
my_cool_string

Why? Because we haven't assigned the results of anything we've done to a variable. A common thing to do, especially when you're cleaning data, would be to assign the results to a new variable:

In [None]:
my_cool_string_clean = my_cool_string.strip().upper()

In [None]:
my_cool_string_clean

#### List and dictionary methods

Like strings, lists and dictionaries are objects, too!

To add an item to a list, use the [`append()`](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) method:

In [None]:
help(salsa_ingredients.append)

In [None]:
salsa_ingredients.append('mayonnaise')

In [None]:
salsa_ingredients

On the other hand, that may be worse than peas in guacamole. To remove an item from a list, use the `list.pop()` method. 

Use the `help()` method to learn more about how the `list.pop()` method behaves.

In [None]:
help(salsa_ingredients.pop)

Now use `list.pop()` to remove mayonnaise from `salsa_ingredients`.

In [None]:
salsa_ingredients

Dictionaries also have a `dict.pop()` method. Unlike the `list.pop()` method, you must specify the key to remove.

In [None]:
help(salsa.pop)

Use `dict.pop()` to remove `oz_made` from `salsa`.

Objects with methods, or "object-oriented programming", is a common way to organize code in Python – and many other programming languages. You don't need to understand the details of how to write object-oriented code yet, but start to familiarize yourself with the concept of calling methods on objects. It will come up over and over again!

### The `print()` function

So far, we've just been running the notebook cells to get the last value returned by the code we write. Using the [`print()`](https://docs.python.org/3/library/functions.html#print) function is a way to print specific things in your script to the screen. This function is handy for debugging.

To print multiple things on the same line, separate them with a comma.

In [None]:
my_name = 'Megan'

# These values will be _printed_ during execution
print(my_name)
print('Hello ', my_name)

your_name = 'Paige'

# This value will be _returned_
your_name

### Indentation

Whitespace matters in Python. Sometimes you'll need to indent bits of code to make things work. This can be confusing! `IndentationError`s are common even for experienced programmers. (FWIW, Jupyter will try to be helpful and insert the correct amount of "significant whitespace" for you.)

You can use tabs or spaces, just don't mix them. [The Python style guide](https://www.python.org/dev/peps/pep-0008/) recommends indenting your code in groups of four spaces, so that's what we'll use.

### `for` loops

You would use a `for` loop to iterate over a collection of things. The statement begins with the keyword `for` (lowercase), then a temporary `variable_name` of your choice to represent each item as you loop through the collection, then the Python keyword `in`, then the collection you're looping over (or its variable name), then a colon, then the indented block of code with instructions about what to do with each item in the collection.

#### Lists

Let's say we have a list of numbers that we assign to the variable `list_of_numbers`.

In [None]:
list_of_numbers = [1, 2, 3, 4, 5, 6]

We could loop over the list and print out each number. Notice that lists are ordered, so iterating over the list will return the items in that that, in the order they're defined.

In [None]:
for number in list_of_numbers:
    print(number)

We could print out each number, and then print out each number _times 6_:

In [None]:
for number in list_of_numbers:
    print('Number: ', number)
    print('Number times 6: ', number*6)

Note that the variable name `number` in our loop is arbitrary. This would also work:

In [None]:
for banana in list_of_numbers:
    print(banana)

It can be hard, at first, to figure out what's a "Python word" and what's a variable name that you get to define. This comes with practice.

#### Strings

Strings are iterable, too. Let's loop over the letters in a sentence:

In [None]:
sentence = 'Hello, IRE/NICAR!'

for letter in sentence:
    print(letter)

Since strings are iterable, so you can perform similar operations to lists, like accessing only part of the string using slices – 

In [None]:
# get the first five characters
sentence[:5]

– checking the length of the string with the `len()` function –

In [None]:
# get the length of the sentence
len(sentence)

– or checking the membership of a word of phrase in your string with `in`.

In [None]:
'Hello' in sentence

#### Dictionaries

You can iterate over dictionaries, too. Unlike lists, historically, dictionaries have not been ordered. (This has changed in the newest versions of Python, however it is still conventional to assume dictionary order will not be preserved.)

When you're looping over a dictionary, the variable name in your `for` loop will refer to the keys. Let's loop over our `salsa` dictionary from up above to see what I mean.

In [None]:
for key in salsa:
    print(key)

To get the _value_ of a dictionary item in a for loop, you'd need to use the key to retrieve it from the dictionary:

In [None]:
for key in salsa:
    print(key, salsa[key])

If you are working with a dictionary where order is important, Python provides [an OrderedDict object](https://docs.python.org/3/library/collections.html#collections.OrderedDict). To use it, first import it.

In [None]:
from collections import OrderedDict
ordered_dict = OrderedDict([('food', 'salsa'), ('taste', 'delicious')])
ordered_dict

### `if` statements
Just like in Excel, you can use the "if" keyword to handle conditional logic.

These statements begin with the keyword `if` (lowercase), then the condition to evaluate, then a colon, then a new line with a block of indented code to execute if the condition resolves to `True`.

In [None]:
if 4 < 6:
    print('4 is less than 6')

You can also add an `else` statement (and a colon) with an indented block of code you want to run if the condition resolves to `False`.

In [None]:
if 4 > 6:
    print('4 is greater than 6?!')
else:
    print('4 is not greater than 6.')

If you need to, you can add multiple conditions with `elif`.

In [None]:
HOME_SCORE = 6
AWAY_SCORE = 8

if HOME_SCORE > AWAY_SCORE:
    print('we won!')
elif HOME_SCORE == AWAY_SCORE:
    print('we tied!')
else:
    print('we lost!')

## HTML 
### What _is_ a web page, anyway?

Generally, a web page consists of a bunch of specifically formatted text files stored on a computer (a _server_) that's probably sitting on a rack in a giant data center somewhere.

Mostly you'll be dealing with `.html` (HyperText Markup Language) files that might include references to `.css` (Cascading Style Sheet) files, which determine how the page looks, and/or `.js` (JavaScript) files, which add interactivity, and other specially formatted text files.

Today, we'll focus on the HTML, which gives structure to the page.

Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag.

A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).

HTML elements can have any number of attributes, such as IDs, which uniquely identify elements --

`<table id="first-table">`

-- classes, which identify a type of element --

`<table class="striped-table">`

-- and styles, which define how specific elements appear --

`<table style="width:95%;">`

-- that will be useful to know about when we're scraping. In the best cases, you can extract content by using the id or class already assigned to the element you’d like to extract. An ‘id’ is intended to act as the unique identifer a specific item on a page. A ‘class’ is used to label a specific type of item on a page. So, there maybe may instances of a class on a page.

### Inspecting HTML in your browser

You can look at the HTML that makes up a web page by _inspecting the source_ in a web browser. We like Chrome and Firefox for this; today, we'll use Chrome.

#### Inspect element

You can inspect specific elements on the page by right-clicking on the page and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page. This can be helpful when you're trying to figure how to uniquely identify the element you want to scrape.

#### View page source

To examine all of the source code that makes up a page, you can "view source." In Chrome, right click anywhere in your browser window, then click "View Page Source."

This will open a new tab showing you all of the HTML code that makes up that page. Ignore 99% of it and try to locate the element(s) that you want to target (use `Ctrl+F` on a PC and `⌘+F` to find).

### Practice

Let's give our new skills a whirl. Open up a Chrome browser and navigate to [the first page of Maryland's list of WARN letters](https://www.dllr.state.md.us/employment/warn.shtml). Your goal is to isolate the table element on the page.

There are many ways to grab content from HTML, and every page you scrape data from will require a slightly different trick. At this stage, your job is to find a pattern or identifier in the code for the element you’d like to extract, which we will then give as instructions to our Python code.

Inspect the table element or view the page source. How would you pick the table element out from the all of the elements that make up the page? Is it the only table? If not, does it have any attributes, like a class or ID, that would allow you to target it?

## Putting it all together

The remainder of this notebook demonstrates how you can use the Python programming language to scrape information from a web page. The goal today: Scrape the main table on [the first page of Maryland's list of WARN letters](https://www.dllr.state.md.us/employment/warn.shtml) and, if time, write the data to a CSV that you can open with Excel.

- [Import libraries](#Import-libraries)
- [Request the page](#Request-the-page)
- [Turn your HTML into soup](#Turn-your-HTML-into-soup)
- [Target and extract data](#Targeting-and-extracting-data)
- [Write the results to file](#Write-the-results-to-file)

### Import libraries

As we learned in the first section, Python provides some broadly useful objects, like strings and lists, and functions, like `type()` and `print()`, by default. For more specialized operations, like making web requests and parsing HTML, we need to import a few modules, from which we can access helpful objects and functions.

Today, we'll use two third-party Python libraries to help us scrape:
- `requests` is the de facto standard for making HTTP requests, similar to what happens when you type a URL into a browser window and hit enter.
- `bs4`, or BeautifulSoup, is a popular library for parsing HTML into a data structure that Python can work with.

**Note:** These libraries are installed separately from Python on a per-project basis. They're already in your working environment for this tutorial. If you want to revisit this tutorial on your own computer, or if you want to create a scraping project of your own, you can ([read more about IRE's recommendations for setting up Python projects here](https://docs.google.com/document/d/1cYmpfZEZ8r-09Q6Go917cKVcQk_d0P61gm0q8DAdIdg/edit#heading=h.od2v1nkge5t1)).

Use the `import` keyword to import the `requests` and `bs4` modules.

In [None]:
import requests
import bs4

Remember, you can use the `dir()` function to inspect the objects and methods included with a module, and the `help()` function to read more about a particular object or method.

**Note**: Some third-party libraries come with more documentation than others, however popular ones (such as `requests`) tend to be just as, if not more, helpful than the standard Python library!

In [None]:
dir(requests)

In [None]:
help(requests.get)

### Request the page

Next, we'll use the `get()` method of the `requests` library (which we just imported) to grab the web page.

Use the `help()` function to learn more about `requests.get`.

In [None]:
help(requests.get)

First, we'll define a variable `URL` as a string containing the web address we want to scrape.

In [None]:
URL = 'http://www.dllr.state.md.us/employment/warn.shtml'

Next, we'll use `requests.get` to retrieve the URL. Remember that you can can assign the output of an expression to a variable, so we'll store the response as a new variable. The variable name is arbitrary, but it's a great idea to use something that describes the value it's pointing to.

In [None]:
warn_page = requests.get(URL)

If you want to make sure that your request was successful, you can check the `status_code` attribute of the Python object that was returned:

In [None]:
warn_page.status_code

A `200` code means all is well. `404` means the page wasn't found, etc. ([Here's one of our favorite lists of HTTP status codes](https://http.cat/) ([or here, if you prefer dogs](https://httpstatusdogs.com/)).)

The object being stored as the `warn_page` variable came back with a lot of potentially useful information we could access. Use the `dir()` method to see all the attributes.

In [None]:
dir(warn_page)

Today, we're mostly interested in the `.text` attribute -- the HTML that makes up the web page, same as if we'd viewed the page source. Let's take a look:

In [None]:
warn_page.text

### ✍️ Try it yourself

Use the code blocks below to experiment with requesting web pages and checking out the HTML that gets returned.

Some ideas to get you started:
- `'http://ire.org'`
- `'https://web.archive.org/web/20031202214318/http://www.tdcj.state.tx.us:80/stat/finalmeals.htm'`
- `'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'`

### Turn your HTML into soup

The HTML in the `.text` attribute of the request object is just a string -- a big ol' chunk of text.

Before we start targeting and extracting pieces of data in the HTML, we need to turn that chunk of text into a data structure that Python can work with. That's where the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (`bs4`) library comes in.

We'll create a new instance of a `BeautifulSoup` object, which lives under the top-level `bs4` library that we imported earlier. We need to give it two things:
- The HTML we'd like to parse -- `warn_page.text`
- A string with the name of the type of parser to use -- `html.parser` is the default and usually fine, but [there are other options](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

We'll save the parsed HTML as a new variable, `soup`.

In [None]:
soup = bs4.BeautifulSoup(warn_page.text, 'html.parser')

Nothing happened, which is good! You can take a look at what `soup` is, but it looks pretty much like `warn_page.text`:

In [None]:
soup

If you want to be sure, you can use the Python function `type()` to check what sort of object you're dealing with:

In [None]:
# the `str` type means a string, or text
type(warn_page.text)

In [None]:
# the `bs4.BeautifulSoup` type means we successfully created the object
type(soup)

### ✍️ Try it yourself

Use the code blocks below to experiment fetching HTML and turning it into soup (if you fetched some pages earlier and saved them as variables, that'd be a good start).

### Target and extract data

Now that we have BeautifulSoup object loaded up, we can go hunting for the specific HTML elements that contain the data we need. Our general strategy:
1. Find the main table with the data we want to grab
2. Get a list of rows (the `tr` element, which stands for "table row") in that table
3. Use a Python `for loop` to go through each table row and find the data inside it (`td`, or "table data")

To accomplish this, we'll use two `bs4` methods:
- [`find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find), which returns the first element that matches whatever criteria you hand it
- [`find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all), which returns a _list_ of elements that match the criteria. ([Here's how Python lists work](Python%20syntax%20cheat%20sheet.ipynb#Lists).)

#### Find the table

To start with, we need to find the table. There are several ways to accomplish this, but because this is the only table on the page (view source and `Ctrl+F` to search for `<table` to confirm), we can simply say, "Look through the `soup` object and find the table tag."

Translated, the code is: `soup.find('table')`. While we're at it, save the results of that search to a new variable, `table`.

Run these cells:

In [None]:
table = soup.find('table')

In [None]:
table

#### Find the rows in the table

Next, use the `find_all()` method to drill down and get a list of rows in the table:

In [None]:
rows = table.find_all('tr')

In [None]:
rows

To see how many items are in this list -- in other words, how many rows are in the table -- you can use the `len()` function:

In [None]:
len(rows)

#### Loop through the rows and extract the data

Next, we can use a [`for` loop](Python%20syntax%20cheat%20sheet.ipynb#for-loops) to go through the list of rows and start grabbing data from each one.

Quick refresher on _for loop_ syntax: Start with the word `for` (lowercase), then a variable name to stand in for each item in the list that you're looping over, then the word `in` (lowercase), then the name of the list holding the items (`rows`, in our case), then a colon, then an indented block of code describing what we're doing to each item in the list.

Each piece of data in the row will be stored in a `td` tag, which stands for "table data." So inside the loop -- in the indented block -- we'll use the `find_all()` method to get a list of every `td` tag inside the row. And from there, we can access the content inside each tag.

Our goal is to end up with a _list_ of data for each row that we will eventually write out to a file. Typically you'd probably do the work of looping and inspecting the results, step by step, in one code cell. But to show the thinking of how you might approach this (and to practice the syntax), we'll start by just printing out each row and then build from there. (`print('='*80)` will print a line of 80 equals signs -- a way to help us see exactly what we're working with in each row.)

In [None]:
for row in rows:
    print(row)
    print('='*80)

Notice that the first item that prints is the header row with the column labels. You are free to keep these headers if you want, but I typically skip that row and define my own list of column names.

(Another thing to consider: On better-constructed web pages, the cells in the header row will be represented by `th` ("table header") tags, not `td` ("table data") tags. The next step in our `for` loop is, "Find all of the `td` tags in this row," so that would be something you would need to deal with.)

We can skip the first row by using _list slicing_: adding square brackets after the name of the list with some instructions about which items in the list we want to select.

Here, the syntax would be: `rows[1:]`, which means, take everything in the `rows` list starting with the item in position 1 (the second item) to the end of the list. Like many programming languages, Python starts counting at 0, so the result will leave off the first item in the list -- i.e. the item in position 0, i.e. the headers.

In [None]:
for row in rows[1:]:
    print(row)
    print('='*80)

Now we're cooking with gas. Let's start pulling out the data in each row. Start by using `find_all()` to grab a list of `td` tags:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    print(cells)
    print('='*80)

Now we have, for each row, a _list_ of `td` tags. Next step is to look at the table and start grabbing specific values based on their position in the list and assigning them to human-readable variable names.

Quick refresher on list syntax: To access a specific item in a list, use square brackets `[]` and the index number of the item you'd like to access. For instance, to get the first cell in the row -- the date that each WARN report was issued -- use `[0]`.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    warn_date = cells[0]
    print(warn_date)
    print('='*80)

This is returning the entire `Tag` object -- we just want the contents inside it. You can access the `.text` attribute of the tag to get the text inside:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    warn_date = cells[0].text    
    print(warn_date)

In the next cell (`[1]`), the `.text` attribute will give you the NAICS code. In the third cell (`[2]`) you'll get the name of the business. Etc.

It's also generally good practice to trim off external whitespace for each value, and you can use the Python built-in string method `strip()` to accomplish this as you march across the row.

Which gets us this far:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    warn_date = cells[0].text.strip()
    naics_code = cells[1].text.strip()
    biz = cells[2].text.strip()
    print(warn_date, naics_code, biz)

### ✍️ Try it yourself

Now that you've gotten this far, see if you can isolate the other pieces of data in each row.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    warn_date = cells[0].text.strip()
    naics_code = cells[1].text.strip()
    biz = cells[2].text.strip()
    
    # address
    
    # wia_code
    
    # total_employees
    
    # effective_date
    
    # type_code

    # print()

### 5. Write the results to file

Now that we've targeted our lists of data for each row, we can use Python's built-in [`csv`](https://docs.python.org/3/library/csv.html) module to write each list to a CSV file.

First, import the csv module.

In [None]:
import csv

Now define a list of headers to match the data (each column header will be a string) -- run this cell:

In [None]:
HEADERS = ['warn_date', 'naics_code', 'biz', 'address', 'wia_code',
           'total_employees', 'effective_date', 'type_code']

Now, using something called a `with` block, open a new CSV file to write to and write some code to do the following things:
- Create a `csv.writer` object
- Write out the list of headers using the `writerow()` method of the `csv.writer` object
- Drop in the `for` loop you just wrote and, instead of just printing the contents of each cell, create a list of items and use the `writerow()` method of the `csv.writer` object to write your list of data to file

In [None]:
# create a file called 'warn-data.csv' in write ('w') mode
# specify that newlines are terminated by an empty string (this deals with a PC-specific problem)
# and use the `as` keyword to name the open file handler (the variable name `outfile` is arbitrary)
with open('warn-data.csv', 'w', newline='') as outfile:
    # go to the csv module we imported and make a new .writer object attached to the open file
    # and save it to a variable
    writer = csv.writer(outfile)

    # write out the list of headers
    writer.writerow(headers)
    
    # paste in the for loop you wrote earlier here -- watch the indentation!
    # it should be at this indentation level =>
    # for row in rows[1:]:
    #     cells = row.find_all('td')
    #     etc. ...
    # but at the end, instead of `print(warn_date, naics_code, ...etc.)`
    # make it something like
    # data_out = [warn_date, naics_code, ...etc.]
    # `writer.writerow(data_out)`

If you look in the folder, you should see a new file: `warn-data.csv`. Hooray!

🎉 🎉 🎉

### ✍️ Try it yourself

Putting it all together:
- Find a website you'd like to scrape
- Use `requests` to fetch the HTML
- Use `bs4` to parse the HTML and isolate the data you're interested in
- Use `csv` to write the data to file

### Extra credit problems

1. **Remove internal whitespace:** Looking over the data, you probably noticed that some of the values have some unnecessary internal whitespace, which you could fix before you wrote each row to file. Python does not have a built-in string method to remove internal whitespace, unfortunately, but [Googling around](https://www.google.com/search?q=python+remove+internal+whitespace) will yield you a common strategy: Using the `split()` method to separate individual words in the string, then `join()`ing the resulting list on a single space. As an example:

```python
my_text = 'hello     world      how are      you?'

# split() will turn this into a list of words
my_text_words = my_text.split()
# ['hello', 'world', 'how', 'are', 'you?']

# join on a single space
my_text_clean = ' '.join(my_text_words)
print(my_text_clean)
# prints 'hello world how are you?'

# or, as a one-liner
my_text_clean = ' '.join(my_text.split())
```

2. **Fetch multiple years:** The table we scraped has WARN notices for the current year, but the agency also maintains pages with WARN notices for previous years -- there's a list of them in a section [toward the bottom of the page](https://www.dllr.state.md.us/employment/warn.shtml). See if you can figure out how to loop over multiple pages and scrape the contents of each into a single CSV.


3. **Build a lookup table:** Each numeric code in the "WIA Code" column correspondes to a local area. See if you can figure out how to create a lookup dictionary that maps the numbers to their locations, then as you're looping over the data table, replace the numeric value in that column with the name of the local area instead. Here's a hint:

```python
    lookup_dict = {
        '1': 'hello',
        '2': 'world'
    }

    print(lookup_dict.get('1'))
    # prints 'hello'

    print(lookup_dict.get('3'))
    # prints None

```


4. **Fix encoding errors:** You might have noticed a few encoding problems -- e.g., `Nestlé` is being renedered as `NestlÃ©`. This is due to an encoding problem -- the `warn_page.text` is not encoded as `utf-8`. Using `decode()` and `encode()`, see if you can fix this. (Hint! It looks like the state of Maryland is a big fan of `latin-1`.)