# Web Scraping Walk-Through

Often the data we want isn't available in a downloadable format or an easily accessible web API. Yet, if we can see the information we want on a public website, we can usually get that information by scraping it.

In this walk-through, we will:

1. Discuss fundamental web technologies
2. Introduce third-party Python packages that support the web-scraping workflow
3. Write some code that scrapes data from the Missouri State Highway Patrol's [website](https://www.mshp.dps.missouri.gov/HP68/SearchAction)

## Don't do this unless you really need to

Scraping a website should be your last resort. Unless under extraordinary circumstances, you should strive to work within the usual guidelines of public information requests.

Talk to the people who run the website you want to scrape, and ask them to provide you with their data, which is most likely in a relational database system (e.g., SQL Server, MySQL, Oracle). Be patient, kind and respectful.

## How the web works

When your web browser loads a website and you point and click your way around the pages of that site, there are several technologies working behind the scenes that make all this possible. To do web scraping, it's helpful to have a good idea of [how the web works](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works). 

In particular, when we scrape a website, we interact directly with (at the very least) two web technologies.

### HTTP

The Hypertext Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP)) is the foundation for exchanging data on the web. It sets the rules for how clients like your preferred web browser (e.g., Firefox, Safari, Chrome) communicate with web servers that host the sites you visit.

The HTTP flow starts when a client sends a request to server. All requests have the following parts:

1. A [method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) indicating the user's desired action.
2. A path to a resource, indicated by Universal Resource Locator (aka, [URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)).
3. Optional headers that convey additional information to the server.

### HTML

The second fundamental web technology we interact with in web scraping is Hypertext Markup Language (aka, [HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)). A web page is a document written in HTML, which gives struture and meaning to its content. In an HTML document, content is annotated (hence, the "Markup" part) using tags like [`<h1>` through `<h6>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Heading_Elements) for headings and [`<p>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p) for paragraphs.

Tags, plus their optional attributes and the content that they wrap, form [elements](https://developer.mozilla.org/en-US/docs/Glossary/element). Elements are what give HTML documents their structure, which makes the content within them *easier* to parse than, say, unstructured text. Still requires some work from us, though.

## Essential Python packages

A couple of third-party Python packages are widely used in web scraping to handle the two web technologies described above.

### [Requests](https://requests.readthedocs.io/en/master/) for managing HTTP requests and responses

Requests also handles other intricacies, including sessions and cookies and URL and form encoding.

We can install Requests using pip, Python's default package manager:

```sh
pip install requests
```

Then we can import the main module:

In [62]:
import requests

#### Making a `GET` Request

The most common HTTP request method is [`GET`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET), which simply gets you a copy of the requested resource located at the URL.


Let's make a `GET` request for the [home page](https://www.mshp.dps.missouri.gov/HP68/SearchAction) of the State Highway Patrol website. The convention is to store the response in a variable called `r`.

In [63]:
r = requests.get('https://www.mshp.dps.missouri.gov/HP68/SearchAction')

The [`.get`](https://2.python-requests.org/en/master/api/#requests.get) function call has one required argument, which is the URL we want to get. This method returns a [`request.Response`](https://2.python-requests.org/en/master/api/#requests.Response) object that represents the server's response.

We can check if we got an "okay" response:

In [64]:
r.ok

True

We can also check the specific [response status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status):

In [65]:
r.status_code

200

We can also access the content of the response, that is, all the HTML that makes up the web page.

In [66]:
r.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\r\n<html>\r\n<head>\r\n\r\n\r\n\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\r\n<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">\r\n<meta http-equiv="Content-Style-Type" content="text/css">\r\n<link href="theme/Master.css" rel="stylesheet" type="text/css">\r\n<link rel="stylesheet" type="text/css" href="theme/print.css" media="print">\r\n<!--[if IE]> <link href="theme/ie.css" rel="stylesheet" type="text/css"> <![endif]-->\r\n\r\n<title>Missouri State Highway Patrol - Crash Reports</title>\r\n\r\n</head>\r\n<body id="body" onload="document.searchForm.searchFirst.focus();">\r\n\r\n\r\n<div id="headerContainer">\r\n\t<div id="headerLeft"><a href="/HP68/search.jsp">\r\n\t<img src="https://www.mshp.dps.missouri.gov/MSHPWeb/Images/newMainHeader.jpg" alt="MSHP Emblem" name="MSHP_Emblem" />\r\n\t</a></div>\r\n\t<!--<div id="header">\r\n\t  <a href="http://www.missouriamberalert.com/current

This content is represented in Python in [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes). If we want the HTML as a string, we can use the response's `.text` attribute.

In [67]:
r.text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\r\n<html>\r\n<head>\r\n\r\n\r\n\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\r\n<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">\r\n<meta http-equiv="Content-Style-Type" content="text/css">\r\n<link href="theme/Master.css" rel="stylesheet" type="text/css">\r\n<link rel="stylesheet" type="text/css" href="theme/print.css" media="print">\r\n<!--[if IE]> <link href="theme/ie.css" rel="stylesheet" type="text/css"> <![endif]-->\r\n\r\n<title>Missouri State Highway Patrol - Crash Reports</title>\r\n\r\n</head>\r\n<body id="body" onload="document.searchForm.searchFirst.focus();">\r\n\r\n\r\n<div id="headerContainer">\r\n\t<div id="headerLeft"><a href="/HP68/search.jsp">\r\n\t<img src="https://www.mshp.dps.missouri.gov/MSHPWeb/Images/newMainHeader.jpg" alt="MSHP Emblem" name="MSHP_Emblem" />\r\n\t</a></div>\r\n\t<!--<div id="header">\r\n\t  <a href="http://www.missouriamberalert.com/current.

When viewed as raw HTML, the web page is difficult to read and a bit overwhelming. If we write the content to a local file, we can open it in our web browser just the same as if we were pointing the web browser at the URL.

In [68]:
with open('index.html', 'w') as f:
    f.write(r.text)

#### Making a `POST` request

After `GET`, the second most common HTTP request method is [`POST`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST).

A `POST` method allows a user agent to send data to a server, which typically happens via a web form, such as the search form on the Missouri State Highway Patrol's home page.

Granted, sending a `POST` method for the purposes of getting search results is a little confusing. Ideally, this form would send `GET` requests. However, for a really complex search form or one that involves submitting sensitive data, the `POST` method might be necessary.

To figure out which method a search form requires, use your browser's web inspector. You can either check the method attribute of the `<form>` element in the HTML. Or you can watch the network tab for the request and check the method there.

The web inspector also reveals other essential information: The names of the form fields. Again, these can be found in either

- in the HTML (look at the name attribute on the form's various input fields);
- in the header of the `POST` request.

We need to know the names of the form fields and what values they will accept because we need to provide this information in our request for search results.

In the Python Requests library, this search info is specified via the `data` keyword argument when calling the [`.post()`](https://2.python-requests.org/en/master/api/#requests.post) function.

For instance, here's how to search for all the crashes with a fatal injury type.

In [69]:
r = requests.post(
    'https://www.mshp.dps.missouri.gov/HP68/SearchAction',
    data={'searchInjury': 'FATAL'},
)

If the response is okay, we can write this to another local file.

In [70]:
if r.ok:
    with open('results.html', 'w') as f:
        f.write(r.text)

### [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing HTML

BeautifulSoup is the most popular Python package for getting data out of HTML.

To install this package, go back to your terminal, and stop the Jupyter server (press "Ctrl" and "C" at the same time, then prompted, press "y", then "Return"). Now that your command prompt is back up:

```sh
pip install beautifulsoup4
```

The parsing process begins by creating a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup) object.

First we import this class.

In [2]:
from bs4 import BeautifulSoup

Then we create an instance of this class by passing in the html string.

In [72]:
soup = BeautifulSoup(r.content)

Technically, BeautifulSoup is an interface to lower-level Python parsing libraries. By default, it uses the the [`html.parser`](https://docs.python.org/3/library/html.parser.html) module in Python's standard library. The other options:

- [lxml](https://lxml.de/) which tends to be faster
- [html5lib](https://html5lib.readthedocs.io/en/latest/) which tends to be more lenient (i.e., tolerant of html docs with unclosed tags and other sub-standard syntax)

BeautifulSoup's docs have more details about [how to install alternate parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) and the [differences](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) between them.



In [73]:
type(soup)

bs4.BeautifulSoup

In [74]:
table = soup.find('table')

In [75]:
type(table)

bs4.element.Tag

In [76]:
table.find_all('th')

[<th class="headerSortPrint">Report</th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=name';document.sortForm.submit();">Name
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=age';document.sortForm.submit();">Age
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=city';document.sortForm.submit();">Person City/State
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=injury';document.sortForm.submit();">Personal Injury
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=date';document.sortForm.submit();">Date
 				
 					<img alt="Sort down" src="/HP68/static/images/down.gif"/>
 </a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=

In [77]:
th_all = table.find_all('th')

In [78]:
type(th_all)

bs4.element.ResultSet

In [79]:
for th in th_all:
    print(th.text)

Report
Name
				
				
Age
				
				
Person City/State
				
				
Personal Injury
				
				
Date
				
					

Time
				
				
Crash County
				
				
Crash Location
				
				
Troop
				
				


In [80]:
headers = []

In [81]:
for th in th_all:
    header = th.text.strip().replace(' ', '_').lower()
    headers.append(header)

In [82]:
headers

['report',
 'name',
 'age',
 'person_city/state',
 'personal_injury',
 'date',
 'time',
 'crash_county',
 'crash_location',
 'troop']

In [83]:
tr_all = table.find_all('tr')[1:]

In [84]:
for tr in tr_all:
    for td in tr.find_all('td'):
        print(td.text.strip())
    print('---------------')

View
FRAZIER, ZACKERY D
43
ADRIAN, MO
FATAL
04/08/2020
12:05AM
BATES
I-49 NB AT 143.4MM
A
---------------
View
SELLERS, KELTON R
17
JEFFERSON CITY, MO
FATAL
04/07/2020
2:18PM
COLE
WALNUT ACRES ROAD .25 MILE SOUTH OF STRINGTOWN STATION ROAD
F
---------------
View
LINTON, MICHAEL L
51
NEOSHO, MO
FATAL
04/05/2020
9:25PM
NEWTON
NORTHBOUND I49, 3 MILES NORTH OF NEOSHO
D
---------------
View
SPEARS, BOBBY R
21
OAK RIDGE, MO
FATAL
04/01/2020
8:22PM
CAPE GIRARDEAU
LARCH LANE SOUTH OF ROUTE FF
E
---------------
View
KOENIG, LARRY K
57
FAYETTE, MO
FATAL
03/30/2020
3:59PM
HOWARD
MO 5  0.6 MILES SOUTH OF CO RD 423
F
---------------
View
MAASSEN, BRAD C
40
LABADIE, MO
FATAL
03/28/2020
5:30PM
WASHINGTON
EASTBOUND WOODLAND DRIVE WEST OF ASPEN DRIVE
C
---------------
View
SELLARS, FREDRICK R
28
PITTSBURG, KS
FATAL
03/28/2020
3:35PM
BARTON
US 160 1 MILE EAST OF MINDENMINES
D
---------------
View
JUVENILE
14
MARTHASVILLE, MO
FATAL
03/28/2020
2:59PM
WARREN
CONCORD VIEW RD .4 MILE FROM CONCORD HILL RD
C
-

In [95]:
def clean_row(tds):
    details_url = td_all[0].find('a').attrs['href']
    if tds[1].text.strip() == 'JUVENILE':
        last_name = ''
        first_name = ''
        is_juvenile = True
    else:
        last_name, first_name = tds[1].text.split(',')
        is_juvenile = False
    row = {
        'details_url': details_url,
        'last_name': last_name.strip(),
        'first_name': first_name.strip(),
        'is_juvenile': is_juvenile,
        'age': int(tds[2].text.strip()),
        'person_city_state': tds[3].text.strip(),
        'personal_injury': tds[4].text.strip(),
        'date': tds[5].text.strip(),
        'time': tds[6].text.strip(),
        'crash_county': tds[7].text.strip(),
        'crash_location': tds[8].text.strip(),
        'troop': tds[9].text.strip(),
    }
    
    return row

In [96]:
rows = []

In [97]:
for tr in tr_all:
    td_all = tr.find_all('td')
    print(td_all)
    print('--------------')
    row = clean_row(td_all)
    print(row)
    rows.append(row)

[<td class="infoCellPrint"><img alt="See Report Details" src="/HP68/static/images/details.gif"/> <a href="/HP68/AccidentDetailsAction?ACC_RPT_NUM=200168702">View</a></td>, <td class="infoCell">
			
				FRAZIER, ZACKERY D
			
			</td>, <td class="infoCell">43</td>, <td class="infoCell">ADRIAN, MO</td>, <td class="infoCell">FATAL </td>, <td class="infoCell">04/08/2020</td>, <td class="infoCell">12:05AM</td>, <td class="infoCell">BATES</td>, <td class="infoCell">I-49 NB AT 143.4MM</td>, <td class="infoCell">A</td>]
--------------
{'details_url': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200168702', 'last_name': 'FRAZIER', 'first_name': 'ZACKERY D', 'is_juvenile': False, 'age': 43, 'person_city_state': 'ADRIAN, MO', 'personal_injury': 'FATAL', 'date': '04/08/2020', 'time': '12:05AM', 'crash_county': 'BATES', 'crash_location': 'I-49 NB AT 143.4MM', 'troop': 'A'}
[<td class="infoCell2Print"><img alt="See Report Details" src="/HP68/static/images/details.gif"/> <a href="/HP68/AccidentDetailsAct

ValueError: too many values to unpack (expected 2)

The [csv](https://docs.python.org/3.8/library/csv.html) module that is part of Python's standard library.

In [None]:
import csv

In [None]:
dir(rows[0].keys())

In [None]:
with open('crashes.csv', 'w', newline='') as f:
    writer = csv.DictWriter(
        f, fieldnames=rows[0].keys()
    )
    
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

## Converting your notebook to a script

[nbconvert](https://pypi.org/project/nbconvert/)

```sh
jupyter nbconvert --no-prompt --to python Untitled1.ipynb
```

Do this in case you want to start working in [Sublime Text](https://www.sublimetext.com/), [Atom](https://atom.io/) or [Visual Code Studio](https://code.visualstudio.com/).