# Web Scraping Walk-Through

Often the data we want isn't available in a downloadable format or an easily accessible web API. Yet, if we can see the information we want on a public website, we can usually get that information by scraping it.

In this walk-through, we will:

1. Discuss fundamental web technologies
2. Introduce third-party Python packages that support the web-scraping workflow
3. Write some code that scrapes data from the Missouri State Highway Patrol's [website](https://www.mshp.dps.missouri.gov/HP68/SearchAction)

## Don't do this unless you really need to

Scraping a website should be your last resort. Unless under extraordinary circumstances, you should strive to work within the usual guidelines of public information requests.

Talk to the people who run the website you want to scrape, and ask them to provide you with their data, which is most likely in a relational database system (e.g., SQL Server, MySQL, Oracle). Be patient, kind and respectful.

## How the web works

When your web browser loads a website and you point and click your way around the pages of that site, there are several technologies working behind the scenes that make all this possible. To do web scraping, it's helpful to have a good idea of [how the web works](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works). 

In particular, when we scrape a website, we interact directly with (at the very least) two web technologies.

### HTTP

The Hypertext Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP)) is the foundation for exchanging data on the web. It sets the rules for how clients like your preferred web browser (e.g., Firefox, Safari, Chrome) communicate with web servers that host the sites you visit.

The HTTP flow starts when a client sends a request to server. All requests have the following parts:

1. A [method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) indicating the user's desired action.
2. A path to a resource, indicated by Universal Resource Locator (aka, [URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)).
3. Optional headers that convey additional information to the server.

### HTML

The second fundamental web technology we interact with in web scraping is Hypertext Markup Language (aka, [HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)). A web page is a document written in HTML, which gives struture and meaning to its content. In an HTML document, content is annotated (hence, the "Markup" part) using tags like [`<h1>` through `<h6>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Heading_Elements) for headings and [`<p>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p) for paragraphs.

Tags, plus their optional attributes and the content that they wrap, form [elements](https://developer.mozilla.org/en-US/docs/Glossary/element). Elements are what give HTML documents their structure, which makes the content within them *easier* to parse than, say, unstructured text. Still requires some work from us, though.

## Essential Python packages

A couple of third-party Python packages are widely used in web scraping to handle the two web technologies described above.

### [Requests](https://requests.readthedocs.io/en/master/) for managing HTTP requests and responses

Requests also handles other intricacies, including sessions and cookies and URL and form encoding.

We can install Requests using pip, Python's default package manager:

```sh
pip install requests
```

Then we can import the main module:

In [1]:
import requests

#### Making a `GET` Request

The most common HTTP request method is [`GET`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET), which simply gets you a copy of the requested resource located at the URL.


Let's make a `GET` request for the [home page](https://www.mshp.dps.missouri.gov/HP68/SearchAction) of the State Highway Patrol website. The convention is to store the response in a variable called `r`.

In [2]:
r = requests.get('https://www.mshp.dps.missouri.gov/HP68/SearchAction')

The [`.get`](https://2.python-requests.org/en/master/api/#requests.get) function call has one required argument, which is the URL we want to get. This method returns a [`request.Response`](https://2.python-requests.org/en/master/api/#requests.Response) object that represents the server's response.

We can check if we got an "okay" response:

In [3]:
r.ok

True

We can also check the specific [response status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status):

In [4]:
r.status_code

200

We can also access the content of the response, that is, all the HTML that makes up the web page.

In [5]:
r.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n\n\n\n\t\n\t\n\t\n\t\n\t\n\n<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">\n<meta http-equiv="Content-Style-Type" content="text/css">\n<link href="theme/Master.css" rel="stylesheet" type="text/css">\n<link rel="stylesheet" type="text/css" href="theme/print.css" media="print">\n<!--[if IE]> <link href="theme/ie.css" rel="stylesheet" type="text/css"> <![endif]-->\n\n<title>Missouri State Highway Patrol - Crash Reports</title>\n\n</head>\n<body id="body" onload="document.searchForm.searchFirst.focus();">\n\n\n<div id="headerContainer">\n\t<div id="headerLeft"><a href="/HP68/search.jsp">\n\t<img src="https://www.mshp.dps.missouri.gov/MSHPWeb/Images/newMainHeader.jpg" alt="MSHP Emblem" name="MSHP_Emblem" />\n\t</a></div>\n\t<!--<div id="header">\n\t  <a href="http://www.missouriamberalert.com/current.php"><img class="align" src="http://www.missouriamberaler

This content is represented in Python in [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes). If we want the HTML as a string, we can use the response's `.text` attribute.

In [6]:
r.text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n\n\n\n\t\n\t\n\t\n\t\n\t\n\n<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">\n<meta http-equiv="Content-Style-Type" content="text/css">\n<link href="theme/Master.css" rel="stylesheet" type="text/css">\n<link rel="stylesheet" type="text/css" href="theme/print.css" media="print">\n<!--[if IE]> <link href="theme/ie.css" rel="stylesheet" type="text/css"> <![endif]-->\n\n<title>Missouri State Highway Patrol - Crash Reports</title>\n\n</head>\n<body id="body" onload="document.searchForm.searchFirst.focus();">\n\n\n<div id="headerContainer">\n\t<div id="headerLeft"><a href="/HP68/search.jsp">\n\t<img src="https://www.mshp.dps.missouri.gov/MSHPWeb/Images/newMainHeader.jpg" alt="MSHP Emblem" name="MSHP_Emblem" />\n\t</a></div>\n\t<!--<div id="header">\n\t  <a href="http://www.missouriamberalert.com/current.php"><img class="align" src="http://www.missouriamberalert

When viewed as raw HTML, the web page is difficult to read and a bit overwhelming. If we write the content to a local file, we can open it in our web browser just the same as if we were pointing the web browser at the URL.

In [7]:
with open('index.html', 'w') as f:
    f.write(r.text)

#### Making a `POST` request

After `GET`, the second most common HTTP request method is [`POST`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST).

A `POST` method allows a user agent to send data to a server, which typically happens via a web form, such as the search form on the Missouri State Highway Patrol's home page.

Granted, sending a `POST` method for the purposes of getting search results is a little confusing. Ideally, this form would send `GET` requests. However, for a really complex search form or one that involves submitting sensitive data, the `POST` method might be necessary.

To figure out which method a search form requires, use your browser's web inspector. You can either check the method attribute of the `<form>` element in the HTML. Or you can watch the network tab for the request and check the method there.

The web inspector also reveals other essential information: The names of the form fields. Again, these can be found in either

- in the HTML (look at the name attribute on the form's various input fields);
- in the header of the `POST` request.

We need to know the names of the form fields and what values they will accept because we need to provide this information in our request for search results.

In the Python Requests library, this search info is specified via the `data` keyword argument when calling the [`.post()`](https://2.python-requests.org/en/master/api/#requests.post) function.

For instance, here's how to search for all the crashes with a fatal injury type.

In [8]:
r = requests.post(
    'https://www.mshp.dps.missouri.gov/HP68/SearchAction',
    data={'searchInjury': 'FATAL'},
)

If the response is okay, we can write this to another local file.

In [9]:
if r.ok:
    with open('results.html', 'w') as f:
        f.write(r.text)

### [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing HTML

BeautifulSoup is the most popular Python package for getting data out of HTML.

To install this package, go back to your terminal, and stop the Jupyter server (press "Ctrl" and "C" at the same time, then prompted, press "y", then "Return"). Now that your command prompt is back up:

```sh
pip install beautifulsoup4
```

The parsing process begins by creating a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup) object.

First we import this class.

In [10]:
from bs4 import BeautifulSoup

Then we create an instance of this class by passing in the html string.

In [11]:
soup = BeautifulSoup(r.content)

Technically, BeautifulSoup is an interface to lower-level Python parsing libraries. By default, it uses the the [`html.parser`](https://docs.python.org/3/library/html.parser.html) module in Python's standard library. The other options:

- [lxml](https://lxml.de/) which tends to be faster
- [html5lib](https://html5lib.readthedocs.io/en/latest/) which tends to be more lenient (i.e., tolerant of html docs with unclosed tags and other sub-standard syntax)

BeautifulSoup's docs have more details about [how to install alternate parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) and the [differences](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) between them.



In [12]:
type(soup)

bs4.BeautifulSoup

In [13]:
table = soup.find('table')

In [14]:
type(table)

bs4.element.Tag

In [15]:
table.find_all('th')

[<th class="headerSortPrint">Report</th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=name';document.sortForm.submit();">Name
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=age';document.sortForm.submit();">Age
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=city';document.sortForm.submit();">Person City/State
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=injury';document.sortForm.submit();">Personal Injury
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=seatbelt';document.sortForm.submit();">Safety Device
 				
 				</a></th>,
 <th class="headerSort"><a href="javascript:document.sortForm.action='/HP68/SortAction?column=date';document.sortForm.submit();">Date
 				
 

In [16]:
th_all = table.find_all('th')

In [17]:
type(th_all)

bs4.element.ResultSet

In [18]:
for th in th_all:
    print(th.text)

Report
Name
				
				
Age
				
				
Person City/State
				
				
Personal Injury
				
				
Safety Device
				
				
Date
				
					

Time
				
				
Crash County
				
				
Crash Location
				
				
Troop
				
				


In [19]:
headers = []

In [20]:
for th in th_all:
    header = th.text.strip().replace(' ', '_').lower()
    headers.append(header)

In [21]:
headers

['report',
 'name',
 'age',
 'person_city/state',
 'personal_injury',
 'safety_device',
 'date',
 'time',
 'crash_county',
 'crash_location',
 'troop']

In [22]:
tr_all = table.find_all('tr')[1:]

In [23]:
for tr in tr_all:
    for td in tr.find_all('td'):
        print(td.text.strip())
    print('---------------')

View
BRANDON, REBECCA J
34
AVA, MO
FATAL
NO
07/20/2020
1:30PM
DOUGLAS
HIGHWAY 5 - FIVE MILES NORTH OF AVA
G
---------------
View
DOBBS, OWEN R
25
AVA, MO
FATAL
NO
07/20/2020
1:30PM
DOUGLAS
HIGHWAY 5 - FIVE MILES NORTH OF AVA
G
---------------
View
MCDOWELL, JOSEPH
71
HAZELWOOD, MO
FATAL
NO
07/19/2020
9:50PM
ST. LOUIS
I-270 NORTHBOUND SOUTH OF MISSOURI 370
C
---------------
View
WINTER, MICHAEL R
61
NEWTON, IL
FATAL
NO
07/19/2020
6:12AM
LAFAYETTE
I-70 EB AT 54 MM
A
---------------
View
DATES, NATOSHA M
25
JOPLIN, MO
FATAL
NO
07/18/2020
7:05PM
JASPER
ROUTE D AT IVY RD IN ORONOGO
D
---------------
View
SMITH, MARYAN C
17
MANSFIELD, MO
FATAL
NO
07/18/2020
3:30PM
DOUGLAS
HWY 5  6.5 MILES NORTH OF AVA
G
---------------
View
SNEAD, JOSHUA D
32
O'FALLON, MO
FATAL
YES
07/18/2020
2:34PM
FRANKLIN
HIGHWAY T WEST OF EAST BECKER JUNCTION
C
---------------
View
ROGERS, MARTIN T
43
STEELVILLE, MO
FATAL
NO
07/18/2020
12:15AM
WASHINGTON
10074 WELLS RD
C
---------------
View
HILL, PHILLIP D
47
MARSHALL M

YES
06/25/2020
3:45AM
ST. LOUIS
WESTBOUND INTERSTATE 64 EAST OF SOUTH MASON ROAD
C
---------------
View
ALLEN, ISAIAH L
25
ST. LOUIS, MO
FATAL
NO
06/24/2020
7:37PM
ST. LOUIS
LEWIS AND CLARK BOULEVARD NORTH OF CHAMBERS ROAD
C
---------------
View
DAVIS, NATHANIEL L
30
ARNOLD, MO
FATAL
NO
06/23/2020
5:40PM
JEFFERSON
SOUTHBOUND MISSOURI 21 NORTH OF HAYDEN ROAD
C
---------------
View
MOORE, RAYMOND C
86
WARRENTON, MO
FATAL
YES
06/23/2020
1:21PM
WARREN
WESTBOUND I-70 AT THE 201.8 MILE MARKER
C
---------------
View
BRYANT, JOSEPH H
63
BOSSIER CITY LA
FATAL
EXEMPT
06/22/2020
8:52PM
ST. LOUIS
EASTBOUND ST CHARLES ROCK RD EAST OF I-270
C
---------------
View
WALTON, DELMER A
80
PACIFIC, MO
FATAL
YES
06/22/2020
4:43PM
ST. LOUIS
18777 HISTORIC ROUTE 66
C
---------------
View
KELAM, TARA N
22
HILLSBORO, MO
FATAL
NO
06/22/2020
1:30AM
JEFFERSON
SOUTHBOUND MISSOURI 21 AT THE 172.2 MILE MARKER
C
---------------
View
STICKLER, MARK D
64
ST JOSEPH, MO
FATAL
YES
06/21/2020
4:30PM
ANDREW
US 169 - 2 MILES 

FATAL
NO
04/30/2020
6:45AM
MARIES
MARIES COUNTY ROAD 623 2.3 MILES NORTH OF DIXON
I
---------------
View
BROSHOUS, KEVIN J
67
SULLIVAN, MO
FATAL
YES
04/29/2020
10:10AM
FRANKLIN
WESTBOUND I-44 AT THE 239 MILE MARKER
C
---------------
View
GUNTER, DONALD L
49
PURDY, MO
FATAL
NO
04/29/2020
2:05AM
BARRY
COUNTY ROAD 1035 3 MILES NORTHEAST OF WHEATON
D
---------------
View
BRISCOE, ALVIN J
70
RICH HILL, MO
FATAL
YES
04/27/2020
7:22AM
BATES
ROUTE U JUST N OF SW COUNTY ROAD 4508
A
---------------
View
BURTON, DENESE
50
CAHOKIA, IL
FATAL
NO
04/27/2020
4:50AM
WARREN
EASTBOUND I-70 AT THE 198.2 MILE MARKER
C
---------------
View
ATCHLEY, KAYLA D
24
NEW MADRID
FATAL
NO
04/26/2020
9:10PM
STODDARD
MO 153 3 MILES NORTH OF PARMA
E
---------------
View
LAND, RICHARD D
53
CLINTON, MO
FATAL
EXEMPT
04/24/2020
4:20AM
HENRY
MO 7 1/10 OF A MILE EAST OF ROUTE DD
A
---------------
View
YOUNG, JOHN M
55
AURORA, MO
FATAL
NO
04/23/2020
3:30PM
LAWRENCE
LAWRENCE 1010 4 MILES EAST OF SARCOXIE
D
---------------
View


CARTHAGE, MO
FATAL
NO
02/22/2020
5:10PM
JASPER
PRIVATE PROPERTY ON COUNTY LANE 121 APPROXIMATELY TWO MILES NORTHEAST OF CARTHAGE
D
---------------
View
TORRES, STEPHANIE
23
CARTHAGE, MO
FATAL
NO
02/22/2020
7:20AM
JASPER
M171 ONE HALF MILE WEST OF CARTHAGE
D
---------------
View
SHOCKLEY, MARCUS E
21
SPRINGFIELD, MO
FATAL
NO
02/22/2020
2:10AM
CHRISTIAN
HWY 13 THREE MILES SOUTH OF HIGHLANDVILLE
D
---------------
View
HENTZ, ROBERT M
60
O'FALLON, MO
FATAL
EXEMPT
02/21/2020
9:00PM
ST. LOUIS
SOUTHBOUND I-270 NORTH OF I-64
C
---------------
View
HAMMANN, GALEN R
63
JEFFERSON CITY, MO
FATAL
YES
02/21/2020
2:27PM
COLE
3662 ROCKRIDGE ROAD
F
---------------
View
PEREZ SALAS, FRANSISCA
59
PURDY, MO
FATAL
YES
02/19/2020
4:22PM
BARRY
WASHINGTON AVENUE IN PURDY
D
---------------
View
BURCKS, CAROLYN S
85
CHARLESTON, MO
FATAL
YES
02/19/2020
1:20PM
MISSISSIPPI
INTERSTATE 57 AT MILE MARKER 1
E
---------------
View
GUTHRIE, RICHARD B
72
BRANSON, MO
FATAL
YES
02/18/2020
9:25AM
LAWRENCE
2 MILES WEST OF HA

---------------
View
SORBELLO, PADEN S
23
BONNE TERRE, MO
FATAL
EXEMPT
12/15/2019
5:03PM
ST. FRANCOIS
SOUTHBOUND US-67 SOUTH OF CASH LANE
C
---------------
View
JUVENILE
3
SELIGMAN, MO
FATAL
YES
12/15/2019
12:20PM
BARRY
HWY 37 AT FR 1050 2 MILES SOUTH OF WASHBURN
D
---------------
View
CLINGMAN, NATHAN D
18
STEEDMAN, MO
FATAL
NO
12/15/2019
12:10PM
CALLAWAY
MO 94 . 5 MILE EAST OF RT C
F
---------------
View
JUVENILE
16
OZARK, MO
FATAL
NO
12/15/2019
5:47AM
CHRISTIAN
ROUTE BB 3 MILES EAST OF SPOKANE
D
---------------
View
PRIMO, TIMOTHY V
34
FENTON, MO
FATAL
YES
12/15/2019
1:42AM
ST. LOUIS
INTERSTATE 270 AT DOUGHERTY FERRY ROAD
C
---------------
View
ADAMS, CHRISTOPHER D
20
BRECKENRIDGE, MO
FATAL
NO
12/14/2019
8:30PM
LIVINGSTON
RTE U - 12 MILES WEST OF CHILLICOTHE
H
---------------
View
JUVENILE
14
FULTON MO
FATAL
NO
12/14/2019
11:15AM
PHELPS
GRANT ROAD  6/10 OF A MILE SOUTH OF DOOLITTLE CITY LIMITS
I
---------------
View
WADKINS, HEATH L
30
MTN, VIEW, MO
FATAL
NO
12/14/2019
6:14AM
TEXAS


PALMYRA, MO
FATAL
UNKNOWN
10/19/2019
7:00PM
MARION
US 61 SOUTH BOUND  2 MILES SOUTH OF PALMYRA, MO
B
---------------
View
CASSIDY, MONTE D
65
MONROE CITY, MO
FATAL
NO
10/19/2019
3:05PM
MARION
US HIGHWAY 61 3 MILES NORTH OF PALMYRA
B
---------------
View
BRANSON, DEBBIE M
24
DIXON, MO
FATAL
NO
10/19/2019
5:48AM
PULASKI
I-44 AT THE 158 MILE MARKER IN WAYNESVILLE
I
---------------
View
FULLER, CHRISTOPHER L
19
JACKSON, MO
FATAL
YES
10/18/2019
6:10PM
CAPE GIRARDEAU
HIGHWAY 25 NORTHBOUND, 1.5 MILES SOUTH OF GORDONVILLE
E
---------------
View
WUGER, GUY A
69
STOUTSVILLE, MO
FATAL
YES
10/18/2019
2:10PM
RALLS
RT D 4 MILES SOUTH OF PERRY
B
---------------
View
WILLEY, LINDELL J
69
TRENTON, MO
FATAL
YES
10/17/2019
3:35PM
GRUNDY
MO 146 - 5 MILES WEST OF TRENTON
H
---------------
View
WOODWARD, NATALIE M
29
WAYNESVILLE, MO
FATAL
YES
10/17/2019
3:45AM
PHELPS
I-44 AT THE 177 MILE MARKER - TWO MILES WEST OF DOOLITTLE
I
---------------
View
LUEBBERING, ROBERT B
83
ST THOMAS, MO
FATAL
NO
10/15/2019
7:0

NO
09/11/2019
9:35PM
IRON
ROUTE E 8 MILES EAST OF ARCADIA
E
---------------
View
GARCIA, MITCHELL C
53
AVA,MO
FATAL
NO
09/10/2019
5:00PM
DOUGLAS
MO76     2.5 MILES EAST OF GOODHOPE
G
---------------
View
MARTIN, PAUL S
59
BROOKLINE STATION, MO
FATAL
EXEMPT
09/10/2019
11:15AM
GREENE
2709 S ROUNDHILL RD, BROOKLINE STATION
D
---------------
View
CRISP, LONNIE E
75
CAULFIELD, MO
FATAL
NO
09/09/2019
4:11PM
OZARK
MO5 AT 3RD STREET IN GAINESVILLE
G
---------------
View
STARK, VINCENT E
90
LICKING, MO
FATAL
NO
09/09/2019
10:00AM
SHANNON
HIGHWAY B FIVE MILES WEST OF AKERS
G
---------------
View
JUVENILE
14
PUXICO, MO
FATAL
NO
09/07/2019
10:20AM
STODDARD
COUNTY ROAD 483 AT ROUTE J
E
---------------
View
BENNEFELD, CODY M
25
SEDALIA, MO
FATAL
NO
09/07/2019
5:30AM
HENRY
ROUTE C AND COUNTY ROAD NE 224
A
---------------
View
HOWERTON, JOHN M
41
LANAGAN, MO
FATAL
NO
09/06/2019
6:45PM
MCDONALD
MISSOURI 76, IN ANDERSON
D
---------------
View
NORMAN, GARY G
58
EL DORADO SPRINGS, MO
FATAL
NO
09/06/2019
1

HWY 181, 8 MILES WEST OF WILLOW SPRINGS
G
---------------
View
ZERAGIA, ROMAN
59
PHILADELPHIA, PA
FATAL
YES
07/29/2019
3:32PM
PHELPS
I-44 WESTBOUND AT THE 175 MILE MARKER - 9 MILES WEST OF ROLLA
I
---------------
View
NABOR, ESTHER M
80
JACKSON
FATAL
UNKNOWN
07/29/2019
1:30PM
CAPE GIRARDEAU
ROUTE Y 5 MILES EAST OF JACKSON
E
---------------
View
WENDLER, PAMELA K
57
MOUNT VERNON, MO
FATAL
YES
07/27/2019
11:10PM
LAWRENCE
HIGHWAY 174, 7 MILES EAST OF MOUNT VERNON
D
---------------
View
BYRNE, LYDIA A
19
ST. LOUIS, MO
FATAL
NO
07/27/2019
5:55PM
PIKE
17883 PIKE COUNTY ROAD 233, PRIVATE ROAD .6 MILE FROM THE ROADWAY
C
---------------
View
ALLISON, HALEY E
20
O'FALLON, MO
FATAL
NO
07/27/2019
2:55PM
TANEY
8 MILES EAST OF BRADLEYVILLE  ON HIGHWAY 76
D
---------------
View
ZAITZ, JOSHUA D
21
FESTUS MO
FATAL
YES
07/27/2019
10:20AM
JEFFERSON
MAPAVILLE HEMATITE ROAD SOUTH OF PLASS ROAD
C
---------------
View
MCALISTER, TYLER L
18
PIEDMONT, MO
FATAL
UNKNOWN
07/27/2019
2:00AM
REYNOLDS
COUNTY ROAD 468

In [39]:
def clean_row(tds):
    details_url = td_all[0].find('a').attrs['href']
    row = {
        'details_url': details_url,
        'name': tds[1].text.strip(),
        'age': tds[2].text.strip(),
        'person_city_state': tds[3].text.strip(),
        'personal_injury': tds[4].text.strip(),
        'date': tds[5].text.strip(),
        'time': tds[6].text.strip(),
        'crash_county': tds[7].text.strip(),
        'crash_location': tds[8].text.strip(),
        'troop': tds[9].text.strip(),
    }
    
    return row

In [40]:
rows = []

In [43]:
for tr in tr_all:
    td_all = tr.find_all('td')
    row = clean_row(td_all)
    rows.append(row)

How to do this as a list comprehension.

In [44]:
rows = [
    clean_row(tr.find_all('td')) for tr in tr_all
]

Here's how we get the number of rows we've parsed from the HTML.

In [54]:
len(rows)

631

The [csv](https://docs.python.org/3.8/library/csv.html) module that is part of Python's standard library.

In [45]:
import csv

Here's how we can get our column headers.

In [52]:
rows[0].keys()

dict_keys(['details_url', 'name', 'age', 'person_city_state', 'personal_injury', 'date', 'time', 'crash_county', 'crash_location', 'troop'])

In [53]:
with open('crashes.csv', 'w', newline='') as f:
    writer = csv.DictWriter(
        f, fieldnames=rows[0].keys()
    )
    
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

## Converting your notebook to a script

[nbconvert](https://pypi.org/project/nbconvert/) is a command-line utility for convert iPython notebooks (.ipynb files) into Python scripts (.py files).

```sh
jupyter nbconvert --no-prompt --to python Untitled1.ipynb
```

Do this in case you want to start working in [Sublime Text](https://www.sublimetext.com/), [Atom](https://atom.io/) or [Visual Code Studio](https://code.visualstudio.com/).