# Web scraping: Texas death row inmates

Now we're going to scrape [a table of inmates on death row in Texas](https://www.tdcj.state.tx.us/death_row/dr_offenders_on_dr.html) into a CSV file.

Our process:
1. Fetch the page that we want to scrape and save a copy of the file on our computer
2. Parse the contents of the file we just saved
3. Write the parsed data to a delimited file (a CSV, in this case)

So: Let's open the page in a new browser tab and view source in another tab.

To start with, we need to import our dependencies:

In [1]:
import csv

import requests
from bs4 import BeautifulSoup

Now let's define a couple of variables:
- The URL to the web page we're going to scrape
- The name of the `.html` file we're going to save the web page to
- The name of the `.csv` file we're going to save the data into

In [2]:
URL = 'https://www.tdcj.state.tx.us/death_row/dr_offenders_on_dr.html'
SAVED_HTML = 'tx-death-row.html'
CSV_FILE = 'tx-death-row.csv'

### Save a local copy

First, we'll use `requests` to fetch the page. While we're at it, we're going to send along a dictionary [with some custom headers](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers): Your name and email address. That's just a courtesy in case the people maintaining the servers need to contact you.

In [3]:
r = requests.get(URL, headers={'name': 'Cody Winchester', 'email': 'cody@ire.org'})

Next, we'll save a local copy of the web page we just fetched. We'll open the `SAVED_HTML` file, which doesn't yet exist, in write (`w`) mode inside a `with` block, and `write()` in the `text` attribute of the web page we just fetched.

In [4]:
with open(SAVED_HTML, 'w') as o:
    o.write(r.text)

### Scrape out the data

Now we'll open that file and turn the contents into a `BeautifulSoup` object.

In [5]:
with open(SAVED_HTML, 'r') as i:
    html = i.read()
    soup = BeautifulSoup(html, 'html.parser')

Where's the table? Looks like it's the only one on the page, so we can just use `find`, which returns _one_ thing instead of a list.

In [6]:
table = soup.find('table')
print(table)

<table class="tdcj_table indent" style="width:98%">
<caption>Offenders on Death Row</caption>
<tr>
<th abbr="tdcj number" scope="col">TDCJ<br/>
        Number</th>
<th abbr="link" align="center" scope="col" width="16%">Link</th>
<th abbr="last name" scope="col">Last Name</th>
<th abbr="first name" scope="col">First Name</th>
<th abbr="date/birth" scope="col">Date of<br/>
        Birth</th>
<th abbr="gender" align="center" scope="col">Gender</th>
<th abbr="race" scope="col">Race</th>
<th abbr="date received" scope="col">Date<br/>
        Received</th>
<th abbr="county" scope="col">County</th>
<th abbr="date/offense" scope="col">Date of<br/>
        Offense</th>
</tr>
<tr>
<td>999610</td>
<td align="center"><a href="dr_info/delacruzisidro.html" title="Offender Information for Isidro Delacruz">Offender Information</a></td>
<td>Delacruz</td>
<td>Isidro</td>
<td>10/07/1990</td>
<td align="center">M</td>
<td>Hispanic</td>
<td>04/26/2018</td>
<td>Tom Green</td>
<td>09/02/2014</td>
</tr>
<tr>


We'll use the `find_all()` method to get a list of rows in the table, and list slicing to skip the first (`[0]`) row, which has the headers.

In [7]:
# loop over the table rows, skipping the header row
for row in table.find_all('tr')[1:]:
    
    # get a list of `td` tags inside this row
    cols = row.find_all('td')
    
    # inmate number is first in this list
    inmate_no = cols[0].string.strip()
    
    # then link, inside the `href` attribute of the a tag
    # we'll prepend the base URL, while we're at it
    link = 'https://www.tdcj.state.tx.us/death_row/' + cols[1].a['href']
    
    # last name in 3rd [2] position
    last = cols[2].string.strip()

    # first name in 4th [3] position
    first = cols[3].string.strip()

    # dob in 5th [4] position
    dob = cols[4].string.strip()

    # sex in 6th [5] position
    sex = cols[5].string.strip()

    # race in 7th [6] position
    race = cols[6].string.strip()

    # date received in 8th [7] position
    date_rcvd = cols[7].string.strip()

    # county in 9th [8] position
    county = cols[8].string.strip()

    # date of offense in 10th [9] position
    date_offense = cols[9].string.strip()

    # print out results
    print(last, first, dob, sex, race, date_rcvd, county, date_offense)

Delacruz Isidro 10/07/1990 M Hispanic 04/26/2018 Tom Green 09/02/2014
Delacerda Jason 07/26/1977 M Hispanic 03/08/2018 Hardin 08/17/2011
Hudson William 07/03/1982 M White 11/16/2017 Anderson 11/14/2015
Tracy Billy 11/30/1977 M White 11/15/2017 Bowie 07/15/2015
Colone Joseph 08/13/1978 M Black 05/09/2017 Jefferson 07/31/2010
Falk John 11/19/1966 M White 03/01/2017 Walker 09/24/2007
Wells, III Amos 08/20/1990 M Black 11/22/2016 Tarrant 07/01/2013
Brownlow Charles 09/16/1977 M Black 05/23/2016 Kaufman 10/28/2013
Bluntson Demond 11/25/1975 M Black 05/10/2016 Webb 06/19/2012
Gonzalez Mark 10/23/1969 M Hispanic 01/28/2016 Bexar 05/28/2011
Calvert James 12/03/1970 M White 10/16/2015 Smith 10/31/2012
Hall Paul 02/18/1993 M Asian 10/09/2015 Brazos 10/20/2011
Williams Eric 04/07/1967 M White 12/18/2014 Kaufman 03/30/2013
Thomas Steven 09/21/1958 M White 12/05/2014 Williamson 11/04/1980
Suniga Brian 12/27/1979 M White 10/30/2014 Lubbock 12/26/2011
Lewis, III Harlem 07/05/1991 M Black 08/01/2014 H

Great! Now, instead of printing, let's write the results out to a CSV file.

Inside a `with` block, we'll open the filename defined above (`CSV_FILE`) in write (`w`) mode, and use a `csv.DictWriter` object to write the data for us.

👉 For more details on reading and writing delimited files in Python, [see this notebook](../reference/Reading%20and%20writing%20delimited%20data%20files%20with%20vanilla%20Python.ipynb).

Let's put it all together:

In [23]:
# open the file in write mode
with open(CSV_FILE, 'w', newline='') as i:
    # define the headers
    headers = ['inmate_no', 'last', 'first', 'dob', 'sex', 'race',
               'date_rcvd', 'county', 'date_offense']

    # create a csv.DictWriter object
    # fieldnames are the headers we just defined
    writer = csv.DictWriter(i, fieldnames=headers)
    
    # write the header row
    writer.writeheader()
    
    # loop over the table rows, skipping the header row
    for row in table.find_all('tr')[1:]:

        # get a list of `td` tags inside this row
        cols = row.find_all('td')

        # inmate number is first in this list
        inmate_no = cols[0].string.strip()

        # then link, inside the `href` attribute of the a tag
        # we'll prepend the base URL, while we're at it
        link = 'https://www.tdcj.state.tx.us/death_row/' + cols[1].a['href']

        # last name in 3rd [2] position
        last = cols[2].string.strip()

        # first name in 4th [3] position
        first = cols[3].string.strip()

        # dob in 5th [4] position
        dob = cols[4].string.strip()

        # sex in 6th [5] position
        sex = cols[5].string.strip()

        # race in 7th [6] position
        race = cols[6].string.strip()

        # date received in 8th [7] position
        date_rcvd = cols[7].string.strip()

        # county in 9th [8] position
        county = cols[8].string.strip()

        # date of offense in 10th [9] position
        date_offense = cols[9].string.strip()

        # write the results to file
        writer.writerow({
            'inmate_no': inmate_no,
            'last': last,
            'first': first,
            'dob': dob,
            'sex': sex,
            'race': race,
            'date_rcvd': date_rcvd,
            'county': county,
            'date_offense': date_offense
        })