# Simple web scraper

In [None]:
>>> import urllib2
>>> urllib2.urlopen("http://www.python.org/")

The scraper will use Python’s **BeautifulSoup** toolkit to **parse the HTML and extract the data**.

We’ll also use the **Requests library** to **open the URL, download the HTML and pass it to BeautifulSoup**.

**Detective work**:

 - http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp Web page, 
 - “View Source", 
 - ‘inspect element.’

**find a pattern or identifier in the code for the elements you’d like to extract**. In the best cases, you can extract content by using the **id or class** already assigned to the element you’d like to extract. 
 - An ‘id’ is intended to act as the unique identifer of a specific item on a page. 
  - A ‘class’ is used to label a specific TYPE of item on a page. So, there may be instances of a class on a page.

In [2]:
import requests

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
print (html)



you should see the entire HTML of the page spilled out.

**Import the BeautifulSoup HTML parsing library and feed it the page**.

In [5]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html) # NEW CODE
print (soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Current Detainees of Boone County Jail
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <!-- -->
  <!-- MultipleRecordApp: CURRENT DETAINEES. Created: Thu Apr 06 13:51:13 CDT 2017 by ITBRANDO. INQHTM-MUL 20120709 -->
  <link href="/mrcjava/mrcclasses/SH01_MP/favicon.ico" rel="shortcut icon"/>
  <link href="/mrcjava/mrcclasses/mrc_servlet_ajax.css" rel="stylesheet" type="text/css"/>
  <link href="/mrcjava/mrcclasses/SH01_MP/mrc_default.css?v=0.74" mrcfile="150" rel="stylesheet" type="text/css"/>
  <link href="/mrcjava/mrcclasses/jquery.custom.css" rel="stylesheet" type="text/css"/>
  <!--[if !IE]>
<link href="/mrcjava/mrcclasses/mrc-responsive.css" rel="stylesheet" />
<![endif]-->
  <!--[if IE 7]>
<link href="/mrcjava/mrcclasses/font/css/font-awesome-ie7.css" rel="stylesheet" />
<![endif]-->
  <style type="text/css">
   #sort_table_head a {pointer-ev



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


You should see the page’s HTML again, but in a prettier format this time.

**Next** we take all the detective work we did with the page’s HTML above and convert it into a simple, direct command that will **instruct BeautifulSoup on how to extract only the table we’re after.**

In [8]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'}) # NEW CODE
print (table.prettify())

<tbody class="stripe" id="mrc_main_table">
 <!-- -->
 <tr class="odd">
  <td class="one td_left" data-th="Last Name">
   ACTON
  </td>
  <td class="one td_left" data-th="First Name">
   ANTHONY
  </td>
  <td class="one td_left" data-th="Middle Name">
   SEAN
  </td>
  <td class="one td_left" data-th="Sex">
   M
  </td>
  <td class="one td_left" data-th="Race">
   B
  </td>
  <td class="one td_right" data-th="Age">
   26
  </td>
  <td class="one td_left" data-th="City">
   COLUMBIA
  </td>
  <td class="one td_left" data-th="State">
   MO
  </td>
  <td class="one td_left" data-th="">
   <a class="_lookup btn btn-primary" height="600" href="SH01_MP.I00500s?PERKEP=66619&amp;hover_redir=&amp;height=600&amp;width=950" linkedtype="I" mrc="returndata" target="_lookup" width="860">
    <i class="fa fa-large fa-fw fa-list-alt">
    </i>
    Details
   </a>
  </td>
 </tr>
 <!-- -->
 <tr class="even">
  <td class="two td_left" data-th="Last Name">
   AKERS
  </td>
  <td class="two td_left" data-th



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


This time it only prints out the table we’re after, which was selected by instructing BeautifulSoup to return only those *&lt;table&gt;* tags with resultsTable as their class attribute.

**Convert the rows in the table into a list**, which we can then loop through and grab all the data out of.

**BeautifulSoup** gets us going by allowing us to dig down into our table and **return a list of rows, which are created in HTML using *&lt;tr&gt;* tags** inside the table.

In [9]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

for row in table.findAll('tr'):  # NEW CODE
    print (row.prettify())

<tr class="odd">
 <td class="one td_left" data-th="Last Name">
  ACTON
 </td>
 <td class="one td_left" data-th="First Name">
  ANTHONY
 </td>
 <td class="one td_left" data-th="Middle Name">
  SEAN
 </td>
 <td class="one td_left" data-th="Sex">
  M
 </td>
 <td class="one td_left" data-th="Race">
  B
 </td>
 <td class="one td_right" data-th="Age">
  26
 </td>
 <td class="one td_left" data-th="City">
  COLUMBIA
 </td>
 <td class="one td_left" data-th="State">
  MO
 </td>
 <td class="one td_left" data-th="">
  <a class="_lookup btn btn-primary" height="600" href="SH01_MP.I00500s?PERKEP=66619&amp;hover_redir=&amp;height=600&amp;width=950" linkedtype="I" mrc="returndata" target="_lookup" width="860">
   <i class="fa fa-large fa-fw fa-list-alt">
   </i>
   Details
  </a>
 </td>
</tr>

<tr class="even">
 <td class="two td_left" data-th="Last Name">
  AKERS
 </td>
 <td class="two td_left" data-th="First Name">
  SYDNEY
 </td>
 <td class="two td_left" data-th="Middle Name">
  RAE
 </td>
 <td cla



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


You’ll now see each row printed out separately as the script loops through the table.

**Next, loop through each of the cells in each row** by selecting them inside the loop. Cells are created in HTML by the *&lt;td&gt;* tag.

In [10]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

for row in table.findAll('tr'):
    for cell in row.findAll('td'):  # NEW CODE
        print (cell.text)

ACTON
ANTHONY
SEAN
M
B
26
COLUMBIA
MO

 Details

AKERS
SYDNEY
RAE
F
W
21
COLUMBIA
MO

 Details

ALEXANDER
BENJAMIN
FRANKLIN
M
B
23
COLUMBIA
MO

 Details

ANDERSON
ANTHONY
CURTIS
M
W
26
NEVADA
MO

 Details

ARMSTRONG
ANTIONE
LAMONT
M
B
18
COLUMBIA
MO

 Details

AULL
JONATHAN
MORGAN
M
W
37
COLUMBIA
MO

 Details

BAKER
CHRISTOPHER
ALAN
M
W
28
COLUMBIA
MO

 Details

BAKER
MICHAEL
COREY
M
B
27
COLUMBIA
MO

 Details

BANNERMAN
KRYSTAL
MARIE
F
W
30
CENTRALIA
MO

 Details

BARNES
DARIUS
JOSHUA
M
B
22
COLUMBIA
MO

 Details

BARTZ-OWENS
ASHLEY
SUE
F
W
32
COLUMBIA
MO

 Details

BECK
JOSHUA
ALEX
M
W
25
COLUMBIA
MO

 Details

BENEDETTI
FRANK
DOMINICK
M
W
29
COLUMBIA
MO

 Details

BETTS
VITA
LAMETA
F
B
45
COLUMBIA
MO

 Details

BLAIR
CONNIE
MARIE
F
B
36
COLDWATER
MS

 Details

BLEVINS
JESSE
RAY
M
W
29
COLUMBIA
MO

 Details

BOND
SOPHIA
LYNN
F
W
24
COLUMBIA
MO

 Details

BOWEN
JAMIE
ROGER
M
W
38
HIGBEE
MO

 Details

BRADLEY
JAMIE
MICHELLE
F
W
33
COLUMBIA
MO

 Details

BRADSHAW
JOSEPH
LEE
M
W
42
COLUM



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Everything should be much better.

Now that we have found the data we want to extract, we need to **structure the data in a way that can be written out to a comma-delimited text file**. That won’t be hard since CSVs aren’t any more than a grid of columns and rows, much like a table.

Let’s **start by adding each cell in a row to a new Python list**.

In [11]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

for row in table.findAll('tr'):
    list_of_cells = []                             # NEW CODE
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)                 # NEW CODE
    print (list_of_cells)

['ACTON', 'ANTHONY', 'SEAN', 'M', 'B', '26', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['AKERS', 'SYDNEY', 'RAE', 'F', 'W', '21', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['ALEXANDER', 'BENJAMIN', 'FRANKLIN', 'M', 'B', '23', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['ANDERSON', 'ANTHONY', 'CURTIS', 'M', 'W', '26', 'NEVADA', 'MO', '\n\xa0Details\n']
['ARMSTRONG', 'ANTIONE', 'LAMONT', 'M', 'B', '18', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['AULL', 'JONATHAN', 'MORGAN', 'M', 'W', '37', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['BAKER', 'CHRISTOPHER', 'ALAN', 'M', 'W', '28', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['BAKER', 'MICHAEL', 'COREY', 'M', 'B', '27', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['BANNERMAN', 'KRYSTAL', 'MARIE', 'F', 'W', '30', 'CENTRALIA', 'MO', '\n\xa0Details\n']
['BARNES', 'DARIUS', 'JOSHUA', 'M', 'B', '22', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['BARTZ-OWENS', 'ASHLEY', 'SUE', 'F', 'W', '32', 'COLUMBIA', 'MO', '\n\xa0Details\n']
['BECK', 'JOSHUA', 'ALEX', 'M', 'W', '25', 'COLUMBIA', 'MO', '\



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Now you should see Python lists streaming by one row at a time.

**Combine the lists into a list**:

Those lists can now be lumped together into one big list of lists. When you think about it, a list of lists isn’t all that different from how a spreadsheet is structured.

In [12]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

list_of_rows = []                            # NEW CODE
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)       # NEW CODE

print (list_of_rows)

[['ACTON', 'ANTHONY', 'SEAN', 'M', 'B', '26', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['AKERS', 'SYDNEY', 'RAE', 'F', 'W', '21', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['ALEXANDER', 'BENJAMIN', 'FRANKLIN', 'M', 'B', '23', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['ANDERSON', 'ANTHONY', 'CURTIS', 'M', 'W', '26', 'NEVADA', 'MO', '\n\xa0Details\n'], ['ARMSTRONG', 'ANTIONE', 'LAMONT', 'M', 'B', '18', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['AULL', 'JONATHAN', 'MORGAN', 'M', 'W', '37', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['BAKER', 'CHRISTOPHER', 'ALAN', 'M', 'W', '28', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['BAKER', 'MICHAEL', 'COREY', 'M', 'B', '27', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['BANNERMAN', 'KRYSTAL', 'MARIE', 'F', 'W', '30', 'CENTRALIA', 'MO', '\n\xa0Details\n'], ['BARNES', 'DARIUS', 'JOSHUA', 'M', 'B', '22', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['BARTZ-OWENS', 'ASHLEY', 'SUE', 'F', 'W', '32', 'COLUMBIA', 'MO', '\n\xa0Details\n'], ['BECK', 'JOSHUA', 'ALEX', 'M', 'W', '25', 'COLUMBI



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


The output is a dump of a big bunch of data. Look closely and you’ll see the list of lists.

**Write output to csv file**:
 1. At the top of the file, **import Python’s built-in csv module**. 
 2. At the botton: 
  - **Create a new file**
  - **Hand off the file** to the csv module
  - Use the csv module's **_writerows_ to dump out our list of lists**.

In [16]:
import csv                                      # NEW CODE
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, "lxml")
table = soup.find('tbody', attrs={'class': 'stripe'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./inmates.csv", "w")           # NEW CODE
writer = csv.writer(outfile)                   # NEW CODE
writer.writerows(list_of_rows)                 # NEW CODE
outfile.close()                                # NEW CODE

Nothing should happen – at least to appear to happen.

Since there are no longer any print statements in the file, the script is no longer dumping data out to your terminal. However, if you open up your code directory you should now see a new file named inmates.csv waiting for you. Open it in a text editor or Excel and you should see structured data all scraped out.

There is still one obvious problem though. **There are no headers**.

Here’s why. If you go back and look closely, **our script is only looping through lists of _&lt;td&gt;_ tags found within each row**. Fun fact: **Header tags in HTML tables are often wrapped in the slightly different _&lt;th&gt;_ tag**. Look back at the source of the Boone County page and you’ll see that’s what exactly they do.

But rather than bend over backwords to dig them out of the page, let’s try something a little different. Let’s just skip the first row when we loop though, and then **write the headers out ourselves at the end**.

In [21]:
import csv
import requests
from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, "lxml")
table = soup.find('tbody', attrs={'class': 'stripe'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
writer.writerows(list_of_rows)
outfile.close()