# Web scraping practice: Pittsburgh lobbyists

We're going to scrape [a database of people registered to lobby the city of Pittsburgh](http://www.openbookpittsburgh.com/SearchLobbyists.aspx).

First, let's import the packages we'll need: `csv`, `requests`, `bs4`, `pandas`.

In [1]:
import csv

import requests
from bs4 import BeautifulSoup
import pandas as pd

### Noodle around

Navigate to the page we're going to scrape. We want everyone -- what happens when you hit "search" without entering any criteria into the form? It works! (This won't always be the case for databases like this.) As of July 28, 2018, the search showed 83 lobbyists.

Notice, too, that the URL has changed from this:

`http://www.openbookpittsburgh.com/SearchLobbyists.aspx`

To this:

`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=10&click=1`

After the `?` are the URL _parameters_, separated by `&`:
- `page=0`
- `cat=LobbyistName`
- `sort=ASC`
- `num=10`
- `click=1`

These are the instructions that get passed to the database after we click search: Return results under the category "Lobbyist Name," show 10 lobbyists at a time, sort ascending, starting with page zero (the first page).

(What happens if you _do_ put something in the search field? A new parameter is added to the URL: `lobbyist=`. But we want everything, so we can ignore this.)

What happens if we tweak the URL and instruct the database to show us _100_ results at a time? Try it:

[`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=100&click=1`](`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=100&click=1`)

Now we have everyone in the database.

### Save web page locally

So we've got our target URL -- if we request that page, we get back some HTML containing all the data we'd like to scrape.

When possible, it's good practice to save local copies of the pages that you're scraping. That way you don't have to rely on a stable internet connection as you work on your scraper, and you can avoid sending unneccessary traffic to the target's server.

Let's do that now.

First, set up a couple of variables:
- The base URL
- A [dictionary](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries) of URL parameters (see [the requests documentation here](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls))
- The name of the `.html` file we'll save to locally
- The name of the `.csv` file we'll write out our results to

In [2]:
BASE_URL = 'http://www.openbookpittsburgh.com/SearchLobbyists.aspx'

URL_PARAMS = {
    'page': 0,
    'cat': 'LobbyistName',
    'sort': 'ASC',
    'num': 1000,
    'click': 1
}

HTML_FILE = 'pittsburgh-lobbyists.html'
CSV_FILE = 'pittsburgh-lobbyists.csv'

Now actually fetch the page, specifying our headers and `params=URL_PARAMS`.

In [3]:
r = requests.get(BASE_URL,
                 headers={'name': 'Cody Winchester', 'email': 'cody@ire.org'},
                 params=URL_PARAMS)

Write the `text` attribute -- the code underpinning the requested page -- to the file under the name we just specified.

In [4]:
with open(HTML_FILE, 'w') as o:
    o.write(r.text)

Great! Now we have a copy of the webpage in this directory. Let's open it up and turn the contents into a `BeautifulSoup` object.

In [5]:
with open(HTML_FILE, 'r') as i:
    soup = BeautifulSoup(i.read(), 'html.parser')

We're ready to start looking for patterns and isolating the HTML elements we want to target. I like to examine the source code in the browser (In Chrome, it's `Ctrl+U` on PCs and `Ctrl+option+U` on a Mac).

It looks like all of the lobbyist HTML is enclosed in a `div` with the class `items-container`. Let's use the BeautifulSoup method `find` to isolate that first.

In [6]:
container = soup.find('div', {'class': 'items-container'})

Within that container, it looks like each individual entry is a `div` with the class `item`. Let's use `find_all` to return a list of matching elements within the container.

Then we can use the built-in [`len()`](https://docs.python.org/3/library/functions.html#len) function to see how many we've got.

In [7]:
items = container.find_all('div', {'class': 'item'})

In [8]:
len(items)

83

Looking good! Let's grab _one_ of those items as a test and parse out the information. We'll then use what we learned to scrape the info out of each entry, one at a time.

Lobbyists have multiple clients, so in our database, one record will be one lobbying relationship -- each line is, essentially, a client and the lobbyist representing them.

In [9]:
# grab the first item and call it `test`
test = items[0]

# the person's name is in an h2 headline
name = test.find('h2').text.strip()

# their position is in a span element with the class `position`
position = test.find('span', {'class': 'position'}).text.strip()

# their status is in two span elements that have the class `position`
# the first is currency (expired or current)
# the second is, are they a lobbyist for the city?
# find_all() returns a list
statuses = test.find_all('span', {'class': 'status'})

# grab text of "currency" status tag
status = statuses[0].text.strip()

# set a default value -- they're assumed to not be a city lobbyist
lobbyist_for_city = False

# unless the word "yes" appears in the (lowercased) city lobbyist span text
# in which case, flip that variable to true
if 'yes' in statuses[1].text.lower():
    lobbyist_for_city = True

# the company is in a div with the class `type`
company = test.find('div', {'class': 'type'}).text.strip()

# the company address is in a div with the class `title`
company_address = test.find('div', {'class': 'title'}).text.strip()

# lobbyists can have one or more clients, and these are list items in an unordered list
# use find_all() to get all of the list items
clients = test.find_all('li')

# loop over the list of clients
for client in clients:
    
    # the company is in a span with the class `company`
    # we'll also strip off the colon at the end and kill any external whitespace
    # https://www.tutorialspoint.com/python/string_rstrip.htm
    client_company = client.find('span', {'class': 'company'}).string.rstrip(':').strip()
    
    # the company address is in a span with the class `address`
    client_address = client.find('span', {'class': 'address'}).string.strip()

    # use a trick to strip out internal whitespace
    # https://stackoverflow.com/a/3739939
    client_address_clean = ' '.join(client_address.split())
    
    # print the results
    print(name, position, status, lobbyist_for_city, company, company_address,
          client_company, client_address_clean)

Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 Veolia Water America LLC 200 East Randolph Street, Chicago IL, 60601
Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 SEIU Healthcare Pennsylvania 1500 North Second Street, Harrisburg PA, 17102
Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 American Traffic Solutions 1330 West Southern Avenue, Tempe AZ, 85282


Solid. Now we can basically copy-paste that code into a [for loop](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops) and apply it to each item we found.

In [11]:
for item in items:

    name = item.find('h2').text.strip()
    position = item.find('span', {'class': 'position'}).text.strip()
    statuses = item.find_all('span', {'class': 'status'})
    status = statuses[0].text.strip()

    lobbyist_for_city = False
    if 'yes' in statuses[1].text.lower():
        lobbyist_for_city = True

    company = item.find('div', {'class': 'type'}).text.strip()
    company_address = item.find('div', {'class': 'title'}).text.strip()

    clients = item.find_all('li')

    for client in clients:
        client_company = client.find('span', {'class': 'company'}).string.rstrip(':').strip()
        client_address = client.find('span', {'class': 'address'}).string.strip()
        
        client_address_clean = ' '.join(client_address.split())
        print(name, position, status, lobbyist_for_city, company, company_address,
              client_company, client_address_clean)

Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 Veolia Water America LLC 200 East Randolph Street, Chicago IL, 60601
Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 SEIU Healthcare Pennsylvania 1500 North Second Street, Harrisburg PA, 17102
Abass B. Kamara Partner expired False The Carey Group The Grant Bldg. 310 Grant Street Suite 1123 Pittsburgh PA, 15219 American Traffic Solutions 1330 West Southern Avenue, Tempe AZ, 85282
Alexandra B. Kozak Mgr. of Gov and Community Relations current 2018 False Duquesne University 600 Forbes Avenue Pittsburgh PA, 15282 Duquesne University 600 Forbes Avenue, Pittsburgh PA, 15282
Andrea Perez PA Political Director current 2018 False SEIU Local 32BJ 209 9th Floor  Pittsburgh PA, 15222 SEIU Local 32BJ 25 West 18th Street 5th Floor, New York NY, 10011
Archie c. Buckner WPA Political Director current 2018 False SEI

Nina Tinari Director of Government Relations expired False S.R. Wojdak and Associates LP 200 South Broad Street, Suite 850 Philadelphia  PA, 19102 Microsoft Corporation 1 Microsoft Way, Redmond WA, 98052
Patrick J. Lavelle Associate current 2018 True Malady & Wooten LLP 604 North Third Street Harrisburg PA, 17101 Duquesne Light Company 411 Seventh Avenue, Pittsburgh PA, 15219
Patrick J. Lavelle Associate current 2018 True Malady & Wooten LLP 604 North Third Street Harrisburg PA, 17101 Yellow Cab Company of Pittsburgh 1825 Liverpool Street, Pittsburgh PA, 15233
Patrick J. Lavelle Associate current 2018 True Malady & Wooten LLP 604 North Third Street Harrisburg PA, 17101 Pittsburgh Symphone Orchestra 600 Penn Avenue, Pittsburgh PA, 15222
Patrick J. Lavelle Associate current 2018 True Malady & Wooten LLP 604 North Third Street Harrisburg PA, 17101 Penn Film Group Suite #2 2820 Smallman Street, Pittsburgh PA, 15222
Patrick J. Lavelle Associate current 2018 True Malady & Wooten LLP 604 Nort

Looking good! Now let's write everything out to a CSV.

In [12]:
with open(CSV_FILE, 'w', newline='') as o:
    headers = ['name', 'position', 'status', 'city_lobbyist', 'company',
               'company_address', 'client_company', 'client_address']

    writer = csv.DictWriter(o, fieldnames=headers)
    writer.writeheader()
    
    for item in items:

        name = item.find('h2').text.strip()
        position = item.find('span', {'class': 'position'}).text.strip()
        statuses = item.find_all('span', {'class': 'status'})
        status = statuses[0].text.strip()

        lobbyist_for_city = False
        if 'yes' in statuses[1].text.lower():
            lobbyist_for_city = True

        company = item.find('div', {'class': 'type'}).text.strip()
        company_address = item.find('div', {'class': 'title'}).text.strip()

        clients = item.find_all('li')

        for client in clients:
            client_company = client.find('span', {'class': 'company'}).string.rstrip(':').strip()
            client_address = client.find('span', {'class': 'address'}).string.strip()

            client_address_clean = ' '.join(client_address.split())
            writer.writerow({
                'name': name,
                'position': position,
                'status': status,
                'city_lobbyist': lobbyist_for_city,
                'company': company,
                'company_address': company_address,
                'client_company': client_company,
                'client_address': client_address_clean
            })

### _Extra credit_

We're repeating ourselves a lot here. If I were going to publish this scraper, I'd probably clean this up into a series of functions that each do one thing. Some homework, if you're interested: Break down the processing we've done into major tasks (fetch the page, save to file, parse the contents) and write [functions](../reference/Functions.ipynb) for each task.

(Eventually, as you progress in your coding journey, [this handy guide to refactoring](https://refactoring-101.readthedocs.io/en/latest/) will become very useful!.)

### Load data into pandas for analysis

Congrats! You've scraped a web page into a clean CSV. Here's where you could load it up into pandas and take a look.

In [13]:
df = pd.read_csv('pittsburgh-lobbyists.csv')

In [14]:
df.head()

Unnamed: 0,name,position,status,city_lobbyist,company,company_address,client_company,client_address
0,Abass B. Kamara,Partner,expired,False,The Carey Group,The Grant Bldg. 310 Grant Street Suite 1123 Pi...,Veolia Water America LLC,"200 East Randolph Street, Chicago IL, 60601"
1,Abass B. Kamara,Partner,expired,False,The Carey Group,The Grant Bldg. 310 Grant Street Suite 1123 Pi...,SEIU Healthcare Pennsylvania,"1500 North Second Street, Harrisburg PA, 17102"
2,Abass B. Kamara,Partner,expired,False,The Carey Group,The Grant Bldg. 310 Grant Street Suite 1123 Pi...,American Traffic Solutions,"1330 West Southern Avenue, Tempe AZ, 85282"
3,Alexandra B. Kozak,Mgr. of Gov and Community Relations,current 2018,False,Duquesne University,"600 Forbes Avenue Pittsburgh PA, 15282",Duquesne University,"600 Forbes Avenue, Pittsburgh PA, 15282"
4,Andrea Perez,PA Political Director,current 2018,False,SEIU Local 32BJ,"209 9th Floor Pittsburgh PA, 15222",SEIU Local 32BJ,"25 West 18th Street 5th Floor, New York NY, 10011"


In [None]:
# what else?