# Web scraping practice: Pittsburgh lobbyists

We're going to scrape [a database of people registered to lobby the city of Pittsburgh](http://www.openbookpittsburgh.com/SearchLobbyists.aspx).

First, let's import the packages we'll need: `csv`, `requests`, `bs4`, `pandas`.

In [None]:
# import csv, requests, BeautifulSoup from bs4, pandas



### Noodle around

Navigate to the page we're going to scrape. We want everyone -- what happens when you hit "search" without entering any criteria into the form? It works! (This won't always be the case for databases like this.) As of July 28, 2018, the search showed 83 lobbyists.

Notice, too, that the URL has changed from this:

`http://www.openbookpittsburgh.com/SearchLobbyists.aspx`

To this:

`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=10&click=1`

After the `?` are the URL _parameters_, separated by `&`:
- `page=0`
- `cat=LobbyistName`
- `sort=ASC`
- `num=10`
- `click=1`

These are the instructions that get passed to the database after we click search: Return results under the category "Lobbyist Name," show 10 lobbyists at a time, sort ascending, starting with page zero (the first page).

(What happens if you _do_ put something in the search field? A new parameter is added to the URL: `lobbyist=`. But we want everything, so we can ignore this.)

What happens if we tweak the URL and instruct the database to show us _100_ results at a time? Try it:

[`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=100&click=1`](`http://www.openbookpittsburgh.com/SearchLobbyists.aspx?&page=0&cat=LobbyistName&sort=ASC&num=100&click=1`)

Now we have everyone in the database.

### Save web page locally

So we've got our target URL -- if we request that page, we get back some HTML containing all the data we'd like to scrape.

When possible, it's good practice to save local copies of the pages that you're scraping. That way you don't have to rely on a stable internet connection as you work on your scraper, and you can avoid sending unneccessary traffic to the target's server.

Let's do that now.

First, set up a couple of variables:
- The base URL
- A [dictionary](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries) of URL parameters (see [the requests documentation here](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls))
- The name of the `.html` file we'll save to locally
- The name of the `.csv` file we'll write out our results to

In [None]:
BASE_URL = 'http://www.openbookpittsburgh.com/SearchLobbyists.aspx'

URL_PARAMS = {
    'page': 0,
    'cat': 'LobbyistName',
    'sort': 'ASC',
    'num': 1000,
    'click': 1
}

HTML_FILE = 'pittsburgh-lobbyists.html'
CSV_FILE = 'pittsburgh-lobbyists.csv'

Now actually fetch the page, specifying our headers and `params=URL_PARAMS`.

In [None]:
# request the page
# specify URL to get, custom headers and `params`


Write the `text` attribute -- the code underpinning the requested page -- to the file under the name we just specified.

In [None]:
# open html file

    # and write the page text into it


Great! Now we have a copy of the webpage in this directory. Let's open it up and turn the contents into a `BeautifulSoup` object.

In [None]:
# open the html file we just made

    # read in the contents and turn them into soup


We're ready to start looking for patterns and isolating the HTML elements we want to target. I like to examine the source code in the browser (In Chrome, it's `Ctrl+U` on PCs and `Ctrl+option+U` on a Mac).

It looks like all of the lobbyist HTML is enclosed in a `div` with the class `items-container`. Let's use the BeautifulSoup method `find` to isolate that first.

In [None]:
# find the container


Within that container, it looks like each individual entry is a `div` with the class `item`. Let's use `find_all` to return a list of matching elements within the container.

Then we can use the built-in [`len()`](https://docs.python.org/3/library/functions.html#len) function to see how many we've got.

In [None]:
# find the items


In [None]:
# check the length of the list of items with len()


Looking good! Let's grab _one_ of those items as a test and parse out the information. We'll then use what we learned to scrape the info out of each entry, one at a time.

Lobbyists have multiple clients, so in our database, one record will be one lobbying relationship -- each line is, essentially, a client and the lobbyist representing them.

In [None]:
# grab the first item and call it `test`


# the person's name is in an h2 headline


# their position is in a span element with the class `position`


# their status is in two span elements that have the class `position`
# the first is currency (expired or current)
# the second is, are they a lobbyist for the city?
# find_all() returns a list


# grab text of "currency" status tag


# set a default value -- they're assumed to not be a city lobbyist


# unless the word "yes" appears in the (lowercased) city lobbyist span text
# in which case, flip that variable to true



# the company is in a div with the class `type`


# the company address is in a div with the class `title`


# lobbyists can have one or more clients, and these are list items in an unordered list
# use find_all() to get all of the list items


# loop over the list of clients

    
    # the company is in a span with the class `company`
    # we'll also strip off the colon at the end and kill any external whitespace
    # https://www.tutorialspoint.com/python/string_rstrip.htm

    
    # the company address is in a span with the class `address`


    # use a trick to strip out internal whitespace
    # https://stackoverflow.com/a/3739939

    
    # print the results


Solid. Now we can basically copy-paste that code into a [for loop](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops) and apply it to each item we found.

In [None]:
# for every item in our list

    # name is in the h2 tag -- text.strip()

    # position is span with class position

    # status is split across two span tags with a class called 'status'
    # use find_all() to get them in a list

    
    # first one [0] is status


    # default value -- they're not a lobbyist for the city

    # if 'yes' is in the lowercase text of the second 'status' span

        # then they _are_ a city lobbyist

    # company is a div with class `type`

    # company address is a div with class `title`

    # clients are a bunch of list items -- use find_all() to get them

    # loop over client list

        # client company is span with class `company`
        # rstrip the colon and strip whitespace

        # client address is span with class address

        # remove internal whitespace

        # print the results


Looking good! Now let's write everything out to a CSV.

In [None]:
# open the CSV_FILE in write mode, newline=''

    # define headers
    headers = ['name', 'position', 'status', 'city_lobbyist', 'company',
               'company_address', 'client_company', 'client_address']

    # create a DictWriter object

    # write the headers

    # loop over the items in our list

        # name is h2

        # position is span with class position

        # statuses arae in spans with class status -- use find_all()

        # first thing in that list is status

        # assume not a lobbyist for the city

        # but if 'yes' in text of second status tag

            # flip to True

        # company info is div with class `type`

        # company address is div with class `title`

        # find_all() to get list of `li` tags of clients

        # loop over client list
            # client company in span with class `company`
            # rstrip() colon and strip() whitespace

            # client address is span with class `address`

            # remove external whitespace

            # write out to file


### _Extra credit_

We're repeating ourselves a lot here. If I were going to publish this scraper, I'd probably clean this up into a series of functions that each do one thing. Some homework, if you're interested: Break down the processing we've done into major tasks (fetch the page, save to file, parse the contents) and write [functions](../reference/Functions.ipynb) for each task.

(Eventually, as you progress in your coding journey, [this handy guide to refactoring](https://refactoring-101.readthedocs.io/en/latest/) will become very useful!.)

### Load data into pandas for analysis

Congrats! You've scraped a web page into a clean CSV. Here's where you could load it up into pandas and take a look.

In [None]:
# read in our csv


In [None]:
# use head() to check it out


In [None]:
# what else?