# Web scraping: Senate press accrediations

In this notebook, we're going to scrape [a table of journalists in the Senate press gallery](https://www.dailypress.senate.gov/?page_id=67).

The data are paginated, but what do we see when we inspect the source? Boom: All of the table rows are there on the page when it loads; some are just hidden from view. This is good news for us -- we don't have to worry about handling pagination or anything, we can just grab that one page.

First, let's import our dependencies:

In [None]:
# csv


# requests

# BeautifulSoup from bs4


Now, some variables:
- The URL to the web page we're going to scrape
- The name of the `.html` file we're going to save the page as
- The name of the `.csv` file we're going to write the parsed data into

In [None]:
URL = 'https://www.dailypress.senate.gov/?page_id=67'
SAVED_HTML = 'senate-press-gallery.html'
CSV_FILE = 'congress-press.csv'

Use `requests` to fetch the page, adding a _dictionary_ of [custom headers](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) to identify yourself.

👉 For more on dictionaries, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries).

In [None]:
# get the HTML
# include a dictionary with your name and email


Open the local HTML file in write (`w`) mode and write in the `text` returned in the request.

In [None]:
# in a with block, open the SAVED_HTML page in write mode
# and write in the text returned when you fetched the page


Now let's open that file in read (`r`) mode and turn the HTML into a `BeautifulSoup` object.

In [None]:
# open the same file in read mode

    # read in the contents of the file

    
    # turn it into soup


Quick HTML lesson! A `<table>` element typically contains rows, represented by the `<tr>` tag, and each row, in turn, contains one or more "table data" cells, represented by the `<td>` tag.

So we're going to loop over the rows in the table, and for each row, we're going to extract the bits of data in each `td` cell.

👉 For more information on _for loops_, [check out this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

The table we're after has the class `tablepress`, so that's how we'll target it. To target elements using BeautifulSoup, we'll pass the `find()` method two things:
- `table`: The name of the element we're targeting
- `{'class': 'tablepress'}`: A dictionary with key/value pairs that describe the element we're targeting

The `find()` method returns the first element it encounters that meets the criteria, reading top to bottom. If we thought there might be more than one table of interest, we could use the `find_all()` method, which returns a list.

In [None]:
# find the table with the class 'tablepress'


Next, let's get a list of the rows in the table. In HTML, table rows are represented by the element `tr`.

In [None]:
# get a list of rows in the table we just found


Let's see what we've got:

In [None]:
# rows


It's a list! Which means we can loop over the rows, using list slicing (`[1:]`) to skip the first (`[0]`) row, which has the headers, and extract the data from the `td` (table data) tags in each row.

In [None]:
# loop over the table rows, skipping the first row

    
    # find all of the `td` cells in this row

    
    # first name is in position [0]
    # get the `string` attribute and call the `strip()` method on it


    # last name is in position [1]


    # affiliation is in position [2]

    # print out what we've got


Great! Now, instead of printing, we'll write the results out to file.

👉 For more information on reading and writing delimited files, [see this notebook](../reference/Reading%20and%20writing%20delimited%20data%20files%20with%20vanilla%20Python.ipynb).

In [None]:
# open the file to write to
# specify write mode, newline=''


    # create a DictWriter object and specify the fieldnames

    
    # write header row


    # loop over table rows, skipping first row

        
        # get a list of `td` tags in this row

        
        # first name is in position [0]        

        
        # last name is in position [1]

        
        # affiliation is in position [2]


        # write out dictionary to file


# 📚 GROUP HOMEWORK 📚

In groups: Print a list of numbers from the calendar on the right-hand rail.

**Bonus** (will require some Googling): Print a list of _links_ in the calendar on the right-hand rail. (Hint: links are `a` tags, the URL is stored in the `href` attribute.)