# Web scraping: Senate press accrediations

In this notebook, we're going to scrape [a table of journalists in the Senate press gallery](https://www.dailypress.senate.gov/?page_id=67).

The data are paginated, but what do we see when we inspect the source? Boom: All of the table rows are there on the page when it loads; some are just hidden from view. This is good news for us -- we don't have to worry about handling pagination or anything, we can just grab that one page.

First, let's import our dependencies:

In [1]:
import csv

import requests
from bs4 import BeautifulSoup

Now, some variables:
- The URL to the web page we're going to scrape
- The name of the `.html` file we're going to save the page as
- The name of the `.csv` file we're going to write the parsed data into

In [2]:
URL = 'https://www.dailypress.senate.gov/?page_id=67'
SAVED_HTML = 'senate-press-gallery.html'
CSV_FILE = 'congress-press.csv'

Use `requests` to fetch the page, adding a _dictionary_ of [custom headers](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) to identify yourself.

👉 For more on dictionaries, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries).

In [3]:
r = requests.get(URL, headers={'name': 'Cody Winchester', 'email': 'cody@ire.org'})

Open the local HTML file in write (`w`) mode and write in the `text` returned in the request.

In [4]:
with open(SAVED_HTML, 'w') as o:
    o.write(r.text)

Now let's open that file in read (`r`) mode and turn the HTML into a `BeautifulSoup` object.

In [5]:
with open(SAVED_HTML, 'r') as i:
    html = i.read()
    soup = BeautifulSoup(html, 'html.parser')

Quick HTML lesson! A `<table>` element typically contains rows, represented by the `<tr>` tag, and each row, in turn, contains one or more "table data" cells, represented by the `<td>` tag.

So we're going to loop over the rows in the table, and for each row, we're going to extract the bits of data in each `td` cell.

👉 For more information on _for loops_, [check out this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

The table we're after has the class `tablepress`, so that's how we'll target it. To target elements using BeautifulSoup, we'll pass the `find()` method two things:
- `table`: The name of the element we're targeting
- `{'class': 'tablepress'}`: A dictionary with key/value pairs that describe the element we're targeting

The `find()` method returns the first element it encounters that meets the criteria, reading top to bottom. If we thought there might be more than one table of interest, we could use the `find_all()` method, which returns a list.

In [6]:
table = soup.find('table', {'class': 'tablepress'})

Next, let's get a list of the rows in the table. In HTML, table rows are represented by the element `tr`.

In [7]:
rows = table.find_all('tr')

Let's see what we've got:

In [8]:
rows

[<tr class="row-1 odd">
 <th class="column-1">Legal First Name</th><th class="column-2">Legal Last Name</th><th class="column-3">Organization</th>
 </tr>, <tr class="row-2 even">
 <td class="column-1">Charles</td><td class="column-2">Abbott</td><td class="column-3">FERN's Ag Insider by Charles Abbott</td>
 </tr>, <tr class="row-3 odd">
 <td class="column-1">Tamar</td><td class="column-2">Abdollah</td><td class="column-3">Associated Press</td>
 </tr>, <tr class="row-4 even">
 <td class="column-1">Richard</td><td class="column-2">Abott</td><td class="column-3">Defense Daily</td>
 </tr>, <tr class="row-5 odd">
 <td class="column-1">Yasmeen</td><td class="column-2">Abutaleb</td><td class="column-3">Thomson Reuters</td>
 </tr>, <tr class="row-6 even">
 <td class="column-1">Joel</td><td class="column-2">Achenbach</td><td class="column-3">Washington Post</td>
 </tr>, <tr class="row-7 odd">
 <td class="column-1">Andrew</td><td class="column-2">Ackerman</td><td class="column-3">Wall Street Jour

It's a list! Which means we can loop over the rows, using list slicing (`[1:]`) to skip the first (`[0]`) row, which has the headers, and extract the data from the `td` (table data) tags in each row.

In [9]:
# loop over the table rows, skipping the first row
for row in rows[1:]:
    
    # find all of the `td` cells in this row
    cols = row.find_all('td')
    
    # first name is in position [0]
    first = cols[0].string.strip()

    # last name is in position [1]
    last = cols[1].string.strip()

    # affiliation is in position [2]
    affil = cols[2].string.strip()
    
    print(first, last, affil)

Charles Abbott FERN's Ag Insider by Charles Abbott
Tamar Abdollah Associated Press
Richard Abott Defense Daily
Yasmeen Abutaleb Thomson Reuters
Joel Achenbach Washington Post
Andrew Ackerman Wall Street Journal/ Dow Jones
Rebecca Adams CQ Roll Call
T. Becket Adams Washington Examiner
Rachel Adams-Heard S & P Global
Timothy Ahmann Thomson Reuters
Akbar Shahid Ahmed Huffington Post
Haruyuki Aikawa Mainichi Shimbun
Julia Airey Washington Times
Nicole Albright Gannett Washington Bureau
Charles Alexander Thomson Reuters
Keith Alexander Washington Post
Hector Alfaro Bloomberg News
Abdulaziz Alhendi Saudi Press Agency
Idrees Ali Thomson Reuters
Safvan Allahverdi Anadolu News Agency
Jacqueline Allen Daily Beast
Nicholas Allen London Daily Telegraph
William Allison Bloomberg News
Ricardo Alonso Associated Press
Luis Alonso Lugo Associated Press
Sarah Ampolsk Kyodo News
Mark Anderson Wall Street Journal/ Dow Jones
Nicholas Anderson Washington Post
Fanny Andre Agence France-Presse
Natalie Andrews

William Holland S & P Global
Allan Holmes Center for Public Integrity
Jacob Holzman CQ Roll Call
David Hood S & P Global
Janet Hook Wall Street Journal/ Dow Jones
Jamie Hopkins Center for Public Integrity
Jeffrey Horwitz Associated Press
Sari Horwitz Washington Post
Mark Hosenball Thomson Reuters
Billy House Bloomberg News
Colleen Howe Argus Media
Thomas Howell Washington Times
Taisei Hoyama Nikkei
Spencer Hsu Washington Post
Zexi Hu China People's Daily
Jasmin Hudson S & P Global
John Hudson Buzzfeed
Emily Huetteman Kaiser Health News
Caitlin Huey-Burns Real Clear Politics
John Hughes Bloomberg News
Siobhan Hughes Wall Street Journal/ Dow Jones
Terence Huie CQ Roll Call
Carl Hulse New York Times
Rebecca Lynn Hume Bond Buyer
Albert Hunt Bloomberg News
John Hunter CQ Roll Call
Lawrence Hurley Thomson Reuters
Charles Hurt Washington Times
Michail Ignatiou Phileleftheros Cyprus
Susumu Ikeda Akahata
Tamara Ilaria Freelance
Kasim Ileri Anadolu News Agency
Gregory Ip Wall Street Journal/ Dow

Hira Qureshi CQ Roll Call
Charles Raasch St. Louis Post-Dispatch
Ana Radelat Connecticut Mirror
Louise Radnofsky Wall Street Journal/ Dow Jones
Sethuraman Rajagopalan Pioneer - India
Chidanand Rajghatta Times of India
Sandhya Raman CQ Roll Call
Roberta Rampton Thomson Reuters
Maya Rao Minneapolis Star Tribune
Alan Rappeport New York Times
Gopal Ratnam CQ Roll Call
Jordan Rau Kaiser Health News
Francesca Rose Regalado Yomiuri Shimbun
Debra Reichmann-Kepler Associated Press
Ryan Reilly Huffington Post
Lisa Rein Washington Post
Beth Reinhard Washington Post
Andrea Ricci Thomson Reuters
Carter Rice Asahi Shimbun
Gillian Rich Investor's Business Daily
Bradford Richardson Washington Times
Warren Richey Christian Science Monitor
Laura Rigby Bloomberg News
Michael Riley Bloomberg News
Benedict Riley-Smith London Daily Telegraph
Salvador Rizzo Washington Post
Alexander Roarty McClatchy
Gregory Robb MarketWatch
Catalina Roberts CQ Roll Call
Gillian Roberts CQ Roll Call
Gregory Roberts Energy Dai

Great! Now, instead of printing, we'll write the results out to file.

👉 For more information on reading and writing delimited files, [see this notebook](../reference/Reading%20and%20writing%20delimited%20data%20files%20with%20vanilla%20Python.ipynb).

In [10]:
# open the file to write to
with open(CSV_FILE, 'w', newline='') as o:

    # create a DictWriter object and specify the fieldnames
    writer = csv.DictWriter(o, fieldnames=['first', 'last', 'affiliation'])
    
    # write header row
    writer.writeheader()

    # loop over table rows, skipping first row
    for row in table.find_all('tr')[1:]:
        
        # get a list of `td` tags in this row
        cols = row.find_all('td')
        
        # first name is in position [0]        
        first = cols[0].string.strip()
        
        # last name is in position [1]
        last = cols[1].string.strip()
        
        # affiliation is in position [2]
        affil = cols[2].string.strip()

        # write out to file
        writer.writerow({
          'first': first,
          'last': last,
          'affiliation': affil
        })

In [14]:
calendar = soup.find('table', {'id': 'wp-calendar'})

In [18]:
for link in calendar.find('tbody').find_all('a'):
    print(link['href'])

https://www.dailypress.senate.gov/?m=20180801
https://www.dailypress.senate.gov/?m=20180803
https://www.dailypress.senate.gov/?m=20180807
https://www.dailypress.senate.gov/?m=20180810
https://www.dailypress.senate.gov/?m=20180813
https://www.dailypress.senate.gov/?m=20180816
https://www.dailypress.senate.gov/?m=20180817
https://www.dailypress.senate.gov/?m=20180821
https://www.dailypress.senate.gov/?m=20180822
https://www.dailypress.senate.gov/?m=20180823
