# Web scraping: Senate press accrediations

In this notebook, we're going to scrape [a table of journalists in the Senate press gallery](https://www.dailypress.senate.gov/?page_id=67).

The data are paginated, but what do we see when we inspect the source? Boom: All of the table rows are there on the page when it loads; some are just hidden from view. This is good news for us -- we don't have to worry about handling pagination or anything, we can just grab one page.

First, let's import our dependencies:

In [9]:
import csv

import requests
from bs4 import BeautifulSoup

Now, some variables:
- The URL to the web page we're going to scrape
- The name of the `.html` file we're going to save the page as
- The name of the `.csv` file we're going to write the parsed data into

In [10]:
URL = 'https://www.dailypress.senate.gov/?page_id=67'
SAVED_HTML = 'senate-press-gallery.html'
CSV_FILE = 'congress-press.csv'

Use `requests` to fetch the page, adding some custom headers to identify yourself.

In [11]:
r = requests.get(URL, headers={'name': 'Cody Winchester', 'email': 'cody@ire.org'})

Open the local HTML file in write (`w`) mode and write in the `text` returned in the request.

In [14]:
with open(SAVED_HTML, 'w') as o:
    o.write(r.text)

Now let's open that file in read (`r`) mode and turn the HTML into a `BeautifulSoup` object.

In [15]:
with open(SAVED_HTML, 'r') as i:
    html = i.read()
    soup = BeautifulSoup(html, 'html.parser')

The table we're after has the class `tablepress`, so that's how we'll target it.

In [16]:
table = soup.find('table', {'class': 'tablepress'})

Now we can loop over the table, using list slicing to skip the first (`[0]`) row with the header values, and extract the data from the `td` tags.

In [19]:
# loop over the table rows, skipping the first row
for row in table.find_all('tr')[1:]:
    
    # find all of the `td` cells in this row
    cols = row.find_all('td')
    
    # first name is in position [0]
    first = cols[0].string.strip()

    # last name is in position [1]
    last = cols[1].string.strip()

    # affiliation is in position [2]
    affil = cols[2].string.strip()
    
    print(first, last, affil)

Charles Abbott FERN's Ag Insider by Charles Abbott
Tamar Abdollah Associated Press
Richard Abott Defense Daily
Yasmeen Abutaleb Thomson Reuters
Joel Achenbach Washington Post
Andrew Ackerman Wall Street Journal/ Dow Jones
Rebecca Adams CQ Roll Call
T. Becket Adams Washington Examiner
Rachel Adams-Heard S & P Global
Timothy Ahmann Thomson Reuters
Akbar Shahid Ahmed Huffington Post
Haruyuki Aikawa Mainichi Shimbun
Julia Airey Washington Times
Nicole Albright Gannett Washington Bureau
Charles Alexander Thomson Reuters
Keith Alexander Washington Post
Hector Alfaro Bloomberg News
Abdulaziz Alhendi Saudi Press Agency
Idrees Ali Thomson Reuters
Safvan Allahverdi Anadolu News Agency
Jacqueline Allen Daily Beast
Nicholas Allen London Daily Telegraph
William Allison Bloomberg News
Ricardo Alonso Associated Press
Luis Alonso Lugo Associated Press
Sarah Ampolsk Kyodo News
Mark Anderson Wall Street Journal/ Dow Jones
Nicholas Anderson Washington Post
Fanny Andre Agence France-Presse
Natalie Andrews

Jennifer Jacobs Bloomberg News
Susan Jaffe Freelance
Daniel Jahn Agence France-Presse
Maryam Jameel Center for Public Integrity
Joshua Jamerson Wall Street Journal/ Dow Jones
Dave Jamieson Huffington Post
Tracy Jan Washington Post
Bart Jansen USA Today
Emily Jashinsky Washington Examiner
John Jenkins Religion News Service
Paul Jenks CQ Roll Call
Leah Jessen Daily Caller
Lalit Jha India Press Trust
Zaid Jilani Intercept (The)
Minmin Jin Xinhua News Agency
Arit John Bloomberg News
Benjamin Johnson Daily Caller
Jenna Johnson Washington Post
Kevin Johnson USA Today
Timothy Johnson McClatchy
Vincent Johnson Variety
Margret Johnston German Press Agency - DPA
Barnaby Jopson Financial Times
Alethea Jordan Gannett Washington Bureau
Cyril Julien Agence France-Presse
Hyosik Jung Joongang Ilbo
Michitaka Kaiya Yomiuri Shimbun
Sara Kamouni Agence France-Presse
Ronald Kampeas Jewish Telegraphic Agency
Paul Kane Washington Post
Rui Kaneya Center for Public Integrity
Cecilia Kang New York Times
Insun K

Victor Sancho Lacalle El Universal
Edmund Sanders Los Angeles Times
David Sands Washington Times
David Sanger New York Times
Margot Sanger Katz New York Times
Jathon Sapsford Wall Street Journal/ Dow Jones
Giuseppe Sarcina Corriere Della Sera
Debra Saunders Las Vegas Review-Journal
David Savage Los Angeles Times
Wataru Sawamura Asahi Shimbun
William Scally William Scally Reports
Rowan Scarborough Washington Times
Juliane Schaeuble Der Tagesspiegel
Joel Schectman Thomson Reuters
Michael Scherer Washington Post
Kristy Scheuble Bloomberg News
Jacob Schlesinger Wall Street Journal/ Dow Jones
Courtney Schlisserman Argus Media
Michael Schmidt New York Times
Robert Schmidt Bloomberg News
Eric Schmitt New York Times
Howard Schneider Thomson Reuters
Thomas Schoenberg Bloomberg News
Fredreka Schouten USA Today
Martin Schram Tribune Content Agency
Peter Schroeder Thomson Reuters
Robert Schroeder MarketWatch
Jessica Schulberg Huffington Post
Fred Schulte Kaiser Health News
Marisa Schultz New York 

Great! Now, instead of printing, we'll write the results out to file.

👉 For more information on reading and writing delimited files, [see this notebook](../reference/Reading%20and%20writing%20delimited%20data%20files%20with%20vanilla%20Python.ipynb).

In [20]:
# open the file to write to
with open(CSV_FILE, 'w') as o:

    # create a DictWriter object and specify the fieldnames
    writer = csv.DictWriter(o, fieldnames=['first', 'last', 'affiliation'])
    
    # write header row
    writer.writeheader()

    # loop over table rows, skipping first row
    for row in table.find_all('tr')[1:]:
        
        # get a list of `td` tags in this row
        cols = row.find_all('td')
        
        # first name is in position [0]        
        first = cols[0].string.strip()
        
        # last name is in position [1]
        last = cols[1].string.strip()
        
        # affiliation is in position [2]
        affil = cols[2].string.strip()

        # write out to file
        writer.writerow({
          'first': first,
          'last': last,
          'affiliation': affil
        })