# Scraping one page per row

Let's say we're interested in our members of Congress, because who isn't? Read in `congress.csv`.

In [14]:
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

In [190]:
congress_df = pd.read_csv('congress.csv')
congress_df.head()

Unnamed: 0,name,slug
0,"Senator Abdnor, James",james-abdnor/A000009
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014
2,"Senator Abourezk, James",james-abourezk/A000017
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374
4,"Senator Abraham, Spencer",spencer-abraham/A000355


# Let's scrape one

The `slug` is the part of the URL that's particular to that member of Congress. So `/james-abdnor/A000009` really means `https://www.congress.gov/member/james-abdnor/A000009`.

Scrape his name, birthdaye, party, whether he's currently in congress, and his bill count (don't worry if the bill count is dirty, you can clean it up later).

In [172]:
driver = webdriver.Chrome()

In [16]:
driver.get(url)

In [68]:
congressman = driver.find_elements_by_class_name("featured")
bill_id = driver.find_element_by_class_name("results-number").text
for congress in congressman:
    print(congress.text.strip())
    print("___________________")
    
print(bill_id)

Subscribe Share/Save Site Feedback
Senator James Abdnor (1923 - 2012)
In Congress 1973 - 1987
MEMBERHide Overview
Courtesy U.S. Senate Historical Office
Read biography
Party Republican
Senate South Dakota 97th-99th (1981-1987)
House South Dakota, District 2 93rd-96th (1973-1981)
___________________
1-100 of 1,949


In [95]:

for congress in congressman:
    designation = re.findall(r"(Senator|Representative)\s(\w+\s\w+\s\w*)", congress.text, re.IGNORECASE)[0][0]
    name = re.findall(r"(Senator|Representative)\s(\w+\s\w+\s\w*)", congress.text, re.IGNORECASE)[0][1]
    birthday = re.findall(r"Senator\s\w+\s\w+\s\w*\D(\d+\s-\s\d+)\D", congress.text, re.IGNORECASE)
    party = re.findall(r"Party\s(\w*)", congress.text)
    cong_years = re.findall(r"In Congress\s(\d*\s-\s\d*)", congress.text, re.IGNORECASE)
    bill_count = re.findall(r"\d\sof\s(\d\D*\d+)" , bill_id)[0].replace(",", "")
    print(name)
    

James Abdnor 


# Build a function

Write a function called `scrape_page` that makes a URL out of the the `slug`, like we're going to use `.apply`.

In [None]:
# def scrape_page(slug):
#     url = "https://www.congress.gov/member/"
#     slug_url = url+slug
#     driver.get(slug_url)
#     congressman = driver.find_elements_by_class_name("featured")
#     bill_id = driver.find_element_by_class_name("results-number").text
    
#     cong_dict = {}
#     for congress in congressman:
#         cong_dict['designation'] = re.findall(r"(Senator|Representative)\s(\w+\s\w+\s\w*)", congress.text, re.IGNORECASE)[0][0]
#         cong_dict['name'] = re.findall(r"(Senator|Representative)\s(\w+\s\w+\s\w*)", congress.text, re.IGNORECASE)[0][1]
#         cong_dict['birthday'] = re.findall(r"(Senator|Representative)\s\w+\s\w+\s\w*\D(\d+\s-\s*\d*)\D", congress.text, re.IGNORECASE)[0][1]
#         cong_dict['party'] = re.findall(r"Party\s(\w*)", congress.text)[0]
#         cong_dict['cong_years'] = re.findall(r"In Congress\s(\d*\s-\s\d*)", congress.text, re.IGNORECASE)[0]
#         cong_dict['bill_count'] = re.findall(r"\d\sof\s(\d+\D*\d*)" , bill_id)[0].replace(",", "")
#     return cong_dict

In [149]:
def scrape_page(slug):
    url = "https://www.congress.gov/member/"
    slug_url = url+slug
    driver.get(slug_url)
    congressman = driver.find_elements_by_class_name("featured")
    bill_id = driver.find_element_by_class_name("results-number").text
    
    cong_dict = {}
    for congress in congressman:
        cong_dict['name'] = driver.find_element_by_class_name("legDetail").text
        cong_dict['birthday'] = driver.find_element_by_class_name("birthdate").text
        cong_dict['party'] = re.findall(r"Party\s(\w*)", congress.text)[0]
        cong_dict['cong_years'] = re.findall(r"In Congress\s(\d*\s-\s\d*)", congress.text, re.IGNORECASE)[0]
        cong_dict['bill_count'] = re.findall(r"\d\sof\s(\d+\D*\d*)" , bill_id)[0].replace(",", "")
    return cong_dict

In [173]:
scrape_try = scrape_page('martin-heinrich/H001046')

In [174]:
scrape_try

{'name': 'Senator Martin Heinrich (1971 - )\nIn Congress 2009 - Present | Get alerts',
 'birthday': '(1971 - )',
 'party': 'Democratic',
 'cong_years': '2009 - ',
 'bill_count': '2249'}

# Do the scraping

Rewrite `scrape_page` to actually scrape the URL. You can use your scraping code from up above. Start by testing with just one row (I put a sample call below) and then expand to your whole dataframe.

Save the results as `scraped_df`.

* **Hint:** Be sure to use `return`!
* **Hint:** Make sure you return a `pd.Series`

In [196]:
big_list = []
error_log = []

#range(0, len(congress_df)-1, 1):

for i in range(20, 100, 1):
    try:
        each_congress = scrape_page(congress_df['slug'][i])
        big_list.append(each_congress)
    except:
        error_log.append(congress_df['slug'][i])

In [197]:
error_log

[]

In [198]:
len(big_list)

80

In [177]:
temp_df = pd.DataFrame(big_list)
temp_df[temp_df['name']=='Senator Martin Heinrich (1971 - )\nIn Congress 2009 - Present | Get alerts']

Unnamed: 0,name,birthday,party,cong_years,bill_count


## Join with your original dataframe

Join your new data with your original data, adding the `_scraped` suffix on the new columns. You can use either `.join` or `.merge`, but be sure to read the docs to know the difference!

## Save it

Save your combined results to `congress-plus-scraped.csv`.