# Scraping one page per row

Let's say we're interested in our members of Congress, because who isn't? Read in `congress.csv`.

In [1]:
import pandas as pd
import requests
import time 
from bs4 import BeautifulSoup


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait



In [2]:
df = pd.read_csv("congress.csv")

# Let's scrape one

The `slug` is the part of the URL that's particular to that member of Congress. So `/james-abdnor/A000009` really means `https://www.congress.gov/member/james-abdnor/A000009`.

Scrape his name, birthdaye, party, whether he's currently in congress, and his bill count (don't worry if the bill count is dirty, you can clean it up later).

In [3]:
my_url = "https://www.congress.gov/member/james-abdnor/A000009"
raw_html = requests.get(my_url).content
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [4]:
print(soup_doc)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>James Abdnor | Congress.gov | Library of Congress</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="https://www.congress.gov/member/james-abdnor/A000009" name="canonical"/>
<meta content="1985 - 1987" name="dc.coverage"/>
<meta content="James Abdnor" name="dc.creator"/>
<meta content="https://www.congress.gov/member/james-abdnor/A000009" name="dc.identifier"/>
<meta content="eng" name="dc.language"/>
<meta content="Text is government work" name="dc.rights"/>
<meta content="James Abdnor" name="dc.subject"/>
<meta content="Legislative Data" name="dc.subject"/>
<meta content="Congress" name="dc.subject"/>
<meta content="James Abdnor" name="dc.title"/>
<meta content="legislation" name="dc.type"/>
<meta content="webpage" name="dc.type"/>
<meta content="Sponsored legislation by James Abdnor, the Senator from South Dakota - in Congress from 1985 through 1987" name="d

In [5]:
details = soup_doc.find_all('div', class_='featured')

name = details[0].find_all('h1')[0].contents[0]

birthdate = details[0].find_all('h1')[0].find_all('span')[0].text.strip()

party = soup_doc.find_all('div', class_="overview-member-column-profile member_profile")[0].find_all('td')[0].text.strip()

bills = soup_doc.find_all('span', class_='results-number')[0].text.strip()

all_details = {
    'Name': name, 
    'Year': birthdate, 
    'Party': party, 
    'Bills_messy': bills
}

print(all_details)


{'Name': 'Senator James Abdnor', 'Year': '(1923 - 2012)', 'Party': 'Republican', 'Bills_messy': '1-100                \r\n                of 1,949'}


# Build a function

Write a function called `scrape_page` that makes a URL out of the the `slug`, like we're going to use `.apply`.

In [40]:
df.head() 

Unnamed: 0,name,slug
0,"Senator Abdnor, James",james-abdnor/A000009
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014
2,"Senator Abourezk, James",james-abourezk/A000017
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374
4,"Senator Abraham, Spencer",spencer-abraham/A000355


In [7]:
def scrape_page(df):
    return df.apply(lambda x:'%s%s' % ("https://www.congress.gov/member/",x['slug']),axis=1)

scrape_page(df)

0       https://www.congress.gov/member/james-abdnor/A...
1       https://www.congress.gov/member/neil-abercromb...
2       https://www.congress.gov/member/james-abourezk...
3       https://www.congress.gov/member/ralph-abraham/...
4       https://www.congress.gov/member/spencer-abraha...
                              ...                        
2343    https://www.congress.gov/member/ryan-zinke/Z00...
2344    https://www.congress.gov/member/roger-zion/Z00...
2345    https://www.congress.gov/member/edward-zorinsk...
2346    https://www.congress.gov/member/edwin-zschau/Z...
2347    https://www.congress.gov/member/john-zwach/Z00...
Length: 2348, dtype: object

# Do the scraping

Rewrite `scrape_page` to actually scrape the URL. You can use your scraping code from up above. Start by testing with just one row (I put a sample call below) and then expand to your whole dataframe.

Save the results as `scraped_df`.

* **Hint:** Be sure to use `return`!
* **Hint:** Make sure you return a `pd.Series`

In [31]:
def scrape_page(row):
    
    my_url =  ('%s%s' % ("https://www.congress.gov/member/", row['slug']))    
    raw_html = requests.get(my_url).content
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    
    member = soup_doc.find('div', class_='container')

    featured = member.find('div', class_='featured')

    name = featured.find('h1').contents[0]

    birthdate = featured.find('span', class_='birthdate').text.strip()

    party = member.find('div', class_="overview-member-column-profile member_profile").find('td').text.strip()

    bills = member.find('span', class_='results-number').text.strip()
    
    result = pd.Series([name, birthdate, party, bills], index =['Name', 'Year', 'Party', 'Bills'])

    return(result)

In [32]:
# Test with this
scrape_page({'slug': 'neil-abercrombie/A000014'})

Name                       Representative Neil Abercrombie
Year                                             (1938 - )
Party                                           Democratic
Bills    1-100                \r\n                of 4,472
dtype: object

In [37]:
all_names = df.apply(lambda x: scrape_page(x), axis = 1)
joined_table = df.join(all_names, rsuffix='_scraped')

In [38]:
joined_table

Unnamed: 0,name,slug,Name,Year,Party,Bills
0,"Senator Abdnor, James",james-abdnor/A000009,Senator James Abdnor,(1923 - 2012),Republican,"1-100 \r\n of 1,949"
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014,Representative Neil Abercrombie,(1938 - ),Democratic,"1-100 \r\n of 4,472"
2,"Senator Abourezk, James",james-abourezk/A000017,Senator James Abourezk,(1931 - ),Democratic,1-100 \r\n of 875
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374,Representative Ralph Lee Abraham,(1954 - ),https://abraham.house.gov/,1-100 \r\n of 736
4,"Senator Abraham, Spencer",spencer-abraham/A000355,Senator Spencer Abraham,(1952 - ),Republican,"1-100 \r\n of 1,227"
...,...,...,...,...,...,...
2343,"Representative Zinke, Ryan K.",ryan-zinke/Z000018,Representative Ryan K. Zinke,(1961 - ),Republican,1-100 \r\n of 364
2344,"Representative Zion, Roger H.",roger-zion/Z000010,Representative Roger H. Zion,(1921 - 2019),Republican,1-60 \r\n of 60
2345,"Senator Zorinsky, Edward",edward-zorinsky/Z000013,Senator Edward Zorinsky,(1928 - 1987),Democratic,"1-100 \r\n of 1,543"
2346,"Representative Zschau, Edwin V. W.",edwin-zschau/Z000014,Representative Edwin V. W. Zschau,(1940 - ),Republican,1-100 \r\n of 303


## Join with your original dataframe

Join your new data with your original data, adding the `_scraped` suffix on the new columns. You can use either `.join` or `.merge`, but be sure to read the docs to know the difference!

### Did my join up there ^ I would move it but that took HOURS to run and I refuse to touch it ever again 

## Save it

Save your combined results to `congress-plus-scraped.csv`.

In [42]:
df_joined = pd.DataFrame(joined_table)
df_joined.head()

Unnamed: 0,name,slug,Name,Year,Party,Bills
0,"Senator Abdnor, James",james-abdnor/A000009,Senator James Abdnor,(1923 - 2012),Republican,"1-100 \r\n of 1,949"
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014,Representative Neil Abercrombie,(1938 - ),Democratic,"1-100 \r\n of 4,472"
2,"Senator Abourezk, James",james-abourezk/A000017,Senator James Abourezk,(1931 - ),Democratic,1-100 \r\n of 875
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374,Representative Ralph Lee Abraham,(1954 - ),https://abraham.house.gov/,1-100 \r\n of 736
4,"Senator Abraham, Spencer",spencer-abraham/A000355,Senator Spencer Abraham,(1952 - ),Republican,"1-100 \r\n of 1,227"


In [82]:
df_joined['Bill_Count'] = df_joined['Bills'].str.extract(r"\w* of (.*)", expand=False)
del df_joined['Bills']

df_joined.head(2)

Unnamed: 0,name,slug,Name,Year,Party,Bill_Count
0,"Senator Abdnor, James",james-abdnor/A000009,Senator James Abdnor,(1923 - 2012),Republican,1949
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014,Representative Neil Abercrombie,(1938 - ),Democratic,4472


In [83]:
df_joined.to_csv('congress-plus-scraped.csv', index = False)