# Scraping one page per row

Let's say we're interested in our members of Congress, because who isn't? Read in `congress.csv`.

In [1]:
import pandas as pd 

In [2]:
df = pd.read_csv('congress.csv')

In [3]:
df

Unnamed: 0,name,slug
0,"Senator Abdnor, James",james-abdnor/A000009
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014
2,"Senator Abourezk, James",james-abourezk/A000017
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374
4,"Senator Abraham, Spencer",spencer-abraham/A000355
...,...,...
2343,"Representative Zinke, Ryan K.",ryan-zinke/Z000018
2344,"Representative Zion, Roger H.",roger-zion/Z000010
2345,"Senator Zorinsky, Edward",edward-zorinsky/Z000013
2346,"Representative Zschau, Edwin V. W.",edwin-zschau/Z000014


In [4]:
from selenium import webdriver
driver = webdriver.Chrome()

In [5]:
driver.get("https://www.congress.gov/member/james-abdnor/A000009")

# Let's scrape one

The `slug` is the part of the URL that's particular to that member of Congress. So `/james-abdnor/A000009` really means `https://www.congress.gov/member/james-abdnor/A000009`.

Scrape his name, birthdaye, party, whether he's currently in congress, and his bill count (don't worry if the bill count is dirty, you can clean it up later).

In [6]:
driver.find_element_by_tag_name('h1').text

'Senator James Abdnor (1923 - 2012)\nIn Congress 1973 - 1987'

In [7]:
driver.find_element_by_class_name('results-number').text

'1-100 of 1,949'

# Build a function

Write a function called `scrape_page` that makes a URL out of the the `slug`, like we're going to use `.apply`.

In [8]:
def scrape_page(end_link):
    url = 'https://www.congress.gov/member/'+end_link
    return (url)

In [9]:
df['url'] = df['slug'].apply(scrape_page)

In [10]:
df

Unnamed: 0,name,slug,url
0,"Senator Abdnor, James",james-abdnor/A000009,https://www.congress.gov/member/james-abdnor/A...
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014,https://www.congress.gov/member/neil-abercromb...
2,"Senator Abourezk, James",james-abourezk/A000017,https://www.congress.gov/member/james-abourezk...
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374,https://www.congress.gov/member/ralph-abraham/...
4,"Senator Abraham, Spencer",spencer-abraham/A000355,https://www.congress.gov/member/spencer-abraha...
...,...,...,...
2343,"Representative Zinke, Ryan K.",ryan-zinke/Z000018,https://www.congress.gov/member/ryan-zinke/Z00...
2344,"Representative Zion, Roger H.",roger-zion/Z000010,https://www.congress.gov/member/roger-zion/Z00...
2345,"Senator Zorinsky, Edward",edward-zorinsky/Z000013,https://www.congress.gov/member/edward-zorinsk...
2346,"Representative Zschau, Edwin V. W.",edwin-zschau/Z000014,https://www.congress.gov/member/edwin-zschau/Z...


# Do the scraping

Rewrite `scrape_page` to actually scrape the URL. You can use your scraping code from up above. Start by testing with just one row (I put a sample call below) and then expand to your whole dataframe.

Save the results as `scraped_df`.

* **Hint:** Be sure to use `return`!
* **Hint:** Make sure you return a `pd.Series`

In [11]:
#def scrape_page(end_link):
    #url = 'https://www.congress.gov/'+end_link
    #return (driver.get(url))
df

Unnamed: 0,name,slug,url
0,"Senator Abdnor, James",james-abdnor/A000009,https://www.congress.gov/member/james-abdnor/A...
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014,https://www.congress.gov/member/neil-abercromb...
2,"Senator Abourezk, James",james-abourezk/A000017,https://www.congress.gov/member/james-abourezk...
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374,https://www.congress.gov/member/ralph-abraham/...
4,"Senator Abraham, Spencer",spencer-abraham/A000355,https://www.congress.gov/member/spencer-abraha...
...,...,...,...
2343,"Representative Zinke, Ryan K.",ryan-zinke/Z000018,https://www.congress.gov/member/ryan-zinke/Z00...
2344,"Representative Zion, Roger H.",roger-zion/Z000010,https://www.congress.gov/member/roger-zion/Z00...
2345,"Senator Zorinsky, Edward",edward-zorinsky/Z000013,https://www.congress.gov/member/edward-zorinsk...
2346,"Representative Zschau, Edwin V. W.",edwin-zschau/Z000014,https://www.congress.gov/member/edwin-zschau/Z...


In [12]:
# Test with this
def scrape_page(guy):
    guy ='james-abdnor/A000009'
    link = driver.get('https://www.congress.gov/member/'+guy)
    name = driver.find_element_by_tag_name('h1').text
    url = 'https://www.congress.gov/member/'+guy
    return(name, url)

In [16]:
df = df['name'][:5].apply(scrape_page)

In [17]:
df

0    (Senator James Abdnor (1923 - 2012)\nIn Congre...
1    (Senator James Abdnor (1923 - 2012)\nIn Congre...
2    (Senator James Abdnor (1923 - 2012)\nIn Congre...
3    (Senator James Abdnor (1923 - 2012)\nIn Congre...
4    (Senator James Abdnor (1923 - 2012)\nIn Congre...
Name: name, dtype: object

## Join with your original dataframe

Join your new data with your original data, adding the `_scraped` suffix on the new columns. You can use either `.join` or `.merge`, but be sure to read the docs to know the difference!

## Save it

Save your combined results to `congress-plus-scraped.csv`.