# Scraping classwork : answers

Steps we do only once
- Create a folder to save HTML
- Make dataframe for bills

Steps to repeat in a `for` loop:
- Request the URL
- Save the HTML of the URL
- Parse the page with bs4
- Find and get what's inside `id='billTextContainer'`
- Clean up the bill text
  - Replace punctuation with space
  - Replace newlines with space
  - Replace multiple spaces with one space
- Get the word count
- Save the word count into the dataframe

Finally, let's output the results of the dataframe to a csv.

In [1]:
import json
import requests
from bs4 import BeautifulSoup
import re
import string
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

## Create `pages` folder to save HTML

In [2]:
!mkdir -p pages

## Import bills data

In [3]:
with open('bills.json') as file:
    bills = json.load(file)

# pretty-print file
# print(json.dumps(bills, indent=2))

## Create dataframe

In [4]:
bills_df = pd.DataFrame(bills)
bills_df['word_count'] = np.nan
bills_df

Unnamed: 0,congress,chamber,bill_url,bill_number,word_count
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133,
1,116,house,https://www.congress.gov/bill/116th-congress/h...,150,
2,116,house,https://www.congress.gov/bill/116th-congress/h...,251,
3,116,house,https://www.congress.gov/bill/116th-congress/h...,259,
4,116,house,https://www.congress.gov/bill/116th-congress/h...,263,
5,116,house,https://www.congress.gov/bill/116th-congress/h...,266,
6,116,house,https://www.congress.gov/bill/116th-congress/h...,276,
7,116,house,https://www.congress.gov/bill/116th-congress/h...,299,
8,116,house,https://www.congress.gov/bill/116th-congress/h...,430,
9,116,house,https://www.congress.gov/bill/116th-congress/h...,434,


## Scrape and parse `bills`

In [None]:
# I can move this code out of the loop too!
punctuation_table = str.maketrans({key: ' ' for key in string.punctuation})

In [None]:
for bill in tqdm(bills):
    bill_url = bill['bill_url']
    bill_number = bill['bill_number']
    
    # Request the URL
    page = requests.get(bill_url)
    
    # Save the HTML of the URL
    # See string_interpolation.ipynb notebook in this repo for how f-strings work
    with open(f'pages/page_{ bill_number }.html', 'w') as f:
        f.write(page.text)
    
    # Parse the page with bs4
    soup = BeautifulSoup(page.text, features='html.parser')
    
    # Find and get what's inside `id='billTextContainer'`
    bill_text_container = soup.find(id='billTextContainer')
    bill_text = bill_text_container.get_text()
    
    # Clean up the bill text
    
    # Replace punctuation with space
    bill_text_cleaned = bill_text.translate(punctuation_table)
    
    # Replace newlines with space
    bill_text_cleaned = re.sub('\\n', ' ', bill_text_cleaned)
    
    # Replace multiple spaces with one space
    bill_text_cleaned = re.sub('\s{2,}', ' ', bill_text_cleaned)
    
    # Get the word count
    bill_word_count = len(bill_text_cleaned.split())
    
    # Save the word count into the dataframe
    bills_df.loc[bills_df['bill_number'] == bill_number, 'word_count'] = bill_word_count

In [None]:
bills_df

## Export the data

In [None]:
bills_df.to_csv('bills.csv', index=False)