# scraping our text

Here I show how we can use web scraping to get bill data from congress.gov servers. FYI - I tried using the congress.gov API to get this dataset, which is always the right thing to do! But it doesn't offer the full text of the bill, so that's why I turned to scraping. For future reference, you can request an API key here: https://api.congress.gov/


In [1]:
import requests # for making http (web) requests
import pandas as pd # for working with tabular (spreadsheet) data
import csv # also for working with tabular data, in csv format

# this grabs the CSV from the previous section. If you get a file
# not found error make sure you go through the previous section to 
# save that csv
bills = pd.read_csv('../gathering/bill_data.csv')

df = pd.DataFrame(bills)

FileNotFoundError: [Errno 2] No such file or directory: '../gathering/bill_data.csv'

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   80 non-null     int64 
 1   title        80 non-null     object
 2   caption      80 non-null     object
 3   category     80 non-null     object
 4   description  80 non-null     object
 5   url          80 non-null     object
 6   legiscan     80 non-null     object
 7   congress     80 non-null     object
dtypes: int64(1), object(7)
memory usage: 5.1+ KB


In [48]:
df

Unnamed: 0.1,Unnamed: 0,title,caption,category,description,url,legiscan,congress
0,0,US HB1064,Ensuring Military Readiness Act of 2023,MILITARY,Ensuring Military Readiness Act of 2023,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1064/id/2737306,https://www.congress.gov/bill/118th-congress/h...
1,1,US HB1112,Ensuring Military Readiness Act of 2023,MILITARY,Ensuring Military Readiness Act of 2023,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1112/id/2742708,https://www.congress.gov/bill/118th-congress/h...
2,2,US HB1276,Protect Minors from Medical Malpractice Act of...,HEALTHCARE,Protect Minors from Medical Malpractice Act of...,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1276/id/2755407,https://www.congress.gov/bill/118th-congress/h...
3,3,US HB1399,Protect Children’s Innocence Act,HEALTHCARE,Protect Children’s Innocence Act,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1399/id/2796538,https://www.congress.gov/bill/118th-congress/h...
4,4,US HB1490,Preventing Violence Against Female Inmates Act...,INCARCERATION,Preventing Violence Against Female Inmates Act...,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1490/id/2761146,https://www.congress.gov/bill/118th-congress/h...
...,...,...,...,...,...,...,...,...
75,75,US SJR90,A joint resolution providing for congressional...,HEALTHCARE,A joint resolution providing for congressional...,https://translegislation.com//bills/2024/US/SJR90,https://legiscan.com/US/text/SJR90/id/3003899,https://www.congress.gov/bill/118th-congress/s...
76,76,US SJR96,A joint resolution providing for congressional...,EDUCATION,A joint resolution providing for congressional...,https://translegislation.com//bills/2024/US/SJR96,https://legiscan.com/US/text/SJR96/id/3009679,https://www.congress.gov/bill/118th-congress/s...
77,77,US SR267,A resolution supporting the designation of the...,SPORTS,A resolution supporting the designation of the...,https://translegislation.com//bills/2024/US/SR267,https://legiscan.com/US/text/SR267/id/2831179,https://www.congress.gov/bill/118th-congress/s...
78,78,US SR53,A resolution establishing a Women's Bill of Ri...,CIVIL RIGHTS,A resolution establishing a Women's Bill of Ri...,https://translegislation.com//bills/2024/US/SR53,https://legiscan.com/US/text/SR53/id/2696872,https://www.congress.gov/bill/118th-congress/s...


## extracting the bill number
In order to scrape the bill text, we need just the bill number. In order to get that, we need to go through the `Legislation Number` column and extract just the number.

In [5]:
df['title']

0     US HB1064
1     US HB1112
2     US HB1276
3     US HB1399
4     US HB1490
        ...    
75     US SJR90
76     US SJR96
77     US SR267
78      US SR53
79     US SR669
Name: title, Length: 80, dtype: object

In [6]:
# we can use the split() method to split up the single string
# into two strings, by the empty space in between them

bill = "US HB1064"
bill.split(' ')

['US', 'HB1064']

In [36]:
# save our variable, which is a list of two items 

bill = "US HB1064"
splitted = bill.split(' ')

In [37]:
# now we focus on the second item in the new list 

splitted[1]

'HB1064'

Now let's do it to all of them!

In [39]:
titles = []
for item in df['title']:
    bill_title = item.split(' ')
    titles.append(bill_title[1])

In [34]:
titles

['HB1064',
 'HB1112',
 'HB1276',
 'HB1399',
 'HB1490',
 'HB1585',
 'HB216',
 'HB3101',
 'HB3102',
 'HB3328',
 'HB3329',
 'HB3462',
 'HB3887',
 'HB429',
 'HB4365',
 'HB4367',
 'HB4398',
 'HB4665',
 'HB4821',
 'HB5',
 'HB5327',
 'HB5636',
 'HB5893',
 'HB5894',
 'HB6040',
 'HB6258',
 'HB6658',
 'HB6728',
 'HB7183',
 'HB7187',
 'HB734',
 'HB736',
 'HB7725',
 'HB8070',
 'HB8433',
 'HB8580',
 'HB8708',
 'HB8752',
 'HB8771',
 'HB8774',
 'HB8997',
 'HB8998',
 'HB9026',
 'HB9027',
 'HB9028',
 'HB9029',
 'HB9218',
 'HB9586',
 'HB985',
 'HJR160',
 'HJR165',
 'HR115',
 'HR1223',
 'HR282',
 'HR298',
 'HR518',
 'HR536',
 'HR769',
 'SB1595',
 'SB1597',
 'SB1709',
 'SB187',
 'SB200',
 'SB2357',
 'SB2394',
 'SB2797',
 'SB3035',
 'SB3438',
 'SB3729',
 'SB435',
 'SB457',
 'SB4638',
 'SB613',
 'SB635',
 'SB752',
 'SJR90',
 'SJR96',
 'SR267',
 'SR53',
 'SR669']

In [63]:
## a very complicated piece of code that goes into each bill title to pull out just the numbers
## feel free to skip --- this is advanced stuff!

title_numbers = []
title_body = []
for i in titles:
    text = ''
    numbers = ''
    for char in i:
        if(char.isalpha()):
            text += char
        else:
            numbers += char
    title_body.append(text)
    title_numbers.append(numbers)
title_numbers

['1064',
 '1112',
 '1276',
 '1399',
 '1490',
 '1585',
 '216',
 '3101',
 '3102',
 '3328',
 '3329',
 '3462',
 '3887',
 '429',
 '4365',
 '4367',
 '4398',
 '4665',
 '4821',
 '5',
 '5327',
 '5636',
 '5893',
 '5894',
 '6040',
 '6258',
 '6658',
 '6728',
 '7183',
 '7187',
 '734',
 '736',
 '7725',
 '8070',
 '8433',
 '8580',
 '8708',
 '8752',
 '8771',
 '8774',
 '8997',
 '8998',
 '9026',
 '9027',
 '9028',
 '9029',
 '9218',
 '9586',
 '985',
 '160',
 '165',
 '115',
 '1223',
 '282',
 '298',
 '518',
 '536',
 '769',
 '1595',
 '1597',
 '1709',
 '187',
 '200',
 '2357',
 '2394',
 '2797',
 '3035',
 '3438',
 '3729',
 '435',
 '457',
 '4638',
 '613',
 '635',
 '752',
 '90',
 '96',
 '267',
 '53',
 '669']

## scraping the bill text
Using that list of numbers as input, we will write a function that scrapes the bill text.

In [9]:
# here we are introducing "f-strings", which is way of writing "formatted strings"
# in python that allows us to input variables, like a bill number, in this case

def scrape_bill_text(numbers):
    bills_text = []
    for item in numbers:
        # f-string is used to add the specific bill number to the URL
        url = (f'https://www.congress.gov/117/bills/hr{item}/BILLS-117hr{item}ih.htm')
        # requests library to scrape the URL, which is formatted for each bill number
        response = requests.get(url)
        content = response.content
        bills_text.append(content)
    return bills_text

Calling the function and saving the results to `sample`

In [10]:
# so we don't overload the website, we will scrape just a sample of
# the first 10 bills. This will be more than enough data for us to
# practice cleaning.

sample = scrape_bill_text(numbers[:10])

In [11]:
len(sample)

147

In [12]:
# let's check out our first item (the first bill text) in the list 

sample[0]

b"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n  1st Session\n                                H. R. 1112\n\n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n\n_______________________________________________________________________\n\n\n                    IN THE HOUSE OF REPRESENTATIVES\n\n                           February 18, 2021\n\n    Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n  Buchanan) introduced the following bill; which was referred to the \n                      Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n                                 A BILL\n\n\n \n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n    Be it enacted by the 

## decoding text from bytes to string
We will "decode" the bytes type of data into a string, so we can eventually save it as a string format. 

In [13]:
# Use type() to see what kind of data we are working with.
# list type 

type(sample)

list

In [14]:
# within the list, bytes type

type(sample[0])

bytes

In [15]:
# turn bytes into string using decode()
decoded = []
for item in sample:
    decoded.append(item.decode('utf-8'))

In [16]:
type(decoded[0])

str

In [17]:
decoded[0]

"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n  1st Session\n                                H. R. 1112\n\n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n\n_______________________________________________________________________\n\n\n                    IN THE HOUSE OF REPRESENTATIVES\n\n                           February 18, 2021\n\n    Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n  Buchanan) introduced the following bill; which was referred to the \n                      Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n                                 A BILL\n\n\n \n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n    Be it enacted by the S

In [18]:
with open('sample.txt', 'w') as f:
    for item in decoded:
        f.write(item)