# scraping with `bs4`

## anti-trans legislation

The past couple of years have seen an explosion in anti-trans legislation that restricts basic rights and recognition for trans people. The number of bills introduced in 2024, like in the year 2023, and 2022 before it, marks new records. These bills prevent trans people from using bathrooms, playing in sports, accessing healthcare, and more in ways that accord with their gender identity. See the [Trans Legislation Tracker](https://translegislation.com/) for more information.

This section uses `requests` and `bs4` to scrape basic metadata about the 80 bills that are being proposed at the federal level. This dataset, which is relatively small in size, will be a starting point for a larger dataset to be used in future lessons in the "cleaning data" and "analyzing data" chapters of this curriculum. For those chapters, you will use Python to gather, clean, and analyze the full text of the bills themselves from `congress.gov`.

In [1]:
# import the following libraries for our web scraping project

import requests # to make https requests
from bs4 import BeautifulSoup # our web scraping library

In [2]:
# save the data from the website as a "soup" object

site = requests.get('https://translegislation.com/bills/2024/US') # gets the URL
html_code = site.content # saves the HTML code
soup = BeautifulSoup(html_code, 'lxml') # creates a soup object

Once we have created our `soup`, we can use dot syntax to access html elements. Notice the result includes the entire html element (with opening and closing tags) that we are searching for. 

In [3]:
# get title

soup.title

<title>United States Bills | Anti-trans legislation</title>

## inspecting our page
Remember that we must inspect pages with our browser's "Inspector" tool, so we know what elements to scrape with bs4.

First, navigate to the target website, at [`https://translegislation.com/`](https://translegislation.com/). Scroll down until you see the "National Anti-Trans Bills" heading. Click on the blue button that says "View 2024 National Bills". (Alternatively, just navigate directly to the page [here](https://translegislation.com/bills/2024/US)).

Once you're on the page, *right click* on a bill title, any bill title, and select the **inspect element** option (or whatever option is closest to that phrase in your menu). The inspector should pop up.

![To open the inspector](./img/inspector0.jpg)

Then, look for that element in the HTML code. The inspector contains everything you need to know about that element, including it's HTML tag `h3`, which contains the `a` and any attributes, like `class` or `href`.

![Examine the inspector](./img/inspector1.jpg)

Once you identify some elements, append the name of the element to the "soup" using dot syntax.


In [4]:
# checking for third level header element

soup.h3

<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1064">US<!-- --> <!-- -->HB1064</a></h3>

In [5]:
# checking for division element (the outer element)

soup.div

<div data-reactroot="" id="__next"><style data-emotion="css-global 1o2ia7f">:host,:root{--chakra-ring-inset:var(--chakra-empty,/*!*/ /*!*/);--chakra-ring-offset-width:0px;--chakra-ring-offset-color:#fff;--chakra-ring-color:rgba(66, 153, 225, 0.6);--chakra-ring-offset-shadow:0 0 #0000;--chakra-ring-shadow:0 0 #0000;--chakra-space-x-reverse:0;--chakra-space-y-reverse:0;--chakra-colors-transparent:transparent;--chakra-colors-current:currentColor;--chakra-colors-black:#000000;--chakra-colors-white:#FFFFFF;--chakra-colors-whiteAlpha-50:rgba(255, 255, 255, 0.04);--chakra-colors-whiteAlpha-100:rgba(255, 255, 255, 0.06);--chakra-colors-whiteAlpha-200:rgba(255, 255, 255, 0.08);--chakra-colors-whiteAlpha-300:rgba(255, 255, 255, 0.16);--chakra-colors-whiteAlpha-400:rgba(255, 255, 255, 0.24);--chakra-colors-whiteAlpha-500:rgba(255, 255, 255, 0.36);--chakra-colors-whiteAlpha-600:rgba(255, 255, 255, 0.48);--chakra-colors-whiteAlpha-700:rgba(255, 255, 255, 0.64);--chakra-colors-whiteAlpha-800:rgba(25

## getting by attribute: `text`

Remember, we can access the text within each tag, getting rid of tags like `<p>` or `<h3>`, by using the `text` property.

In [6]:
soup.h3.text

'US HB1064'

You can layer elements on top of each other to get more specific elements

In [7]:
soup.div.a.text

'Trans Legislation Tracker'

Using a variable, we can save just the text. This will be useful later, when we write more complex code, and migrate our data into a spreadsheet.

In [8]:
# saving the text from the level 3 header element to "bill_title"

bill_title = soup.h3.text

In [9]:
bill_title

'US HB1064'

## searching by HTML attributes: `class` and `href`

Remember that, in addition to text, we can also get the HTML attributes. [HTML attributes](https://www.w3schools.com/html/html_attributes.asp) contain additional inforamation about HTML tag. A popular attribute is `href`, which stands for hyperlink reference, and it contains the link's URL address. 

To access the attributes like `href`, we use the syntax: `tag['attr']`.

In [10]:
# note that this prints the value of each attribute (like the name of the class), not
# the actual text contained within the larger element. For that, use the `text` property.

soup.h3['class']

['chakra-heading', 'css-1vygpf9']

In [11]:
link_location = soup.h3.a['href']

In [12]:
# the result will be just a `/` because it links to the current page

link_location

'/bills/2024/US/HB1064'

Once we have our `class` and `href` values, then we can combine them with `find()` or `find_all()` to get more granular about our searching. For example, if we want to access a particular element that has a specific class name. 

Just include the `class_=xxx` in your `find()` or `find_all()` call.

In [13]:
soup.find('h3', class_='css-1vygpf9')

<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1064">US<!-- --> <!-- -->HB1064</a></h3>

In [14]:
soup.find_all('h3', class_='css-1vygpf9')

[<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1064">US<!-- --> <!-- -->HB1064</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1112">US<!-- --> <!-- -->HB1112</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1276">US<!-- --> <!-- -->HB1276</a></h3>,
 <h3 class="chakra-heading

In [15]:
# save h3 element to a variable 

bill_title = soup.find('h3', class_='css-1vygpf9')

In [16]:
bill_title

<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2024/US/HB1064">US<!-- --> <!-- -->HB1064</a></h3>

Making variables is useful for layering other operations on top, like getting `text`.

In [17]:
bill_title.text

'US HB1064'

## methods vs attributes
The decision to use dot syntax (like `soup.h3.a`) or a method (like `find()` or `find_all()`) depends on what you're trying to do, and what kind of data you have about the thing you're trying to scrape. In many cases, you could use either one. 

The difference between the two is how data is stored in Python. In dot syntax, it's stored as an attribute, or property, of the soup object. By using a method like `find()`, you're executing a function to find the data. 

### individual challenge: 
Let's try doing the same thing two different ways. What if I wanted to get the link, the value of the `href` attribute, using both methods and attributes? 

In [18]:
# use find to search by element and class. Now, grab the link. 

link = soup.find('h3', class_='css-1vygpf9').a['href']

print(link)

/bills/2024/US/HB1064


In [19]:
# there are multiple ways to do this! If you don't need the class, just use dot syntax

soup.h3.a['href']

'/bills/2024/US/HB1064'

You can also use `find()` to search an element by specific attribute. Just include the `class_=xxx` in your `find()` call.

## looping through `find_all()`

Want to print out all tags of a specific element? Then we use `find_all()`. *Note: we use `find_all()` rather than `find()`, because only `find_all()` returns a list like object, which is better for looping.*

Let's do this to get all of the bill names, with just the text.


In [20]:
for i in soup.find_all('h3'):
    print(i.text)

US HB1064
US HB1112
US HB1276
US HB1399
US HB1490
US HB1585
US HB216
US HB3101
US HB3102
US HB3328
US HB3329
US HB3462
US HB3887
US HB429
US HB4365
US HB4367
US HB4398
US HB4665
US HB4821
US HB5
US HB5327
US HB5636
US HB5893
US HB5894
US HB6040
US HB6258
US HB6658
US HB6728
US HB7183
US HB7187
US HB734
US HB736
US HB7725
US HB8070
US HB8433
US HB8580
US HB8708
US HB8752
US HB8771
US HB8774
US HB8997
US HB8998
US HB9026
US HB9027
US HB9028
US HB9029
US HB9218
US HB9586
US HB985
US HJR160
US HJR165
US HR115
US HR1223
US HR282
US HR298
US HR518
US HR536
US HR769
US SB1595
US SB1597
US SB1709
US SB187
US SB200
US SB2357
US SB2394
US SB2797
US SB3035
US SB3438
US SB3729
US SB435
US SB457
US SB4638
US SB613
US SB635
US SB752
US SJR90
US SJR96
US SR267
US SR53
US SR669


## getting data from each bill card

For our project, we want to scrape information about each bill contained within the bill cards. 

Like all good programmers, we will break our task up into a number of steps:
1. isolate the bill_cards data from the rest of the webpage 
2. pick out the information we want from the bill cards 
3. process our information into lists
4. save that information to a csv file

Each of these steps itself contains smaller steps, which we will figure out as we go along. Let's begin with the first step. Here, we want to separate out that information (within the bill cards) from the rest of the website. This will make it easier to then go grab the elements we need later.

## step 1: isolate our bill_cards data from the rest of the web page
First, create a new object called `bill_cards`, which enables us to narrow down the parts of the website that we want to scrape.

In [21]:
# to get the element and class for the cards, use the inspector

bill_cards = soup.find_all('div', class_ ='css-4rck61')

Let's use a loop to check that we have all the right data. In the next section, we will be able to pick out specific pieces of text, based on their HTML markup.

In [22]:
# printing out all the text contained in the "bill cards" div

for i in bill_cards:
    print(i.text)

US HB1064MILITARYINTRODUCEDEnsuring Military Readiness Act of 2023To provide requirements related to the eligibility of transgender individuals from serving in the Armed Forces.Transgender persons who require or have undergone gender transition are disqualified from military service.View Bill
US HB1112MILITARYINTRODUCEDEnsuring Military Readiness Act of 2023To provide requirements related to the eligibility of individuals who identify as transgender from serving in the Armed Forces.View Bill
US HB1276HEALTHCAREINTRODUCEDProtect Minors from Medical Malpractice Act of 2023To protect children from medical malpractice in the form of gender transition procedures.A medical practitioner, in any circumstance described in subsection (c), who performs a gender-transition procedure on an individual who is less than 18 years of age shall, as described in subsection (b), be liable to the individual if injured (including any physical, psychological, emotional, or physiological harms) by such procedu

## step 2: pick out information from each bill card

## group challenge!
Now that we have narrowed down our data to `bill_cards`, we can search within this code for the individual elements we want. For our dataset, we want to scrape the following information:
- bill title
- bill caption (if it exists!)
- bill category
- bill description
- link to bill

Using the inspector, take 5-10 minutes create a list of html elements and attributes that correspond to the above information. Work in partners. 

<!--
# h3.text
# this.class = css-bu60l4
# h2.text
# a.class = chakra-link
-->

Once we have the code for the relevant HTML elements, we will now extract them and save them. To do that, we will write a loop that goes through each item in our `bill_cards`, gets the relevant HTML element, and saves it to a variable. Our loop will goes through each bill card, one by one, and pull out the title, description, category, and link. 

*Note: loops are ways of programmatically going through a dataset and doing something to each item in the dataset, like extracting it. Read more about [loops in the intro workshop](../intro/loops.ipynb)*

Below, I will be explaining the code logic in by writing it out in "pseudo-code" in the comments. Pseudo-code is a cross between normal language and programming language, that is useful for explaining and working out how to write the actual programming code in Python.

In [23]:
# for each card in bill_cards:
# get the title in h3.text
# get the category in span.text
# get the caption in h2.text
# get the descriptoin in p.text (if any)
# get the link in a tag, class "chakra-link"

In [24]:
# runs the loop on the bill cards
bill_cards = soup.find_all('div', class_ ='css-4rck61')

for item in bill_cards[:10]: # only the first ten cards, just to check if it is working
    print(item.h3.text) # title
    print(item.span.text) # category
    print(item.h2.text) # caption
    print(item.p.text) # description (if any)
    print(item.a['href']) # add https://translegislation.com/bills/2023/US

US HB1064
MILITARY
Ensuring Military Readiness Act of 2023
To provide requirements related to the eligibility of transgender individuals from serving in the Armed Forces.
/bills/2024/US/HB1064
US HB1112
MILITARY
Ensuring Military Readiness Act of 2023
To provide requirements related to the eligibility of individuals who identify as transgender from serving in the Armed Forces.
/bills/2024/US/HB1112
US HB1276
HEALTHCARE
Protect Minors from Medical Malpractice Act of 2023
To protect children from medical malpractice in the form of gender transition procedures.
/bills/2024/US/HB1276
US HB1399
HEALTHCARE
Protect Children’s Innocence Act
To amend chapter 110 of title 18, United States Code, to prohibit gender affirming care on minors, and for other purposes.
/bills/2024/US/HB1399
US HB1490
INCARCERATION
Preventing Violence Against Female Inmates Act of 2023
To secure the dignity and safety of incarcerated women.
/bills/2024/US/HB1490
US HB1585
EDUCATION
Prohibiting Parental Secrecy Policies

Excellent work! In the next section, we will write some code to save this data in the form of a spreadsheet, a `csv` file.