# exploring `bs4`

## what is `bs4?`

bs4 is short for `BeautifulSoup4`, a python package for parsing HTML data. bs4's power comes from using python syntax to access and manipulate HTML elements. This means that it uses the python language and its syntax to get information from pages written in the web's main computer lanugage, HTML.

I explain what the code below does in "comments" contained within each cell. Comments in Python are written on lines that begin with a hashtag `#`. They are like annotations for the code. The `#` which starts the comment line indicates to the computer that it should ignore that line (in other words, that the line is meant for human readers).

In [1]:
# import the following libraries for our web scraping project

import requests # to make https requests
from bs4 import BeautifulSoup # our web scraping library
import lxml # a parser for working with html data

In [2]:
# save the data from the website as a "soup" object

site = requests.get('https://translegislation.com/bills/2023/passed') # gets the URL
html_code = site.content # saves the HTML code
soup = BeautifulSoup(html_code, 'lxml') # creates a soup object

Once we have created our `soup`, we can use dot syntax to access html elements. Notice the result includes the entire html element (with opening and closing tags) that we are searching for. 

In [3]:
# get title

soup.title

<title>2023 Passed anti-trans bills: Trans Legislation Tracker</title>

## inspecting our page
Let's combine what we know about inspecting pages with bs4. We use html elements that we find in the inspector to feed into the bs4 syntax and get the content of these elements.  

First, navigate to the target website, at [`https://translegislation.com/`](https://translegislation.com/). Scroll down until "Passed anti-trans bills" heading. Click on the red button that says "View 2023 Passed Bills". (Alternatively, just navigate directly to the page [here](https://translegislation.com/bills/2023/passed)).

Once you're on the page, *right click* on a bill title, any bill title, and select the **inspect element** option (or whatever option is closest to that phrase in your menu). The inspector should pop up.

![To open the inspector](./img/inspector0.jpg)

Then, look for that element in the HTML code. The inspector contains everything you need to know about that element, including it's HTML tag `h3`, which contains the `a` and any attributes, like `class` or `href`.

![Examine the inspector](./img/inspector1.jpg)

Once you identify some elements, append the name of the element to the "soup" using dot syntax.


In [4]:
# checking for third level header element

soup.h3

<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/HB261">AL<!-- --> <!-- -->HB261</a></h3>

In [5]:
# checking for division element

soup.div

<div data-reactroot="" id="__next"><style data-emotion="css-global 1o2ia7f">:host,:root{--chakra-ring-inset:var(--chakra-empty,/*!*/ /*!*/);--chakra-ring-offset-width:0px;--chakra-ring-offset-color:#fff;--chakra-ring-color:rgba(66, 153, 225, 0.6);--chakra-ring-offset-shadow:0 0 #0000;--chakra-ring-shadow:0 0 #0000;--chakra-space-x-reverse:0;--chakra-space-y-reverse:0;--chakra-colors-transparent:transparent;--chakra-colors-current:currentColor;--chakra-colors-black:#000000;--chakra-colors-white:#FFFFFF;--chakra-colors-whiteAlpha-50:rgba(255, 255, 255, 0.04);--chakra-colors-whiteAlpha-100:rgba(255, 255, 255, 0.06);--chakra-colors-whiteAlpha-200:rgba(255, 255, 255, 0.08);--chakra-colors-whiteAlpha-300:rgba(255, 255, 255, 0.16);--chakra-colors-whiteAlpha-400:rgba(255, 255, 255, 0.24);--chakra-colors-whiteAlpha-500:rgba(255, 255, 255, 0.36);--chakra-colors-whiteAlpha-600:rgba(255, 255, 255, 0.48);--chakra-colors-whiteAlpha-700:rgba(255, 255, 255, 0.64);--chakra-colors-whiteAlpha-800:rgba(25

## the `soup` object

This word "object" in Python is something you'll hear often. It means a collection of data and functions that can work on that data. You can think of it as a way of representing real world objects (like this web page) that is organized and accessible, so you can search and manipulate that information with Python.

Let's take an initial look into what this beautiful soup object allows us to do. It takes the HTML source, the specific HTML elements or "tags," and makes it possible for us to access those tags using python syntax.

In [11]:
soup.h2

<h2 class="chakra-heading css-qwq9lj">What anti-trans bills passed in <!-- -->2023<!-- -->?</h2>

In [12]:
soup.a

<a class="chakra-link css-f4h6uy" href="/">Trans Legislation Tracker</a>

## getting text

Let's go a little deeper than the element. We can access the text within each tag, getting rid of tags like `<p>` or `<h3>`, by using the `text` property.

In [None]:
# append the text property after the title property

soup.title.text

'2023 Passed anti-trans bills: Trans Legislation Tracker'

Combine this with what we learned about variables from the [intro to Python workshop](../intro/variables.ipynb), and we can save just the text.

In [None]:
# saving the text from the level 3 header element to "bill_title"

bill_title = soup.h3.text

In [23]:
bill_title

'AL HB261'

Saving data to variables is useful. Later on, we will save to variables in order to migrate our data into a spreadsheet!

## getting attributes

In addition to text, we can also get the HTML attributes. [HTML attributes](https://www.w3schools.com/html/html_attributes.asp) contain additional inforamation about HTML tag. A popular attribute is `href`, which stands for hyperlink reference, and it contains the link's URL address. To access the attributes like `href`, we use the syntax: `tag['attr']`.

In [None]:
# note that this prints the value of each attribute (like the name of the class), not
# the actual text contained within the larger element. For that, use the `text` property.

soup.h3['class']

['chakra-heading', 'css-1vygpf9']

In [39]:
link_location = soup.a['href']

In [34]:
# the result will be just a `/` because it links to the current page

link_location

'/'

## `find()`

Python syntax offers multiple ways for accessing the soup object in Python. So far, we have been accessing them by properties like `h1` and `text`, using dot syntax. 

We can also access the title attribute using methods, like `find()` or `find_all()`. This option is more useful if we want to get granular about our choices. For example, if we want to access a particular element that has a specific class name. 

In [6]:
soup.find('p')

<p class="chakra-text css-1g6ksko" style="line-height:1.2rem;margin-right:15px"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/">Trans Legislation Tracker</a></p>

In [None]:
# save p element to a variable 

paragraph = soup.find('p')

In [None]:
paragraph

Making variables is useful for layering other operations on top, like getting `text`.

In [None]:
# combine what we know about methods and properties to get the text from the paragraph

paragraph.text

<p class="chakra-text css-1g6ksko" style="line-height:1.2rem;margin-right:15px"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/">Trans Legislation Tracker</a></p>

In [17]:
# doing the same with a link

link = soup.find('a')
print(link)

<a class="chakra-link css-f4h6uy" href="/">Trans Legislation Tracker</a>


You can also use `find()` to search an element by specific attribute. Just include the `class_=xxx` in your `find()` call.

In [12]:
soup.find('div', class_='css-wd7aku')

<div class="css-wd7aku"><style data-emotion="css 1vygpf9">.css-1vygpf9{font-family:var(--chakra-fonts-heading);font-weight:var(--chakra-fontWeights-bold);font-size:var(--chakra-fontSizes-2xl);line-height:1.33;color:#181818;text-align:left;margin-bottom:var(--chakra-space-1);}@media screen and (min-width: 48em){.css-1vygpf9{font-size:var(--chakra-fontSizes-3xl);line-height:1.2;}}</style><h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakr

Again, you can combine this with the `text` property to get just the text

In [13]:
soup.find('div', class_='css-wd7aku').text

'AL HB261SPORTS'

## `find_all()`

Want to print out all tags of a specific element? Then we use `find_all()`

In [40]:
soup.find_all('h3')

[<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/HB261">AL<!-- --> <!-- -->HB261</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/SB261">AL<!-- --> <!-- -->SB261</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AR/HB1156">AR<!-- --> <!-- -->HB1156</a></h3>,
 <h3 class="chakra-heading css

In [41]:
# making a list of all our level 3 headers

headers = soup.find_all('h3')

In [45]:
# remember list slicing? Here we use list slicing to print out only the first three elements

headers[:3]

[<h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/HB261">AL<!-- --> <!-- -->HB261</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/SB261">AL<!-- --> <!-- -->SB261</a></h3>,
 <h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AR/HB1156">AR<!-- --> <!-- -->HB1156</a></h3>]

For our project, we want to scrape information about each bill contained within the bill cards. So, it makes sense to separate out that information (within the bill cards) from the rest of the website. This will make it easier to then go grab the elements we need later.

To do so, use the inspector to find the element that contains all of the bill cards. We can see that the element is `div` with the class `css-1ftdpv0`.

![inspector over the bill cards column](./img/explore0.jpg)

You may have to pan your mouse over different parts of the code (inside the inspector window) until you see the desired the webpage in the blue highlighter. In our case, we want all of the cards to be highlighted, beacause we want the element that corresponds to that section.

In [None]:
# grap all of the elements contained within the div with the class `css-1ftdpv0`

soup.find_all('div', class_='css-1ftdpv0')

In [None]:
# save the elements to `bill_cards`. This will make it easier to search the data later on.

bill_cards = soup.find_all('div', class_='css-1ftdpv0')

## group challenge
Now that we have narrowed down our data to `bill_cards`, we can search within this code for the individual elements we want. For our dataset, we want to scrape the following information:
- bill title
- bill category
- bill description
- link to bill

Using the inspector, take 5-10 minutes create a list of html elements and attributes that correspond to the above information. Work in partners. 

<!--
# h3.text
# this.class = css-bu60l4
# h2.text
# a.class = chakra-link
-->