# exploring `bs4`

## what is `bs4?`

bs4 is short for `BeautifulSoup4`, a python package for parsing HTML data. bs4's power comes from using python syntax to access and manipulate HTML elements. This means that it uses the python language and its syntax to get information from pages written in the web's main computer lanugage, HTML.

I explain what the code below does in "comments" contained within each cell. Comments in Python are written on lines that begin with a hashtag `#`. They are like annotations for the code. The `#` which starts the comment line indicates to the computer that it should ignore that line (in other words, that the line is meant for human readers).

## how to scrape a website with 6 lines of code

In [1]:
# import the following libraries for our web scraping project

import requests # to make https requests
from bs4 import BeautifulSoup # our web scraping library
import lxml # a parser for working with html data

In [2]:
# save the data from the website as a "soup" object

site = requests.get('https://www.nytimes.com/section/nyregion') # gets the URL
html_code = site.content # saves the HTML code
soup = BeautifulSoup(html_code, 'lxml') # creates a soup object

*NOTE:*
If you get an error with importing one of the above libraries, make sure you have them installed. On Jupyter, that means running the following in your Command Line program (like Terminal or Gitbash):
```console
pip install requests
pip install bs4
pip install lxml
```
On colab, run the same code, but put an exclamation before `pip`, like: 
```console
!pip install requests
!pip install bs4
!pip install lxml
```

## the `soup` object

This word "object" in Python is something you'll hear often. It means a collection of data and functions that can work on that data. You can think of it as a way of representing real world objects (like this web page) that is organized and accessible, so you can search and manipulate that information with Python.

Let's take an initial look into what this beautiful soup object allows us to do. It takes the HTML source, the specific HTML elements or "tags," and makes it possible for us to access those tags using python syntax -- specifically, the dot syntax.

In [3]:
soup.title

<title data-rh="true">New York - The New York Times</title>

In [4]:
soup.h1

<h1 class="css-14dhlt9 e16wpn5v0" data-component-name="collection-header">New York</h1>

In [5]:
soup.a

<a class="css-kgn7zc" href="#site-content">Skip to content</a>

## getting text

Let's go a little deeper than the element. We can access the text within each tag, getting rid of tags like `<p>` or `<h3>`, by using the `text` property.

In [6]:
# append the text property after the title property

soup.title.text

'New York - The New York Times'

What is the element to get a headline? Use the inspector to find it.

In [7]:
# saving the text from the level 3 header element to "bill_title"

headline = soup.h3.text

In [8]:
headline

'Her Pension Checks Vanished. The Doorman Stole Them, Prosecutors Say.'

In [9]:
soup.p

<p>Advertisement</p>

## getting attributes
In addition to text, we can also get the HTML attributes. [HTML attributes](https://www.w3schools.com/html/html_attributes.asp) contain additional inforamation about HTML tag. A popular attribute is `href`, which stands for hyperlink reference, and it contains the link's URL address. To access the attributes like `href`, we use the syntax: `tag['attr']`.

In [10]:
# note that this prints the value of each attribute (like the name of the class), not
# the actual text contained within the larger element. For that, use the `text` property
# on the element by itself.

soup.h3['class']

['css-1ykb5sd', 'e1hr934v2']

In [11]:
link_location = soup.a['href']

In [12]:
link_location

'#site-content'

## more granular searching with `find()`
What if you wanted to find an element by a class's value? You would use the `find()` method. This is useful when there are a lot of objects with the same element, and you want more specificity. 

In [13]:
soup.find('div', class_='css-14ee9cx')

<div class="css-14ee9cx"><article class="css-1l4spti"><figure aria-label="media" class="css-8izaxg" role="group"><div class="css-79elbk" data-testid="imageContainer-children-Image"><img alt="" class="css-rq4mmj" decoding="async" height="100" sizes="(min-width: 1024px) 205px, 150px" src="https://static01.nyt.com/images/2024/10/02/multimedia/02Gold-02-vmbz/02Gold-02-vmbz-thumbWide.jpg?quality=75&amp;auto=webp&amp;disable=upscale" srcset="https://static01.nyt.com/images/2024/10/02/multimedia/02Gold-02-vmbz/02Gold-02-vmbz-thumbWide.jpg?quality=100&amp;auto=webp 190w,https://static01.nyt.com/images/2024/10/02/multimedia/02Gold-02-vmbz/02Gold-02-vmbz-videoThumb.jpg?quality=100&amp;auto=webp 75w,https://static01.nyt.com/images/2024/10/02/multimedia/02Gold-02-vmbz/02Gold-02-vmbz-videoLarge.jpg?quality=100&amp;auto=webp 768w,https://static01.nyt.com/images/2024/10/02/multimedia/02Gold-02-vmbz/02Gold-02-vmbz-mediumThreeByTwo210.jpg?quality=100&amp;auto=webp 210w,https://static01.nyt.com/images/2

In [14]:
# we can append text to the end of the find() call
soup.find('div', class_='css-14ee9cx').text

'Eugene Gold, Brooklyn D.A. Who Led the ‘Son of Sam’ Case, Dies at 100He prosecuted high-profile cases in the 1970s and championed Soviet Jews, but, after retiring, he ran afoul of the law himself, charged with a sex offense.By Joseph P. Fried\xa0'

## challenge: 
How do you print out all of the elements by a specific tag, like `h3`?

## `find_all()`

Want to print out all tags of a specific element? Then we use `find_all()`

In [15]:
soup.find_all('h3')

[<h3 class="css-1ykb5sd e1hr934v2"><a class="css-1u3p7j1" data-rref="" href="/2024/10/02/nyregion/nyc-doorman-steals-money-from-teacher.html">Her Pension Checks Vanished. The Doorman Stole Them, Prosecutors Say.</a></h3>,
 <h3 class="css-1ykb5sd e1hr934v2"><a class="css-1u3p7j1" data-rref="" href="/2024/10/02/nyregion/penn-station-gateway.html">Penn Station Takes Up Two Blocks. Railroads Say They Must Have More.</a></h3>,
 <h3 class="css-1ykb5sd e1hr934v2"><a class="css-1u3p7j1" data-rref="" href="/2024/10/02/nyregion/eric-adams-black-voters.html">What Black Voters Are Saying About Eric Adams Since His Indictment</a></h3>,
 <h3 class="css-1ykb5sd e1hr934v2"><a class="css-1u3p7j1" data-rref="" href="/2024/10/02/nyregion/adams-return-federal-court.html">Prosecutors Warn of More Charges and Defendants in Adams Graft Case</a></h3>,
 <h3 class="css-1j88qqx e15t083i0">Eugene Gold, Brooklyn D.A. Who Led the ‘Son of Sam’ Case, Dies at 100</h3>,
 <h3 class="css-1j88qqx e15t083i0">Sarah Snook to

Now, how would we print out just the text from these elements? Check the error closely.

In [16]:
soup.find_all('h3').text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [3]:
headers = soup.find_all('h3')

In [4]:
for item in headers:
    print(item.text)

Emails Suggest Cuomo Undersold His Role in Altering Covid Report
How to Make $6,000 a Month by Moving Citi Bikes Around the Block
As Public Support for Migrants Fades, Private Donors Confront the Crisis
Police Officials Defend Subway Shooting That Gravely Wounded Bystander
‘Eh, Whatever.’ Angelenos Shrug at Recent Quakes, Decades After the Last ‘Big One.’
A Trip to the Many Worlds of Hellboy’s Creator
Homes for Sale in New York and New Jersey
Homes for Sale in Manhattan and Queens
Valarie D’Elia, Travel Reporter on TV and Radio, Dies at 64
Harrison J. Goldin, 88, New York City Comptroller in Fiscal Crisis, Is Dead
M.T.A.’s Financial Needs Grow With Congestion Pricing in Purgatory
‘You’re Basically on a Broadway Stage, With New Friends’
Panic! At the Vegan Food Festival
With Trump Sentencing Delayed, It’s an Ordinary Wednesday


Now, how would I save that data to a list? 

In [18]:
header_text = []
for item in headers:
    header_text.append(item.text)

In [19]:
header_text

['Emails Suggest Cuomo Undersold His Role in Altering Covid Report',
 'How to Make $6,000 a Month by Moving Citi Bikes Around the Block',
 'As Public Support for Migrants Fades, Private Donors Confront the Crisis',
 'Police Officials Defend Subway Shooting That Gravely Wounded Bystander',
 '‘Eh, Whatever.’ Angelenos Shrug at Recent Quakes, Decades After the Last ‘Big One.’',
 'A Trip to the Many Worlds of Hellboy’s Creator',
 'Homes for Sale in New York and New Jersey',
 'Homes for Sale in Manhattan and Queens',
 'Valarie D’Elia, Travel Reporter on TV and Radio, Dies at 64',
 'Harrison J. Goldin, 88, New York City Comptroller in Fiscal Crisis, Is Dead',
 'M.T.A.’s Financial Needs Grow With Congestion Pricing in Purgatory',
 '‘You’re Basically on a Broadway Stage, With New Friends’',
 'Panic! At the Vegan Food Festival',
 'With Trump Sentencing Delayed, It’s an Ordinary Wednesday']

## group challenge
Write a loop that pulls out all descriptions below the headlines. Then, do something with the results, like loop through them to get the text or to filter them by some condition (like if they contain a word). Save the output to a list. 

In [22]:
# before we filter out the headlines by word, we will lowercase each word so that we don't have to account for
# capital words vs lowercase words. It's a way of regularizing the text.

police = []
for i in header_text:
    # lowercasing each word
    lower = i.lower()
    if "police" in lower:
        police.append(i)

In [23]:
police

['Police Officials Defend Subway Shooting That Gravely Wounded Bystander']