<center>
    <h1 style="color:#0099cc">
        <b>
            Introduction to BeautifulSoup
        </b>
    </h1>
    <p style="color:#0099cc">Presented by <i>Parsa Abbasi</i> at Quera Data Analysis Bootcamp | <i>January 2023<i></p>
</center>

# The requests library

The requests library will make a <code>GET</code> request to a web server, which will download the HTML contents of a given web page for us. 

In [1]:
import requests
page = requests.get("https://stallman.org/")

## Status codes

HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

*   Informational responses (<code>100</code> – <code>199</code>)
*   Successful responses    (<code>200</code> – <code>299</code>)
*   Redirection messages    (<code>300</code> – <code>399</code>)
*   Client error responses  (<code>400</code> – <code>499</code>)
*   Server error responses  (<code>500</code> – <code>599</code>)

([More details](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status))

In [2]:
page.status_code

200

## Page content

We can print out the HTML content of the page using the <code>content</code> property:

In [None]:
page.content

# BeautifulSoup

We can use the <code>BeautifulSoup</code> library to parse the HTML document.

You can install Beautiful Soup 4 with <code>pip install beautifulsoup4</code>. ([Documentation](https://www.crummy.com/software/BeautifulSoup/))

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

## Prettified view

In [None]:
print(soup.prettify())

## Finding all instances of a tag

🔍 Suppose we want to extract all the links in this webpage:

In [5]:
links = soup.find_all('a')
print('Number of links:', len(links))

Number of links: 350


Note that <code>find_all</code> returns a list, so we’ll have to loop through, or use list indexing.

In [6]:
links[:10]

[<a href="/grav-mass.html">Grav-Mass</a>,
 <a href="./grav-mass.png"><img alt="A Grav-Mass tree" src="./grav-mass-icon.png"/></a>,
 <a href="https://stallmansupport.org">Support me against a campaign of hatred</a>,
 <a href="#politics">Political Articles</a>,
 <a class="nobr" href="archives/polnotes.html">Political Notes</a>,
 <a href="talks.html">Talks</a>,
 <a href="airlines.html">Airlines</a>,
 <a href="/antiglossary.html">Anti-Glossary</a>,
 <a href="/archive.html">Archive</a>,
 <a href="/banfacerecognition.html">Ban face recognition</a>]

We can use <code>get_text()</code> method to extract the text of a tag.

In [7]:
links[0].get_text()

'Grav-Mass'

We can get a dictionary of a tag attributes by calling the <code>.attrs</code>.

In [10]:
links[0].attrs

{'href': '/grav-mass.html'}

We can get the value of an attribute using <code>get</code> method.

In [8]:
links[0].get('href')

'/grav-mass.html'

## Finding the first appearance of a tag

If we only want to find the first instance of a tag, we can use the <code>find</code> method.

In [11]:
soup.find('a')

<a href="/grav-mass.html">Grav-Mass</a>

## Searching by class and id

We can use the <code>id</code> argument of <code>find</code> method to search for items by id.

In [12]:
soup.find(id='comic-container')

<div id="comic-container">
<div id="comic">
<div id="comic-expand">
<a href="comics.html"><img alt="" src="images/expand_r.png"/></a>
</div> <!-- END comic-expand div -->
<a href="/images/so-many-candidates.jpg">
<img alt="So Many Candidates" src="/images/so-many-candidates-small.jpg"/>
</a>
</div> <!-- END comic div -->
</div>

We can use the <code>class_</code> argument of <code>find_all</code> method to search for items by class.

In [13]:
soup.find_all(class_='c2')

[<div class="c2">
 
 What's bad about:
 <a href="/airbnb.html">Airbnb</a> |
 <a href="/amazon.html">Amazon</a> |
 <a href="/amtrak.html">Amtrak</a> |
 <a href="/ancestry.html">Ancestry</a> |
 <a href="/apple.html">Apple</a> |
 <a href="/cloudflare.html">Cloudflare</a> |
 <a href="/discord.html">Discord</a> |
 <a href="/ebooks.pdf">Ebooks</a> |
 <a href="/eventbrite.html">Eventbrite</a> |
 <a href="/evernote.html">Evernote</a> |
 <a href="/facebook.html">Facebook</a> |
 <a href="/frito-lay.html">Frito-Lay</a> |
 <a href="/frontier.html">Frontier</a> |
 <a href="/google.html">Google</a> |
 <a href="/gofundme.html">Gofundme</a> |
 <a href="/food-delivery.html">Grubhub</a> |
 <a href="/intel.html">Intel</a> |
 <a href="/linkedin.html">LinkedIn</a> |
 <a href="/lyft.html">Lyft</a> |
 <!-- meetup.com has the same injustices as eventbrite.com
      and they share one page -->
 <a href="/eventbrite.html">Meetup</a> |
 <a href="/microsoft.html">Microsoft</a> |
 <a href="/netflix.html">Netflix</

## Searching by CSS selector

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

*   <code>p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag.
*   <code>body p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag inside of a <code>body</code> tag.
*   <code>html body</code> — finds all <code>body</code> tags inside of an <code>html</code> tag.
*   <code>p.outer-text</code> — finds all <code>p</code> tags with a class of <code>outer-text</code>.
*   <code>p#first</code> — finds all <code>p</code> tags with an id of <code>first</code>.
*   <code>body p.outer-text</code> — finds any <code>p</code> tags with a class of <code>outer-text</code> inside of a <code>body</code> tag.

([Learn more about CSS selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors))

👨‍💻 There is an open-source chrome extension named [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) that makes CSS selector generation and discovery on complicated sites a breeze.

🔍 Suppose we want to find which services [Richard Stallman](https://en.wikipedia.org/wiki/Richard_Stallman) has written a negative review about.

In [24]:
soup.select('.c2 a')

[<a href="/airbnb.html">Airbnb</a>,
 <a href="/amazon.html">Amazon</a>,
 <a href="/amtrak.html">Amtrak</a>,
 <a href="/ancestry.html">Ancestry</a>,
 <a href="/apple.html">Apple</a>,
 <a href="/cloudflare.html">Cloudflare</a>,
 <a href="/discord.html">Discord</a>,
 <a href="/ebooks.pdf">Ebooks</a>,
 <a href="/eventbrite.html">Eventbrite</a>,
 <a href="/evernote.html">Evernote</a>,
 <a href="/facebook.html">Facebook</a>,
 <a href="/frito-lay.html">Frito-Lay</a>,
 <a href="/frontier.html">Frontier</a>,
 <a href="/google.html">Google</a>,
 <a href="/gofundme.html">Gofundme</a>,
 <a href="/food-delivery.html">Grubhub</a>,
 <a href="/intel.html">Intel</a>,
 <a href="/linkedin.html">LinkedIn</a>,
 <a href="/lyft.html">Lyft</a>,
 <a href="/eventbrite.html">Meetup</a>,
 <a href="/microsoft.html">Microsoft</a>,
 <a href="/netflix.html">Netflix</a>,
 <a href="/patreon.html">Patreon</a>,
 <a href="/pay-toilets.html">Pay Toilets</a>,
 <a href="/skype.html">Skype</a>,
 <a href="/slack.html">Slack</a

In [25]:
[item.get_text() for item in soup.select('.c2 a')]

['Airbnb',
 'Amazon',
 'Amtrak',
 'Ancestry',
 'Apple',
 'Cloudflare',
 'Discord',
 'Ebooks',
 'Eventbrite',
 'Evernote',
 'Facebook',
 'Frito-Lay',
 'Frontier',
 'Google',
 'Gofundme',
 'Grubhub',
 'Intel',
 'LinkedIn',
 'Lyft',
 'Meetup',
 'Microsoft',
 'Netflix',
 'Patreon',
 'Pay Toilets',
 'Skype',
 'Slack',
 'Spotify',
 'Tesla',
 'Ticketmaster',
 'Twitter',
 'Uber',
 "Wendy's",
 'WhatsApp',
 'Zoom']

# 📑 Sources and References

*   [Tutorial: Web Scraping with Python Using Beautiful Soup by *Vik Paruchuri*](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)
*   [Richard Stallman's Personal Site](https://stallman.org/)
*   [HTTP response status codes, *Mozilla*](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
*   [CSS selectors, *Mozilla*](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors)
*   [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)