<center>
    <h1 style="color:#0099cc">
        <b>
            Introduction to BeautifulSoup
        </b>
    </h1>
    <p style="color:#0099cc">Presented by <i>Parsa Abbasi</i> at Quera Data Analysis Bootcamp | <i>April 2023<i></p>
</center>

# The `requests` library
The `requests` library is a Python library that allows you to send HTTP requests. It is an easy-to-use library with a lot of features ranging from passing parameters in URLs to sending custom headers and SSL Verification.

We can make a `GET` request to a website using the `get()` method and store the response in a variable. Let's try it out for the [Hacker News](https://news.ycombinator.com/) website.

In [43]:
import requests
page = requests.get("https://news.ycombinator.com/")

## Status Code
HTTP status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

- Informational responses (`100`–`199`)
- Successful responses (`200`–`299`)
- Redirects (`300`–`399`)
- Client errors (`400`–`499`)
- Server errors (`500`–`599`)

You can check this [link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) for more information about HTTP status codes.

The `status_code` attribute of the response object contains the status code of the response. If the status code is `200`, then the request has succeeded.

In [44]:
page.status_code

200

If you want to check the status in a human-readable format, you can use the built-in `http` library.

In [45]:
from http.client import responses
responses[page.status_code]

'OK'

## Content
The `content` attribute of the response object contains the content of the response, in bytes. You can use the `decode()` method to convert the bytes to a string.

In [None]:
page.content.decode()

# BeautifulSoup
We can use the `BeautifulSoup` library to parse the HTML content of a webpage.

In [47]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

## Prettified View
The `prettify()` method of the `BeautifulSoup` object returns a string that contains the HTML content of the webpage in a more readable format.

In [None]:
print(soup.prettify())

## Getting the Title
The `title` attribute of the `BeautifulSoup` object returns the title of the webpage.

In [49]:
soup.title

<title>Hacker News</title>

We can extract the text of the title using the `text` attribute of the `title` object.

In [50]:
soup.title.text

'Hacker News'

## Finding all instances of a tag
The `find_all()` method of the `BeautifulSoup` object returns a list of all the HTML tags that match the given name.

In [51]:
# fing all the links in the page
links = soup.find_all('a')
print('{} links found'.format(len(links)))

224 links found


Note that the `find_all()` method returns a list, so we need to loop through the list or use list indexing to access the elements.

In [52]:
links[:10]

[<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>,
 <a href="news">Hacker News</a>,
 <a href="newest">new</a>,
 <a href="front">past</a>,
 <a href="newcomments">comments</a>,
 <a href="ask">ask</a>,
 <a href="show">show</a>,
 <a href="jobs">jobs</a>,
 <a href="submit">submit</a>,
 <a href="login?goto=news">login</a>]

We can get a dictionary of all the attributes of a tag using the `attrs` attribute of the tag object.

In [53]:
links[0].attrs

{'href': 'https://news.ycombinator.com'}

We can extract the value of a specific attribute using the `get()` method of the dictionary.

In [54]:
links[0].get('href')

'https://news.ycombinator.com'

## Finding the first appearance of a tag
The `find()` method of the `BeautifulSoup` object returns the first HTML tag that matches the given name.

In [55]:
soup.find('a')

<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>

## Find by ID
The `find()` and `find_all()` methods can also be used to find tags by their `id` attribute.

In [None]:
soup.find(id='hnmain')

## Find by Class
The `find()` and `find_all()` methods can also be used to find tags by their `class` attribute.

In [64]:
# find all news items
news = soup.find_all(class_='athing')
print('{} item with class athing are found!'.format(len(news)))

30 item with class athing are found!


In [65]:
news[0]

<tr class="athing" id="35426482">
<td align="right" class="title" valign="top"><span class="rank">1.</span></td> <td class="votelinks" valign="top"><center><a href="vote?id=35426482&amp;how=up&amp;goto=news" id="up_35426482"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><span class="titleline"><a href="item?id=35426482">Launch HN: OutSail (YC W23) – Wingsails to reduce cargo ship fuel consumption</a></span></td></tr>

## Select by CSS Selector
CSS selectors are patterns used to select the content you want to style. Here are some examples of CSS selectors:

*   <code>p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag.
*   <code>body p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag inside of a <code>body</code> tag.
*   <code>html body</code> — finds all <code>body</code> tags inside of an <code>html</code> tag.
*   <code>p.outer-text</code> — finds all <code>p</code> tags with a class of <code>outer-text</code>.
*   <code>p#first</code> — finds all <code>p</code> tags with an id of <code>first</code>.
*   <code>body p.outer-text</code> — finds any <code>p</code> tags with a class of <code>outer-text</code> inside of a <code>body</code> tag.

If you want to learn more about CSS selectors, you can check this [link](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors).

The `select()` method of the `BeautifulSoup` object returns a list of all the HTML tags that match the given CSS selector.


👨‍💻 There is an open-source chrome extension named [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) that makes CSS selector generation and discovery on complicated sites a breeze.

In [77]:
# Find all headlines
headlines = soup.select('.titleline>a')
headlines

[<a href="item?id=35426482">Launch HN: OutSail (YC W23) – Wingsails to reduce cargo ship fuel consumption</a>,
 <a href="https://github.com/system76/firmware-open">System76 Open Firmware</a>,
 <a href="item?id=35424807">Ask HN: Who is hiring? (April 2023)</a>,
 <a href="https://mullvad.net/en/browser">The Mullvad Browser</a>,
 <a href="https://github.com/hocus-dev/hocus">Show HN: Hocus – self-hosted alternative to GitHub Codespaces using Firecracker</a>,
 <a href="https://www.easypost.com/careers" rel="nofollow">EasyPost (YC S13) Is Hiring</a>,
 <a href="https://fas.org/blogs/security/2023/04/volkel-nuclear-weapon-accident/">Was there a U.S. nuclear weapons accident at a Dutch air base?</a>,
 <a href="https://www.inverse.com/input/features/tropetrainer-thomas-buchler-torah-software">His software sang the words of God. Then it went silent</a>,
 <a href="https://www.economist.com/business/2023/03/30/alibaba-breaks-itself-up-in-six">Alibaba breaks itself up in six</a>,
 <a href="https://e

In [75]:
# Get the text of the headlines
headlines_text = [headline.text for headline in headlines]
headlines_text

['Launch HN: OutSail (YC W23) – Wingsails to reduce cargo ship fuel consumption',
 'System76 Open Firmware',
 'Ask HN: Who is hiring? (April 2023)',
 'The Mullvad Browser',
 'Show HN: Hocus – self-hosted alternative to GitHub Codespaces using Firecracker',
 'EasyPost (YC S13) Is Hiring',
 'Was there a U.S. nuclear weapons accident at a Dutch air base?',
 'His software sang the words of God. Then it went silent',
 'Alibaba breaks itself up in six',
 'How to do hard things',
 'An Essay on Diseases Incidental to Literary and Sedentary Persons (1768)',
 'Ask HN: Who wants to be hired? (April 2023)',
 'Stop-Motion Movies Are Animated at Aardman [video]',
 'NAND Flash Data Recovery Cookbook (2013) [pdf]',
 'Experts warn yearly checkups carry risks and do not reduce mortality',
 "We're Knot Friends",
 'Near-lossless image formats using ultra-fast LZ codecs',
 'Was MPLS Traffic Engineering Worthwhile?',
 'Destreza',
 "Deer don't regrow antlers the way lower animals regrow limbs",
 'State-of-th

In [76]:
# Extract the url of the headlines
headlines_url = [headline.get('href') for headline in headlines]
headlines_url

['item?id=35426482',
 'https://github.com/system76/firmware-open',
 'item?id=35424807',
 'https://mullvad.net/en/browser',
 'https://github.com/hocus-dev/hocus',
 'https://www.easypost.com/careers',
 'https://fas.org/blogs/security/2023/04/volkel-nuclear-weapon-accident/',
 'https://www.inverse.com/input/features/tropetrainer-thomas-buchler-torah-software',
 'https://www.economist.com/business/2023/03/30/alibaba-breaks-itself-up-in-six',
 'https://every.to/no-small-plans/how-to-do-hard-things',
 'https://publicdomainreview.org/collection/blights-of-the-bookish',
 'item?id=35424805',
 'https://www.youtube.com/watch?v=jZvQzkFcKEM',
 'http://web.archive.org/web/20180516153837/http://www.adreca.net/NAND-Flash-Data-Recovery-Cookbook.pdf',
 'https://english.elpais.com/science-tech/2023-04-01/do-you-need-a-yearly-checkup-experts-warn-that-they-carry-risks-and-do-not-reduce-mortality.html',
 'https://jeremykun.com/2023/04/01/were-knot-friends/',
 'http://richg42.blogspot.com/2023/04/a-dead-sim

# 📑 Sources and References

*   [Tutorial: Web Scraping with Python Using Beautiful Soup by *Vik Paruchuri*](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)
*   [Hacker News](https://news.ycombinator.com/)
*   [HTTP response status codes, *Mozilla*](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
*   [CSS selectors, *Mozilla*](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors)
*   [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)