<center>
    <h1 style="color:#0099cc">
        <b>
            Introduction to BeautifulSoup
        </b>
    </h1>
    <p style="color:#0099cc">Presented by <i>Parsa Abbasi</i> at Quera Data Analysis Bootcamp | <i>April 2023<i></p>
</center>

# The `requests` library
The `requests` library is a Python library that allows you to send HTTP requests. It is an easy-to-use library with a lot of features ranging from passing parameters in URLs to sending custom headers and SSL Verification.

We can make a `GET` request to a website using the `get()` method and store the response in a variable. Let's try it out for the [Hacker News](https://news.ycombinator.com/) website.

In [1]:
import requests
page = requests.get("https://news.ycombinator.com/")

## Status Code
HTTP status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

- Informational responses (`100`–`199`)
- Successful responses (`200`–`299`)
- Redirects (`300`–`399`)
- Client errors (`400`–`499`)
- Server errors (`500`–`599`)

You can check this [link](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) for more information about HTTP status codes.

The `status_code` attribute of the response object contains the status code of the response. If the status code is `200`, then the request has succeeded.

In [2]:
page.status_code

200

If you want to check the status in a human-readable format, you can use the built-in `http` library.

In [3]:
from http.client import responses
responses[page.status_code]

'OK'

## Content
The `content` attribute of the response object contains the content of the response, in bytes. You can use the `decode()` method to convert the bytes to a string.

In [4]:
page.content.decode()

'<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?Vmm4qeoo2LYDXC715ll1">\n        <link rel="shortcut icon" href="favicon.ico">\n          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">\n        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.svg" width="18" height="18" style="border:1px white solid; display:block"></a></td>\n                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n                            <a href="newest">new</a> | 

# BeautifulSoup
We can use the `BeautifulSoup` library to parse the HTML content of a webpage.

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

## Prettified View
The `prettify()` method of the `BeautifulSoup` object returns a string that contains the HTML content of the webpage in a more readable format.

In [6]:
print(soup.prettify())

<html lang="en" op="news">
 <head>
  <meta content="origin" name="referrer"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="news.css?Vmm4qeoo2LYDXC715ll1" rel="stylesheet" type="text/css"/>
  <link href="favicon.ico" rel="shortcut icon"/>
  <link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
  <title>
   Hacker News
  </title>
 </head>
 <body>
  <center>
   <table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
    <tr>
     <td bgcolor="#ff6600">
      <table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%">
       <tr>
        <td style="width:18px;padding-right:4px">
         <a href="https://news.ycombinator.com">
          <img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/>
         </a>
        </td>
        <td style="line-height:12pt; height:10px;">
         <span class="pagetop">
          <b class="hnname">
    

## Getting the Title
The `title` attribute of the `BeautifulSoup` object returns the title of the webpage.

In [7]:
soup.title

<title>Hacker News</title>

We can extract the text of the title using the `text` attribute of the `title` object.

In [8]:
soup.title.text

'Hacker News'

## Finding all instances of a tag
The `find_all()` method of the `BeautifulSoup` object returns a list of all the HTML tags that match the given name.

In [20]:
# find all the links in the page
links = soup.find_all('a')
print('{} links found'.format(len(links)))

229 links found


Note that the `find_all()` method returns a list, so we need to loop through the list or use list indexing to access the elements.

In [10]:
links[:10]

[<a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a>,
 <a href="news">Hacker News</a>,
 <a href="newest">new</a>,
 <a href="front">past</a>,
 <a href="newcomments">comments</a>,
 <a href="ask">ask</a>,
 <a href="show">show</a>,
 <a href="jobs">jobs</a>,
 <a href="submit">submit</a>,
 <a href="login?goto=news">login</a>]

We can get a dictionary of all the attributes of a tag using the `attrs` attribute of the tag object.

In [11]:
links[0].attrs

{'href': 'https://news.ycombinator.com'}

We can extract the value of a specific attribute using the `get()` method of the dictionary.

In [12]:
links[0].get('href')

'https://news.ycombinator.com'

## Finding the first appearance of a tag
The `find()` method of the `BeautifulSoup` object returns the first HTML tag that matches the given name.

In [13]:
soup.find('a')

<a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a>

## Find by ID
The `find()` and `find_all()` methods can also be used to find tags by their `id` attribute.

In [14]:
soup.find(id='hnmain')

<table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr></table></td></tr>
<tr id="pagespace" style="height:10px" title=""></tr><tr><td><table border="0" cellpadding="0" cellspacing="0">
<tr class="athing" id="37171801">


## Find by Class
The `find()` and `find_all()` methods can also be used to find tags by their `class` attribute.

In [15]:
# find all news items
news = soup.find_all(class_='athing')
print('{} item with class athing are found!'.format(len(news)))

30 item with class athing are found!


In [16]:
news[0]

<tr class="athing" id="37171801">
<td align="right" class="title" valign="top"><span class="rank">1.</span></td> <td class="votelinks" valign="top"><center><a href="vote?id=37171801&amp;how=up&amp;goto=news" id="up_37171801"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><span class="titleline"><a href="https://matklad.github.io/2023/08/17/typescript-is-surprisingly-ok-for-compilers.html" rel="noreferrer">TypeScript Is Surprisingly OK for Compilers</a><span class="sitebit comhead"> (<a href="from?site=matklad.github.io"><span class="sitestr">matklad.github.io</span></a>)</span></span></td></tr>

## Select by CSS Selector
CSS selectors are patterns used to select the content you want to style. Here are some examples of CSS selectors:

*   <code>p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag.
*   <code>body p a</code> — finds all <code>a</code> tags inside of a <code>p</code> tag inside of a <code>body</code> tag.
*   <code>html body</code> — finds all <code>body</code> tags inside of an <code>html</code> tag.
*   <code>p.outer-text</code> — finds all <code>p</code> tags with a class of <code>outer-text</code>.
*   <code>p#first</code> — finds all <code>p</code> tags with an id of <code>first</code>.
*   <code>body p.outer-text</code> — finds any <code>p</code> tags with a class of <code>outer-text</code> inside of a <code>body</code> tag.

If you want to learn more about CSS selectors, you can check this [link](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors).

The `select()` method of the `BeautifulSoup` object returns a list of all the HTML tags that match the given CSS selector.


👨‍💻 There is an open-source chrome extension named [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) that makes CSS selector generation and discovery on complicated sites a breeze.

In [17]:
# Find all headlines
headlines = soup.select('.titleline>a')
headlines

[<a href="https://matklad.github.io/2023/08/17/typescript-is-surprisingly-ok-for-compilers.html" rel="noreferrer">TypeScript Is Surprisingly OK for Compilers</a>,
 <a href="https://github.com/IBM/fp-go">Fp-go: Functional Programming Library for Golang</a>,
 <a href="https://fsharpforfunandprofit.com/rop/" rel="noreferrer">Railway Oriented Programming</a>,
 <a href="https://micro-editor.github.io/" rel="noreferrer">micro – A Modern Alternative to nano</a>,
 <a href="https://us.starlabs.systems/pages/starlite" rel="noreferrer">StarLite 12.5-inch Linux tablet</a>,
 <a href="https://lwn.net/Articles/939981/" rel="noreferrer">GIL removal and the Faster CPython project</a>,
 <a href="https://www.theguardian.com/music/2023/aug/16/if-stevie-wonder-wants-to-play-it-pay-attention-how-a-bizarre-new-instrument-found-unusual-success" rel="noreferrer">A new instrument found unusual success</a>,
 <a href="https://vermaden.wordpress.com/2023/08/18/freebsd-bhyve-virtualization/" rel="noreferrer">FreeBS

In [18]:
# Get the text of the headlines
headlines_text = [headline.text for headline in headlines]
headlines_text

['TypeScript Is Surprisingly OK for Compilers',
 'Fp-go: Functional Programming Library for Golang',
 'Railway Oriented Programming',
 'micro – A Modern Alternative to nano',
 'StarLite 12.5-inch Linux tablet',
 'GIL removal and the Faster CPython project',
 'A new instrument found unusual success',
 'FreeBSD Bhyve Virtualization',
 'China’s property giant Evergrande files for bankruptcy protection in Manhattan',
 'ProtonMail Complied with 5,957 Data Requests in 2022 – Still Secure and Private?',
 'λ Calculus (2013) [pdf]',
 'New book considers the impact of electronic logging devices on drivers',
 'SUSE to go private',
 'New type of star gives clues to mysterious origin of magnetars',
 'Bali rice experiment cuts greenhouse gas emissions and increases yields',
 'The aging brain: is misplaced DNA to blame?',
 'Metallica Hard-Wires a Different Set List Every Night',
 'RoboAgent: A universal agent with 12 Skills',
 'Ancient fires drove large mammals extinct, study suggests',
 'How to comm

In [19]:
# Extract the url of the headlines
headlines_url = [headline.get('href') for headline in headlines]
headlines_url

['https://matklad.github.io/2023/08/17/typescript-is-surprisingly-ok-for-compilers.html',
 'https://github.com/IBM/fp-go',
 'https://fsharpforfunandprofit.com/rop/',
 'https://micro-editor.github.io/',
 'https://us.starlabs.systems/pages/starlite',
 'https://lwn.net/Articles/939981/',
 'https://www.theguardian.com/music/2023/aug/16/if-stevie-wonder-wants-to-play-it-pay-attention-how-a-bizarre-new-instrument-found-unusual-success',
 'https://vermaden.wordpress.com/2023/08/18/freebsd-bhyve-virtualization/',
 'https://www.cnbc.com/2023/08/18/china-property-developer-evergrande-files-for-bankruptcy-protection-in-us.html',
 'https://restoreprivacy.com/protonmail-data-requests-user-logs/',
 'https://www.cs.rpi.edu/academics/courses/spring11/proglang/handouts/lambda-calculus-chapter.pdf',
 'https://www.truckersnews.com/home/article/15305066/new-book-considers-the-impact-of-electronic-logging-devices-on-drivers',
 'https://www.suse.com/news/EQT-announces-voluntary-public-purchase-offer-and-int

# 📑 Sources and References

*   [Tutorial: Web Scraping with Python Using Beautiful Soup by *Vik Paruchuri*](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)
*   [Hacker News](https://news.ycombinator.com/)
*   [HTTP response status codes, *Mozilla*](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
*   [CSS selectors, *Mozilla*](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors)
*   [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)