## **Get the website - using HTTP library (Requests)**

In [1]:
import requests

html = requests.get("https://keithgalli.github.io/web-scraping/example.html")

print(html.content)

b'<html>\n<head>\n<title>HTML Example</title>\n</head>\n<body>\n\n<div align="middle">\n<h1>HTML Webpage</h1>\n<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>\n</div>\n\n<h2>A Header</h2>\n<p><i>Some italicized text</i></p>\n\n<h2>Another header</h2>\n<p id="paragraph-id"><b>Some bold text</b></p>\n\n</body>\n</html>\n'


In [2]:
# Response Object: https://www.w3schools.com/python/ref_requests_response.asp

print('Response url:', html.url)

# HTTP response status codes: https://developer.mozilla.org/th/docs/Web/HTTP/Status
print('Response status:', html.status_code)

print('Response status:', html.encoding)

Response url: https://keithgalli.github.io/web-scraping/example.html
Response status: 200
Response status: utf-8


In [3]:
print('Response headers:', html.headers)

Response headers: {'Connection': 'keep-alive', 'Content-Length': '259', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Fri, 03 Jul 2020 00:21:07 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"5efe79f3-198"', 'expires': 'Sun, 11 Jun 2023 04:56:17 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '1962:544A:192F56D:25C0097:64855199', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 11 Jun 2023 04:46:18 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-chi-kigq8000129-CHI', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1686458778.932623,VS0,VE110', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '3123e912ff2bebfb5f11237ac8091bbeff29e174'}


### **Beginning to scrape with BeautifulSoup**

In [4]:
# Install beautifulsoup4
!pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from bs4 import BeautifulSoup

In [6]:
html = requests.get("https://keithgalli.github.io/web-scraping/example.html")
bs = BeautifulSoup(html.content, "html.parser")

In [7]:
print(bs.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



### **find() and find_all() with BeautifulSoup**

In [8]:
result_find = bs.find('h2')
result_findall = bs.find_all('h2')

print('find: ', result_find)
print('find_all: ', result_findall)

find:  <h2>A Header</h2>
find_all:  [<h2>A Header</h2>, <h2>Another header</h2>]


In [9]:
result_header_tags = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print('Header: ', result_header_tags)

result_p_tags = bs.find_all('p')
print('p: ', result_p_tags)

Header:  [<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]
p:  [<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [10]:
for tag in result_p_tags:
    print(tag.get_text())

Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html
Some italicized text
Some bold text


In [11]:
result_p_by_id = bs.find('p', attrs={'id': 'paragraph-id'})
print(result_p_by_id.get_text())

Some bold text


### **Keyword Arguments**

In [12]:
html = requests.get("http://www.pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(html.content, "html.parser")

In [13]:
result_by_class = bs.find_all('span', {'class':{'green', 'red'}})
print([text for text in result_by_class])

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! w

In [14]:
result_by_class = bs.find_all('span', {'class':{'green'}})
print([text for text in result_by_class])
print('------------------')
print([text.get_text() for text in result_by_class])

[<span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Princ