# Web Scraping with Beautiful Soup

Beautiful Soup is a popular Python library for extracting data from HTML and XML documents. It simplifies parsing the complex structure of these markup languages by creating a tree-like representation. This allows you to easily navigate, search, and manipulate the content using Pythonic idioms.

Beautiful Soup is a powerful tool for web scraping tasks, where you can automatically collect information from websites.

For more details and extensive documentation, refer to the Beautiful Soup official website: Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
r = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
html = r.text

In [3]:
html[:1000] + '...'

'<html>\n<head>\n<style>\n.green{\n\tcolor:#55ff55;\n}\n.red{\n\tcolor:#ff5555;\n}\n#text{\n\twidth:50%;\n}\n</style>\n</head>\n<body>\n<h1>War and Peace</h1>\n<h2>Chapter 1</h2>\n<div id="text">\n"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don\'t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy \'faithful slave,\' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.</span>"\n<p/>\nIt was in July, 1805, and the speaker was the well-known <span class="green">Anna\nPavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya\nFedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man\nof high rank

In [4]:
soup = BeautifulSoup(html, 'html.parser')

We use `find_all` to search for all elements with tags listed in the square brackets (representing different heading levels). This creates a list of `Bs4` elements for each heading. We iterate through the list of headings and print the element.

In [5]:
titles = soup.find_all(['h1', 'h2','h3','h4','h5','h6'])

for title in titles:
    print(title)

<h1>War and Peace</h1>
<h2>Chapter 1</h2>


We use `find_all` to search for all span elements where the class attribute is equal to "green". Note the use of the class_ keyword argument for clarity (equivalent to class="green"). This creates a list of `Bs4` elements for each green span.

In [6]:
# names = soup.find_all('span', {'class': 'green'})
names = soup.find_all("span", class_="green")

for name in names[:5]:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
