# Quiz 01: Beautiful Soup

Primary Author: Hannah Marr

Collaborator: Emma Virnelli

CS 119

9/19/24

Programmatic Extraction

Reminder: Use “View Page Source” of the web page to view HTML of the page. Use “Inspect” to understand the HTML under a particular tag. If possible, please use Google Colab (or another Jupyter environment) as your programming environment as it allows for mixing code and text in the same file. Use a relatively recent version of Beautiful Soup with either "html.parser" or "lxml" parser.

Write a Python program books() to return the names of the canon books.

In [5]:
from bs4 import BeautifulSoup #to use the Beautiful Soup package, we must import the library
import requests #requests is a Python library that allows us to send HTTP requests very easily, and will allow us handle the URLs used here without manually adding query strings

In [7]:
def books():
  #This will fetch the webpage content using the requests library
  url = 'https://aliceinwonderland.fandom.com/wiki/Alice_in_Wonderland_Wiki'
  response = requests.get(url)

  #We then need to parse the webpage content using Beautiful Soup
  soup = BeautifulSoup(response.text, 'html.parser')

  #Next, from looking at the HTML of the webpage, and finding the <span> tag that contains the 'canon books' text, we learn that this HTML falls under a <ul> tag within the class 'wds-list wds-is-linked', and we need to get the <ul> element
  canon_books_section = soup.find('span', string='Canon books')
  if canon_books_section:
    books_list = canon_books_section.find_next('ul', {'class': 'wds-list wds-is-linked'})

    #Further, we see that each individual canon book is located within an <li> tag within this section, so to retrieve the titles of the canon books, we will need to fetch the text within each <li> tag
    canon_books = books_list.find_all('li')

    #Here, we extract and print the book titles as a list
    book_titles = [book.find('span').get_text() for book in canon_books]
    return book_titles
  else:
    return "Canon books section not found."

#Here we call the function and print the result
print(books())

["Alice's Adventures in Wonderland", 'Through the Looking-Glass, and What Alice Found There', "Alice's Adventures Underground"]


---

Write a Python program poems() to return the names of the canon poems and their URLs.

In [11]:
from bs4 import BeautifulSoup #to use the Beautiful Soup package, we must import the library
import requests #requests is a Python library that allows us to send HTTP requests very easily, and will allow us handle the URLs used here without manually adding query strings

In [13]:
def poems():
  #This will fetch the webpage content using the requests library
  url = 'https://aliceinwonderland.fandom.com/wiki/Alice_in_Wonderland_Wiki'
  response = requests.get(url)

  #We then need to parse the webpage content using Beautiful Soup
  soup = BeautifulSoup(response.text, 'html.parser')

  #Now we will locate the <span> tag with the text 'canon poems' and fetch the following <ul> element
  canon_poems_section = soup.find('span', string='Canon poems')
  if canon_poems_section:
    poems_list = canon_poems_section.find_next('ul', {'class': 'wds-list wds-is-linked'})

    #Now we will extract all the <li> tags within this section
    canon_poems = poems_list.find_all('li')

    #Here we extract the poem names and URLs and append to the poems_data list
    poems_data = []
    for poem in canon_poems:
      poem_name = poem.find('span').get_text()  #This retrieves the poem title
      poem_url = poem.find('a')['href']  #This retrieves the poem URL
      poems_data.append({'name': poem_name, 'url': poem_url}) #This appends the data we've extracted to the poems_data list
    return poems_data
  else:
    return "Canon poems section not found."

#Now we call the function and print the result
poems_list = poems()
for poem in poems_list:
  print(f"{poem['name']}: {poem['url']}")

Jabberwocky: https://aliceinwonderland.fandom.com/wiki/Jabberwocky
How Doth the Little Crocodile: https://aliceinwonderland.fandom.com/wiki/How_Doth_the_Little_Crocodile
The Walrus and the Carpenter: https://aliceinwonderland.fandom.com/wiki/The_Walrus_and_the_Carpenter_(poem)
You Are Old, Father William: https://aliceinwonderland.fandom.com/wiki/You_Are_Old,_Father_William
Humpty Dumpty's Recitation: https://aliceinwonderland.fandom.com/wiki/Humpty_Dumpty%27s_Recitation
Turtle Soup: https://aliceinwonderland.fandom.com/wiki/Turtle_Soup
Tis the Voice of the Lobster: https://aliceinwonderland.fandom.com/wiki/Tis_the_Voice_of_the_Lobster


---

Write a Python program poem_title_text(n) that calls poems() and, using n as the index and returns the poem title and its text.

In [17]:
import requests
from bs4 import BeautifulSoup

#We can reuse the poems function for this
def poems():
  url = 'https://aliceinwonderland.fandom.com/wiki/Alice_in_Wonderland_Wiki'
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')

  canon_poems_section = soup.find('span', string='Canon poems')
  if canon_poems_section:
    poems_list = canon_poems_section.find_next('ul', {'class': 'wds-list wds-is-linked'})
    canon_poems = poems_list.find_all('li')

    poems_data = []
    for poem in canon_poems:
      poem_name = poem.find('span').get_text()  #This retrieves the poem title
      poem_url = poem.find('a')['href']  #This retrieves the poem URL
      #This constructs a full URL if it is not given
      if not poem_url.startswith('http'):
        poem_url = 'https://aliceinwonderland.fandom.com' + poem_url
      poems_data.append({'name': poem_name, 'url': poem_url})

    return poems_data
  else:
    return []

#Now we create the new poem_title_text function
def poem_title_text(n):
  all_poems = poems()  #This fetches all poems and their URLs

  #Here we check if the index n is valid
  if n < 0 or n >= len(all_poems):
    return f"No poem at index {n}. Available range is from 0 to {len(all_poems) - 1}."

  #This retrieves the poem's title and URL based on the index n
  poem_title = all_poems[n]['name']
  poem_url = all_poems[n]['url']

  #Now we fetch the poem page content
  poem_page = requests.get(poem_url)
  poem_soup = BeautifulSoup(poem_page.text, 'html.parser')

  #Here, from examining the HTML view of the page, we know we must extract the poem text from the <div class="mw-parser-output">
  poem_text_section = poem_soup.find('div', {'class': 'mw-parser-output'})

  if poem_text_section:
    #Now we must extract individual paragraphs within the maw-parser-output section
    poem_text_parts = poem_text_section.find_all('p')

    #Since the text may be in multiple paragraphs, here we join the text of all the paragraphs
    poem_text = '\n\n'.join([p.get_text() for p in poem_text_parts])
  else:
    poem_text = "Poem text not found."

  #This code returns the poem title and text
  return {'title': poem_title, 'text': poem_text}

#For example, here is how we would retrieve the title and text of the poem at index 2
poem_data = poem_title_text(2)
print(f"Title: {poem_data['title']}\n\nText:\n{poem_data['text']}")

Title: The Walrus and the Carpenter

Text:



Illustration by Sir John Tenniel.

The Walrus and the Carpenter is a poem by Lewis Carroll that appears within his 1871 novel, Through the Looking-Glass, and What Alice Found There. Tweedledee and Tweedledum perform it for Alice in the fourth chapter.


The sun was shining on the sea,
Shining with all his might:
He did his very best to make
The billows smooth and bright--
And this was odd, because it was
The middle of the night.


The moon was shining sulkily,
Because she thought the sun
Had got no business to be there
After the day was done--
"It's very rude of him," she said,
"To come and spoil the fun!"


The sea was wet as wet could be,
The sands were dry as dry.
You could not see a cloud, because
No cloud was in the sky:
No birds were flying overhead--
There were no birds to fly.


The Walrus and the Carpenter
Were walking close at hand;
They wept like anything to see
Such quantities of sand:
"If this were only cleared away,"
They said

In [19]:
#As another example, here is how we would retrieve the title and text of the poem at index 5
poem_data = poem_title_text(5)
print(f"Title: {poem_data['title']}\n\nText:\n{poem_data['text']}")

Title: Turtle Soup

Text:
"Turtle Soup" is a song sung by the Mock Turtle in Chapter 10 of Alice's Adventures in Wonderland. It is a parody of the poem "Star of the Evening."


Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
Soup of the evening, beautiful Soup!


Beau--ootiful Soo--oop!
Beau--ootiful Soo--oop!
Soo--oop of the e--e--evening,
Beautiful, beautiful Soup!


Beautiful Soup! Who cares for fish,
Game or any other dish?
Who would not give all else for two
Pennyworth only of Beautiful Soup?
Pennyworth only of beautiful Soup?


Beau--ootiful Soo--oop!
Beau--ootiful Soo--oop!
Soo--oop of the e--e--evening,
Beautiful, beauti--FUL SOUP!

