**PART 1**

In [1]:
# Request library in python can be used for making web requests
import requests

In [2]:
# URL for the web page that we are trying the scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

In [3]:
# Fetch the web page using requests library's get method
response = requests.get(url)

In [4]:
# When the web page is fetched successfully, the response will have the status code 200
# which refers that the get request is completed.
# Failure to fetch the page will generate other status codes (e.g. 204, 400, 404, etc)
response.status_code

200

In [5]:
# To get the html content, we are only interested in the attribute named 'content' for now
page_content = response.content

In [6]:
# This shows that the html content is fetched in bytes format
type(page_content)

bytes

In [7]:
# page_content

If you look into the html content itself, it may seem like a mess.
Our goal is to pull out the necessary text content from this messy data in page_content. 

We will use **BeautifulSoup**, a python library which can be used to parse html contents and get the information we need from html pages.

In [8]:
from bs4 import BeautifulSoup

In [9]:
# Load the html page as BeautifulSoup structure
soup = BeautifulSoup(page_content, 'html.parser')

As we can access the html content now, our next task is to parse the html content to extract important information. If you explore the output below closely, you can see there are different tags (e.g. < a >, < span >, etc) containing piece of information. The elements with tags are known as DOM (Document Object Model) elements. These DOM elements maintain the structure, style and content of a html page. We will use these DOM elements to get different portions of the html page.   

In [92]:
# Display the raw html in a more structured way
# print(soup.prettify())
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python (programming language) - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=docume

This 'soup' variable is a BeautifulSoup object that contains the html content.
Let's start scraping...

Open the web page in a brower. When you hover your mouse over the newly opened tab, we can see 'some text' floating.
This text should be present in the html content we fetched. Try searching that 'some text' in the raw html.

You can see this text appears within a <title> tag which is also the first <title> tag in the raw html.

Let's get that title element

In [11]:
# This returns a html title element.
# But we are more interested in the text inside
soup.title

<title>Python (programming language) - Wikipedia</title>

In [12]:
# Get the text inside title element
soup.title.text

'Python (programming language) - Wikipedia'

If there's any unnecessary white spaces or new lines in the text, those can be removed using .strip()

In [13]:
soup.title.text.strip()

'Python (programming language) - Wikipedia'

Go back to the raw html and observe the < a > tags. 

In [14]:
# Get the first <a> element 
first_a_tag = soup.a

In [15]:
# You can see that an element can have other attributes included
first_a_tag

<a class="mw-jump-link" href="#bodyContent">Jump to content</a>

You can also use .find() to do the same stuff (getting first element of that type)

In [96]:
soup.find('a')

<a class="mw-jump-link" href="#bodyContent">Jump to content</a>

Try to access the attributes, e.g 'class', 'href', etc.
Attributes of an element can be accessed using element['attribute_name'] format.

In [16]:
print(first_a_tag.text)

Jump to content


In [17]:
# Show the class attribute
first_a_tag['class']

['mw-jump-link']

In [18]:
# Exercise: Display the 'href' attribute of the first <a> element

# Solution:
# first_a_tag['href']

We can fetch all the elements of a specific type by using .find_all('element_tag_name'). Do not use < > before and after the tag name.

In [19]:
# Get all the <a> element
a_tags = soup.find_all('a')

In [20]:
# If you see other tags besides <a> in the raw, those tags are actually withing <a> tags in the main html tree.
a_tags[:5]

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a class="mw-logo" href="/wiki/Main_Page">
 <img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>
 <span class="mw-logo-container">
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>
 </span>
 </a>,
 <a accesskey="f" class="mw-ui-button mw-ui-quiet mw-ui-icon mw-ui-icon-element mw-ui-icon-wikimedia-search search-toggle" href="/wiki/Special:Search" title="Search Wikipedia [f]">
 <span>Search</span>
 </a>,
 <a href="/w/index.php?title=Special:CreateAccount&amp;returnto=Python+%28programming+language%29" title="You are encouraged to create an account and log in; however, it 

Now let's go and observe the < p > tags in raw html. This tags generally contains paragraphs in html. 

In [21]:
# First <p> element
first_p_tag = soup.p
print(first_p_tag)

<p class="mw-empty-elt">
</p>


In [95]:
# We can all use .find() to get the first element of a specific tag
first_p_tag = soup.find('p')
first_p_tag

<p class="mw-empty-elt">
</p>

In [22]:
# Get all <p> elements
all_paragraphs = soup.find_all('p')

In [23]:
# Display first 5 paragraphs
all_paragraphs[:5]

[<p class="mw-empty-elt">
 </p>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Python</b> is a <a href="/wiki/High-level_programming_language" title="High-level programming language">high-level</a>, <a href="/wiki/General-purpose_programming_language" title="General-purpose programming language">general-purpose programming language</a>. Its design philosophy emphasizes <a class="mw-redirect" href="/wiki/Code_readability" title="Code readability">code readability</a> with the use of <a href="/wiki/Off-side_rule" title="Off-side rule">significant indentation</a>.<sup class="reference" id="cite_ref-AutoNT-7_33-0"><a href="#cite_note-AutoNT-7-33">[33]</a></sup>
 </p>,
 <p>Python is <a href="/wiki/Type_system#DYNAMIC" title="Type system">dynamically typed</a> and <a href="/wiki/Garbage_collection_(computer_science)" title="Garbage collection (computer science)">garbage-collected</a>. It supports multiple <a href="/wiki/Programming_paradigm" title="Programming paradigm">programming paradigms</a>, 

In [24]:
# Let's write a function to get all the paragraphs in the html in cleaned format
def get_paragraph_texts():
    all_p_texts = []
    for para in all_paragraphs:
      p_text = para.text.strip()
      if p_text:
        all_p_texts.append(p_text)
    return all_p_texts

In [25]:
all_p_texts = get_paragraph_texts()

In [26]:
len(all_p_texts) # Total number of paragraphs we got

78

Go back to the web page and compare if the paragraphs we get are same to the text content in the web page. 

In [27]:
# Show texts in first 10 paragraphs
all_p_texts[0:10]

['Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[33]',
 'Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[34][35]',
 'Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python\xa00.9.0.[36] Python\xa02.0 was released in 2000. Python\xa03.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python\xa02.7.18, released in 2020, was the last release of Python\xa02.[37]',
 'Python consistently ranks as one of the most popular programming languages.[38][39][40][41]',
 'Python was conceived in the late 1980s[42] by Guido v

**Pandas** can be used to save data in tabular format using DataFrame.

In [28]:
import pandas as pd

In [29]:
# Create a dataframe which has a single column named 'text'
text_df = pd.DataFrame({'text': all_p_texts})

In [30]:
# Display first 10 elements in the dataframe
text_df.head(10)

Unnamed: 0,text
0,"Python is a high-level, general-purpose progra..."
1,Python is dynamically typed and garbage-collec...
2,Guido van Rossum began working on Python in th...
3,Python consistently ranks as one of the most p...
4,Python was conceived in the late 1980s[42] by ...
5,"Python 2.0 was released on 16 October 2000, wi..."
6,Python 2.7's end-of-life was initially set for...
7,"In 2022, Python 3.10.4 and 3.9.12 were expedit..."
8,"As of November 2022,[update] Python 3.11.0 is ..."
9,Python is a multi-paradigm programming languag...


In [31]:
# We can save this dataframe as a csv file.
# While scraping data, always try to save a copy of csv file. 
# You can use the csv in the future for further processing of the data you collected
text_df.to_csv('wiki.csv')

In [32]:
# Exercise:
# Get the first 'heading' within the html and display the text and other attributes of that element
# Hint: 'heading' can be represented as <h1>, <h2>, <h3>, etc tags. Try with <h1> first

#Solution:
# first_heading = soup.h1
# print(first_heading)
# print(first_heading.text)
# print(first_heading['class'],first_heading['id'])

In [33]:
# Exercise:
# Find all the web links provided in <a> tags
# Hint: links are generally set as 'href' attribute 

# Solution:
# href_links = []
# for a in soup.find_all('a'):
#   href_links.append(a.get('href'))

# href_links

**PART 2**

Now we will scrape a real look-alike web page which contains some interesting information about books. Let's visit http://books.toscrape.com/ and explore the page to see what kind of information you can fetch about every book. 

Each book has it's title, price, star rating, availability status which we can collect.

In [34]:
# URL for the web page that we are trying the scrape
book_url = "http://books.toscrape.com/"

In [35]:
# Fetch the web page using requests library's get method
book_response = requests.get(book_url)

In [36]:
# When the web page is fetched successfully, the response will have the status code 200
# which refers that the get request is completed.
# Failure to fetch the page will generate other status codes (e.g. 204, 400, 404, etc)
response.status_code

200

In [37]:
# Load the html page as BeautifulSoup structure
book_soup = BeautifulSoup(book_response.content, 'html.parser')

In [38]:
book_soup.prettify()



In [39]:
# Get the title of the web page
book_soup.title.text.strip()

'All products | Books to Scrape - Sandbox'

Search the first book name, "A Light in the Attic' in the html content. If you look closely, you can get that all the information about this book is inside an < article > element with class attribute named 'product_pod'.

In [40]:
# Get the first article element which represents the first book
book_soup.article

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [41]:
# You can include the class attribute name in the search for more specificity. 
# As 'class' is a defined keyword in python, class_ is used to provide the parameter. 
book_soup.find("article", class_="product_pod")

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [42]:
# Let's write a function to extract of the article elements with class name "product_pod
def get_book_articles():
    article_elements = []
    for article in book_soup.find_all("article", class_="product_pod"):
      article_elements.append(article)
    return article_elements

In [43]:
article_elements = get_book_articles()

In [44]:
# Exercise:
# Display and explore the article element for 6th book.
# Also extract the all the <div> elements this article

# Solution:
# print(article_elements[5])
# len(article_elements[5].find_all('div'))

If we search the book title, "A Light in the Attic" in the article element, we can find it in the 'title' attribute of < a > tag within < h3 > element.

In [45]:
# Get the book title for first book
article_elements[0].h3.a['title']

'A Light in the Attic'

In [46]:
# Write a function to get titles for all the books
def get_book_titles(articles):
    titles = []
    for article in articles:
      titles.append(article.h3.a['title'])
    return titles

In [47]:
book_titles = get_book_titles(article_elements)

In [48]:
# First 10 book titles
book_titles[:10]

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria']

In [49]:
# Get the web url for this book
first_book_url = article_elements[0].h3.a['href']
first_book_url

'catalogue/a-light-in-the-attic_1000/index.html'

In [50]:
# The url you see is a subpath. We can append it with the base url to get the accessible link.
book_url + first_book_url

'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

In [51]:
# Get web urls for all the books
def get_book_urls(articles):
    book_urls = []
    for article in articles:
      book_urls.append(book_url + article.h3.a['href'])
    return book_urls

In [52]:
all_book_urls = get_book_urls(article_elements)

In [53]:
# First 10 book urls
all_book_urls[:10]

['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html']

In [54]:
# You can get the price of the first book 
# From the article element, observe the price is shown within a <p> tag with class attribute "price_color".
article_elements[0].find('p', class_="price_color").text

'£51.77'

In [55]:
# Exercise:
# Write a function to get prices for all the books in a list and show prices for first 10 books

# Solution:
# def get_book_prices(article_elements):
#     prices = []
#     for article in article_elements:
#       prices.append(article.find('p', class_="price_color").text)
#     return prices

In [56]:
# book_prices = get_book_prices(article_elements) # This line will be uncommented for exercise

# prices of first 10 books
# book_prices[:10]

Explore the article element again. See that, the star rating information for a book is included in a < p > element. But watch carefully here, the class attribute for this < p > tag has two space separated value.

The first value is "star-rating" which we can use to get this < p > element for all the books. And the second value is the star rating from 1-5 in string format.

In [57]:
# Let's get the < p > tag for first book's rating information 
first_book_rating_p = article_elements[0].find('p', class_="star-rating")

In [58]:
# To find the rating, get all the class values. 
# Rating is the second element in the list
first_book_rating_p['class']

['star-rating', 'Three']

In [59]:
# Rating of first book
first_book_rating = first_book_rating_p['class'][1]
first_book_rating

'Three'

In [60]:
# Write a function to get ratings for all the books
def get_book_ratings(articles):
    ratings = []
    for article in articles:
      ratings.append(article.find('p', class_="star-rating")['class'][1])
    return ratings

In [61]:
book_ratings = get_book_ratings(article_elements)

In [62]:
# Display rating for first 10 books
book_ratings[:10]

['Three', 'One', 'One', 'Four', 'Five', 'One', 'Four', 'Three', 'Four', 'One']

In [63]:
# Exercise:
# Find the availability of the first book. 
# Then, write a function to create a list of availability statuses for all books

# # Solution:
# first_book_availability_p = article_elements[0].find('p',class_="instock availability")
# # or
# first_book_availability_p = article_elements[0].find('p',class_="instock")

In [64]:
# first_book_availability = first_book_availability_p.text.strip()

In [65]:
# first_book_availability

In [66]:
# Solution
# Find availabilities for all books
# def get_book_availabilities(articles):
#     availabilities = []
#     for article in articles:
#       availabilities.append(article.find('p',class_="instock availability").text.strip())
#     return availabilities

In [67]:
# book_availabilities = get_book_availabilities(article_elements) 

In [68]:
# Print availabilities for the first 10 books
# availabilities[:10]

In [69]:
# Get the image url for the first book
img_url = article_elements[0].a['href']

In [70]:
# Create an accessible url by appending with the base url
book_url + img_url

'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

In [71]:
# Write a function to get cover image urls for all books as a list
def get_img_urls(articles):
    img_urls = []
    for article in articles:
      img_urls.append(book_url + article.a['href'])
    return img_urls

In [72]:
book_img_urls = get_img_urls(article_elements)

In [73]:
# Image urls for first 10 books
book_img_urls[:10]

['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html']

In [74]:
import pandas as pd

In [75]:
# Exercise:
## If the previous steps are completed, uncomment the code and run this code block
## You should be able to create a dataframe with multiple columns for all the books 

# books_df = pd.DataFrame({
#             "title": book_titles,
#             "price": book_prices,
#             "star_rating": book_ratings,
#             "availability": book_availabilities,
#             "book_url": book_urls,
#             "book_cover": book_img_urls
#           })

In [76]:
# books_df

**PART 3: Take Home Exercise**

In [77]:
# Get url for another book, let's say 8th book (list index: 7)
another_book_url = book_url + article_elements[7].h3.a['href']
another_book_url

'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html'

In [78]:
## If the dataframe is created successfully, you can also access the link from book_url column
# another_book_url = books_df.iloc[7].book_url

In [79]:
# Use the url for this book to make a web request
another_book_response = requests.get(another_book_url)

In [80]:
# Create a BeautifulSoup object for the web response
another_soup = BeautifulSoup(another_book_response.content, 'html.parser')

In [81]:
# See the html content for this book page
another_soup


<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    &quot;If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull.&quot;-Kinky Friedman, author and Governor of the Heart of Texas &quot;What kind of confidence would it take for a woman to buck the old boy's club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book ta

In [82]:
# Get all the <p> elements
another_soup.find_all('p')

[<p class="price_color">£17.93</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock (19 available)
     
 </p>,
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <!-- <small><a href="/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/reviews/">
         
                 
                     0 customer reviews
                 
         </a></small>
          --> 
 
 
 <!-- 
     <a id="write_review" href="/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/reviews/add/#addreview" class="btn btn-success btn-sm">
         Write a review
     </a>
 
  --></p>,
 <p>"If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull."-Kinky Friedman, author and Governor of the Heart of Tex

In [83]:
# Exercise:
# We are interested to get the book description from this <p> elements
# Get the description 'text' for this book. For example, it starts with "Patient Twenty-nine.A monster..." for 8th book.

# Solution 
# another_soup.find_all('p')[3].text

In [84]:
# We can get the star rating from this page too
star_rating_p = another_soup.find_all('p',class_="star-rating")[0]
print(star_rating_p['class'][1])

Three


If you go back to the raw html for this book page, you can see the description is also provided with a < meta > tag.

In [85]:
# Let's see all the <meta> elements
another_soup.find_all('meta')

[<meta content="text/html; charset=utf-8" http-equiv="content-type"/>,
 <meta content="24th Jun 2016 09:29" name="created"/>,
 <meta content="
     &quot;If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull.&quot;-Kinky Friedman, author and Governor of the Heart of Texas &quot;What kind of confidence would it take for a woman to buck the old boy's club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book takes you back with a &quot;If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull.&quot;-Kinky Friedman, author and Governor of the Heart of Texas &quot;What kind of confidence would it take for a woman to buck the old boy's club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book takes you back with a breathtaking, present-tense bird's eye view into a time when women's lib

In [86]:
# Book description from <meta> element by providing the the value of 'name' attribute in the tag
another_book_description = another_soup.find('meta', attrs={'name': 'description'})
another_book_description['content'].strip()

'"If you have a heart, if you have a soul, Karen Hicks\' The Coming Woman will make you fall in love with Victoria Woodhull."-Kinky Friedman, author and Governor of the Heart of Texas "What kind of confidence would it take for a woman to buck the old boy\'s club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book takes you back with a "If you have a heart, if you have a soul, Karen Hicks\' The Coming Woman will make you fall in love with Victoria Woodhull."-Kinky Friedman, author and Governor of the Heart of Texas "What kind of confidence would it take for a woman to buck the old boy\'s club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book takes you back with a breathtaking, present-tense bird\'s eye view into a time when women\'s liberation was primarily confined to one woman\'s very capable, independent mind. I couldn\'t put it down."---Ruth Buzzi, Golden Globe Award winner and Television Hall of Fam

In [87]:
# If you visit the book webpage again, you can see a table of information
# In the raw html, this information is presented in <table> element
# Let's get this <table> element
another_book_info_table = another_soup.find('table')

In [88]:
another_book_info_table

<table class="table table-striped">
<tr>
<th>UPC</th><td>e72a5dfc7e9267b2</td>
</tr>
<tr>
<th>Product Type</th><td>Books</td>
</tr>
<tr>
<th>Price (excl. tax)</th><td>£17.93</td>
</tr>
<tr>
<th>Price (incl. tax)</th><td>£17.93</td>
</tr>
<tr>
<th>Tax</th><td>£0.00</td>
</tr>
<tr>
<th>Availability</th>
<td>In stock (19 available)</td>
</tr>
<tr>
<th>Number of reviews</th>
<td>0</td>
</tr>
</table>

In [89]:
# You can access the headers of the table by accessing <th> (Table Header) element
another_book_table_header = another_book_info_table.find_all('th')

# Display the table headers
for h in another_book_table_header:
    print(h.text)

UPC
Product Type
Price (excl. tax)
Price (incl. tax)
Tax
Availability
Number of reviews


In [90]:
# Exercise:
# Extract the values for the header names in the table as a list
# Hint: Look into <td> (Table Data) tags

# Solution:
# another_book_table_values = another_book_info_table.find_all('td')

# another_book_table_data = []
# for val in another_book_table_values:
#     another_book_table_data.append(val.text)
    
# another_book_table_data

In [91]:
# Exercise (Optional):
# Get these attributes for all the books by fetching html content for every book_url in the dataframe we created at the end of part 2
# Append these fetched values for the books as different columns in the dataframe