# Tutorial: Web Scraping using BeautifulSoup

## Introduction
Web scraping is a technique that uses the HTML structure of a webpage to extract useful information. This is very useful to automate web related tasks that have a fixed structure. In this notebook we will be extracting quotes and their authors from this website: http://quotes.toscrape.com

### Importing Libraries

We will import the following libraries for scraping the webpage:
1. **requests**: Used for basic get, post operations to the webpage. Here, to get the data from quotes.toscrape.com's servers
2. **bs4** (BeautifulSoup): To extract the content based on html tags and their attributes


In [3]:
import requests
from bs4 import BeautifulSoup

In [None]:
#child and descence 

Defining two variables for user input:
1. tag (the HTML tag to search for in a quote)
2. index (the location number of the quote that we want to extract)

We define `modindex = index-1` as the modified index because indexing starts from 0 in Python.

In [4]:
tag = input("Enter the tag for which you want to find quotes: ")
index = int(input("Enter which quote to show: "))
modindex = index-1

Enter the tag for which you want to find quotes: love
Enter which quote to show: 1


### Getting the required pages
Web scraping requires a great deal of understanding on how a particular webpage is created and less about the actual scraping process. We need to know the URL structure, the tags/ids/classes used for the information of relevance, etc.

Let's have a look at the quotes webpage. If you look at the page's source, you will see that every quote tag URL is of the form <b>http://quotes.toscrape.com/tag/[tag_name]/page/[page_number]/ </b> (note that the domain URL "http://quotes.toscrape.com" is implicitly assumed by the link so it is not necessary to include in the URL) and by observation we see that each page contains a maximum of 10 quotes. Using this information, we can get the HTML content of the page by using the method: `requests.get()`. You will learn about web requests in the coming week.

For now, note that the `request.get()` method gets a response object returned by the server based on the given URL. Based on the response, we can extract its content using bs4 and create what is generally called a `soup`.
The `soup.prettify()` method prettifies the extracted HTML content in the soup so that it is clearly legible.

In [5]:
# Generating URL of interest

page = str((modindex//10)+1)
url = "http://quotes.toscrape.com/tag/"+tag+"/page/"+page+"/"
print("URL:", url)

# You can open this URL in your browser to check if the link opens to a valid page

URL: http://quotes.toscrape.com/tag/love/page/1/


In [6]:
# Extracting the contents of the webpage to a soup
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<h3>Viewing tag: <a href="/tag/love/page/1/">love</a></h3>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<span>by <small class="author" itemprop="author">André Gide</small>
<a href="/author/Andre-Gide">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="life,love" itemprop="keywords"/>
<a class="tag" href=

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <h3>
    Viewing tag:
    <a href="/tag/love/page/1/">
     love
    </a>
   </h3>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “It is better to be hated for what you are than to be loved for what you are not.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        André Gid

**Question**: Print the soup variable as is and print soup.prettify, what difference do you see?

**Answer**

It includes the indentation of the original file, so it is easier to read the html file since it is matching the default indentation format

Now that we have the HTML content, observe that all the data relevant to quotes is in a `span` tag and has an attribute, 'class' as `text`. The `find_all()` method for the soup object is used here to get all such occurences. It gives us the list of all occurences of the query, here quotes. We store the list in variable `text`.


In [8]:
text = soup.find_all('span',class_="text")
print(text)

[<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>, <span class="text" itemprop="text">“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold 

Hence here we can see all the quotes relevent to input tags are stored in the list <b>text</b>
to access the desired quote (nth quote):

In [9]:
try:
  print(text[modindex%10].text)
except:
  print('quote not found')

“It is better to be hated for what you are than to be loved for what you are not.”


Similarly, to find the author's name, we find the tag `small` and class `author`.

In [10]:
try:
    au = soup.find_all('small',class_="author")
    print("by",au[modindex%10].text)
except:
    print("quote not found")

by André Gide


### Conclusion

Using BeautifulSoup, we have shown how to scrape data and extract useful information from the webpages. There are many applications to scraping which you can explore in your free time! Good job!