# Scraping 'Quotes to Scrape' Website Using Python  

![](https://quotefancy.com/media/wallpaper/3840x2160/1699820-Albert-Einstein-Quote-The-world-as-we-have-created-it-is-a-process.jpg)

**[Quotes to Scrape](http://quotes.toscrape.com/)** is the website for the popular quotes by different authors related to different subjects. Who don't like to hear quotes? Everyone likes to hear quotes as we can connect with our feelings like feeling of fear, sadness, motivation, science, art, etc.

We will use this http://quotes.toscrape.com/ page to retrieve the information using web scraping. In this project we are going to scrape quotes with their authors, tags (subjects), and links to the author's biography.

## What is Web Scraping?
 Web Scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. There are many examples where web scraping can be useful, e.g. to create a job searching scraper, top movies scraper, mobile phone's price scraper and so on. Example, here we are extracting quotes. Web scraping involves extracting out information often from HTML documents and then convert into structured data like CSV file.
 
## Steps to Follow : 
* Install important libraries that will be helpful for the project i.e. requests, BeautifulSoup4, pandas.
* Download the web page using the requests library
* Inspecting HTML source code of the web page
* Parsing parts of the website using Beautiful Soup
* Convert parsing parts into csv file
* Have a look on csv file using pandas library.

## Install and Import important libraries 

In [1]:
# Installing important libraries i.e requests for downloading the webpage, BeautifulSoup for parsing the html tags
!pip install jovian --upgrade --quiet
!pip install requests --upgrade --quiet
!pip install BeautifulSoup4 --upgrade --quiet

# Importing important libraries
import jovian
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Download the web page [Quotes to Scrape](http://quotes.toscrape.com/) using requests library

Requests is elegant and simple HTTP (HyperText Transfer Protocol) library for Python which allows you to send HTTP requests easily. `Requests.get` function will download the web page and returns a response object with page contents and some information indicating whether the request was successful, using a status_code.

In [2]:
url='http://quotes.toscrape.com/'
response=requests.get(url)

`response.status_code` will provide you the code whether the request was successful or not. If the `status.code` lies between 200 to 209 then the request was successful otherwise it was not successful.

In [3]:
response.status_code

200

# Inspect HTML of the web page
We can view the source code of the webpage by doing right-clicking anywhere on the web page and selecting 'Inspect' option. It opens the "Developer Tools" pane, where we can see the source code as a tree. We can expand and collapse various nodes and find the source code for a specific portion of the page. 

Here's how our web page look like:
![](https://i.imgur.com/rxSb0J0.png)

## Parsing parts of the website using Beautiful Soup
To extract information from the HTML source code using programming, we will use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. BeautifulSoup will return an object containing several properties and methods to extract the information from HTML documents.

In [4]:
doc=BeautifulSoup(response.text,'html.parser')
type(doc)

bs4.BeautifulSoup

![](https://i.imgur.com/RTmd1vu.png)

Upon inspecting the box containing information for the quotes, we will find a `div` tag for every quote, with class is set to `quote`. 

Let's find all the `div` tags having class `quote`.

In [5]:
div_tags=doc.find_all('div',class_='quote')
len(div_tags)

10

Every page has 10 quotes, hence the `length` of `div_tags` is 10. 
`div_tags` contains the information such as **quote, author name, links to author's biography, tags**.

Let's extract quotes using the helper function `get_quotes`.
![](https://i.imgur.com/7pQ8iHH.png)

In [7]:
def get_quotes(div_tags):
    """Get the list of quotes for one page"""
    quotes=[]
    for tag in div_tags:
        quote=tag.find('span',class_='text').text
        quotes.append(quote)
    return quotes    

In [8]:
get_quotes(div_tags)

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']


Let's extract author names, which is inside `small tag` with class set to `author`
which is inside `span tag` with class set to `None`. 
![](https://i.imgur.com/rbGBJOm.png)

In [9]:
def get_author_name(div_tags):
    """Get the author names for the quotes for one page"""
    authors=[]
    for tag in div_tags:
        span_tag=tag.find('span',class_=None)
        author=span_tag.find('small',class_='author').text
        authors.append(author)
    return authors  

In [10]:
get_author_name(div_tags)

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

Let's extract author links, which is inside `a tag` with attribute `href`
which is inside `span tag` with class set to `None`. 
![](https://i.imgur.com/iREXyaK.png)


In [13]:
def get_author_urls(div_tags):
    """ Get the list of urls for one page"""
    author_links=[]
    for tag in div_tags:
        span_tag=tag.find('span',class_=None)
        author_link='http://quotes.toscrape.com'+span_tag.find('a')['href']
        author_links.append(author_link)
    return author_links

In [14]:
get_author_urls(div_tags)

['http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/author/J-K-Rowling',
 'http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/author/Jane-Austen',
 'http://quotes.toscrape.com/author/Marilyn-Monroe',
 'http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/author/Andre-Gide',
 'http://quotes.toscrape.com/author/Thomas-A-Edison',
 'http://quotes.toscrape.com/author/Eleanor-Roosevelt',
 'http://quotes.toscrape.com/author/Steve-Martin']

Let's extract tags of the quotes, which is inside `meta tag` with attribute `content`
which is inside `div tag` with class set to `tags`.
![](https://i.imgur.com/caMcpHc.png)


In [11]:
def get_quote_tag(div_tags):
    """Get the quote tags for different quotes for one page"""
    name_tags=[]
    for tag in div_tags:
        name_tag=tag.find('div',class_='tags').meta['content']
        name_tags.append(name_tag)
    return name_tags

In [12]:
get_quote_tag(div_tags)

['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

We will make the list of dictionaries by combining all the lists obtained by parsing the website using `Beautiful Soup`.

In [10]:
def list_of_dict(quotes_list,author_names,urls,subject_names):
    """Convert the lists into one list of dictionaries having quote, author names, link, tags"""
    return [{'Quotes': quotes_list[i],
             'Author': author_names[i],
             'Tags': subject_names[i],
             'Link': urls[i]} for i in range(len(quotes_list))] # using the list comprehension
quotes_dict=list_of_dict(quotes,authors,author_links,name_tags)

## Convert parsing parts into csv file

Let's first convert dictionary into Pandas `DataFrame`. A Pandas `DataFrame` is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. Then, using `to_csv`, will save the DataFrame into `CSV` file.

In [11]:
df=pd.DataFrame(quotes_dict)
df.to_csv('quote.csv',index=None)

## Have a look on csv file using pandas library
`read_csv` helps to read a comma-separated values (csv) file into DataFrame.

In [12]:
pd.read_csv('quote.csv')

Unnamed: 0,Quotes,Author,Tags,Link
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world",http://quotes.toscrape.com/author/Albert-Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices",http://quotes.toscrape.com/author/J-K-Rowling
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles",http://quotes.toscrape.com/author/Albert-Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor",http://quotes.toscrape.com/author/Jane-Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational",http://quotes.toscrape.com/author/Marilyn-Monroe
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood,success,value",http://quotes.toscrape.com/author/Albert-Einstein
6,“It is better to be hated for what you are tha...,André Gide,"life,love",http://quotes.toscrape.com/author/Andre-Gide
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison,failure,inspirational,paraphrased",http://quotes.toscrape.com/author/Thomas-A-Edison
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt,http://quotes.toscrape.com/author/Eleanor-Roos...
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor,obvious,simile",http://quotes.toscrape.com/author/Steve-Martin


## Full code of the project 
We have parsed all the details for one page. To generalize this for `n` number of pages, we can extend these lists by applying `extend` method of lists in the for loop.
This is the full code of this project with generalization to `n` number of pages.

In [13]:
def scraping_quotes(number_of_pages,path=None):
    """Get the quotes and write them into a csv file"""
    if path is None:
        path='Quotes.csv'
    items=parse_quotes(number_of_pages)    
    df=pd.DataFrame(items)
    df.to_csv(path,index=None)
    print('Quotes information for {} pages written to file "{}"'.format(number_of_pages,path))
    return path


def parse_quotes(number_of_pages):
    """Get the list of dictionaries having quotes, author names and other information for the particular number of pages """
    subject_names,urls,author_names,quotes_list=[],[],[],[]
    for i in range(1,number_of_pages+1):
        page=str(i)
        quote_doc=get_webpage(page)
        div_tags=quote_doc.find_all('div',class_='quote')
        subject_names.extend(get_quote_tag(div_tags))  #joining the lists so that only elements are added to get a single list pages'
        urls.extend(get_author_urls(div_tags))         #joining the lists so that only elements are added to get a single list
        author_names.extend(get_author_name(div_tags)) #joining the lists so that only elements are added to get a single list
        quotes_list.extend(get_quotes(div_tags))       #joining the lists so that only elements are added to get a single list
        Quotes_list=list_of_dict(quotes_list,author_names,urls,subject_names)
    return Quotes_list

def get_quotes(div_tags):
    """Get the list of quotes for one page"""
    quotes=[]
    for tag in div_tags:
        quote=tag.find('span',class_='text').text
        quotes.append(quote)
    return quotes

def get_author_urls(div_tags):
    """ Get the list of urls for one page"""
    author_links=[]
    for tag in div_tags:
        span_tag=tag.find('span',class_=None)
        author_link='http://quotes.toscrape.com'+span_tag.find('a')['href']
        author_links.append(author_link)
    return author_links

def get_quote_tag(div_tags):
    """Get the quote tags for different quotes for one page"""
    name_tags=[]
    for tag in div_tags:
        name_tag=tag.find('div',class_='tags').meta['content']
        name_tags.append(name_tag)
    return name_tags

def get_author_name(div_tags):
    """Get the author names for the quotes for one page"""
    authors=[]
    for tag in div_tags:
        span_tag=tag.find('span',class_=None)
    #print(span_tag)
        author=span_tag.find('small',class_='author').text
        authors.append(author)
    return authors 


def list_of_dict(quotes_list,author_names,urls,subject_names):
    """Convert the lists into one list of dictionaries having quote, author names, link, tags"""
    return [{'Quotes': quotes_list[i],
             'Author': author_names[i],
             'Tags': subject_names[i],
             'Link': urls[i]} for i in range(len(quotes_list))]

def get_webpage(page):
    """Get the webpage containing quotes, authors etc."""
    #construct the url
    url='http://quotes.toscrape.com/page/'+page+'/'
    
    #using request download the webpage
    response=requests.get(url)
    
    #check if the process is successful
    if response.status_code!=200:
        print('Status Code:',response.status_code)
        raise Exception ('We Failed to fetch the webpage')+url 
    else:
        doc=BeautifulSoup(response.text,'html.parser')
    return doc             

In [17]:
scraping_quotes(10)

Quotes information for 10 pages written to file "Quotes.csv"


'Quotes.csv'

This is how our `Quotes.csv` file looks like:
![](https://i.imgur.com/dU3eN1C.png)

In [18]:
pd.read_csv('Quotes.csv')

Unnamed: 0,Quotes,Author,Tags,Link
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world",http://quotes.toscrape.com/author/Albert-Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices",http://quotes.toscrape.com/author/J-K-Rowling
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles",http://quotes.toscrape.com/author/Albert-Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor",http://quotes.toscrape.com/author/Jane-Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational",http://quotes.toscrape.com/author/Marilyn-Monroe
...,...,...,...,...
95,“You never really understand a person until yo...,Harper Lee,better-life-empathy,http://quotes.toscrape.com/author/Harper-Lee
96,“You have to write the book that wants to be w...,Madeleine L'Engle,"books,children,difficult,grown-ups,write,write...",http://quotes.toscrape.com/author/Madeleine-LE...
97,“Never tell the truth to people who are not wo...,Mark Twain,truth,http://quotes.toscrape.com/author/Mark-Twain
98,"“A person's a person, no matter how small.”",Dr. Seuss,inspirational,http://quotes.toscrape.com/author/Dr-Seuss


## Summary

* Downloaded the webpage using `requests` library
* Find the list of quotes, author names, urls, quote tags for the single page by parsing the HTMl source code of the web page using the `Beautiful Soup` library
* Combined the lists of all the required pages
* Convert those lists into `list of dictionaries`
* Convert the parsed information into `CSV` file
* Had a look on CSV file using `Pandas` library

## Future works ideas
* We can scrape the web page for the pages individually
* We can do analysis for finding the top 10 authors
* We can do analysis for finding the top 10 tags

## References
* N S, Aakash. (2021) “Let’s Build a Python Web Scraping Project from Scratch”, Jovian.ai, Available at :https://www.youtube.com/watch?v=RKsLLG-bzEY
* Requests Documentation, available at https://requests.readthedocs.io/en/latest/
* BeautifulSoup Documentation, available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Pandas Dataframe as a CSV, available at https://www.geeksforgeeks.org/saving-a-pandas-dataframe-as-a-csv/
* Quotes To Scrape Website, available at https://quotes.toscrape.com/

In [16]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "singhalkshama4343/web-scraping-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/singhalkshama4343/web-scraping-project[0m


'https://jovian.ai/singhalkshama4343/web-scraping-project'