# Web Scraping in Python with Beautiful Soup, Requests and pandas

*This tutorial is mainly based on the tutorial [Build a Web Scraper with Python in 5 Minutes](https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html) by Natassha Selvaraj as well as the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).*

In this tutorial, you will learn how to:

1. Scrape the web page [“Quotes to Scrape”](https://quotes.toscrape.com/) using [Requests](https://docs.python-requests.org/en/latest/). 


1. Pulling data out of HTML using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


1. Use [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) to inspect the CSS of the web page.


1. Store the scraped data in a [pandas](https://pandas.pydata.org/) dataframe.

## Prerequisites

To start this tutorial, you need: 

- Some basic understanding of HTML and CSS and CSS selectors.
- Google's web browser [Chrome](https://support.google.com/chrome/answer/95346?hl=en&co=GENIE.Platform%3DDesktop) and the [Chrome extension SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)
- To know how to use [Chrome DevTools](https://developer.chrome.com/docs/devtools/)

> To learn more about HTML, CSS, Chrome DevTools and the Selector Gadget, follow the instructions in this [web scraping basics tutorial](https://kirenz.github.io/codelabs/codelabs/webscraping/#0).

## Setup

In [6]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

## Scrape website with Requests

- First, we use `requests` to scrape the website (using a GET request).

- `requests.get()` fetches all the content from a particular website and returns a response object (we call it `html`):

In [7]:
url = 'http://quotes.toscrape.com/'

html = requests.get(url)

- Check if the response was succesful (with `.status_code`):

In [8]:
html.status_code

200

- Response 200 means that the request has succeeded. 

## Investigate HTML with Beautiful Soup

- We can use the response object to access certain features such as content, text, headers, etc. 

- In our example, we only want to obtain `text` from the object.

- Therefore, we use `html.text` which only returns the text of the response.

- Running `html.text` through BeautifulSoup using the `html.parser` gives us a Beautiful Soup object:

In [9]:
soup = BeautifulSoup(html.text, 'html.parser')

- `soup` represents the document as a nested data structure:

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

Next, we take a look at some ways to navigate that data structure.

### Get all text

- A common task is extracting all the text from a page (since the output is quite large, we don't actually print the output of the following function):

In [11]:
# print(soup.get_text())

### Investigate title

- Print the complete HTML title (`.title`):

In [12]:
soup.title

<title>Quotes to Scrape</title>

- Show name of the title tag (`.title.name`):

In [13]:
soup.title.name

'title'

- Only print the text of the title (`title.string`):

In [14]:
soup.title.string

'Quotes to Scrape'

- Show the name of the parent tag of title:

In [15]:
soup.title.parent.name

'head'

### Investigate hyperlinks

- Show the first hyperlink in the document:

In [16]:
soup.a

<a href="/" style="text-decoration: none">Quotes to Scrape</a>

### Investigate a text element

In [17]:
soup.span.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

### Extract specific elements with find and find_all

- Since there are many div tags in HTML, we can’t use the previous approaches to extract relevant information.

- Instead, we need to use the `find` and `find_all` methods which you can use to extract specific HTML tags from the web page.

- This methods can be used to retrieve all the elements on the page that match our specifications. 

- Let's say our goal is to obtain all quotes, authors and tags from the website [“Quotes to Scrape”](https://quotes.toscrape.com/).

- We want to store all information in a pandas dataframe (every row should contain a quote as well as the corresponding author and tags).   

- First, we use [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) in Google Chrome to inspect the website. 


> Review the [web scraping basics tutorial](https://kirenz.github.io/codelabs/codelabs/webscraping/#0) to learn how inspect websites.

#### Extract all quotes

Task: Extract all quotes

- First, we use the div class "quote" to retrieve all relevant information regarding the quotes:

In [18]:
quotes_all = soup.find_all('div', {'class': 'quote'})
quotes_all

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

- Next, we can iterate through our new `quotes_all` object and extract only the text of the quotes:

  - we want to store all text quotes in a new array called `quotes_text` (you need top provide an empty list)
  - To extract the quotes, note that the text of the quotes are available in the tag `span` as "`class`:`text`" (see output above))
  - finally, we can use the method `.text` to make sure we only extract text
  
Some hints:  
  
 ```python 
# create empty array
quotes_text = []

# use for loop to write quotes in quotes_text with append
 for i in ___:
    ___.append((___.find('___', {'___':'___'})).___)
```  

In [19]:
quotes_text = []

for i in quotes_all:
    quotes_text.append((i.find('span', {'class':'text'})).text)

In [21]:
quotes_text

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [22]:
# first quote 
quotes_text[0]

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

- Next, we want to store the data in a pandas dataframe (to make later data preprocessing steps easier)

In [25]:
df_quotes = pd.DataFrame({"quote" : quotes_text})
df_quotes

Unnamed: 0,quotes
0,“The world as we have created it is a process ...
1,"“It is our choices, Harry, that show what we t..."
2,“There are only two ways to live your life. On...
3,"“The person, be it gentleman or lady, who has ..."
4,"“Imperfection is beauty, madness is genius and..."
5,“Try not to become a man of success. Rather be...
6,“It is better to be hated for what you are tha...
7,"“I have not failed. I've just found 10,000 way..."
8,“A woman is like a tea bag; you never know how...
9,"“A day without sunshine is like, you know, nig..."


#### Extract all authors

Task: Extract all authors 

In [24]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

- In this example, we don't want to create a new object (like  `quotes_all`) as an intermediate step. 


- Instead, we use a different approach:
  - create an emtpty array mit the name `authors_text`
  - use the `soup` object and implement the `find_all()` function in a for loop to extract the authors (take a look at the code where we created `quotes_all`):
  
Hint:

```python
___ = []

for i in ___.___("___",{"___": "___"}):
    ___.___((___.___("___", {"___": "___"})).___)
```

In [16]:
for i in soup.findAll("div",{"class": "quote"}):
    print((i.find("small", {"class": "author"})).text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


We create a new dataframe:

- call the dataframe: df_authors
- name the column: author

In [None]:
df_authors = pd.DataFrame({"author" : authors_text})
df_authors

We could use a left join to combine the two dataframes

In [None]:
df_quotes.join(df_authors)

#### Extract all tags

Task: Extract all tags

- Information about the tags is available in the class "tags".

- We need to extract the "content" from "meta" and return it as array:

In [17]:
for i in soup.findAll("div",{"class": "tags"}):
    print((i.find("meta"))['content'])

change,deep-thoughts,thinking,world
abilities,choices
inspirational,life,live,miracle,miracles
aliteracy,books,classic,humor
be-yourself,inspirational
adulthood,success,value
life,love
edison,failure,inspirational,paraphrased
misattributed-eleanor-roosevelt
humor,obvious,simile


## Create dataframe for all quotes, authors and tags

- Next, we want to store all quotes with the corresponding authors and tags information in a pandas dataframe.  

- Note that the site has a total of ten pages and we want to collect the data from all of them. 

- The website's URL address is structured as follows:

  - page 1: https://quotes.toscrape.com/page/1/
  - page 2: https://quotes.toscrape.com/page/2/
  - ...
  - page 10: https://quotes.toscrape.com/page/10/

- This means we can use the part "https://quotes.toscrape.com/page/" as root and iterate over the pages 1 to 10.

We will proceed as follows:

1. Store the root url without the page number as a variable called `root`.


1. Prepare three empty arrays: `quotes`, `authors` and `tags`.


1. Create a loop that ranges from 1–10 to iterate through every page on the site.


1. Append the scraped data to our arrays.

- Note that we use the same code as before (we simply replace `print` with `foo.append`)

In [18]:
# store root url without page number
root = 'http://quotes.toscrape.com/page/'

# create empty arrays
quotes = []
authors = []
tags = []

# loop over page 1 to 10
for pages in range(1,10): 
        
        html = requests.get(root + str(pages))
        
        soup = BeautifulSoup(html.text)    

        for i in soup.findAll("div",{"class":"quote"}):
                 quotes.append((i.find("span",{"class":"text"})).text)  
   
        for j in soup.findAll("div",{"class":"quote"}):
                 authors.append((j.find("small",{"class":"author"})).text)    
        
        for k in soup.findAll("div",{"class":"tags"}):
                 tags.append((k.find("meta"))['content'])

- Create pandas dataframe

In [19]:
df = pd.DataFrame(
    {'Quotes':quotes,
     'Authors':authors,
     'Tags':tags
    })

- Show result

In [20]:
df.head()

Unnamed: 0,Quotes,Authors,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational"


- Congratulations! You have successfully completed this tutorial.