# Part IV. Scraping web pages with Requests and BeautifulSoup

## The Task
In part III, we will create a reusable function that scrapes quotes from the website by stating the range of the pages. The outputs will be save as CSV.


| author          | author_url                                        | quote_text                                        | tags                                     |
| :-------------- | :------------------------------------------------ | :------------------------------------------------ | ---------------------------------------- |
| Albert Einstein | http://quotes.toscrape.com/author/Albert-Einstein | “The world as we have created it is a process ... | change;deep-thoughts;thinking;world      |
| J.K. Rowling    | http://quotes.toscrape.com/author/J-K-Rowling     | “It is our choices, Harry, that show what we t... | abilities;choices                        |
| Albert Einstein | http://quotes.toscrape.com/author/Albert-Einstein | “There are only two ways to live your life. On... | inspirational;life;live;miracle;miracles |
| Jane Austen     | http://quotes.toscrape.com/author/Jane-Austen     | “The person, be it gentleman or lady, who has ... | aliteracy;books;classic;humor            |



## Main Steps

- Import necessary libraries
- Define the URL
- Make a request to retrieve HTML codes
- Make the soup
- Parse HTML with BeautifulSoup
- Store the results
- Create reusable functions

We will start with scraping quotes from Page 1.

## Step 1. Import necessary libraries

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Step 2.  Make a simple request and retrieve response contents

### 2.1 Define the URL

In [None]:
# Indicate the base URL
base_url = 'http://quotes.toscrape.com/page/'

# The page number for the first page is 1
page_num = 1

# url = the base URL + page number
url = base_url + str(page_num)

### 2.2 Make a simple GET request to retrieve HTML codes
The `requests library` is commonly used in Python to make HTTP requests.

In [None]:
# Make HTTP requests
r = requests.get(url)

# Retrieve response contents
c = r.content

## Step 3. Make the soup

In [None]:
soup = BeautifulSoup(c)

## Step 4. Parse HTML with BeautifulSoup
### 4.1 Locate all the quotes on the page

In [None]:
quotes = soup.find_all('div', {'class': 'quote'})

### 4.2 Observe the retrieved quotes

In [None]:
print(quotes[0])

### 4.3 Reuse the get_quote function
Because the HTML strings share the same patterns with the one we processed in Part II, here we can reuse the `get_quote` function.
<br>However, we will need to make a minor change to the function so that it can take a bs4 element as the input.
- "#" the code to make soup as the input is already a bs4 element.
- Change the input of the function to soup

In [None]:
def get_quote(text):
    # make the soup
    soup = BeautifulSoup(text, "lxml")

    # retrieve the text of the quote
    quote_text = soup.find('span')
    quote_text = quote_text.text

    # retrieve the author name
    author = soup.find('small')
    author = author.text

    # retrieve the tags
    tags = soup.find_all('a', {'class': 'tag'})
    tags_ls = []
    for tag in tags:
        tag = tag.text
        tags_ls.append(tag)
    tags = ';'.join(tags_ls)

    #retreive the author URL
    author_url = soup.find('a')
    author_url = author_url.get('href')
    author_url = 'http://quotes.toscrape.com' + author_url
    
    results_dt = {
    'author': author,
    'author_url': author_url,
    'tags': ';'.join(tags_ls), 
    'quote_text': quote_text
    }
    
    return results_dt

### 4.3 Test out the modified get_quote function

In [None]:
# parsing the fifth quotes
get_quote(quotes[4])

### 4.4 Scrape all the quotes on the page

In [None]:
outputs = []
for quote in quotes:
    quote = get_quote(quote)
    outputs.append(quote)

## Step 5. Create reusable functions

### 5.1 Function - Get quotes by page number

In [None]:
def get_quotes(page_number):
    # define the URL
    base_url = 'http://quotes.toscrape.com/page/'
    page_number = str(page_number)
    url = base_url + page_number
    
    # make a request and make the soup
    r = requests.get(url)
    c = r.content
    soup = BeautifulSoup(c, 'lxml')
    
    # locate all the quotes
    quotes = soup.find_all('div', {'class': 'quote'})
    
    # parse each quote using for loops
    outputs = []
    for quote in quotes:
        quote = get_quote(quote)
        outputs.append(quote)
        
    # return the outputs
    return outputs

Try out the function now! In the cell below, print the quotes on Page 4.

### 5.2 Function - Get quotes by a range of numbers

#### 5.2.1 Get started with a for loops
Examples: from Page 1 to page 3

In [None]:
outputs = []
for i in list(range(1, 4)):
    outputs += get_quotes(i)

#### 5.2.2 Transform to outputs into a tabluar format

In [None]:
outputs = pd.DataFrame(outputs)
outputs.head()

#### 5.2.3 Write the lines together

In [None]:
def scrape_quotes(start, end):
    outputs = []
    for i in list(range(start, end + 1)):
        outputs += get_quotes(i)
    outputs = pd.DataFrame(outputs)
    return outputs

## Test out the function
In the cell below, scrape quotes from page 2 to 8.