# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [13]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [14]:
def quotes_parser(content):
    soup = BeautifulSoup(content, "lxml")
    tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'span']
    text = [element.text for element in soup.find_all(tags)]
    return '\n'.join(text)

my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

my_spider.kickstart()


Quotes to Scrape

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)

“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)

“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)

“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)

“

## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [15]:
# your code here
import time

class IronhackSpider:

    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            
    def error_handling(self, url):
        try:
            response = requests.get(url)
            return response
        except requests.exceptions.TooManyRedirects:
            print("URL incorrecta, ingresa una diferente.")
        except requests.exceptions.Timeout:
            print("La respuesta no se ha generado en el tiempo esperado.")
        except requests.exceptions.SSLError:
            print("Error SSL.")
        except requests.exceptions.RequestException as e:
            print("Error de tipo desconocido.")
            
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

my_spider.kickstart()


Quotes to Scrape

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)

“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)

“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)

“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)

“

# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [6]:
# your code here
from time import sleep
class IronhackSpider:

    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)            
            
    def error_handling(self, url):
        try:
            response = requests.get(url)
            return response
        except requests.exceptions.TooManyRedirects:
            print("URL incorrecta, ingresa una diferente.")
        except requests.exceptions.Timeout:
            print("La respuesta no se ha generado en el tiempo esperado.")
        except requests.exceptions.SSLError:
            print("Error SSL.")
        except requests.exceptions.RequestException as e:
            print("Error de tipo desconocido.")
            
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

my_spider.sleep_interval = 5
my_spider.kickstart()


Quotes to Scrape

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)

“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)

“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)

“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)

“

# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [12]:
# your code here
class IronhackSpider:

    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        if result:
            self.output_results(result)
        return result
    
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)
            condition = self.scrape_url(self.url_pattern % i)
            if not condition:
                break
            
    def error_handling(self, url):
        try:
            response = requests.get(url)
            return response
        except requests.exceptions.TooManyRedirects:
            print("URL incorrecta, ingresa una diferente.")
        except requests.exceptions.Timeout:
            print("La respuesta no se ha generado en el tiempo esperado.")
        except requests.exceptions.SSLError:
            print("Error SSL.")
        except requests.exceptions.RequestException as e:
            print("Error de tipo desconocido.")
            
my_spider = IronhackSpider(URL_PATTERN, 10, 1, content_parser=quotes_parser)

my_spider.kickstart()





Quotes to Scrape








Quotes to Scrape




Login






“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world



“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)


            Tags:
            
abilities
choices



“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)


            Tags:
            
inspirational
life
live
miracle
miracles



“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)


            Tags:
            
aliteracy
books
classic
humor



“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe





Quotes to Scrape








Quotes to Scrape




Login






“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”
by George R.R. Martin
(about)


            Tags:
            
read
readers
reading
reading-books



“You can never get a cup of tea large enough or a book long enough to suit me.”
by C.S. Lewis
(about)


            Tags:
            
books
inspirational
reading
tea



“You believe lies so you eventually learn to trust no one but yourself.”
by Marilyn Monroe
(about)






“If you can make a woman laugh, you can make her do anything.”
by Marilyn Monroe
(about)


            Tags:
            
girls
love



“Life is like riding a bicycle. To keep your balance, you must keep moving.”
by Albert Einstein
(about)


            Tags:
            
life
simile



“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”
by Marilyn Monroe
(about)


            Tags:






Quotes to Scrape








Quotes to Scrape




Login






“Anyone who has never made a mistake has never tried anything new.”
by Albert Einstein
(about)


            Tags:
            
mistakes



“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”
by Jane Austen
(about)


            Tags:
            
humor
love
romantic
women



“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”
by J.K. Rowling
(about)


            Tags:
            
integrity



“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”
by Jane Austen
(about)


            Tags:
            
books
library
reading


# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [22]:
# your code here
class IronhackSpider:

    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            
    def error_handling(self, url):
        try:
            response = requests.get(url)
            return response
        except requests.exceptions.TooManyRedirects:
            print("URL incorrecta, ingresa una diferente.")
        except requests.exceptions.Timeout:
            print("La respuesta no se ha generado en el tiempo esperado.")
        except requests.exceptions.SSLError:
            print("Error SSL.")
        except requests.exceptions.RequestException as e:
            print("Error de tipo desconocido.")
            
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 10

def quotes_parser(content):
    soup = BeautifulSoup(content,'html.parser')
    seccion = soup.findAll('section')[0]
    libros = seccion.findAll('article', {'class':'product_pod'})
    cadena = ''
    for libro in libros:
        cadena += libro.text + '\n'
    return cadena
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

my_spider.kickstart()












A Light in the ...

£51.77


    
        In stock
    


Add to basket














Tipping the Velvet

£53.74


    
        In stock
    


Add to basket














Soumission

£50.10


    
        In stock
    


Add to basket














Sharp Objects

£47.82


    
        In stock
    


Add to basket














Sapiens: A Brief History ...

£54.23


    
        In stock
    


Add to basket














The Requiem Red

£22.65


    
        In stock
    


Add to basket














The Dirty Little Secrets ...

£33.34


    
        In stock
    


Add to basket














The Coming Woman: A ...

£17.93


    
        In stock
    


Add to basket














The Boys in the ...

£22.60


    
        In stock
    


Add to basket














The Black Maria

£52.15


    
        In stock
    


Add to basket














Starving Hearts (Triangular Trade ...

£13.99


    
        In stock
    


Add to basket














Shakespeare's Son












Immunity: How Elie Metchnikoff ...

£57.36


    
        In stock
    


Add to basket














I Hate Fairyland, Vol. ...

£29.17


    
        In stock
    


Add to basket














I am a Hero ...

£54.63


    
        In stock
    


Add to basket














How to Be Miserable: ...

£46.03


    
        In stock
    


Add to basket














Her Backup Boyfriend (The ...

£33.97


    
        In stock
    


Add to basket














Giant Days, Vol. 2 ...

£22.11


    
        In stock
    


Add to basket














Forever and Forever: The ...

£29.69


    
        In stock
    


Add to basket














First and First (Five ...

£15.97


    
        In stock
    


Add to basket














Fifty Shades Darker (Fifty ...

£21.96


    
        In stock
    


Add to basket














Everydata: The Misinformation Hidden ...

£54.35


    
        In stock
    


Add to basket














Don't Be a Jerk: ...

£37.97


    


# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here

# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here