# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it works. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robust web spider that you can further work on in the Web Scraping Project.

In [1]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [4]:
# your code here
def quotes_parser(content):
    result = BeautifulSoup(content,'html.parser')
    content = [i for i in result.find_all('span',attrs={'class':'text'})]
    return content

In [5]:
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)
my_spider.kickstart()

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</spa

## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [9]:
 """
    Scrape the content of a single url.
    """
def scrape_url(self, url):
    response = requests.get(url)
    if response.status_code==200:
        result = self.content_parser(response.content)
        self.output_results(result)
    elif respons.status_code==404:
        raise 'Not found'
    elif response.status_code==408:
        raise 'timeout error'
    elif response.status_code==500:
        raise 'Server not returning anything'
    else:
        'error'

# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [10]:
# your code here
def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0:
                print(f'wating {abs(self.sleep_interval)} seconds')
                time.sleep(abs(self.sleep_interval))
            self.scrape_url(self.url_pattern % i)

# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [12]:
# your code here
# your code here

import requests
import time
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        if response.status_code==200:
            result = self.content_parser(response.content)
            self.output_results(result)
        elif respons.status_code==404:
            raise 'Not found'
        elif response.status_code==408:
            raise 'timeout error'
        elif response.status_code==500:
            raise 'Server not returning anything'
        else:
            'error'

        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0:
                print(f'wating for {abs(self.sleep_interval)} seconds')
                time.sleep(abs(self.sleep_interval))
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    result = BeautifulSoup(content,'html.parser')
    content = [i for i in result.find_all('span',attrs={'class':'text'})]
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser,sleep_interval=2)

# Start scraping jobs
my_spider.kickstart()

wating for 2 seconds
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for wh

[<span class="text" itemprop="text">“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”</span>, <span class="text" itemprop="text">“Do one thing every day that scares you.”</span>, <span class="text" itemprop="text">“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”</span>, <span class="text" itemprop="text">“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”</span>, <span class="text" itemprop="text">“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”</span>, <span class="text" itemprop="text">“The difference between genius and stupidity is: genius has its limits.”</span>, <span class="text" itemprop="text">“He's like a drug for you, Bella.”</span>, <span class="text

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [15]:
# your code here
# your code here
# your code here

import requests
import time
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        if response.status_code==200:
            result = self.content_parser(response.content)
            self.output_results(result)
        elif respons.status_code==404:
            raise 'Not found'
        elif response.status_code==408:
            raise 'timeout error'
        elif response.status_code==500:
            raise 'Server not returning anything'
        else:
            'error'

        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0:
                print(f'wating for {abs(self.sleep_interval)} seconds')
                time.sleep(abs(self.sleep_interval))
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    result = BeautifulSoup(content,'html.parser')
    content = [i for i in result.find('section').find_all('a') if i.text != '']
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser,sleep_interval=2)

# Start scraping jobs
my_spider.kickstart()

wating for 2 seconds
[<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>, <a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>, <a href="soumission_998/index.html" title="Soumission">Soumission</a>, <a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>, <a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>, <a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>, <a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>, <a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull">The Coming Woman: A ...</a>,

[<a href="the-nameless-city-the-nameless-city-1_940/index.html" title="The Nameless City (The Nameless City #1)">The Nameless City (The ...</a>, <a href="the-murder-that-never-was-forensic-instincts-5_939/index.html" title="The Murder That Never Was (Forensic Instincts #5)">The Murder That Never ...</a>, <a href="the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html" title="The Most Perfect Thing: Inside (and Outside) a Bird's Egg">The Most Perfect Thing: ...</a>, <a href="the-mindfulness-and-acceptance-workbook-for-anxiety-a-guide-to-breaking-free-from-anxiety-phobias-and-worry-using-acceptance-and-commitment-therapy_937/index.html" title="The Mindfulness and Acceptance Workbook for Anxiety: A Guide to Breaking Free from Anxiety, Phobias, and Worry Using Acceptance and Commitment Therapy">The Mindfulness and Acceptance ...</a>, <a href="the-life-changing-magic-of-tidying-up-the-japanese-art-of-decluttering-and-organizing_936/index.html" title="The Life-Changing Magic of

[<a href="algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html" title="Algorithms to Live By: The Computer Science of Human Decisions">Algorithms to Live By: ...</a>, <a href="a-world-of-flavor-your-gluten-free-passport_879/index.html" title="A World of Flavor: Your Gluten Free Passport">A World of Flavor: ...</a>, <a href="a-piece-of-sky-a-grain-of-rice-a-memoir-in-four-meditations_878/index.html" title="A Piece of Sky, a Grain of Rice: A Memoir in Four Meditations">A Piece of Sky, ...</a>, <a href="a-murder-in-time_877/index.html" title="A Murder in Time">A Murder in Time</a>, <a href="a-flight-of-arrows-the-pathfinders-2_876/index.html" title="A Flight of Arrows (The Pathfinders #2)">A Flight of Arrows ...</a>, <a href="a-fierce-and-subtle-poison_875/index.html" title="A Fierce and Subtle Poison">A Fierce and Subtle ...</a>, <a href="a-court-of-thorns-and-roses-a-court-of-thorns-and-roses-1_874/index.html" title="A Court of Thorns and Roses (A Court of Thorns

[<a href="modern-romance_820/index.html" title="Modern Romance">Modern Romance</a>, <a href="miss-peregrines-home-for-peculiar-children-miss-peregrines-peculiar-children-1_819/index.html" title="Miss Peregrine’s Home for Peculiar Children (Miss Peregrine’s Peculiar Children #1)">Miss Peregrine’s Home for ...</a>, <a href="louisa-the-extraordinary-life-of-mrs-adams_818/index.html" title="Louisa: The Extraordinary Life of Mrs. Adams">Louisa: The Extraordinary Life ...</a>, <a href="little-red_817/index.html" title="Little Red">Little Red</a>, <a href="library-of-souls-miss-peregrines-peculiar-children-3_816/index.html" title="Library of Souls (Miss Peregrine’s Peculiar Children #3)">Library of Souls (Miss ...</a>, <a href="large-print-heart-of-the-pride_815/index.html" title="Large Print Heart of the Pride">Large Print Heart of ...</a>, <a href="i-had-a-nice-time-and-other-lies-how-to-find-love-sht-like-that_814/index.html" title="I Had a Nice Time And Other Lies...: How to find love &am

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here