# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [6]:
import requests
from bs4 import BeautifulSoup as bs # added this alias myself, check the code for other mentions of BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [34]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    # this function scrapes the url, and parses the content
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
        
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
      

URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge
    
    
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_quotes_parser)
my_spider.kickstart() # part of this function prints itself

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [45]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
        
    def output_results(self, r):
        print(r)
    
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
      

URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge
    
    
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_quotes_parser)
my_spider.kickstart() # part of this function prints itself

Response Status Code is 200
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [47]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
        
    def output_results(self, r):
        print(r)
    
    # CHALLENGE 3 - integrate a sleep timer
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0: 
                sleep(sleep_interval) # do I need to pass 'self.sleep_interval' as the argument instead?
            self.scrape_url(self.url_pattern % i)
      

URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge
    
    
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_quotes_parser)
my_spider.kickstart() # part of this function prints itself

Response Status Code is 200
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [49]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
        
    def output_results(self, r):
        print(r)
    
    # CHALLENGE 3 - integrate a sleep timer
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0: 
                sleep(sleep_interval) # do I need to pass 'self.sleep_interval' as the argument instead?
            self.scrape_url(self.url_pattern % i)
      

URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrape, updated for CHALLENGE 4
    
    
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_quotes_parser)
my_spider.kickstart() # part of this function prints itself


Response Status Code is 200
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
Response Status Code is 200
["“This 

Response Status Code is 200
['“If I had a flower for every time I thought of you...I could walk through my garden forever.”', '“Some people never go crazy. What truly horrible lives they must lead.”', '“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”', '“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”', "“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”", '“The reason I talk to myself is because I’m the only one whose answers I accept.”', "“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”", '“I am free of all prejudice. I hate everyone equally. ”', "“The question isn't who is going to let me; it's

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [52]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
    
    def my_booktitle_parser(site):
        soup = bs(site, "html.parser")
        table_content = soup.find('div', {'class':'col-sm-8 col-md-9'})
        return [element.text for element in table_content.find_all('h3')]
        
    def output_results(self, r):
        print(r)
    
    # CHALLENGE 3 - integrate a sleep timer
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0: 
                sleep(sleep_interval) # do I need to pass 'self.sleep_interval' as the argument instead?
            self.scrape_url(self.url_pattern % i)
      

# URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
# PAGES_TO_SCRAPE = 10 # how many webpages to scrape, updated for CHALLENGE 4

URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 50 # 1,000 results, at 20 per page, gives 50 pages 
    
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_booktitle_parser)
my_spider.kickstart() # part of this function prints itself


Response Status Code is 200
['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]
Response Status Code is 200
['In Her Wake', 'How Music Works', 'Foolproof Preserving: A Guide ...', 'Chase Me (Paris Nights ...', 'Black Dust', 'Birdsong: A Story in ...', "America's Cradle of Quarterbacks: ...", 'Aladdin and His Wonderful ...', 'Worlds Elsewhere: Journeys Around ...', 'Wall and Piece', 'The Four Agreements: A ...', 'The Five Love Languages: ...', 'The Elephant Tree', 'The Bear and the ...', "Sophie's World", 'Penny Maybe', 'Maude (1883-1993):She Grew Up 

Response Status Code is 200
['Hold Your Breath (Search ...', 'Hamilton: The Revolution', 'Greek Mythic History', 'God: The Most Unpleasant ...', 'Glory over Everything: Beyond ...', 'Feathers: Displays of Brilliant ...', 'Far & Away: Places ...', 'Every Last Word', 'Eligible (The Austen Project ...', 'El Deafo', 'Eight Hundred Grapes', 'Eaternity: More than 150 ...', 'Eat Fat, Get Thin', "Don't Get Caught", 'Doctor Sleep (The Shining ...', 'Demigods & Magicians: Percy ...', 'Dear Mr. Knightley', 'Daily Fantasy Sports', 'Crazy Love: Overwhelmed by ...', 'Cometh the Hour (The ...']
Response Status Code is 200
['Code Name Verity (Code ...', 'Clockwork Angel (The Infernal ...', 'City of Glass (The ...', 'City of Fallen Angels ...', 'City of Bones (The ...', 'City of Ashes (The ...', 'Cell', 'Catching Jordan (Hundred Oaks)', 'Carry On, Warrior: Thoughts ...', 'Carrie', 'Buying In: The Secret ...', 'Brain on Fire: My ...', 'Batman: Europa', 'Barefoot Contessa Back to ...', 'Barefoot Contessa

Response Status Code is 200
['The Dream Thieves (The ...', 'The Darkest Corners', 'The Crossover', 'The 5th Wave (The ...', 'Tell the Wind and ...', 'Tell Me Three Things', 'Talking to Girls About ...', 'Siddhartha', 'Shiver (The Wolves of ...', 'Remember Me?', 'Red Dragon (Hannibal Lecter ...', 'Peak: Secrets from the ...', 'My Mother Was Nuts', 'Mexican Today: New and ...', 'Maybe Something Beautiful: How ...', 'Lola and the Boy ...', 'Logan Kade (Fallen Crest ...', 'Last One Home (New ...', 'Killing Floor (Jack Reacher ...', 'Kill the Boy Band']
Response Status Code is 200
['Isla and the Happily ...', 'If I Stay (If ...', 'I Know Why the ...', 'Harry Potter and the ...', 'Fruits Basket, Vol. 5 ...', 'Foundation (Foundation (Publication Order) ...', 'Fool Me Once', 'Find Her (Detective D.D. ...', 'Evicted: Poverty and Profit ...', 'Drama', 'Dracula the Un-Dead', 'Digital Fortress', 'Death Note, Vol. 5: ...', 'Data, A Love Story: ...', 'Critique of Pure Reason', 'Booked', 'Blue Lily, 

Response Status Code is 200
['Fruits Basket, Vol. 2 ...', 'Diary of a Minecraft ...', 'Y: The Last Man, ...', 'While You Were Mine', 'Where Lightning Strikes (Bleeding ...', "When I'm Gone", 'Ways of Seeing', 'Vampire Knight, Vol. 1 ...', 'Vampire Girl (Vampire Girl ...', 'Twenty Love Poems and ...', 'Travels with Charley: In ...', 'Three Wishes (River of ...', 'This One Moment (Pushing ...', 'The Zombie Room', 'The Wicked + The ...', 'The Tumor', 'The Story of Hong ...', 'The Silent Wife', 'The Silent Twin (Detective ...', 'The Selfish Gene']
Response Status Code is 200
['The Secret Healer', 'The Sandman, Vol. 1: ...', 'The Republic', 'The Odyssey', "The No. 1 Ladies' ...", 'The Nicomachean Ethics', 'The Name of the ...', 'The Mirror & the ...', 'The Little Prince', 'The Light of the ...', 'The Last Girl (The ...', 'The Iliad', 'The Hook Up (Game ...', 'The Haters', 'The Girl You Lost', 'The Girl In The ...', 'The End of the ...', 'The Edge of Reason ...', 'The Complete Maus (Maus ...

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# from the reference material, I typed "what is my user agent" into google, returned
'''
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36

paste that to give this
my_header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
          }
which can then be passed as an optional argument to the requests.get() function
response = requests.get(url, headers = 'my_header')
'''

# a function from the reference material on making a random user-agent selection
import numpy as np
 
def get_random_ua():
    random_ua = '' # call it an empty string
    ua_file = 'ua_file.txt' # label the destination, typically a text file that holds a bunch of user agents (put this file in the .gitignore)
    try:
        with open(ua_file) as f:
            lines = f.readlines() # try to open the file, reading all the lines within 
        if len(lines) > 0: # if something exists within this, if you successfully read the lines from the file
            prng = np.random.RandomState() # this is a random number generator
            index = prng.permutation(len(lines) - 1) # pick a random number within
            idx = np.asarray(index, dtype=np.integer)[0]
            random_ua = lines[int(idx)] # assign the data from that randomly selected line
    except Exception as ex: # unless things went wrong
        print('Exception in random_ua')
        print(str(ex))
    finally:
        return random_ua

In [None]:
# your code here # not entirely sure if this works
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs
import numpy as np

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None, HEADERS):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.headers = headers
    
    # BONUS 1 - get a random User Agent
    def get_random_ua():
        random_ua = '' # call it an empty string
        ua_file = 'ua_file.txt' # label the destination, typically a text file that holds a bunch of user agents (put this file in the .gitignore)
        try:
            with open(ua_file) as f:
                lines = f.readlines() # try to open the file, reading all the lines within 
            if len(lines) > 0: # if something exists within this, if you successfully read the lines from the file
                prng = np.random.RandomState() # this is a random number generator
                index = prng.permutation(len(lines) - 1) # pick a random number within
                idx = np.asarray(index, dtype=np.integer)[0]
                random_ua = lines[int(idx)] # assign the data from that randomly selected line
        except Exception as ex: # unless things went wrong
            print('Exception in random_ua')
            print(str(ex))
        finally:
            return random_ua
    
    def get_referrer():
        referral_link = ''
        ref_file = 'ref_file.txt'
        try:
            with open(ref_file) as f:
                lines = f.readlines()
            if len(lines) > 0:
                prng = np.random.RandomState()
                index = prng.permutation(len(lines) - 1)
                idx = np.asarray(index, dtype=np.integer)[0]
                referral_link = lines[int(idx)]
        except Exception as err:
            print("Exception in referral_link")
            print(str(err))
        finally:
            return referall_link
        
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10, headers = HEADERS)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
    
    def my_booktitle_parser(site):
        soup = bs(site, "html.parser")
        table_content = soup.find('div', {'class':'col-sm-8 col-md-9'})
        return [element.text for element in table_content.find_all('h3')]
        
    def output_results(self, r):
        print(r)
    
    # CHALLENGE 3 - integrate a sleep timer
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0: 
                sleep(sleep_interval) # do I need to pass 'self.sleep_interval' as the argument instead?
            self.scrape_url(self.url_pattern % i)
      

# URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
# PAGES_TO_SCRAPE = 10 # how many webpages to scrape, updated for CHALLENGE 4

URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 50 # 1,000 results, at 20 per page, gives 50 pages 
HEADERS = {'user-agent' = get_random_ua(), 'referrer' = get_referrer()}   
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_booktitle_parser, HEADERS)
my_spider.kickstart() # part of this function prints itself

# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here
# DEPENDENCIES
import requests
from bs4 import BeautifulSoup as bs
import numpy as np
import asyncio

class IronhackSpider:
    # initalize the class, with the arguments that are passed within
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None, HEADERS):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.headers = headers
    
    # BONUS 1 - get a random User Agent
    def get_random_ua():
        random_ua = '' # call it an empty string
        ua_file = 'ua_file.txt' # label the destination, typically a text file that holds a bunch of user agents (put this file in the .gitignore)
        try:
            with open(ua_file) as f:
                lines = f.readlines() # try to open the file, reading all the lines within 
            if len(lines) > 0: # if something exists within this, if you successfully read the lines from the file
                prng = np.random.RandomState() # this is a random number generator
                index = prng.permutation(len(lines) - 1) # pick a random number within
                idx = np.asarray(index, dtype=np.integer)[0]
                random_ua = lines[int(idx)] # assign the data from that randomly selected line
        except Exception as ex: # unless things went wrong
            print('Exception in random_ua')
            print(str(ex))
        finally:
            return random_ua
    
    def get_referrer():
        referral_link = ''
        ref_file = 'ref_file.txt'
        try:
            with open(ref_file) as f:
                lines = f.readlines()
            if len(lines) > 0:
                prng = np.random.RandomState()
                index = prng.permutation(len(lines) - 1)
                idx = np.asarray(index, dtype=np.integer)[0]
                referral_link = lines[int(idx)]
        except Exception as err:
            print("Exception in referral_link")
            print(str(err))
        finally:
            return referall_link
    
    # CHALLENGE 2 this function scrapes the url, parses the content inside, and handles
    def scrape_url(self, url):
        
        try:
            response = requests.get(url, timeout=10, headers = HEADERS)
            print("Response Status Code is " + str(response.status_code))
            if response.status_code >= 300:
                print("Got an error: " + str(response.status_code))
        except requests.exceptions.Timeout as Terr:
            print("Timed out after over 10 seconds " + str(Terr))
            pass
        except requests.exceptions.TooManyRedirects as Rerr:
            print("Tried to redirect you too many times " + str(Rerr))
            pass
        except requests.exceptions.SSLError as SSLerr:
            print("This site is not secure " + str(SSLerr))
            pass
        except requests.exceptions.RequestException as e:
            print("Unknown error " + str(e))
        
        result = self.content_parser(response.content)
        self.output_results(result)
    
    # used for scrape_url() to find all the quote content
    def my_quotes_parser(site):
        soup = bs(site, "html.parser")
        return [element.text for element in soup.find_all('span', {'class':'text'})]
    
    def my_booktitle_parser(site):
        soup = bs(site, "html.parser")
        table_content = soup.find('div', {'class':'col-sm-8 col-md-9'})
        return [element.text for element in table_content.find_all('h3')]
        
    def output_results(self, r):
        print(r)
    
    # CHALLENGE 3 - integrate a sleep timer
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0: 
                sleep(sleep_interval) # do I need to pass 'self.sleep_interval' as the argument instead?
            self.scrape_url(self.url_pattern % i)
      

# URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
# PAGES_TO_SCRAPE = 10 # how many webpages to scrape, updated for CHALLENGE 4

URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 50 # 1,000 results, at 20 per page, gives 50 pages 
HEADERS = {'user-agent' = get_random_ua(), 'referrer' = get_referrer()}   
    
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=IronhackSpider.my_booktitle_parser, HEADERS)
my_spider.kickstart() # part of this function prints itself