## Inheriting the Spider

In [None]:
def inspect_class(c):
  newc = c()
  meths = dir(newc)
  if 'name' in meths:
    print("Your spider class name is:", newc.name)
  if 'from_crawler' in meths:
    print("It seems you have inherited methods from scrapy.Spider -- NICE!")
  else:
    print("Oh no! It doesn't seem that you are inheriting the methods from scrapy.Spider!!")
  

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass
  
# Inspect Your Class
inspect_class(YourSpider)

<p>When learning about <code>scrapy</code> spiders, we saw that the main portion of the code for us to adjust is the <code>class</code> for the spider. To help build some familiarity of the class, you will complete a short piece of code to complete a toy-model of the spider class code. We've omitted the code that would actually <em>run</em> the spider, only including the pieces necessary to create the class. </p>
<p>As mentioned in the lesson, a <code>class</code> is roughly a collection of related variables and functions housed together. Sometimes one class likes to use methods from another class, and so we will <em>inherit</em> methods from a different class. That's what we do in the spider class.</p>
<p>We wrote the function <code>inspect_class</code> to look at the your class once you're done, if you'd like to test your solution!</p>

<ul>
<li>Pass <code>scrapy.Spider</code> as an argument to the class <code>YourSpider</code>; this will make it so that <code>YourSpider</code> <em>inherits</em> the methods from <code>scrapy.Spider</code>.</li>
</ul>

<ul>
<li>As stated in the instructions, all you need to do is pass <code>scrapy.Spider</code> as an argument into <code>YourSpider</code>!</li>
</ul>

## Hurl the URLs

In [None]:
def inspect_class( c ):
  newc = c()
  meths = dir( newc )
  if 'start_requests' in meths:
    print( "The start_requests method yields the following urls:" )
    for u in newc.start_requests():
      print(  "\t-", u )

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com", "https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

<p>In the next lesson we will talk about the <code>start_requests</code> method within the spider class. In this quick exercise, we ask you to change around a variable within the <code>start_requests</code> method which foreshadows some of what we will be learning in the next lesson. Basically, we want you to start becoming comfortable turning some of the wheels within a spider class; in this case, making a list of <code>urls</code> within the <code>start_requests</code> method. </p>
<p>We've written a function <code>inspect_class</code> which will print out the list of elements you have in the <code>urls</code> variable within the <code>start_requests</code> method.</p>
<p><strong>Note</strong>: in the next several exercises, you will write code to complete your spider class, but the code does not yet include the pieces to actually <strong>run</strong> the spider; that will come at the end.</p>

<ul>
<li>Fill in the blank within the <code>start_requests</code> method to assign the variable <code>urls</code> a list with the two strings: <code>"https://www.datacamp.com"</code> and<code>"https://scrapy.org"</code>.</li>
</ul>

<ul>
<li>Set <code>urls</code> in <code>start_requests</code> equal to a list with the two elements being the two strings in the instructions!</li>
</ul>

## Self Referencing is Classy

In [None]:
def inspect_class( c ):
  newc = c()
  try:
    newc.start_requests()
  except:
    print( "Oh No! Something is wrong with the code! Keep trying." )

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!" )
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

<p>You probably have noticed that within the spider class, we always input the argument <code>self</code> in the <code>start_requests</code> and <code>parse</code> methods (just look in the sample code in this exercise!). This allows us to reference between methods within the class. That is, if we want to refer to the method <code>parse</code> within the <code>start_requests</code> method, we would need to write <code>self.parse</code> rather than just <code>parse</code>; what writing <code>self</code> does is tell the code: "Look in the same class as <code>start_requests</code> for a method called <code>parse</code> to use." </p>
<p>In this exercise you will get a chance to play with this "self referencing".</p>

<ul>
<li>Fill in the required <code>scrapy</code> object into the class <code>YourSpider</code> needed to create the <code>scrapy</code> spider.</li>
<li>Pass the string argument <code>"Hello World!"</code> to fill in the blank in the <code>start_requests</code> method to use the <code>print_msg</code> method.</li>
</ul>

<ul>
<li>Remember that your scrapy spider needs to inherit <code>scrapy.Spider</code>.</li>
<li>Within <code>start_requests</code>, all you need to do is pass the <code>"Hello World"</code> string as an argument to the call of <code>start.print_msg</code>.</li>
</ul>

## Starting with Start Requests

In [None]:
def inspect_class( c ):
  newc = c()
  try:
    y = list( newc.start_requests() )
    first_yield = y[0]
    print( "The url you would scrape is:", first_yield.url )
    cb = first_yield.callback
    print( "The name of the callback method you called is:", cb.__name__ )
  except:
    print( "Oh No! Something is wrong with the code! Keep trying." )

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = "https://www.datacamp.com", callback = self.parse )
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

<p>In the last lesson we learned about setting up the <code>start_requests</code> method within a <code>scrapy</code> spider. Here we have another toy-model spider which doesn't actually scrape anything, but gives you a chance to play with the start_requests method. What we want is for you to start becomming familiar with the arguments you pass into the <code>scrapy.Request</code> call within <code>start_requests</code>. </p>
<p>As before, we have created the function <code>inspect_class</code> to examine what you are yielding in <code>start_requests</code>.</p>

<ul>
<li>Fill in the required <code>scrapy</code> object into the class <code>YourSpider</code> needed to create the <code>scrapy</code> spider.</li>
<li>Fill in the blank in the yielded <code>scrapy.Request</code> call within the <code>start_requests</code> method so that the URL this spider would start scraping is <code>"https://www.datacamp.com"</code> and would use the <code>parse</code> method (within the <code>YourSpider</code> class) as the method to parse the website.</li>
</ul>

<ul>
<li>Remember that you will use two arguments in <code>start_requests</code>, the first is <code>url</code> and the second is <code>callback</code>.</li>
<li>Remember to use <code>self.parse</code> to reference the method <code>parse</code> within the <code>start_requests</code> method.</li>
</ul>

## Pen Names

In [None]:
import requests
from scrapy.http import TextResponse

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

def inspect_spider( s ):
  news = s()
  try:
    req = list( news.start_requests() )[0]
    url = req.url
    html = requests.get( url ).content
    response = TextResponse( url = url, body = html, encoding = 'utf-8' )
    author_names = req.callback( response )
    print( 'You have collected the author names:')
    for a in author_names:
      print('\t-', a )
  except:
    print( 'Oh no! Something went wrong with the code. Keep trying!')

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response ):
    # Create an extracted list of course author names
    author_names = response.css( 'p.course-block__author-name::text' ).extract()
    # Here we will just return the list of Authors
    return author_names
  
# Inspect the spider
inspect_spider( DCspider )

<p>In this exercise, we have set up a spider class which, when finished, will retrieve the author names from a shortened version of the DataCamp course directory. The URL for the shortened version is stored in the variable <code>url_short</code>. Your job will be to create the list of <strong>extracted</strong> author names in the <code>parse</code> method of the spider. </p>
<p>Two things you should know:</p>
<ul>
<li>You will be using the <code>response</code> object and the <code>css</code> method here. </li>
<li>The course author names are defined by the <strong>text</strong> within the paragraph <code>p</code> elements belonging to the class <code>course-block__author-name</code></li>
</ul>
<p>You can inspect the spider using the function <code>inspect_spider()</code> that we built for you -- it will print out the author names you find!</p>
<p><strong>Note that this and the remaining exercises in this chapter may take some time to load.</strong></p>

<ul>
<li>Fill in the required arguments to the parse method so that it will work as required when called in the <code>start_requests</code> method. </li>
<li>Within the <code>parse</code> method, <strong>create</strong> a variable <code>author_names</code>, which is a list of strings created by extracting the text from the paragraph elements belonging to the class <code>course-block__author-name</code>.</li>
</ul>

<ul>
<li>Remember that one of the two arguments <code>parse</code> needs to be passed is <code>self</code>.</li>
<li>When using <code>response.css</code> to get to the author names, make sure to extract the text.</li>
<li>Remember that in CSS Locator notation, use a period <code>.</code> when trying to select an element by class.</li>
<li>Remember that in CSS Locator notation, to select text, you use <code>::text</code> appropriately in the string.</li>
</ul>

## Crawler Time

In [None]:
import requests
from scrapy.http import TextResponse

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

def print_first_descr( course_descrs ):
  print( "The first course description is: \n", course_descrs[0] )
  
def inspect_spider( s ):
  news = s()
  try:
    req1 = list( news.start_requests() )[0]
    html1 = requests.get( req1.url ).content
    response1 = TextResponse( url = req1.url, body = html1, encoding = 'utf-8' )
    req2 = list( news.parse( response1 ) )[0]
    html2 = requests.get( req2.url ).content
    response2 = TextResponse( url = req2.url, body = html2, encoding = 'utf-8' )
    for d in news.parse_descr( response2 ):
      print("One course description you found is:", d )
      break
  except:
    print("Oh no! Something is wrong with the code. Keep trying!")

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow( url = link, callback = self.parse_descr )
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr


# Inspect the spider
inspect_spider( DCdescr )

<p>This will be your first chance to play with a spider which will crawl between sites (by first collecting links from one site, and following those links to parse new sites). This spider starts at the shortened DataCamp course directory, then extracts the links of the courses in the <code>parse</code> method; from there, it will follow those links to extract the course descriptions from each course page in the <code>parse_descr</code> method, and put these descriptions into the list <code>course_descrs</code>. Your job is to complete the code so that the spider runs as desired!</p>
<p>We have created a function <code>inspect_spider</code> which will print out one of the course descriptions you scrape (if done correctly)!</p>

<ul>
<li>Fill in the two blanks below (one in each of the parsing methods) with the appropriate entries so that the spider can move from the first parsing method to the second correctly.</li>
</ul>

<ul>
<li>Remember that <code>response.follow</code> works in a similar way to <code>scrapy.Request</code>.</li>
<li>Don't forget that the first of two entries for a parsing method is <code>self</code>.</li>
</ul>

## Time to Run

In [None]:
url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

def previewCourses( dc_dict, n = 3 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    for i,ct in enumerate(dc_dict[t]):
      print("\tChapter %d: %s" % (i+1,ct) )
    print("")

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

<p>In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods. </p>
<p>The point of this exercise is to remedy that!</p>
<p>The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a <strong>very</strong> easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!</p>

<ul>
<li>Fill in the one blank at the end of the <code>parse_pages</code> methods to assign the chapter titles to the dictionary whose key is the corresponding course title.</li>
</ul>
<p><strong>NOTE</strong>: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!</p>

<ul>
<li>The chapter titles are within the variable <code>ch_titles</code>.</li>
</ul>

## DataCamp Descriptions

In [None]:
url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

def previewCourses( dc_dict, n = 1 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    print("\tDescription: %s" % dc_dict[t] )
    print("")

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css('p.course__description::text')
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

<p>Like the previous exercise, the code here is long since you are working with an entire web-crawling spider! But again, don't let the amount of code intimidate you, you have a handle on how spiders work now, and you are perfectly capable to complete the easy task for you here!</p>
<p>As in the previous exercise, we have created a function <code>previewCourses</code> which lets you preview the output of the spider, but you can always just explore the dictionary <code>dc_dict</code> too after you run the code.</p>
<p>In this exercise, you are asked to create a CSS Locator string direct to the text of the course description. All you need to know is that from the course page, the course description text is within a paragraph <code>p</code> element which belongs to the class <code>course__description</code> (two underlines).</p>

<ul>
<li>Fill in the one blank below in the <code>parse_pages</code> method with a CSS Locator string which directs to the text within the paragraph <code>p</code> element which belongs to the class <code>course__description</code>.</li>
</ul>
<p><strong>NOTE</strong>: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!</p>

<ul>
<li>Remember that using a period <code>.</code> in CSS Locator notation tells us to select by class.</li>
<li>Remember that to select text in CSS Locator notation, you need to use <code>::text</code> somewhere.</li>
</ul>

## Capstone Crawler

In [None]:
from scrapy.http import TextResponse
import requests

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

html = requests.get( url_short ).content

response = TextResponse( url = url_short, body = html, encoding = 'utf-8' )

dc_dict = dict()

def previewCourses( dc_dict = dc_dict, n = 3 ):
  parse( self = None, response = response )
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    print( "\tDESCRIPTION: %s" % dc_dict[t] )
    print("")
    
# Does this fix the testing issue?
self = None

def parse( self, response ):
  pass

<p>This exercise gives you a chance to show off what you've learned! In this exercise, you will write the parse function for a spider and then fill in a few blanks to finish off the spider. On the course directory page of DataCamp, each listed course has a title and a short course description. This spider will be used to scrape the course directory to extract the course titles and short course descriptions. You will not need to follow any links this time. Everything you need to know is:</p>
<ul>
<li>The course titles are defined by the text within an <code>h4</code> element whose class contains the string <code>block__title</code> (double underline).</li>
<li>The short course descriptions are defined by the text within a paragraph <code>p</code> element whose class contains the string <code>block__description</code> (double underline).</li>
</ul>