### Inheriting the Spider
When learning about scrapy spiders, we saw that the main portion of the code for us to adjust is the class for the spider. To help build some familiarity of the class, you will complete a short piece of code to complete a toy-model of the spider class code. We've omitted the code that would actually run the spider, only including the pieces necessary to create the class.

As mentioned in the lesson, a class is roughly a collection of related variables and functions housed together. Sometimes one class likes to use methods from another class, and so we will inherit methods from a different class. That's what we do in the spider class.

We wrote the function inspect_class to look at the your class once you're done, if you'd like to test your solution!

Pass scrapy.Spider as an argument to the class YourSpider; this will make it so that YourSpider inherits the methods from scrapy.Spider.

In [2]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass

def inspect_class(c):
  newc = c()
  meths = dir(newc)
  if 'name' in meths:
    print("Your spider class name is:", newc.name)
  if 'from_crawler' in meths:
    print("It seems you have inherited methods from scrapy.Spider -- NICE!")
  else:
    print("Oh no! It doesn't seem that you are inheriting the methods from scrapy.Spider!!")
      
# Inspect Your Class
inspect_class(YourSpider)

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Hurl the URLs
In the next lesson we will talk about the start_requests method within the spider class. In this quick exercise, we ask you to change around a variable within the start_requests method which foreshadows some of what we will be learning in the next lesson. Basically, we want you to start becoming comfortable turning some of the wheels within a spider class; in this case, making a list of urls within the start_requests method.

We've written a function inspect_class which will print out the list of elements you have in the urls variable within the start_requests method.

Note: in the next several exercises, you will write code to complete your spider class, but the code does not yet include the pieces to actually run the spider; that will come at the end.

Fill in the blank within the start_requests method to assign the variable urls a list with the two strings: "https://www.datacamp.com" and"https://scrapy.org".

In [5]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com", "https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Self Referencing is Classy
You probably have noticed that within the spider class, we always input the argument self in the start_requests and parse methods (just look in the sample code in this exercise!). This allows us to reference between methods within the class. That is, if we want to refer to the method parse within the start_requests method, we would need to write self.parse rather than just parse; what writing self does is tell the code: "Look in the same class as start_requests for a method called parse to use."

In this exercise you will get a chance to play with this "self referencing".

Instruct100 XP
Fill in the required scrapy object into the class YourSpider needed to create the scrapy s

pider.
Pass the string argument "Hello World!" to fill in the blank in the start_requests method to use the print_msg method.

In [6]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!" )
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


Self referencing in classes can be a bit confusing, but you nailed it!! (And, don't worry, we don't need to worry too much more about this subject for this course)!!

### Starting with Start Requests
In the last lesson we learned about setting up the start_requests method within a scrapy spider. Here we have another toy-model spider which doesn't actually scrape anything, but gives you a chance to play with the start_requests method. What we want is for you to start becomming familiar with the arguments you pass into the scrapy.Request call within start_requests.

As before, we have created the function inspect_class to examine what you are yielding in start_requestswebsite.

Fill in the required scrapy object into the class YourSpider needed to create the scrapy spider.


Fill in the blank in the yielded scrapy.Request call within the start_requests method so that the URL this spider would start scraping is "https://www.datacamp.com" and would use the parse method (within the YourSpider class) as the method to parse the website.

In [7]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url="https://www.datacamp.com", callback=self.parse )
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Pen Names
In this exercise, we have set up a spider class which, when finished, will retrieve the author names from a shortened version of the DataCamp course directory. The URL for the shortened version is stored in the variable url_short. Your job will be to create the list of extracted author names in the parse method of the spider.

Two things you should know:

You will be using the response object and the css method here.
The course author names are defined by the text within the paragraph p elements belonging to the class course-block__author-name
You can inspect the spider using the function inspect_spider() that we built for you -- it will print out the author names you find!

Note that this and the remaining exercises in this chapter may take some time to load.

Fill in the required arguments to the parse method so that it will work as required when called in the start_requests method.


Within the parse method, create a variable author_names, which is a list of strings created by extracting the text from the paragraph elements belonging to the class course-block__author-name. 

In [11]:
# Import the scrapy library
import scrapy


def inspect_spider( s ):
  news = s()
  try:
    req = list( news.start_requests() )[0]
    url = req.url
    html = requests.get( url ).content
    response = TextResponse( url = url, body = html, encoding = 'utf-8' )
    author_names = req.callback( response )
    print( 'You have collected the author names:')
    for a in author_names:
      print('\t-', a )
  except:
    print( 'Oh no! Something went wrong with the code. Keep trying!')


# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response ):
    # Create an extracted list of course author names
    author_names = response.css('p.course-block__author-name::text').extract()
    # Here we will just return the list of Authors
    return author_names
  
# Inspect the spider
inspect_spider( DCspider )

Oh no! Something went wrong with the code. Keep trying!


### Crawler Time
This will be your first chance to play with a spider which will crawl between sites (by first collecting links from one site, and following those links to parse new sites). This spider starts at the shortened DataCamp course directory, then extracts the links of the courses in the parse method; from there, it will follow those links to extract the course descriptions from each course page in the parse_descr method, and put these descriptions into the list course_descrs. Your job is to complete the code so that the spider runs as desired!

We have created a function inspect_spider which will print out one of the course descriptions you scrape (if done correctly)!

Fill in the two blanks below (one in each of the parsing methods) with the appropriate entries so that the spider can move from the first parsing method to the second correctly.

In [12]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow(url=link, callback=self.parse_descr)
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr


# Inspect the spider
inspect_spider( DCdescr )

Oh no! Something went wrong with the code. Keep trying!


### Time to Run
In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods.

The point of this exercise is to remedy that!

The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a very easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!

Fill in the one blank at the end of the parse_pages methods to assign the chapter titles to the dictionary whose key is the corresponding course title.


NOTE: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!

In [14]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()





def previewCourses( dc_dict, n = 3 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    for i,ct in enumerate(dc_dict[t]):
      print("\tChapter %d: %s" % (i+1,ct) )
    print("")
      
# Print a preview of courses
previewCourses(dc_dict)

2024-01-20 16:38:12 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-01-20 16:38:12 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform Windows-11-10.0.22631-SP0
2024-01-20 16:38:12 [scrapy.addons] INFO: Enabled addons:
[]


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-01-20 16:38:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-20 16:38:12 [scrapy.extensions.telnet] INFO: Telnet Password: b8fe25301937252c
2024-01-20 16:38:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.lo

ReactorNotRestartable: 

### DataCamp Descriptions
Like the previous exercise, the code here is long since you are working with an entire web-crawling spider! But again, don't let the amount of code intimidate you, you have a handle on how spiders work now, and you are perfectly capable to complete the easy task for you here!

As in the previous exercise, we have created a function previewCourses which lets you preview the output of the spider, but you can always just explore the dictionary dc_dict too after you run the code.

In this exercise, you are asked to create a CSS Locator string direct to the text of the course description. All you need to know is that from the course page, the course description text is within a paragraph p element which belongs to the class course__description (two underlines).

Fill in the one blank below in the parse_pages method with a CSS Locator string which directs to the text within the paragraph p element which belongs to the class course__description.


NOTE: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!

In [15]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css( 'p.course__description::text' )
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

2024-01-20 16:39:39 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-01-20 16:39:39 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform Windows-11-10.0.22631-SP0
2024-01-20 16:39:39 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-20 16:39:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-20 16:39:39 [scrapy.extensions.telnet] INFO: Telnet Password: 70d8d17cfe686056
2024-01-20 16:39:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-20 16:39:39 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-20 16:39:39 [scrapy.middleware] INFO: Enabled downloader mi

ReactorNotRestartable: 

### Capstone Crawler
This exercise gives you a chance to show off what you've learned! In this exercise, you will write the parse function for a spider and then fill in a few blanks to finish off the spider. On the course directory page of DataCamp, each listed course has a title and a short course description. This spider will be used to scrape the course directory to extract the course titles and short course descriptions. You will not need to follow any links this time. Everything you need to know is:

The course titles are defined by the text within an h4 element whose class contains the string block__title (double underline).
The short course descriptions are defined by the text within a paragraph p element whose class contains the string block__description (double underline).

Assign to the variable crs_titles the extracted list of course titles on the DataCamp course directory page. You should use the contains call within your XPath, and your XPath string should point to the text of the selected objects.


Assign to the variable crs_descrs the extracted list of short course descriptions. You should use the contains call within your XPath. You should use the contains call within your XPath, and your XPath string should point to the text of the selected objects.


(Since we want a list of extracted data, we will use the extract() call (rather than extract_first()). )

In [16]:
# parse method
def parse(self, response):
  # Extracted course titles
  crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
  # Extracted course descriptions
  crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
  # Fill in the dictionary: it is the spider output
  for crs_title, crs_descr in zip(crs_titles, crs_descrs):
    dc_dict[crs_title] = crs_descr

Fill in the four blanks below with the necessary entries to complete your spider.


Note: If you hit Run Code, you will need to Reset to Sample before hitting Run Code again.

In [17]:
# Import scrapy
import scrapy

# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class YourSpider(scrapy.Spider):
  name = 'yourspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request(url = url_short, callback=self.parse)
      
  def parse(self, response):
    # My version of the parser you wrote in the previous part
    crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
    crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
    for crs_title, crs_descr in zip( crs_titles, crs_descrs ):
      dc_dict[crs_title] = crs_descr
    
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

2024-01-20 17:08:47 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-01-20 17:08:47 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform Windows-11-10.0.22631-SP0
2024-01-20 17:08:47 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-20 17:08:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-20 17:08:47 [scrapy.extensions.telnet] INFO: Telnet Password: 805c472ca4668cec
2024-01-20 17:08:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-20 17:08:47 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-20 17:08:47 [scrapy.middleware] INFO: Enabled downloader mi

ReactorNotRestartable: 