# Scrapy Tutorial  
## Authors: Pinyi Liu, Qiulu Shi, Hua Tong, Jiahuan Zou

# 1. Motivation
Scrapy is a web crawling framework that can be used for data mining and information processing. The main objective of scrapy is to extract and store items from a website through defining and executing a spider. Scrapy enables users to find particular information from websites for further analysis. Besides, scrapy makes web scraping more effective and efficient because scrapy provides build-in support for scraping data from HTML/XML sources using regular expressions, for generating feed exports in alternative formats (JSON, CSV, XML), for downloading images automatically through the use of media pipeline etc. You can refer to part 3 for installation and more detailed applications are explained in part 4, 5 and 6. 




# 2. Context
## There are three alternative solutions.

## Alternative Solution 1: Twill
Twill is a simple language that allows users to browse the Web from a command-line interface. With Twill, a user can navigate through Web sites that use forms, cookies, and most standard Web features.

There are two simple ways to use Twill from Python. And they are compatible with each other, so the user does not need to choose between them.

The first is to simply import all of the commands in commands.py and use them directly from Python. The second way to use Twill from Python is to talk to the Web browser directly by calling the get_browser() function.

## Alternative Solution 2: Beautifulsoup + urllib2
Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful.

First, Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application

Second, Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.

Third, Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. 

The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

## Alternative Solution 3: Write code yourself
Usually it takes no more than 50 lines of code to extract information from a website. However, since we may want to save information from hyperlinks on the initial page, things may become a little bit harder. On the other hand, self-built code are usually less efficient than built packages online. So the first two methods are better alternative solutions.

# 3. Installation instructions, platform restriction and dependent libraries


## Installation Instructions
Scrapy and its independencies can be installed with installation of Python packages from PyPI with: pip install Scrapy. (Run this command in Command Prompt – Windows, Anaconda Prompt – Jupyter notebook or Terminal – Mac) It is strongly recommended to install Scrapy in a virtual environment, to avoid conflicting with your system packages.

## Platform Restriction
Scrapy runs on Python 2.7 and Python 3.3 or above (except on Windows where Python 3 is not supported yet). Python 3 is not supported on Windows. This is because Scrapy core requirement Twisted does not support Python 3 on Windows.

## Dependent Libraries
Scrapy is written in pure Python and depends on a few key Python packages (among others):

   1) lxml, an efficient XML and HTML parser

   2) parsel, an HTML/XML data extraction library written on top of lxml,

   3) w3lib, a multi-purpose helper for dealing with URLs and web page encodings

   4) twisted, an asynchronous networking framework

   5) cryptography and pyOpenSSL, to deal with various network-level security needs

The minimal versions which Scrapy is tested against are:

   1) Twisted 14.0

   2) lxml 3.4

   3) pyOpenSSL 0.14

Scrapy may work with older versions of these packages but it is not guaranteed it will continue working because it’s not being tested against them.

# 4. Minimal Working Example
Scrapy is used to Scrape Web Pagesrite. In this example, we will build a crawler to scrape and parse the hat selling information on Amazon and store the data to a CSV file.

## (1) Create a project
Scrapy uses terminal (Mac) or Command Prompt (Windows) to create project.
The project is contained in a folder, including project configuration information (.cfg), py documents (items.py, pipelines.py, settings.py) specify the contain of the project. We can rewrite the documents
to do the project.

In [1]:
import scrapy

In [2]:
# we will create a project named working_example
!scrapy startproject working_example 

New Scrapy project 'working_example', using template directory '//anaconda/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example

You can start your first spider with:
    cd working_example
    scrapy genspider example example.com


In [3]:
!ls

Icon?              Scrapy Final.ipynb [1m[34mworking_example[m[m


In [4]:
%cd working_example

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example


In [5]:
!ls

Icon?           scrapy.cfg      [1m[34mworking_example[m[m


In [6]:
!cat scrapy.cfg
# scrapy.cfg shows the configuration information

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = working_example.settings

[deploy]
#url = http://localhost:6800/
project = working_example


In [7]:
# show Available tool commands
!scrapy -h

Scrapy 1.3.2 - project: working_example

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands      
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command


## (2) Spider
Spiders are classes that you define and that Scrapy uses to scrape information from websites. 
You can use genspider or scrapy.spiders to create your spider. 


### 1. Create spider using genspider
Genspider is used to create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The <name> parameter is set as the spider’s name, while <domain> is used to generate the allowed_domains and start_urls spider’s attributes.<br>
-Syntax: scrapy genspider [-t template] <name> <domain><br>
-genspider doesn't require project

In [8]:
!scrapy genspider firstspider firstspider.com
#you can use scrapy toolst to create a new spider
#Note:some Scrapy commands (like crawl) must be run from inside a Scrapy project.

Created spider 'firstspider' using template 'basic' in module:
  working_example.spiders.firstspider


In [9]:
!scrapy genspider -l # show available templates

Available templates:
  basic
  crawl
  csvfeed
  xmlfeed


In [10]:
!scrapy genspider example example.com
# create spider 'example' using template 'basic' in module

Created spider 'example' using template 'basic' in module:
  working_example.spiders.example


In [11]:
!scrapy genspider -t crawl scrapyorg scrapy.org
# created spider 'scrapyorg' using template 'crawl' in module

Created spider 'scrapyorg' using template 'crawl' in module:
  working_example.spiders.scrapyorg


### list
List all available spiders in the current project. The output is one spider per line.  
Syntax: scrapy list  
Requires project: yes

In [12]:
!scrapy list

example
firstspider
scrapyorg


### check
Syntax: scrapy check [-l] <spider><br>
check requires project to be existed

In [13]:
!scrapy check -l 
#checks current spiders and items

In [14]:
!scrapy check
#check status of current spiders


----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK


### 2. Create spider using scrapy.Spider
You can also use scrapy.spider to create a spider.<br>
Usually, a created spider is stored in spiders directory and executed by crawl command.

In [15]:
%cd working_example/spiders
# open spider directory

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example/working_example/spiders


In [16]:
!ls
# shows spiders that already exist

Icon?          [1m[34m__pycache__[m[m    firstspider.py
__init__.py    example.py     scrapyorg.py


In [17]:
!touch Basic_spider.py
# create a new spider named Basic_spider
# write information about how to scrape a website into this spider document

We will create a spider named Basic_spider that extracts information from simon business school's faculty website.

In [18]:
%%writefile Basic_spider.py
import scrapy

class MySpider(scrapy.Spider):
    name = 'faulty'
    allowed_domains = ['simon.rochester.edu']
    start_urls = ['http://www.simon.rochester.edu/faculty-and-research/faculty-directory/index.aspx']

Overwriting Basic_spider.py


In [19]:
!touch Advanced_spider.py
# create another spider named Advanced_spider
# put more detailed information about how to scrape a website

Advanced_spider extracts only the names and links related to names of simon business school's faulty website.

In [20]:
%%writefile Advanced_spider.py
import scrapy
from scrapy.selector import HtmlXPathSelector

class MySpider(scrapy.Spider):
    name = 'faulty'
    allowed_domains = ['simon.rochester.edu']
    start_urls = ['http://www.simon.rochester.edu/faculty-and-research/faculty-directory/index.aspx'
                 ]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//td[@class='name']")
        for title in titles:
            name = title.select("a/text()").extract()
            link = title.select("a/@href").extract()
            print (title, link)

Overwriting Advanced_spider.py


In [21]:
!ls
# now we have these two new spiders in our folder

Advanced_spider.py Icon?              [1m[34m__pycache__[m[m        firstspider.py
Basic_spider.py    __init__.py        example.py         scrapyorg.py


## (3) items
items.py document contains a item class, named by your project. It defines the fields you want to scrape from the web. You can rewrite the items document using %%writefile in iPython.

In [22]:
%cd ..
!ls
# returns back to root directory, where we can find the items.py document.

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example/working_example
Icon?          [1m[34m__pycache__[m[m    middlewares.py settings.py
__init__.py    items.py       pipelines.py   [1m[34mspiders[m[m


In [23]:
!cat items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WorkingExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


we can rewrite the item class. Let it contain contain the item name and link to open the item.

In [24]:
%%writefile items.py

from scrapy.item import Item, Field

class WorkingExampleItem(Item):
    title = Field() # title is the item name
    link = Field() # link is the like to open the item selling information

Overwriting items.py


In [25]:
%run items.py
# run items.py to let the item information containing in jupyter notebook
# items.py can also be used to specify newly created spider

In [26]:
!cat items.py


from scrapy.item import Item, Field

class WorkingExampleItem(Item):
    title = Field() # title is the item name
    link = Field() # link is the like to open the item selling information

## (4) crawl
Syntax: scrapy crawl <spider>
Requires project: yes

In [27]:
# we can crawl the website by name defined by our spider
!scrapy crawl faulty

2017-02-21 00:04:02 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: working_example)
2017-02-21 00:04:02 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'working_example', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'working_example.spiders', 'SPIDER_MODULES': ['working_example.spiders']}
2017-02-21 00:04:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-02-21 00:04:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redir

In [28]:
# or we can use runspider command to scrape a website
%cd spiders
!scrapy runspider Basic_spider.py

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example/working_example/spiders
2017-02-21 00:04:12 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: working_example)
2017-02-21 00:04:12 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['working_example.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'working_example', 'NEWSPIDER_MODULE': 'working_example.spiders'}
2017-02-21 00:04:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2017-02-21 00:04:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.

In [29]:
# export the output to csv
output = !scrapy crawl faulty
import numpy as np
np.savetxt('faulty.csv',output, delimiter = ",", fmt = "%s")

In [30]:
!cat faulty.csv

2017-02-21 00:04:18 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: working_example)
2017-02-21 00:04:18 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['working_example.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'working_example', 'NEWSPIDER_MODULE': 'working_example.spiders'}
2017-02-21 00:04:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2017-02-21 00:04:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermidd

In [31]:
!ls
# now we have the output instored in spider directory

Advanced_spider.py __init__.py        faulty.csv
Basic_spider.py    [1m[34m__pycache__[m[m        firstspider.py
Icon?              example.py         scrapyorg.py


# 5. Scrapy Use Cases
## Example One

In this example, we are going to use scrapy to scrape top-rated movies from imdb.

In [32]:
!pwd

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/working_example/working_example/spiders


First return to tutorial folder to demonstrate use cases.

In [33]:
%cd ../../..

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package


### 1. Creating a new Scrapy project
Before you can do anything else, you need to create a new Scrapy project.
Under your chosen directory, run the following code:

In [34]:
!scrapy startproject imdbTop

New Scrapy project 'imdbTop', using template directory '//anaconda/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/imdbTop

You can start your first spider with:
    cd imdbTop
    scrapy genspider example example.com


### 2. Define/write the spider

The class that does the crawling is called Spider. You write the spider, and Scrapy uses it to scrape information from a website or websites.  
Spiders must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.


In [35]:
# change directory to spiders
%cd imdbTop/imdbTop/spiders

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/imdbTop/imdbTop/spiders


In [36]:
!touch ratings_spider.py

The following is the code for our spider. We are trying to get the movie titles and the ratings for these top-rated movies.  
Save the code in a file named 'ratings_spider.py'. Put this file under the **imdbTop/spiders** directory in your project.

In [37]:
%%writefile ratings_spider.py
import scrapy


class RatingsSpider(scrapy.Spider):
    name = "ratings"

    
    start_urls = ['http://www.imdb.com/chart/top']
    
    def parse(self, response):
        for row in response.css('tr')[1:-2]:
            yield {
                'title': row.css('a::text').extract()[2],
                'rating': row.css('strong::text').extract_first()
            }


Overwriting ratings_spider.py


**name** Identifies your spider. Keep in mind that you cannot use the same name for multiple spiders in your project.  
**start_urls** Can take a list of urls. A shortcut to the default **start_requests** method.  
**parse** Handles the response downloaded for each of the requests made. 




### 3. Run the spider & extract data

Go to the top level of your project and run the following code:

In [38]:
!scrapy crawl ratings

2017-02-21 00:05:20 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: imdbTop)
2017-02-21 00:05:20 [scrapy.utils.log] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['imdbTop.spiders'], 'NEWSPIDER_MODULE': 'imdbTop.spiders', 'BOT_NAME': 'imdbTop'}
2017-02-21 00:05:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-02-21 00:05:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'sc

### 4. Store scraped data

To store scraped data, just run the following command.

In [39]:
%cd ..

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/imdbTop/imdbTop


In [40]:
!scrapy crawl ratings -o ratings.json

2017-02-21 00:05:35 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: imdbTop)
2017-02-21 00:05:35 [scrapy.utils.log] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'imdbTop', 'FEED_URI': 'ratings.json', 'NEWSPIDER_MODULE': 'imdbTop.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['imdbTop.spiders']}
2017-02-21 00:05:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2017-02-21 00:05:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlew

Now you have a .json file that contains the scraped movie ratings data.  
It is important to note that Scrapy appends to a given file instead of overwriting its contents, so if you run this command twice without removing the file before the second time, you might end up with a broken json file.

You can also store the scraped data in other formats such as JSON Lines.

In [41]:
!scrapy crawl ratings -o ratings.jl

2017-02-21 00:05:44 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: imdbTop)
2017-02-21 00:05:44 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'jl', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['imdbTop.spiders'], 'FEED_URI': 'ratings.jl', 'BOT_NAME': 'imdbTop', 'NEWSPIDER_MODULE': 'imdbTop.spiders'}
2017-02-21 00:05:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.corestats.CoreStats']
2017-02-21 00:05:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares

## Example Two

In this example, we are going to use item pipeline to filter scraped movies whose production year is after year 2000. We'll continue to use the code from imdbTop example.

To define common output data format Scrapy provides the Item class. Item objects are simple containers used to collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available fields.

### 1. Declaring Items

First we declare movie items in **imdbTop/itmes.py** using following code:

In [42]:
!ls

Icon?          [1m[34m__pycache__[m[m    middlewares.py ratings.jl     settings.py
__init__.py    items.py       pipelines.py   ratings.json   [1m[34mspiders[m[m


In [43]:
%%writefile items.py
import scrapy

class Movie(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    rating = scrapy.Field()
    ranking = scrapy.Field()
    year = scrapy.Field()
    pass

Overwriting items.py


Then we add items to spider "ratings_spider_advanced.py". Here we added some new data to scrape, like ranking and year.

In [44]:
%cd spiders

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/imdbTop/imdbTop/spiders


In [45]:
!touch ratings_spider_advanced.py

In [46]:
%%writefile ratings_spider_advanced.py
import scrapy
from imdbTop.items import Movie

class RatingsSpider(scrapy.Spider):
    name = "ratings_advanced"

    start_urls = ['http://www.imdb.com/chart/top']
    
    def parse(self, response):
        movies = []
        for row in response.css('tr')[1:-2]:
            movie = Movie()
            movie['title'] = row.css('a::text').extract()[2]
            movie['rating'] = row.css('strong::text').extract_first()
            movie['ranking'] = row.css('td.titleColumn::text').extract()[0].replace('\n', '').replace(' ','').replace('.','')
            movie['year'] = row.css('span.secondaryInfo::text').extract()[0][1:-1]
            movies.append(movie)
        return movies

Overwriting ratings_spider_advanced.py


After an item has been scraped by a spider, it is then sent to the Item Pipeline, which processes it through several components that are executed sequentially.

We can define a pipeline in **imdbTop/pipelines.py**:

In [47]:
%cd ..

/Users/Ralph/Linda_Google_Drive/Simon_BA/2017_Winter/Advanced_BA/Team_Project/Python tool/Scrapy_Package/imdbTop/imdbTop


In [48]:
!ls

Icon?          [1m[34m__pycache__[m[m    middlewares.py ratings.jl     settings.py
__init__.py    items.py       pipelines.py   ratings.json   [1m[34mspiders[m[m


Now write the code into pipelines.py

In [49]:
%%writefile pipelines.py
from scrapy.exceptions import DropItem

class DropMoviePipeline(object):

    def process_item(self, movie, spider):
        if int(movie['year']) > 2000:
            return movie
        else:
            raise DropItem("Movie \' %s \' is before year 2000" % movie['title'])

Overwriting pipelines.py


In the pipeline defined above, we are returning the movies that were produced after year 2000.  

Finally, to activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting in **imdbTop/settings.py**:

In [50]:
%%writefile -a settings.py
ITEM_PIPELINES = {
   'imdbTop.pipelines.DropMoviePipeline': 300,
}

Appending to settings.py


We have implemented all the code needed to filter the movies. Go to the top level of your project and run the following code:

In [51]:
!scrapy crawl ratings_advanced

2017-02-21 00:07:22 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: imdbTop)
2017-02-21 00:07:22 [scrapy.utils.log] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['imdbTop.spiders'], 'NEWSPIDER_MODULE': 'imdbTop.spiders', 'BOT_NAME': 'imdbTop'}
2017-02-21 00:07:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-02-21 00:07:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'sc

Also, in your system command line, you could use the following code to store results in JSON file:

In [52]:
!scrapy crawl ratings_advanced -o movies_after_2000.json

2017-02-21 00:07:31 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: imdbTop)
2017-02-21 00:07:31 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'imdbTop', 'FEED_FORMAT': 'json', 'FEED_URI': 'movies_after_2000.json', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['imdbTop.spiders'], 'NEWSPIDER_MODULE': 'imdbTop.spiders'}
2017-02-21 00:07:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.feedexport.FeedExporter']
2017-02-21 00:07:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloa

# 6. Other interesting or useful features

## (1) Logging
Scrapy uses Python’s built-in logging system for event logging. Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.

In [None]:
# a simply logging message using the logging.WARNING level:
import logging
logging.warning("This is a warning")

In [None]:
# you can put logging inside a spider
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://scrapinghub.com']
    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

## (2) Sending email
Scrapy provides its own facility for sending e-mails which is very easy to use.  
It also provides a simple API for sending attachments and it’s very easy to configure, with a few settings.

#### syntax
class scrapy.mail.MailSender(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None,
smtpport=None)
send(to, subject, body, cc=None, attachs=(), mimetype=’text/plain’, charset=None)

- Parameters
    - to – the e-mail recipients
    - subject – the subject of the e-mail
    - cc – the e-mails to CC
    - body – the e-mail body
    - attachs(attach_name, mimetype, file_object) – an iterable of tuples  where attach_name is a string with the name that will appear on the e-mail’s attachment, mimetype is the mimetype of the attachment and file_object is a readable file object with the contents of the attachment
    - mimetype  – the MIME type of the e-mail
    - charset – the character encoding to use for the e-mail contents


- The following command define the MailSender class, and can be used to configure setting:
    - subject – the subject of the e-mail
    - cc – the e-mails to CC
    - MAIL_FROM
    Default: ’scrapy@localhost’
    Sender email to use (From: header) for sending emails.
    - MAIL_HOST
    Default: ’localhost’
    SMTP host to use for sending emails.
    - MAIL_PORT
    Default: 25
    MTP port to use for sending emails.
    - MAIL_USER
    Default: None
    User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
    - MAIL_PASS
    Default: None
    Password to use for SMTP authentication, along with MAIL_USER.

- Note: Scrapy does not support sending mail with Python 3.

In [None]:
from scrapy.mail import MailSender
mailer = MailSender()
mailer.send(to=["Pinyi.Liu@simon.rochester.edu"], subject="scrapy tool", body="Hi! This is scrapy", 
            cc=["rippleslpy@gamil.com"])
# NOTE: Scrapy does not support sending mail with Python 3.

## (3) Core API
    
 - Scrapy core API is used for developers of extensions and middlewares.
 The following are some typical types of API:
    - Crawler API:
      - Crawler object is the main entry point to Scrapy API. It is passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy. 
 
    - Settings API
       - It sets the key name and priority level of the default settings priorities used in Scrapy. Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the Settings class.
   
   - SpiderLoader API
     - It is in charge of retrieving and handling the spider classes defined across the project. Custom spider loaders can be employed by specifying their path in the SPIDER_LOADER_CLASS project setting. They must fully implement the scrapy.interfaces.ISpiderLoader interface to guarantee an errorless execution.

## (4) Signals
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project to perform additional tasks or extend Scrapy to add functionality not provided out of the box. You can connect to signals (or send your own) through the Signals API.

# 7. Summary and Personal Assessment 
Scrapy is a powerful crawling framework and is fairly easy to use, even for people who are not expert on python. One of the most outstanding advantages of scrapy is that it requests can be scheduled and processed asynchronously. Scrapy does not have to wait for sending another request after all the other requests and processes are finished. In other words, scrapy can still operate when it catches errors or fails in some requests. Another attractive feature of scrapy is that scrapy can be easily modified according to users’ needs and can complete requests fairly quickly. In most cases, scrapy is able to crawl 20 to 100 urls per second. Scrapy also provides support for multiple types of spiders (BaseSpider, Sitemaps etc.). 

However, for large scale web crawling, it may take Scrapy longer to debug compared with java based crawling frameworks. 


# 8. References
https://doc.scrapy.org/en/latest/intro/overview.html  
https://www.youtube.com/watch?v=CsaqVQ4NIEU  
http://www.crummy.com/software/BeautifulSoup/  
http://docs.python.org/2/library/urllib2.html  
http://www.imdb.com/chart/top

