# CrawlerRunner命令执行详解
　　参考：[Scrapy 源码阅读（二）：看源码](https://zhuanlan.zhihu.com/p/150120517)<br>
　　　　　[Scrapy的抓取流程——CrawlerProcess](https://blog.csdn.net/okm6666/article/details/89160886)<br>
　　　　　[通过核心ＡＰＩ启动单个或多个scrapy爬虫](https://www.jianshu.com/p/add5c59d698a)<br>
　　　　　[Scrapy进阶-命令行的工作原理（以runspider为例）](https://www.jianshu.com/p/8e252b2272d8)<br>
　　　　　[同时运行多个scrapy爬虫的几种方法（自定义scrapy项目命令）](https://www.cnblogs.com/rwxwsblog/p/4578764.html)<br>
　　　　　[scrapy项目下spiders内多个爬虫同时运行](https://blog.csdn.net/beyond_f/article/details/74626451)<br>
　　　　　[python scrapy项目下spiders内多个爬虫同时运行](https://blog.csdn.net/qq_38282706/article/details/80977576)<br>
　　　　　[scrapy启动流程源码分析(2)CrawlerProcess主进程](https://blog.csdn.net/csdn_yym/article/details/85423656)<br>
   
## 一、简介
　　从上一节，我们知道scrapy命令执行时，会配置项目环境、解析命令行、启动一个主进程运行爬虫任务。本节讨论进程启动后，爬虫任务是如何执行的？<br>
　　CrawlerProcess是CrawlerRunner的子类，它控制了twisted的reactor(wisted中的reactor相当于asyncio中loop，deferred相当于 future)，也就是整个事件循环。它负责配置reactor并启动事件循环，最后在所有爬取结束后停止reactor。另外还控制了一些信号操作，使用户可以手动终止爬取任务。<br>
　　此类在scrapy/crawler.py中定义，此模块有三个类：Crawler、CrawlerRunner和CrawlerProcess。<br>
　　Crawler是实际执行爬取的类，并管理了自身的启停，接受控制信号和setting配置等，其实例，里面使用一种spider，代表了一个爬取任务。CrawlerRunner 是对crawler的调度，其需要twised框架。CrawlerProcess相当于封装了twisted需要的reactor以后的 CrawlerRunner，可以控制多个Crawler同时进行多种爬取任务。CrawlerProcess通过实现start方法来启动一个Twisted的reactor（另有shutdown信号处理、顶层logging功能）。<br>
　　由execute()函数通过一系列解析动作,调用 CrawlerProcess对象 的run() 方法执行具体的爬虫任务。这里，CrawlerProcess实例和其内部的Crawler,调用Crawler里的engine实例的 engine.open_spider 做准备工作（scheduler和其他需要对应crawler实例化的东西），然后调用 Crawler里的engine.start开启引擎，最后调用Crawlerprocess.start方法启动reactor。<br>
　　爬取正式开始，流程图如下：<br>
　　![Crawlerprocess抓取流程图](./images/scrapy_Crawlerprocess流程图.png)
  
## 二、crawl.py与CrawlerRunner介绍
### 2.1、几个重要Class的关系
　　\[CrawlerRunner　\[Crawler　\[\[Spider\],　\[ExecutionEngine　\[spider,slot　\[scheduler\],downloader, scraper\]\]\]\]\]<br>
　　1、Crawler 可以理解为爬虫的一个容器<br>
　　2、CrawlerRunner 对 Crawler 做了一些封装，可以让我们更方便的运行爬虫。类似的还有 CrawlerProcess，它是 CrawlerRunner 的子类<br>
　　3、Spider 就是我们编写爬虫文件时依赖的类，ExecutionEngine 则是 Scrapy 调度的核心<br>
　　4、spider，Crawler 中传递过来的 Spider 对象<br>
　　5、slot，插槽，用于请求存储以及调度<br>
　　6、scheduler，一般是 scrapy.core.scheduler.Scheduler 的对象<br>
　　7、downloader，一般是 crapy.core.downloader.Downloader 的对象<br>
　　8、scraper，一般是 scrapy.core.scraper.Scraper 的对象，与 Spider Middleware 和 Item Pipelines 有关。<br>
　　**注意：**<br>
　　1、spider是程序员编写的爬虫代码模块，一般是存放在项目里spiders文件夹内，并给每个爬虫模块赋予独立的名称，命令行启动时通过不同的名称启动不同的spider；<br>
　　2、crawler是爬取任务，每次在命令行启动，都会新建一个新的crawler爬取任务，可以为同一个spider新建多个crawler，表现在命令里就是同样的命令可以重复执行多次，同一个spider对应的多个crawler共同占有同样的私有配置、同一个任务队列。<br>

  
  
## 一、通过自定义scrapy命令的方式来运行
　　配置说明：https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/commands.html<br>
1、创建commands目录<br>
>    mkdir commands<br>

　　注意：commands和spiders目录是同级的<br>
2、在commands下面添加一个文件crawlall.py<br>
　　注意：这里主要通过修改scrapy的crawl命令来完成同时执行spider的效果。<br>

In [7]:
from scrapy.commands import ScrapyCommand  
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict

class Command(ScrapyCommand):
  
    requires_project = True
  
    def syntax(self):  
        return '[options]'  
  
    def short_desc(self):  
        return 'Runs all of the spiders'  

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
                          help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE",
                          help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT",
                          help="format to use for dumping items with -o")

    def process_options(self, args, opts):
        ScrapyCommand.process_options(self, args, opts)
        try:
            opts.spargs = arglist_to_dict(opts.spargs)
        except ValueError:
            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)

    def run(self, args, opts):
        #settings = get_project_settings()

        spider_loader = self.crawler_process.spider_loader
        for spidername in args or spider_loader.list():
            print("*********cralall spidername************" + spidername)
            self.crawler_process.crawl(spidername, **opts.spargs)   # 执行初始化（创建新爬虫对象）

        self.crawler_process.start()    # 正式运行

ModuleNotFoundError: No module named 'scrapy'

　　这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下所有的spider，然后利用self.crawler_process.crawl运行spider<br>

　　3、commands命令下添加__init__.py文件<br>

touch __init__.py<br>
　　注意：这一步一定不能省略。我就是因为这个问题折腾了一天。囧。。。就怪自己半路出家的吧。<br>

　　如果省略了会报这样一个异常<br>
>Traceback (most recent call last):<br>
　　File "/usr/local/bin/scrapy", line 9, in \<module\><br>
　　　　load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()<br>
　　File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute<br>
　　　　cmds = \_get_commands_dict(settings, inproject)<br>
　　File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in \_get_commands_dict<br>
　　　　cmds.update(\_get_commands_from_module(cmds_module, inproject))<br>
　　File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in \_get_commands_from_module<br>
　　　　for cmd in \_iter_command_classes(module):<br>
　　File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in \_iter_command_classes<br>
　　　　for module in walk_modules(module_name):<br>
　　File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules<br>
　　　　mod = import_module(path)<br>
　　File "/usr/local/lib/python2.7/importlib/\_\_init__.py", line 37, in import_module<br>
　　　　\_\_import__(name)<br>
ImportError: No module named commands<br>

　　一开始怎么找都找不到原因在哪。耗了我一整天，后来到http://stackoverflow.com/上得到了网友的帮助。再次感谢万能的互联网，要是没有那道墙该是多么的美好呀！扯远了，继续回来<br>

　　4、settings.py目录下创建setup.py（这一步去掉也没影响，不知道官网帮助文档这么写有什么具体的意义。<br>

>from setuptools import setup, find_packages<br>
<br>
setup(name='scrapy-mymodule',<br>
　　entry_points={<br>
　　　　'scrapy.commands': [<br>
　　　　　　'crawlall=cnblogs.commands:crawlall',<br>
　　　　],<br>
　　},<br>
)<br>

　　这个文件的含义是定义了一个crawlall命令，cnblogs.commands为命令文件目录，crawlall为命令名。<br>

　　5. 在settings.py中添加配置：<br>
>COMMANDS_MODULE = 'cnblogs.commands'<br>

　　6. 运行命令scrapy crawlall<br>

In [2]:
# --coding:utf-8--
# scrapy.crawler.py
import logging
import pprint
import signal
import warnings
 
from twisted.internet import defer
from zope.interface.exceptions import DoesNotImplement
 
try:
    # zope >= 5.0 only supports MultipleInvalid
    from zope.interface.exceptions import MultipleInvalid
except ImportError:
    MultipleInvalid = None
 
from zope.interface.verify import verifyClass
 
from scrapy import signals, Spider
from scrapy.core.engine import ExecutionEngine
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.extension import ExtensionManager
from scrapy.interfaces import ISpiderLoader
from scrapy.settings import overridden_settings, Settings
from scrapy.signalmanager import SignalManager
from scrapy.utils.log import (
    configure_logging,
    get_scrapy_root_handler,
    install_scrapy_root_handler,
    log_scrapy_info,
    LogCounterHandler,
)
from scrapy.utils.misc import create_instance, load_object
from scrapy.utils.ossignal import install_shutdown_handlers, signal_names
from scrapy.utils.reactor import install_reactor, verify_installed_reactor
 
# scrapy 中文解释及其注释
logger = logging.getLogger(__name__)
 
#实际爬取执行的类
class Crawler:
 
    def __init__(self, spidercls, settings=None):
        if isinstance(spidercls, Spider):
            #要求传入的是类而不是实例
            raise ValueError('The spidercls argument must be a class, not an object')
 
        if isinstance(settings, dict) or settings is None:
            #转化为setting对象
            settings = Settings(settings)
 
        self.spidercls = spidercls
        self.settings = settings.copy()
        #然后使用spidercls类的update_setting方式来更新设置：导入spider的custom_settings
        self.spidercls.update_settings(self.settings)
 
        self.signals = SignalManager(self)   # 声明一个SignalManager对象，这个对象主要是利用开源的python库
        # pydispatch作消息的发送和路由.scrapy使用它发送关键的消息事件给关心者，如爬取开始，爬取结束等消息.
	    # 通过send_catch_log_deferred来发送消息，通过connect方法来注册关心消息的处理函数

        #从类的setting中的STATS_CLASS拿到stats
        self.stats = load_object(self.settings['STATS_CLASS'])(self)
        #从setting中拿到loglevel 将初始化的LogCounterHandler 加入到logging.root
        handler = LogCounterHandler(self, level=self.settings.get('LOG_LEVEL'))
        logging.root.addHandler(handler)
        # 显示出来所有被复写的setting
        d = dict(overridden_settings(self.settings))
        logger.info("Overridden settings:\n%(settings)s",
                    {'settings': pprint.pformat(d)})
 
        if get_scrapy_root_handler() is not None:
            # scrapy root handler already installed: update it with new settings
            install_scrapy_root_handler(self.settings)
        # lambda is assigned to Crawler attribute because this way it is not
        # garbage collected after leaving __init__ scope
        self.__remove_handler = lambda: logging.root.removeHandler(handler)
        #将该signals.engine_stopped信号的callback注册到self.__remove_handler函数上
        self.signals.connect(self.__remove_handler, signals.engine_stopped)   # 注册引擎结束消息处理函数
        # log格式指定
        lf_cls = load_object(self.settings['LOG_FORMATTER'])
        self.logformatter = lf_cls.from_crawler(self)
        # 扩展 还没看
        self.extensions = ExtensionManager.from_crawler(self)    # 添加ExtensionManager。
 
        self.settings.freeze()
        self.crawling = False
        self.spider = None
        self.engine = None
        # spider、engine 属性仅设置为 None，会在 crawl() 方法中具体实例化
    #defer.inlineCallbacks 装饰器 是指当使用异步调用该方法时候，
    # 该方法可以用类似同步语法的方法写异步的工作，其中yield deferred对象后
    # 后续代码会等待这个yield出去的deferred成功返回后再进行下一步
    # 其中等待时间交还给reactor。
    @defer.inlineCallbacks
    def crawl(self, *args, **kwargs):    # 使用了Twisted的defer.inlineCallbacks装饰器，表明此函数非阻塞，异步执行
        '''
        调用Crawler的crawl方法开启一个爬取任务，通过调用spider的from_crawler方法来创建一个spider对象，
        这样，许多spider类都可以使用crawler的方法和数据，属于依赖注入。spider的代码由程序员自己编写，
        不同的爬虫类除了调用父类的from_crawler外，可以重定义这个方法来实现个性化实现。
        '''
        if self.crawling:
            raise RuntimeError("Crawling already taking place")
        self.crawling = True
        # 初始当前化抓取任务的 spider、engine、start_requests
        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            # 从self.spider.start_requests()中拿到requests
            start_requests = iter(self.spider.start_requests())
            # 调用异步方法，开始爬虫爬取工作
            yield self.engine.open_spider(self.spider, start_requests)   # 调用执行引擎打开spider；
            # 调用核心的start方法，并将返回值包装成deferred对象
            # 启动执行引擎。此时仍然并未真正开始爬取，仍然是CrawlerProcess.start()之前的预处理步骤。
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            self.crawling = False
            if self.engine is not None:
                yield self.engine.close()
            raise
 
    def _create_spider(self, *args, **kwargs):
        #调用传入spider类的from_crawler(self, args, *kwargs)
        return self.spidercls.from_crawler(self, *args, **kwargs)
 
    def _create_engine(self):
        #ExecutionEngine(self, lambda _: self.stop()) 传入stop函数 实例化engine
        return ExecutionEngine(self, lambda _: self.stop())
 
    #重置self.crawling 为false 同时发出 异步命令self.engine.stop 包装一下self.engine.stop成为deferred对象
    @defer.inlineCallbacks
    def stop(self):
        """Starts a graceful stop of the crawler and returns a deferred that is
        fired when the crawler is stopped."""
        if self.crawling:
            self.crawling = False
            yield defer.maybeDeferred(self.engine.stop)
 
# 简单的说 如果你自己的应用用到reactor 可以考虑用这个类控制spider启停等，不然用CrawlerProcess
class CrawlerRunner:
    """
    This is a convenient helper class that keeps track of, manages and runs
    crawlers inside an already setup :mod:`~twisted.internet.reactor`.
    The CrawlerRunner object must be instantiated with a
    :class:`~scrapy.settings.Settings` object.
    This class shouldn't be needed (since Scrapy is responsible of using it
    accordingly) unless writing scripts that manually handle the crawling
    process. See :ref:`run-from-script` for an example.
    """
    #property()函数是用来指定当前属性的文件描述符类的方法，这里就是把lambda作为 他的getter 返回的是self._crawlers
    crawlers = property(
        lambda self: self._crawlers,
        doc="Set of :class:`crawlers <scrapy.crawler.Crawler>` started by "
            ":meth:`crawl` and managed by this class."
    )
    #从setting里 建立spiderloader实例
    @staticmethod
    def _get_spider_loader(settings):
        """ Get SpiderLoader instance from settings """
        cls_path = settings.get('SPIDER_LOADER_CLASS')
        #load_object(cls_path) 是将xx.xx路径转为实例的类
        loader_cls = load_object(cls_path)
        excs = (DoesNotImplement, MultipleInvalid) if MultipleInvalid else DoesNotImplement
        try:
            verifyClass(ISpiderLoader, loader_cls)
        except excs:
            warnings.warn(
                'SPIDER_LOADER_CLASS (previously named SPIDER_MANAGER_CLASS) does '
                'not fully implement scrapy.interfaces.ISpiderLoader interface. '
                'Please add all missing methods to avoid unexpected runtime errors.',
                category=ScrapyDeprecationWarning, stacklevel=2
            )
        #这里返回的是 setting中spiderloader加载目前的setting后的实例
        return loader_cls.from_settings(settings.frozencopy())
 
    def __init__(self, settings=None):
        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)
        self.settings = settings
        self.spider_loader = self._get_spider_loader(settings)  
        # 会加载项目内所有 spider 类，并进行名称重复检查。
        self._crawlers = set()    # _crawlers 保存所有的爬虫任务，每个任务是一个实例化的 Crawler 对象（通过
        # 蜘蛛名或已存在的 Crawler 实例化），默认情况下只会存在一个 spider 的抓取任务，但也可以通过脚本
        # 同时运行多个 spdier：同一进程运行多个 spider
        self._active = set()     # 保存所有 Crawler 实例的 crawl() 方法，Crawler.crawl() 方
        # 法被@defer.inlineCallbacks 装饰后返回一个 Deferred。
        self.bootstrap_failed = False
        self._handle_twisted_reactor()
 
    @property
    def spiders(self):
        warnings.warn("CrawlerRunner.spiders attribute is renamed to "
                      "CrawlerRunner.spider_loader.",
                      category=ScrapyDeprecationWarning, stacklevel=2)
        return self.spider_loader
 
    #生成一个crawler对象 然后调用其crawl方法，其中有些管理crawler的deferred的部分 在_crawl里
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        """
        Run a crawler with the provided arguments.
        It will call the given Crawler's :meth:`~Crawler.crawl` method, while
        keeping track of it so it can be stopped later.
        If ``crawler_or_spidercls`` isn't a :class:`~scrapy.crawler.Crawler`
        instance, this method will try to create one using this parameter as
        the spider class given to it.
        Returns a deferred that is fired when the crawling is finished.
        :param crawler_or_spidercls: already created crawler, or a spider class
            or spider's name inside the project to create it
        :type crawler_or_spidercls: :class:`~scrapy.crawler.Crawler` instance,
            :class:`~scrapy.spiders.Spider` subclass or string
        :param args: arguments to initialize the spider
        :param kwargs: keyword arguments to initialize the spider
        """
        if isinstance(crawler_or_spidercls, Spider):
            raise ValueError(
                'The crawler_or_spidercls argument cannot be a spider object, '
                'it must be a spider class (or a Crawler object)')
        crawler = self.create_crawler(crawler_or_spidercls)   
        # 创建 Crawler 实例（通过传入蜘蛛名或已存在的 Crawler 初始化）
        return self._crawl(crawler, *args, **kwargs)
        # 返回一个Deferred对象给CrawlerProcess，把Deferred对象加入_active集合，然后就可以在必要时
        # 结束Crawler，并通过向Deferred中添加_done callback来跟踪一个Crawler的结束。
 
    def _crawl(self, crawler, *args, **kwargs):
        self.crawlers.add(crawler) #这个集合里是crawler
        d = crawler.crawl(*args, **kwargs)
        self._active.add(d) # 这个集合里是执行crawl后的deferred对象
        # 将新创建的 Crawler 及 Crawler.crawl() 分别添加进 self.crawlers、self._active。
 
        def _done(result):
            # discard 相当于不报错的remove
            self.crawlers.discard(crawler)
            self._active.discard(d)
            #a|=2等价于a=a|2(按位或)
            self.bootstrap_failed |= not getattr(crawler, 'spider', None)
            return result
        # 运行crawl前 加入到管理集合，给其deferred对象添加结束后清理管理集合的代码
        return d.addBoth(_done)    # Deferred
        # 为 Crawler.crawl() 添加回调_done，当前抓取任务完毕后从 self.crawlers、self._active 中删除
        # 上一步所做的添加。最终返回的是 Crawler.crawl() 这个 Deferred。
'''
crawl() 回调链是被立即激活的 (@defer.inlineCallbacks 性质如此)，随着函数一步步深入的执行，最终阻塞并
等待内层 Deferred 的激活（这些 Deferred 使用类似 reactor.callLater() 之类的方法注册），reactor.run() 
执行后激活等待中的 Deferred，调度并开始抓取/处理数据。
所以：在 crawl() 执行中，scrapy 并没有开始抓取数据，只是做了系列的初始化动作。
'''

    def create_crawler(self, crawler_or_spidercls):
        """
        如果crawler_or_spidercls（命令行输入的spider名称）是一个Spider的子类（已经运行）则创建一个
        新的Crawler，如果crawler_or_spidercls是一个字符串（未运行），则根据名称来查找对应的spider
        并创建一个Crawler实例并执行Crawler的初始化。
        Return a :class:`~scrapy.crawler.Crawler` object.
        * If ``crawler_or_spidercls`` is a Crawler, it is returned as-is.
        * If ``crawler_or_spidercls`` is a Spider subclass, a new Crawler
          is constructed for it.
        * If ``crawler_or_spidercls`` is a string, this function finds
          a spider with this name in a Scrapy project (using spider loader),
          then creates a Crawler instance for it.
        """
        if isinstance(crawler_or_spidercls, Spider):
            raise ValueError(
                'The crawler_or_spidercls argument cannot be a spider object, '
                'it must be a spider class (or a Crawler object)')
        if isinstance(crawler_or_spidercls, Crawler):
            return crawler_or_spidercls
 
        return self._create_crawler(crawler_or_spidercls)
 
    def _create_crawler(self, spidercls):
        if isinstance(spidercls, str):
            spidercls = self.spider_loader.load(spidercls)
        # 实际实例化Crawler进行的地方 传入的是spider的类和setting
        return Crawler(spidercls, self.settings)
 
    def stop(self):
        """
        Stops simultaneously all the crawling jobs taking place.
        Returns a deferred that is fired when they all have ended.
        """
        #跟crawler不一样，这个是一个由每个crawler执行stop函数后返回的deferred对象列表
        return defer.DeferredList([c.stop() for c in list(self.crawlers)])
 
    # 跟多进程的join类似，等待所有crawler完成任务
    # 上面说到这个_active集合是所有crawler的deferred对象 把他们yield出去 以便调用后续callback
    @defer.inlineCallbacks
    def join(self):
        """
        join()
        Returns a deferred that is fired when all managed :attr:`crawlers` have
        completed their executions.
        """
        # 此函数首先调用join函数来对前面所有Crawler的crawl方法返回的Deferred对象添加一个_stop_reactor方法，
        # 当所有Crawler对象都结束时用来关闭reactor。
        while self._active:
            yield defer.DeferredList(self._active)
    # 这个方法返回从传入setting 字符中加载 用 load_object()加载进来的recator实例 或者啥也不做
    def _handle_twisted_reactor(self):
        if self.settings.get("TWISTED_REACTOR"):
            verify_installed_reactor(self.settings["TWISTED_REACTOR"])
 
#这个crawlerProcess 是用来在不用recator的应用里使用 同时在一个进程里使用多个spider的
# 如果只是使用scrapy就不用改这个，除非想把scrapy放到你自己应用里　
# 实际上这个是上一个 CrawlerRunner 添加了reactor后的东西
class CrawlerProcess(CrawlerRunner):
    """
    A class to run multiple scrapy crawlers in a process simultaneously.
    This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support
    for starting a :mod:`~twisted.internet.reactor` and handling shutdown
    signals, like the keyboard interrupt command Ctrl-C. It also configures
    top-level logging.
    This utility should be a better fit than
    :class:`~scrapy.crawler.CrawlerRunner` if you aren't running another
    :mod:`~twisted.internet.reactor` within your application.
    The CrawlerProcess object must be instantiated with a
    :class:`~scrapy.settings.Settings` object.
    :param install_root_handler: whether to install root logging handler
        (default: True)
    This class shouldn't be needed (since Scrapy is responsible of using it
    accordingly) unless writing scripts that manually handle the crawling
    process. See :ref:`run-from-script` for an example.
    """
 
    def __init__(self, settings=None, install_root_handler=True):
        super().__init__(settings)
        #将shutdownhandler 加载为函数_signal_shutdown
        install_shutdown_handlers(self._signal_shutdown)      # 注册关闭句柄，如ctrl+c
        configure_logging(self.settings, install_root_handler)     # 配置logging
        log_scrapy_info(self.settings)       # 打印当前scrapy概况
 
    def _signal_shutdown(self, signum, _):
        from twisted.internet import reactor
        #将shutdownhandler注册为 _signal_kill
        install_shutdown_handlers(self._signal_kill)
        signame = signal_names[signum]
        logger.info("Received %(signame)s, shutting down gracefully. Send again to force ",
                    {'signame': signame})
        #使用reactor.callFromThread(self._graceful_stop_reactor)命令调用自身的结束语句
        reactor.callFromThread(self._graceful_stop_reactor)
 
    def _signal_kill(self, signum, _):
        from twisted.internet import reactor
        #将shutdownhandler注册为signal.SIG_IGN
        install_shutdown_handlers(signal.SIG_IGN)
        signame = signal_names[signum]
        logger.info('Received %(signame)s twice, forcing unclean shutdown',
                    {'signame': signame})
        #直接将recator关闭
        reactor.callFromThread(self._stop_reactor)
 
    def start(self, stop_after_crawl=True):
        """
        This method starts a :mod:`~twisted.internet.reactor`, adjusts its pool
        size to :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache
        based on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.
        If ``stop_after_crawl`` is True, the reactor will be stopped after all
        crawlers have finished, using :meth:`join`.
        :param bool stop_after_crawl: stop or not the reactor when all
            crawlers have finished
        """
        # 注册系列 reactor 相关配置，调用 reactor.run() 开始事件循环。
        from twisted.internet import reactor
        #设置如果爬完后关闭recator的话 就添加相应的callback结束callback 这就是上面join的用处 ，如果这里不设置为true
        # 那么这个reactor就会留着不销毁
        if stop_after_crawl:
            d = self.join()
            # 通过 join() 方法返回一个由 self._active 初始化的 DeferredList，即：包含的是一个个
            # 抓取任务（Crawler.crawl()）
            # Don't start the reactor if the deferreds are already fired
            if d.called:
                return
            d.addBoth(self._stop_reactor)     # 为 DeferredList 添加关闭 reactor 的回调
        #加载一个配置 threadpool和dns_resolver的配置到recator
        resolver_class = load_object(self.settings["DNS_RESOLVER"])
        resolver = create_instance(resolver_class, self.settings, self, reactor=reactor)
        resolver.install_on_reactor()   # 配置 dns 缓存、线程池、系统事件触发器（没有了解过这几个 API）
        tp = reactor.getThreadPool()
        tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))
        reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
        # 启动reactor事件循环，标志着所有爬虫正式运行，如果没有手动结束，就只会在所有爬虫全部爬取完成后
        # 才会自动结束。
        reactor.run(installSignalHandlers=False)  # blocking call
 
    def _graceful_stop_reactor(self):
        # 给所有crawler 的deferred对象后面添加一个完成后销毁的动作
        d = self.stop()
        d.addBoth(self._stop_reactor)
        return d
 
    def _stop_reactor(self, _=None):
        from twisted.internet import reactor
        try:
            reactor.stop()
        except RuntimeError:  # raised if already stopped or in shutdown stage
            pass
 
    def _handle_twisted_reactor(self):
        if self.settings.get("TWISTED_REACTOR"):
            install_reactor(self.settings["TWISTED_REACTOR"], self.settings["ASYNCIO_EVENT_LOOP"])
        super()._handle_twisted_reactor()

IndentationError: unexpected indent (<ipython-input-2-629684eb49c5>, line 239)