Merge remote-tracking branch 'origin/master'

mumupy · Mar 13, 2019 · ad5d003 · ad5d003
2 parents 22b081b + 7fb710c
commit ad5d003
Show file tree

Hide file tree

Showing 9 changed files with 163 additions and 28 deletions.
diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/docs/_templates/README.rst → docs/README.rst b/docs/_templates/README.rst → docs/README.rst
@@ -1,18 +1,19 @@
 pcrawler爬虫
-######
+============
 
-***pcrawler是一款python版本的爬虫程序，通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
-包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***
+pcrawler是一款python版本的爬虫程序，通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
+包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。
 
 特性
-######
+----
 * 简单的API，可快速上手
 * 模块化的结构，可轻松扩展
 * 提供多线程和分布式支持
 
 架构
-#####
+----
 pcrawler主要包含downloader、schedular、processor、storage四大组件组成。
+
 * processor 爬虫页面处理器，对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
 * schedular URL管理组件，对待抓取的URL队列进行管理，对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
 * downloader 下载组件，默认使用urllib2下载。

diff --git a/docs/_static/img/python_window_install.png b/docs/_static/img/python_window_install.png
diff --git a/docs/_templates/bloomFilter.rst → docs/bloomFilter.rst b/docs/_templates/bloomFilter.rst → docs/bloomFilter.rst
@@ -1,6 +1,5 @@
 布隆过滤器
-#########
-
+==========
 布隆过滤器是一个快速过滤数据的工具，pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略，
 通过布隆过滤器可以大大减少内存消耗，本来项目使用list来去重，但是内存消耗太大，随着爬虫程序的
 运行，会导致机器内存消耗过大，最终导致内存溢出。使用布隆过滤器大大减少了内存消耗
diff --git a/docs/conf.py b/docs/conf.py
@@ -15,7 +15,8 @@
 # import os
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
-
+import sphinx_rtd_theme
+from recommonmark.parser import CommonMarkParser
 
 # -- Project information -----------------------------------------------------
 
@@ -28,7 +29,6 @@
 # The full version, including alpha/beta/rc tags
 release = u'0.0.1'
 
-
 # -- General configuration ---------------------------------------------------
 
 # If your documentation needs a minimal Sphinx version, state it here.
@@ -38,18 +38,24 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
+# extensions = ['sphinx.ext.*']
 extensions = [
+    'sphinx.ext.autosectionlabel',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.intersphinx',
 ]
-
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
-
+source_suffix = ['.rst', '.md']
+source_parsers = {
+    '.md': CommonMarkParser,
+}
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
 # source_suffix = ['.rst', '.md']
 source_suffix = '.rst'
-
+source_encoding = 'utf-8'
 # The master toctree document.
 master_doc = 'index'
 
@@ -58,7 +64,8 @@
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
-language = None
+# language = None
+language = 'en'
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
@@ -68,14 +75,19 @@
 # The name of the Pygments (syntax highlighting) style to use.
 pygments_style = None
 
-
 # -- Options for HTML output -------------------------------------------------
 
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
-html_theme = 'alabaster'
-
+# html_theme = 'alabaster'
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+# html_logo = 'img/logo.svg'
+html_theme_options = {
+    'logo_only': True,
+    'display_version': False,
+}
 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the
 # documentation.
@@ -103,7 +115,6 @@
 # Output file base name for HTML help builder.
 htmlhelp_basename = 'pcrawlerdoc'
 
-
 # -- Options for LaTeX output ------------------------------------------------
 
 latex_elements = {
@@ -132,7 +143,6 @@
      u'ganliang', 'manual'),
 ]
 
-
 # -- Options for manual page output ------------------------------------------
 
 # One entry per manual page. List of tuples
@@ -142,7 +152,6 @@
      [author], 1)
 ]
 
-
 # -- Options for Texinfo output ----------------------------------------------
 
 # Grouping the document tree into Texinfo files. List of tuples
@@ -154,7 +163,6 @@
      'Miscellaneous'),
 ]
 
-
 # -- Options for Epub output -------------------------------------------------
 
 # Bibliographic Dublin Core info.
@@ -170,4 +178,5 @@
 # epub_uid = ''
 
 # A list of files that should not be packed into the epub file.
-epub_exclude_files = ['search.html']
+epub_exclude_files = ['search.html']
+autosectionlabel_prefix_document = True
diff --git a/docs/get_start.rst b/docs/get_start.rst
@@ -0,0 +1,22 @@
+Getting Started
+===============
+
+pcrawler爬虫程序是由python编写，因此想要运行本程序必须要安装python环境，目前该程序支持
+的python版本为python-2.7.15，其他的版本后续会持续增加。
+
+Windows开发环境安装
+-------------------
+官网（https://www.python.org/downloads/release/python-2715/）下载python的Windows x86-64 MSI installer安装包，然后一直下一步即可。
+
+
+Linux开发环境安装
+-----------------
+linux环境一般都自带python开发环境，但是需要查看python的版本号。可以通过 输入python命令查看python版本号，如果版本号差别太大请更换
+python安装包。
+
+爬虫程序运行
+------------
+
+.. code :: python
+
+    python crawler.py travis
diff --git a/docs/index.rst b/docs/index.rst
@@ -5,16 +5,74 @@
 
 Welcome to pcrawler's documentation!
 ====================================
+pcrawler是一款python版本的爬虫程序，通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
+包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。
+
+The code is open source, and `available on GitHub`_.
+
+.. _Read the docs: http://readthedocs.org/
+.. _Sphinx: http://sphinx.pocoo.org/
+.. _reStructuredText: http://sphinx.pocoo.org/rest.html
+.. _CommonMark: http://commonmark.org/
+.. _Subversion: http://subversion.tigris.org/
+.. _Bazaar: http://bazaar.canonical.com/
+.. _Git: http://git-scm.com/
+.. _Mercurial: https://www.mercurial-scm.org/
+.. _available on GitHub: https://github.com/mumupy/pcrawler.git
+
+The main documentation for the site is organized into a couple sections:
+
+* :ref:`user-docs`
+* :ref:`about-docs`
+* :ref:`feature-docs`
+
+Information about development is also available:
+
+* :ref:`dev-docs`
+* :ref:`design-docs`
+
+.. _user-docs:
 
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
+   :caption: User Documentation
 
+   get_start
+   install
 
+.. _about-docs:
 
-Indices and tables
-==================
+.. toctree::
+   :maxdepth: 2
+   :caption: About Read the Docs
+
+   README
+   bloomFilter
+
+.. _feature-docs:
+
+.. toctree::
+   :maxdepth: 2
+   :glob:
+   :caption: Feature Documentation
+
+   README
+   bloomFilter
+
+
+.. _dev-docs:
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Developer Documentation
+
+   README
+   bloomFilter
+
+.. _design-docs:
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Designer Documentation
 
-* :ref:`bloomFilter`
-* :ref:`modindex`
-* :ref:`search`
+   Theme <https://sphinx-rtd-theme.readthedocs.io/en/latest/>
diff --git a/docs/install.rst b/docs/install.rst
@@ -0,0 +1,28 @@
+Installation
+============
+
+使用pcrawler需要安装的组件。其中有些组件是程序需要的，而有些组件是为了编写文档和测试代码覆盖率而添加的
+组件。
+
+项目组件
+--------
+.. code :: python
+
+    #使用bloomFilter来进行数据去重
+    pip install pybloom
+
+    #分析html
+    pip install lxml
+
+    #使用avro来存储爬虫数据
+    pip install avro
+
+其他组件
+--------
+.. code :: python
+
+    #使用codecov来生成测试代码覆盖率
+    pip install codecov
+
+    #使用recommonmark来将md文件转化为rst文档
+    pip install recommonmark