添加docs文档

mumupy · Sep 25, 2018 · 5f5ccbe · 5f5ccbe
1 parent 9472bcc
commit 5f5ccbe
Show file tree

Hide file tree

Showing 8 changed files with 262 additions and 45 deletions.
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,19 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/__init__.py b/docs/__init__.py
diff --git a/docs/README.md → docs/_templates/README.rst b/docs/README.md → docs/_templates/README.rst
@@ -1,15 +1,19 @@
-# pcrawler爬虫
+pcrawler爬虫
+######
+
 ***pcrawler是一款python版本的爬虫程序，通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
 包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***
 
-## 特性：
-- 简单的API，可快速上手
-- 模块化的结构，可轻松扩展
-- 提供多线程和分布式支持
+特性
+######
+* 简单的API，可快速上手
+* 模块化的结构，可轻松扩展
+* 提供多线程和分布式支持
 
-## 架构
+架构
+#####
 pcrawler主要包含downloader、schedular、processor、storage四大组件组成。
-- processor 爬虫页面处理器，对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
-- schedular URL管理组件，对待抓取的URL队列进行管理，对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
-- downloader 下载组件，默认使用urllib2下载。
-- storage 存储组件，支持多样文件格式(csv、json、avro、video)
+* processor 爬虫页面处理器，对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
+* schedular URL管理组件，对待抓取的URL队列进行管理，对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
+* downloader 下载组件，默认使用urllib2下载。
+* storage 存储组件，支持多样文件格式(csv、json、avro、video)
diff --git a/docs/_templates/bloomFilter.rst b/docs/_templates/bloomFilter.rst
@@ -0,0 +1,6 @@
+布隆过滤器
+#########
+
+布隆过滤器是一个快速过滤数据的工具，pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略，
+通过布隆过滤器可以大大减少内存消耗，本来项目使用list来去重，但是内存消耗太大，随着爬虫程序的
+运行，会导致机器内存消耗过大，最终导致内存溢出。使用布隆过滤器大大减少了内存消耗
diff --git a/docs/bloomFilter.md b/docs/bloomFilter.md
diff --git a/docs/conf.py b/docs/conf.py
@@ -1,14 +1,173 @@
-#!/usr/bin/env python
 # -*- coding: utf-8 -*-
-# @Time    : 2018/9/21 17:32
-# @Author  : ganliang
-# @File    : conf.py
-# @Desc    : 将md文档转化为rst文档
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
 
-from recommonmark.parser import CommonMarkParser
+# -- Path setup --------------------------------------------------------------
 
-source_parsers = {
-    '.md': CommonMarkParser,
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = u'pcrawler'
+copyright = u'2018, ganliang'
+author = u'ganliang'
+
+# The short X.Y version
+version = u'0.0.1'
+# The full version, including alpha/beta/rc tags
+release = u'0.0.1'
+
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'alabaster'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'pcrawlerdoc'
+
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
 }
 
-source_suffix = ['.rst', '.md']
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'pcrawler.tex', u'pcrawler Documentation',
+     u'ganliang', 'manual'),
+]
+
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'pcrawler', u'pcrawler Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'pcrawler', u'pcrawler Documentation',
+     author, 'pcrawler', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,15 +1,20 @@
-# pcrawler爬虫
-***pcrawler是一款python版本的爬虫程序，通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
-包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***
-
-## 特性：
-- 简单的API，可快速上手
-- 模块化的结构，可轻松扩展
-- 提供多线程和分布式支持
-
-## 架构
-pcrawler主要包含downloader、schedular、processor、storage四大组件组成。
-- processor 爬虫页面处理器，对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
-- schedular URL管理组件，对待抓取的URL队列进行管理，对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
-- downloader 下载组件，默认使用urllib2下载。
-- storage 存储组件，支持多样文件格式(csv、json、avro、video)
+.. pcrawler documentation master file, created by
+   sphinx-quickstart on Tue Sep 25 09:21:33 2018.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to pcrawler's documentation!
+====================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+
+
+Indices and tables
+==================
+
+* :ref:`bloomFilter`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/make.bat b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+
+:end
+popd