Skip to content

Commit

Permalink
添加docs文档
Browse files Browse the repository at this point in the history
  • Loading branch information
babymm authored and babymm@aliyun.com committed Sep 25, 2018
1 parent 9472bcc commit 5f5ccbe
Show file tree
Hide file tree
Showing 8 changed files with 262 additions and 45 deletions.
19 changes: 19 additions & 0 deletions docs/Makefile
@@ -0,0 +1,19 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
6 changes: 0 additions & 6 deletions docs/__init__.py

This file was deleted.

24 changes: 14 additions & 10 deletions docs/README.md → docs/_templates/README.rst
@@ -1,15 +1,19 @@
# pcrawler爬虫
pcrawler爬虫
######

***pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***

## 特性:
- 简单的API,可快速上手
- 模块化的结构,可轻松扩展
- 提供多线程和分布式支持
特性
######
* 简单的API,可快速上手
* 模块化的结构,可轻松扩展
* 提供多线程和分布式支持

## 架构
架构
#####
pcrawler主要包含downloader、schedular、processor、storage四大组件组成。
- processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
- schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
- downloader 下载组件,默认使用urllib2下载。
- storage 存储组件,支持多样文件格式(csv、json、avro、video)
* processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
* schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
* downloader 下载组件,默认使用urllib2下载。
* storage 存储组件,支持多样文件格式(csv、json、avro、video)
6 changes: 6 additions & 0 deletions docs/_templates/bloomFilter.rst
@@ -0,0 +1,6 @@
布隆过滤器
#########

布隆过滤器是一个快速过滤数据的工具,pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略,
通过布隆过滤器可以大大减少内存消耗,本来项目使用list来去重,但是内存消耗太大,随着爬虫程序的
运行,会导致机器内存消耗过大,最终导致内存溢出。使用布隆过滤器大大减少了内存消耗
5 changes: 0 additions & 5 deletions docs/bloomFilter.md

This file was deleted.

177 changes: 168 additions & 9 deletions docs/conf.py
@@ -1,14 +1,173 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2018/9/21 17:32
# @Author : ganliang
# @File : conf.py
# @Desc : 将md文档转化为rst文档
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config

from recommonmark.parser import CommonMarkParser
# -- Path setup --------------------------------------------------------------

source_parsers = {
'.md': CommonMarkParser,
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))


# -- Project information -----------------------------------------------------

project = u'pcrawler'
copyright = u'2018, ganliang'
author = u'ganliang'

# The short X.Y version
version = u'0.0.1'
# The full version, including alpha/beta/rc tags
release = u'0.0.1'


# -- General configuration ---------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'

# The master toctree document.
master_doc = 'index'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store']

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}


# -- Options for HTMLHelp output ---------------------------------------------

# Output file base name for HTML help builder.
htmlhelp_basename = 'pcrawlerdoc'


# -- Options for LaTeX output ------------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',

# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}

source_suffix = ['.rst', '.md']
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'pcrawler.tex', u'pcrawler Documentation',
u'ganliang', 'manual'),
]


# -- Options for manual page output ------------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'pcrawler', u'pcrawler Documentation',
[author], 1)
]


# -- Options for Texinfo output ----------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'pcrawler', u'pcrawler Documentation',
author, 'pcrawler', 'One line description of project.',
'Miscellaneous'),
]


# -- Options for Epub output -------------------------------------------------

# Bibliographic Dublin Core info.
epub_title = project

# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''

# A unique identification for the text.
#
# epub_uid = ''

# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
35 changes: 20 additions & 15 deletions docs/index.rst
@@ -1,15 +1,20 @@
# pcrawler爬虫
***pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***

## 特性:
- 简单的API,可快速上手
- 模块化的结构,可轻松扩展
- 提供多线程和分布式支持

## 架构
pcrawler主要包含downloader、schedular、processor、storage四大组件组成。
- processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
- schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
- downloader 下载组件,默认使用urllib2下载。
- storage 存储组件,支持多样文件格式(csv、json、avro、video)
.. pcrawler documentation master file, created by
sphinx-quickstart on Tue Sep 25 09:21:33 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to pcrawler's documentation!
====================================

.. toctree::
:maxdepth: 2
:caption: Contents:



Indices and tables
==================

* :ref:`bloomFilter`
* :ref:`modindex`
* :ref:`search`
35 changes: 35 additions & 0 deletions docs/make.bat
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

if "%1" == "" goto help

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%

:end
popd

0 comments on commit 5f5ccbe

Please sign in to comment.