Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
babymm
authored and
babymm@aliyun.com
committed
Sep 25, 2018
1 parent
9472bcc
commit 5f5ccbe
Showing
8 changed files
with
262 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
|
||
# You can set these variables from the command line. | ||
SPHINXOPTS = | ||
SPHINXBUILD = sphinx-build | ||
SOURCEDIR = . | ||
BUILDDIR = _build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,19 @@ | ||
# pcrawler爬虫 | ||
pcrawler爬虫 | ||
###### | ||
|
||
***pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要 | ||
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。*** | ||
|
||
## 特性: | ||
- 简单的API,可快速上手 | ||
- 模块化的结构,可轻松扩展 | ||
- 提供多线程和分布式支持 | ||
特性 | ||
###### | ||
* 简单的API,可快速上手 | ||
* 模块化的结构,可轻松扩展 | ||
* 提供多线程和分布式支持 | ||
|
||
## 架构 | ||
架构 | ||
##### | ||
pcrawler主要包含downloader、schedular、processor、storage四大组件组成。 | ||
- processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。 | ||
- schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。 | ||
- downloader 下载组件,默认使用urllib2下载。 | ||
- storage 存储组件,支持多样文件格式(csv、json、avro、video) | ||
* processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。 | ||
* schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。 | ||
* downloader 下载组件,默认使用urllib2下载。 | ||
* storage 存储组件,支持多样文件格式(csv、json、avro、video) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
布隆过滤器 | ||
######### | ||
|
||
布隆过滤器是一个快速过滤数据的工具,pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略, | ||
通过布隆过滤器可以大大减少内存消耗,本来项目使用list来去重,但是内存消耗太大,随着爬虫程序的 | ||
运行,会导致机器内存消耗过大,最终导致内存溢出。使用布隆过滤器大大减少了内存消耗 |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,173 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# @Time : 2018/9/21 17:32 | ||
# @Author : ganliang | ||
# @File : conf.py | ||
# @Desc : 将md文档转化为rst文档 | ||
# | ||
# Configuration file for the Sphinx documentation builder. | ||
# | ||
# This file does only contain a selection of the most common options. For a | ||
# full list see the documentation: | ||
# http://www.sphinx-doc.org/en/master/config | ||
|
||
from recommonmark.parser import CommonMarkParser | ||
# -- Path setup -------------------------------------------------------------- | ||
|
||
source_parsers = { | ||
'.md': CommonMarkParser, | ||
# If extensions (or modules to document with autodoc) are in another directory, | ||
# add these directories to sys.path here. If the directory is relative to the | ||
# documentation root, use os.path.abspath to make it absolute, like shown here. | ||
# | ||
# import os | ||
# import sys | ||
# sys.path.insert(0, os.path.abspath('.')) | ||
|
||
|
||
# -- Project information ----------------------------------------------------- | ||
|
||
project = u'pcrawler' | ||
copyright = u'2018, ganliang' | ||
author = u'ganliang' | ||
|
||
# The short X.Y version | ||
version = u'0.0.1' | ||
# The full version, including alpha/beta/rc tags | ||
release = u'0.0.1' | ||
|
||
|
||
# -- General configuration --------------------------------------------------- | ||
|
||
# If your documentation needs a minimal Sphinx version, state it here. | ||
# | ||
# needs_sphinx = '1.0' | ||
|
||
# Add any Sphinx extension module names here, as strings. They can be | ||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom | ||
# ones. | ||
extensions = [ | ||
] | ||
|
||
# Add any paths that contain templates here, relative to this directory. | ||
templates_path = ['_templates'] | ||
|
||
# The suffix(es) of source filenames. | ||
# You can specify multiple suffix as a list of string: | ||
# | ||
# source_suffix = ['.rst', '.md'] | ||
source_suffix = '.rst' | ||
|
||
# The master toctree document. | ||
master_doc = 'index' | ||
|
||
# The language for content autogenerated by Sphinx. Refer to documentation | ||
# for a list of supported languages. | ||
# | ||
# This is also used if you do content translation via gettext catalogs. | ||
# Usually you set "language" from the command line for these cases. | ||
language = None | ||
|
||
# List of patterns, relative to source directory, that match files and | ||
# directories to ignore when looking for source files. | ||
# This pattern also affects html_static_path and html_extra_path. | ||
exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store'] | ||
|
||
# The name of the Pygments (syntax highlighting) style to use. | ||
pygments_style = None | ||
|
||
|
||
# -- Options for HTML output ------------------------------------------------- | ||
|
||
# The theme to use for HTML and HTML Help pages. See the documentation for | ||
# a list of builtin themes. | ||
# | ||
html_theme = 'alabaster' | ||
|
||
# Theme options are theme-specific and customize the look and feel of a theme | ||
# further. For a list of options available for each theme, see the | ||
# documentation. | ||
# | ||
# html_theme_options = {} | ||
|
||
# Add any paths that contain custom static files (such as style sheets) here, | ||
# relative to this directory. They are copied after the builtin static files, | ||
# so a file named "default.css" will overwrite the builtin "default.css". | ||
html_static_path = ['_static'] | ||
|
||
# Custom sidebar templates, must be a dictionary that maps document names | ||
# to template names. | ||
# | ||
# The default sidebars (for documents that don't match any pattern) are | ||
# defined by theme itself. Builtin themes are using these templates by | ||
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html', | ||
# 'searchbox.html']``. | ||
# | ||
# html_sidebars = {} | ||
|
||
|
||
# -- Options for HTMLHelp output --------------------------------------------- | ||
|
||
# Output file base name for HTML help builder. | ||
htmlhelp_basename = 'pcrawlerdoc' | ||
|
||
|
||
# -- Options for LaTeX output ------------------------------------------------ | ||
|
||
latex_elements = { | ||
# The paper size ('letterpaper' or 'a4paper'). | ||
# | ||
# 'papersize': 'letterpaper', | ||
|
||
# The font size ('10pt', '11pt' or '12pt'). | ||
# | ||
# 'pointsize': '10pt', | ||
|
||
# Additional stuff for the LaTeX preamble. | ||
# | ||
# 'preamble': '', | ||
|
||
# Latex figure (float) alignment | ||
# | ||
# 'figure_align': 'htbp', | ||
} | ||
|
||
source_suffix = ['.rst', '.md'] | ||
# Grouping the document tree into LaTeX files. List of tuples | ||
# (source start file, target name, title, | ||
# author, documentclass [howto, manual, or own class]). | ||
latex_documents = [ | ||
(master_doc, 'pcrawler.tex', u'pcrawler Documentation', | ||
u'ganliang', 'manual'), | ||
] | ||
|
||
|
||
# -- Options for manual page output ------------------------------------------ | ||
|
||
# One entry per manual page. List of tuples | ||
# (source start file, name, description, authors, manual section). | ||
man_pages = [ | ||
(master_doc, 'pcrawler', u'pcrawler Documentation', | ||
[author], 1) | ||
] | ||
|
||
|
||
# -- Options for Texinfo output ---------------------------------------------- | ||
|
||
# Grouping the document tree into Texinfo files. List of tuples | ||
# (source start file, target name, title, author, | ||
# dir menu entry, description, category) | ||
texinfo_documents = [ | ||
(master_doc, 'pcrawler', u'pcrawler Documentation', | ||
author, 'pcrawler', 'One line description of project.', | ||
'Miscellaneous'), | ||
] | ||
|
||
|
||
# -- Options for Epub output ------------------------------------------------- | ||
|
||
# Bibliographic Dublin Core info. | ||
epub_title = project | ||
|
||
# The unique identifier of the text. This can be a ISBN number | ||
# or the project homepage. | ||
# | ||
# epub_identifier = '' | ||
|
||
# A unique identification for the text. | ||
# | ||
# epub_uid = '' | ||
|
||
# A list of files that should not be packed into the epub file. | ||
epub_exclude_files = ['search.html'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,20 @@ | ||
# pcrawler爬虫 | ||
***pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要 | ||
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。*** | ||
|
||
## 特性: | ||
- 简单的API,可快速上手 | ||
- 模块化的结构,可轻松扩展 | ||
- 提供多线程和分布式支持 | ||
|
||
## 架构 | ||
pcrawler主要包含downloader、schedular、processor、storage四大组件组成。 | ||
- processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。 | ||
- schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。 | ||
- downloader 下载组件,默认使用urllib2下载。 | ||
- storage 存储组件,支持多样文件格式(csv、json、avro、video) | ||
.. pcrawler documentation master file, created by | ||
sphinx-quickstart on Tue Sep 25 09:21:33 2018. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to pcrawler's documentation! | ||
==================================== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`bloomFilter` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
@ECHO OFF | ||
|
||
pushd %~dp0 | ||
|
||
REM Command file for Sphinx documentation | ||
|
||
if "%SPHINXBUILD%" == "" ( | ||
set SPHINXBUILD=sphinx-build | ||
) | ||
set SOURCEDIR=. | ||
set BUILDDIR=_build | ||
|
||
if "%1" == "" goto help | ||
|
||
%SPHINXBUILD% >NUL 2>NUL | ||
if errorlevel 9009 ( | ||
echo. | ||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx | ||
echo.installed, then set the SPHINXBUILD environment variable to point | ||
echo.to the full path of the 'sphinx-build' executable. Alternatively you | ||
echo.may add the Sphinx directory to PATH. | ||
echo. | ||
echo.If you don't have Sphinx installed, grab it from | ||
echo.http://sphinx-doc.org/ | ||
exit /b 1 | ||
) | ||
|
||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% | ||
goto end | ||
|
||
:help | ||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% | ||
|
||
:end | ||
popd |