Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
babymm committed Mar 13, 2019
2 parents 22b081b + 7fb710c commit ad5d003
Show file tree
Hide file tree
Showing 9 changed files with 163 additions and 28 deletions.
15 changes: 15 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 4 additions & 1 deletion .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 6 additions & 5 deletions docs/_templates/README.rst → docs/README.rst
@@ -1,18 +1,19 @@
pcrawler爬虫
######
============

***pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。***
pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。

特性
######
----
* 简单的API,可快速上手
* 模块化的结构,可轻松扩展
* 提供多线程和分布式支持

架构
#####
----
pcrawler主要包含downloader、schedular、processor、storage四大组件组成。

* processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。
* schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。
* downloader 下载组件,默认使用urllib2下载。
Expand Down
Binary file added docs/_static/img/python_window_install.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 1 addition & 2 deletions docs/_templates/bloomFilter.rst → docs/bloomFilter.rst
@@ -1,6 +1,5 @@
布隆过滤器
#########

==========
布隆过滤器是一个快速过滤数据的工具,pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略,
通过布隆过滤器可以大大减少内存消耗,本来项目使用list来去重,但是内存消耗太大,随着爬虫程序的
运行,会导致机器内存消耗过大,最终导致内存溢出。使用布隆过滤器大大减少了内存消耗
37 changes: 23 additions & 14 deletions docs/conf.py
Expand Up @@ -15,7 +15,8 @@
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))

import sphinx_rtd_theme
from recommonmark.parser import CommonMarkParser

# -- Project information -----------------------------------------------------

Expand All @@ -28,7 +29,6 @@
# The full version, including alpha/beta/rc tags
release = u'0.0.1'


# -- General configuration ---------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
Expand All @@ -38,18 +38,24 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
# extensions = ['sphinx.ext.*']
extensions = [
'sphinx.ext.autosectionlabel',
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

source_suffix = ['.rst', '.md']
source_parsers = {
'.md': CommonMarkParser,
}
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'

source_encoding = 'utf-8'
# The master toctree document.
master_doc = 'index'

Expand All @@ -58,7 +64,8 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# language = None
language = 'en'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand All @@ -68,14 +75,19 @@
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None


# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'

# html_theme = 'alabaster'
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# html_logo = 'img/logo.svg'
html_theme_options = {
'logo_only': True,
'display_version': False,
}
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
Expand Down Expand Up @@ -103,7 +115,6 @@
# Output file base name for HTML help builder.
htmlhelp_basename = 'pcrawlerdoc'


# -- Options for LaTeX output ------------------------------------------------

latex_elements = {
Expand Down Expand Up @@ -132,7 +143,6 @@
u'ganliang', 'manual'),
]


# -- Options for manual page output ------------------------------------------

# One entry per manual page. List of tuples
Expand All @@ -142,7 +152,6 @@
[author], 1)
]


# -- Options for Texinfo output ----------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
Expand All @@ -154,7 +163,6 @@
'Miscellaneous'),
]


# -- Options for Epub output -------------------------------------------------

# Bibliographic Dublin Core info.
Expand All @@ -170,4 +178,5 @@
# epub_uid = ''

# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
epub_exclude_files = ['search.html']
autosectionlabel_prefix_document = True
22 changes: 22 additions & 0 deletions docs/get_start.rst
@@ -0,0 +1,22 @@
Getting Started
===============

pcrawler爬虫程序是由python编写,因此想要运行本程序必须要安装python环境,目前该程序支持
的python版本为python-2.7.15,其他的版本后续会持续增加。

Windows开发环境安装
-------------------
官网(https://www.python.org/downloads/release/python-2715/)下载python的Windows x86-64 MSI installer安装包,然后一直下一步即可。


Linux开发环境安装
-----------------
linux环境一般都自带python开发环境,但是需要查看python的版本号。可以通过 输入python命令查看python版本号,如果版本号差别太大请更换
python安装包。

爬虫程序运行
------------

.. code :: python
python crawler.py travis
70 changes: 64 additions & 6 deletions docs/index.rst
Expand Up @@ -5,16 +5,74 @@
Welcome to pcrawler's documentation!
====================================
pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要
包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。

The code is open source, and `available on GitHub`_.

.. _Read the docs: http://readthedocs.org/
.. _Sphinx: http://sphinx.pocoo.org/
.. _reStructuredText: http://sphinx.pocoo.org/rest.html
.. _CommonMark: http://commonmark.org/
.. _Subversion: http://subversion.tigris.org/
.. _Bazaar: http://bazaar.canonical.com/
.. _Git: http://git-scm.com/
.. _Mercurial: https://www.mercurial-scm.org/
.. _available on GitHub: https://github.com/mumupy/pcrawler.git

The main documentation for the site is organized into a couple sections:

* :ref:`user-docs`
* :ref:`about-docs`
* :ref:`feature-docs`

Information about development is also available:

* :ref:`dev-docs`
* :ref:`design-docs`

.. _user-docs:

.. toctree::
:maxdepth: 2
:caption: Contents:
:caption: User Documentation

get_start
install

.. _about-docs:

Indices and tables
==================
.. toctree::
:maxdepth: 2
:caption: About Read the Docs

README
bloomFilter

.. _feature-docs:

.. toctree::
:maxdepth: 2
:glob:
:caption: Feature Documentation

README
bloomFilter


.. _dev-docs:

.. toctree::
:maxdepth: 2
:caption: Developer Documentation

README
bloomFilter

.. _design-docs:

.. toctree::
:maxdepth: 2
:caption: Designer Documentation

* :ref:`bloomFilter`
* :ref:`modindex`
* :ref:`search`
Theme <https://sphinx-rtd-theme.readthedocs.io/en/latest/>
28 changes: 28 additions & 0 deletions docs/install.rst
@@ -0,0 +1,28 @@
Installation
============

使用pcrawler需要安装的组件。其中有些组件是程序需要的,而有些组件是为了编写文档和测试代码覆盖率而添加的
组件。

项目组件
--------
.. code :: python
#使用bloomFilter来进行数据去重
pip install pybloom
#分析html
pip install lxml
#使用avro来存储爬虫数据
pip install avro
其他组件
--------
.. code :: python
#使用codecov来生成测试代码覆盖率
pip install codecov
#使用recommonmark来将md文件转化为rst文档
pip install recommonmark

0 comments on commit ad5d003

Please sign in to comment.