Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/master'
- Loading branch information
Showing
9 changed files
with
163 additions
and
28 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,5 @@ | ||
布隆过滤器 | ||
######### | ||
|
||
========== | ||
布隆过滤器是一个快速过滤数据的工具,pcrawler爬虫程序使用布隆过滤器主要是做爬虫去重的策略, | ||
通过布隆过滤器可以大大减少内存消耗,本来项目使用list来去重,但是内存消耗太大,随着爬虫程序的 | ||
运行,会导致机器内存消耗过大,最终导致内存溢出。使用布隆过滤器大大减少了内存消耗 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Getting Started | ||
=============== | ||
|
||
pcrawler爬虫程序是由python编写,因此想要运行本程序必须要安装python环境,目前该程序支持 | ||
的python版本为python-2.7.15,其他的版本后续会持续增加。 | ||
|
||
Windows开发环境安装 | ||
------------------- | ||
官网(https://www.python.org/downloads/release/python-2715/)下载python的Windows x86-64 MSI installer安装包,然后一直下一步即可。 | ||
|
||
|
||
Linux开发环境安装 | ||
----------------- | ||
linux环境一般都自带python开发环境,但是需要查看python的版本号。可以通过 输入python命令查看python版本号,如果版本号差别太大请更换 | ||
python安装包。 | ||
|
||
爬虫程序运行 | ||
------------ | ||
|
||
.. code :: python | ||
python crawler.py travis |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
Installation | ||
============ | ||
|
||
使用pcrawler需要安装的组件。其中有些组件是程序需要的,而有些组件是为了编写文档和测试代码覆盖率而添加的 | ||
组件。 | ||
|
||
项目组件 | ||
-------- | ||
.. code :: python | ||
#使用bloomFilter来进行数据去重 | ||
pip install pybloom | ||
#分析html | ||
pip install lxml | ||
#使用avro来存储爬虫数据 | ||
pip install avro | ||
其他组件 | ||
-------- | ||
.. code :: python | ||
#使用codecov来生成测试代码覆盖率 | ||
pip install codecov | ||
#使用recommonmark来将md文件转化为rst文档 | ||
pip install recommonmark |