# 用python写网络爬虫

## Web Scraping with Python

【澳】 Richard Lawson著  
李斌 译  
中国工信出版集团 人民邮电出版社  
2016年9月第1版

豆瓣： 
[用Python写网络爬虫 (豆瓣)](https://book.douban.com/subject/26869992/)

别人的读书笔记：  

[《web scraping with python》（用Python写网络爬虫）读书笔记 - 简书](https://www.jianshu.com/p/26e7925177bb)  
[《用Python写网络爬虫》读书笔记 | Pythoner](http://www.pythoner.com/495.html)  
[《用python写网络爬虫》学习心得笔记 - 知乎](https://zhuanlan.zhihu.com/p/23957531)

## 0. 背景调研——爬取网站之前的先期了解

1. 检查robots.txt  
robots.txt可以让爬虫了解爬取该网站时存在哪些限制。检查robots.txt可以最小化爬虫被封禁的可能，还能发现和网站结构相关的线索。  
关于robots.txt的更多信息：[Web机器人页面](http://www.robotstxt.org/robotstxt.html)

2. 检查网站地图  
网站提供的sitemap文件可以帮爬虫定位网站最新的内容，而无须爬取每一个网页。不过该文件经常存在缺失、过期或不完整的问题。  
网站地图标准的定义：[sitemaps.org - Protocol](https://www.sitemaps.org/protocol.html)

3. 估算网站大小  
目标网站的大小会影响我们如何进行爬取。估算网站大小的一个简便方法是检查Google爬虫的结果，因为Google很可能已经爬取过我们感兴趣的网站。site:url

4. 识别网站所用技术  
构建网站所使用的技术类型也会对我们如何爬取产生影响。builtwith模块可以检查网站构建的技术类型。  

5. 寻找网站所有者  
对一些网站，我们可能会关心其所有者是谁。比如，若已知网站所有者会封禁网络爬虫，那就最好把下载速度控制得更加保守一些。为了找到网站的所有者，可以使用WHOIS协议查询域名的注册者是谁。安装模块python-whois，使用时`import whois`

In [33]:
import builtwith

In [3]:
builtwith.parse('http://example.webscraping.com/')

{'web-servers': ['Nginx'],
 'web-frameworks': ['Web2py', 'Twitter Bootstrap'],
 'programming-languages': ['Python'],
 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}

In [34]:
builtwith.parse('http://www.engine3d.com')

{'font-scripts': ['Font Awesome'],
 'javascript-frameworks': ['Moment.js', 'jQuery'],
 'web-frameworks': ['Pure CSS', 'Twitter Bootstrap']}

In [35]:
builtwith.parse('https://www.baidu.com')

{}

In [36]:
builtwith.parse('https://www.taobao.com')

{'web-servers': ['Tengine']}

In [2]:
import whois

In [12]:
print(whois.whois('appspot.com'))

{
  "domain_name": [
    "APPSPOT.COM",
    "appspot.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2019-02-06 10:33:49",
    "2019-02-06 02:33:49"
  ],
  "creation_date": [
    "2005-03-10 02:27:55",
    "2005-03-09 18:27:55"
  ],
  "expiration_date": [
    "2020-03-10 01:27:55",
    "2020-03-09 00:00:00"
  ],
  "name_servers": [
    "NS1.GOOGLE.COM",
    "NS2.GOOGLE.COM",
    "NS3.GOOGLE.COM",
    "NS4.GOOGLE.COM",
    "ns4.google.com",
    "ns1.google.com",
    "ns3.google.com",
    "ns2.google.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",

In [3]:
print(whois.whois('engine3d.com'))

{
  "domain_name": [
    "ENGINE3D.COM",
    "engine3d.com"
  ],
  "registrar": "Alibaba Cloud Computing (Beijing) Co., Ltd.",
  "whois_server": "grs-whois.hichina.com",
  "referral_url": null,
  "updated_date": "2019-04-01 02:15:58",
  "creation_date": "2017-04-17 08:39:26",
  "expiration_date": "2020-04-17 08:39:26",
  "name_servers": [
    "DNS29.HICHINA.COM",
    "DNS30.HICHINA.COM"
  ],
  "status": "ok https://icann.org/epp#ok",
  "emails": "DomainAbuse@service.aliyun.com",
  "dnssec": "unsigned",
  "name": null,
  "org": null,
  "address": null,
  "city": null,
  "state": "jiang su",
  "zipcode": null,
  "country": null
}


## 1. 第一个网络爬虫

### 1.1 download(url)函数的各阶版本

#### 1.1.1. 最简单的写法：  

In [37]:
import urllib

def download(url):
    return urllib.request.urlopen(url).read()

In [44]:
url = 'https://www.workflowy.com'
type(download(url))

bytes

#### 1.1.2. 会捕获异常，写出异常原因。

In [40]:
#_*_coding:utf-8_*_
import urllib

def download(url):
    try:
        html = urllib.request.urlopen(url).read()
    except urllib.error.URLError as e :
        print('Download error: ', e.reason)
        html = None
    return html

In [46]:
url = 'https://www.taobao2.com'
download(url)

Download error: [Errno 60] Operation timed out


关于状态码：[RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content](https://tools.ietf.org/html/rfc7231#section-6)

#### 1.1.3. 会重试下载

In [8]:
def download(url, num_retries=2):
    try:
        html = urllib.request.urlopen(url).read()
    except urllib.error.URLError as e:
        print('Download error: ', e.reason)
        html = None
        if num_retries>0:
            if hasattr(e, 'code') and 500<=e.code<=600:
                return download(url, num_retries-1)
    return html

In [24]:
help(hasattr)

Help on built-in function hasattr in module builtins:

hasattr(obj, name, /)
    Return whether the object has an attribute with the given name.
    
    This is done by calling getattr(obj, name) and catching AttributeError.



In [52]:
url = 'http://httpstat.us/500'
download(url)

Download Error:  Internal Server Error
Download Error:  Internal Server Error
Download Error:  Internal Server Error
Download Error:  Internal Server Error


#### 1.1.4. 设置用户代理

In [50]:
url = 'http://www.meetup.com'
type(download(url))

bytes

In [12]:
def download(url, user_agent='wswp', num_retries=2):
    print('Downloading: ', url)
    headers = {'User-agent':user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('Download Error: ', e.reason)
        html = None
        if num_retries>0:
            if hasattr(e, 'code') and 500<=code<600:
                return download(url, user_agent, num_retries-1)
    return html

In [None]:
import urllib

def download(url, user_agent='wswp', num_tries=2):
    print('Downloading: ', url)
    headers = {'User-agent': user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('Download Error: ', e.reason)
        html = None
        if num_tries>0:
            if hasattr(e, 'code') and 500<=e.code<600:
                return download(url, user_agent, num_tries-1)
    return html

### 1.2 第一个网络爬虫

#### 1.2.1. 按sitemap给出的链接获取

In [35]:
#_*_coding:utf-8_*_
import re
import urllib

def crawl_sitemap(url):
    sitemap = download(url)
    links = re.findall(r'<loc>(.*?)</loc>', sitemap)
    for link in links:
        html = download(link)

In [36]:
url = 'http://example.webscraping.com/sitemap.xml'
crawl_sitemap(url)

Downloading:  http://example.webscraping.com/sitemap.xml


TypeError: cannot use a string pattern on a bytes-like object

[一劳永逸解决：TypeError: cannot use a string pattern on a bytes-like object - 不随缘的随心，能改变的随性 - CSDN博客](https://blog.csdn.net/jieli_/article/details/70166244)

增加byte到string的转换

In [7]:
#_*_coding:utf-8_*_
import re
import urllib
import chardet # 检测编码格式

def crawl_sitemap(url):
    sitemap = download(url)
    encode_type = chardet.detect(sitemap)
    sitemap = sitemap.decode(encode_type['encoding']) #进行相应解码，赋给原标识符（变量）
    links = re.findall(r'<loc>(.*?)</loc>', sitemap)
    for link in links:
        html = download(link)

In [40]:
url = 'http://example.webscraping.com/sitemap.xml'
crawl_sitemap(url)

Downloading:  http://example.webscraping.com/sitemap.xml
Downloading:  http://example.webscraping.com/places/default/view/Afghanistan-1
Downloading:  http://example.webscraping.com/places/default/view/Aland-Islands-2
Downloading:  http://example.webscraping.com/places/default/view/Albania-3
Downloading:  http://example.webscraping.com/places/default/view/Algeria-4
Downloading:  http://example.webscraping.com/places/default/view/American-Samoa-5
Downloading:  http://example.webscraping.com/places/default/view/Andorra-6
Downloading:  http://example.webscraping.com/places/default/view/Angola-7
Downloading:  http://example.webscraping.com/places/default/view/Anguilla-8
Downloading:  http://example.webscraping.com/places/default/view/Antarctica-9
Downloading:  http://example.webscraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading:  http://example.webscraping.com/places/default/view/Argentina-11
Downloading:  http://example.webscraping.com/places/default/view/Armenia-12
Downlo

NameError: name 'code' is not defined

第一次运行：'Too many requests'

第二次运行：'Service unavailable'

#### 1.2.2. 按ID遍历

`Downloading:  http://example.webscraping.com/places/default/view/Bahamas-17`  
这个URL中，Bahamas是页面别名，web服务器会忽略这个字符串，只使用ID来匹配数据库中的相关记录。 

In [41]:
import itertools

In [42]:
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d'%page
    html = download(url)
    if html is None:
        break
    else:
        pass

Downloading:  http://example.webscraping.com/view/-1
Downloading:  http://example.webscraping.com/view/-2
Downloading:  http://example.webscraping.com/view/-3
Downloading:  http://example.webscraping.com/view/-4
Downloading:  http://example.webscraping.com/view/-5
Downloading:  http://example.webscraping.com/view/-6
Downloading:  http://example.webscraping.com/view/-7
Downloading:  http://example.webscraping.com/view/-8
Downloading:  http://example.webscraping.com/view/-9
Downloading:  http://example.webscraping.com/view/-10
Downloading:  http://example.webscraping.com/view/-11
Downloading:  http://example.webscraping.com/view/-12
Downloading:  http://example.webscraping.com/view/-13
Downloading:  http://example.webscraping.com/view/-14
Downloading:  http://example.webscraping.com/view/-15
Downloading:  http://example.webscraping.com/view/-16
Downloading:  http://example.webscraping.com/view/-17
Downloading:  http://example.webscraping.com/view/-18
Downloading:  http://example.webscrap

NameError: name 'code' is not defined

连续发生多次下载错误后才退出程序。

In [45]:
max_errors = 5
num_errors = 0
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d'%page
    html = download(url)
    if html is None:
        num_errors += 1
        if num_errors == max_errors:
            break
        else:
            num_errors = 0

Downloading:  http://example.webscraping.com/view/-1
Downloading:  http://example.webscraping.com/view/-2
Downloading:  http://example.webscraping.com/view/-3
Downloading:  http://example.webscraping.com/view/-4
Downloading:  http://example.webscraping.com/view/-5
Downloading:  http://example.webscraping.com/view/-6
Downloading:  http://example.webscraping.com/view/-7
Downloading:  http://example.webscraping.com/view/-8
Downloading:  http://example.webscraping.com/view/-9
Downloading:  http://example.webscraping.com/view/-10
Downloading:  http://example.webscraping.com/view/-11
Downloading:  http://example.webscraping.com/view/-12
Downloading:  http://example.webscraping.com/view/-13
Downloading:  http://example.webscraping.com/view/-14
Downloading:  http://example.webscraping.com/view/-15
Downloading:  http://example.webscraping.com/view/-16
Downloading:  http://example.webscraping.com/view/-17
Downloading:  http://example.webscraping.com/view/-18
Downloading:  http://example.webscrap

NameError: name 'code' is not defined

#### 1.2.3. 链接爬虫

In [23]:
import re
import chardet

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        encode_type = chardet.detect(html)
        html = html.decode(encode_type['encoding'])
        for link in get_link(html):
            if re.match(link_regex, link):
                link = urllib.parse.urljoin(seed_url,link)
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)
def get_link(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)  
    return webpage_regex.findall(html)

python2中的urlparse，在python3中变成了urllib.parse，无须单独导入。

In [10]:
help(list.pop)

Help on method_descriptor:

pop(self, index=-1, /)
    Remove and return item at index (default last).
    
    Raises IndexError if list is empty or index is out of range.



In [11]:
help(re.match)

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.



In [24]:
link_crawler('http://example.webscraping.com', '/(index|view)')

Downloading:  http://example.webscraping.com


In [65]:
seed_url = 'http://example.webscraping.com'
crawl_queue = [seed_url]
seen = set(crawl_queue)
html = download(seed_url)

Downloading:  http://example.webscraping.com


In [68]:
def get_link(html):
    webpage_regex = re.compile('<a[^>]+href=["\'].*?["\']', re.IGNORECASE) 
    #webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) 
    return webpage_regex.findall(html)
get_link(html)

['<a href="/places/default/index"',
 '<a class="dropdown-toggle" data-toggle="dropdown" href="#"',
 '<a href="/places/default/user/register?_next=/places/default/index"',
 '<a href="/places/default/user/login?_next=/places/default/index"',
 '<a href="/places/default/index"',
 '<a href="/places/default/search"',
 '<a href="/places/default/view/Afghanistan-1"',
 '<a href="/places/default/view/Aland-Islands-2"',
 '<a href="/places/default/view/Albania-3"',
 '<a href="/places/default/view/Algeria-4"',
 '<a href="/places/default/view/American-Samoa-5"',
 '<a href="/places/default/view/Andorra-6"',
 '<a href="/places/default/view/Angola-7"',
 '<a href="/places/default/view/Anguilla-8"',
 '<a href="/places/default/view/Antarctica-9"',
 '<a href="/places/default/view/Antigua-and-Barbuda-10"',
 '<a href="/places/default/index/1"']

原始html字符串是：`<a href="/places/default/view/Afghanistan-1"><img src="/places/static/images/flags/af.png"> Afghanistan</a>`

webpage_regex是：`<a[^>]+href=["\'](.*?)["\']`

为什么会匹配结果为:`/places/default/view/Afghanistan-1`？？秘密在于括号，有了括号，就是有分组。findall的返回值就是元组组成的列表。因为只有一个分组，所以就是只有`(.*?)`部分的内容。  

经过测试，拿掉了括号，果然结果是`<a href="/places/default/index"'`这种了，居然还有个a class的结果：`<a class="dropdown-toggle" data-toggle="dropdown" href="#"'`，瞬间明白了为什么表达式是`a[^>]+`，为了有的语句会在这中间插入class。

In [32]:
for link in get_link(html):
    link_regex = '/(index|view)'
    if re.match(link_regex, link):
        link = urllib.parse.urljoin(seed_url,link)
        if link not in seen:
            seen.add(link)
            crawl_queue.append(link)

In [52]:
for link in get_link(html):
    link_regex = '/(index|view)'
    if re.compile(link_regex).search(link):
        link = urllib.parse.urljoin(seed_url,link)
        if link not in seen:
            seen.add(link)
            crawl_queue.append(link)

In [39]:
re.match(link_regex, '/places/default/view/Antigua-and-Barbuda-10')

NoneType

In [40]:
htmlRegex = re.compile(link_regex)
mo = htmlRegex.search('/places/default/view/Antigua-and-Barbuda-10')
mo.group()

'/view'

In [47]:
re.match('/w+', '/places/default/view/Antigua-and-Barbuda-10')

**在python3中，re.match()必须要在起始位置匹配成功才行。python2似乎没有这样的规定。**

In [50]:
mo = re.compile(link_regex).search('/places/default/view/Antigua-and-Barbuda-10')
print(mo)

<re.Match object; span=(15, 20), match='/view'>


通过上面的研究，必须要把原程序中的match方法改成search方法才行。

In [69]:
#_*_coding:utf-8_*_
import urllib
import re
import chardet

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_link(html):
            if re.compile(link_regex).search(link):
                link = urllib.parse.urljoin(seed_url,link)
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

def get_link(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)  
    return webpage_regex.findall(html)

def download(url, user_agent='wswp', num_retries=2):
    # 原先的输出为二进制，非string。把转string的代码放到这里，改为输出string。
    print('Downloading: ', url)
    headers = {'User-agent':user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('Download Error: ', e.reason)
        html = None
        if num_retries>0:
            if hasattr(e, 'code') and 500<=code<600:
                return download(url, user_agent, num_retries-1)
    encode_type = chardet.detect(html)
    html = html.decode(encode_type['encoding'])
    return html

#增加default，减少获取量。
link_crawler('http://example.webscraping.com', 'default/(index|view)')

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/places/default/index/1
Downloading:  http://example.webscraping.com/places/default/index/2
Downloading:  http://example.webscraping.com/places/default/index/3
Downloading:  http://example.webscraping.com/places/default/index/4
Downloading:  http://example.webscraping.com/places/default/index/5
Downloading:  http://example.webscraping.com/places/default/index/6
Downloading:  http://example.webscraping.com/places/default/index/7
Downloading:  http://example.webscraping.com/places/default/index/8
Downloading:  http://example.webscraping.com/places/default/index/9
Downloading:  http://example.webscraping.com/places/default/index/10
Downloading:  http://example.webscraping.com/places/default/index/11
Downloading:  http://example.webscraping.com/places/default/index/12
Downloading:  http://example.webscraping.com/places/default/index/13
Downloading:  http://example.webscraping.com/places/default/index/1

NameError: name 'code' is not defined

In [63]:
s = set([6,2,8,4,2,6,7,5,5])
print(s)

{2, 4, 5, 6, 7, 8}


In [59]:
type(s)

set

[Python3 集合类型 | Python 3 教程 - 与知](https://www.yuzhi100.com/tutorial/python3/python3-sets)

set（集合）类型是Python3的一种数据类型，集合（set）中包含的元素是无序的，无重复的序列。集合数据类型的主要作用是**测试是否是集合成员中的一个**，和**消除重复元素**。

集合（set）是可变数据类型，支持插入和删除元素，但是不支持索引和分片元素。  

测试是否是集合中的一个，以及消除重复元素，怪不得seen要用set类型，意思是只要看过的，一概不爬，不管这一路上碰到多少次。

In [75]:
url = 'http://www.engine3d.com'
user_agent='wswp'

headers = {'User-agent':user_agent}
request = urllib.request.Request(url, headers=headers)
html = urllib.request.urlopen(request).read()
encode_type = chardet.detect(html)
html = html.decode(encode_type['encoding'])

In [71]:
print(headers)

{'User-agent': 'wswp'}


In [76]:
print(encode_type)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


In [73]:
print(request)

<urllib.request.Request object at 0x10f474668>


In [77]:
print(html)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>三维GIS软件与服务开发商_三维GIS行业解决方案提供商_中科图新</title>
    <meta name="description" content="苏州中科图新网络科技有限公司是国内优秀的三维GIS软件与服务提供商，在三维GIS行业深耕近10年。公司专注于倾斜摄影实景三维应用，以技术创新为核心竞争力，已有4项发明专利和20多项软件著作权。中科图新拥有Wish3D和LocaSpace 两大产品系，致力于为企业用户和开发者提供更轻便实用的三维地图软件及数据发布服务，已经成功地应用于无人机航测、规划设计、水利、社区管理、智慧城市等多个领域 ，获得众多用户的认可。" />
    <meta name="keywords" content="locaspace、LSV、Wish3D、GIS、三维实景、解决方案、定制开发、OEM" />
    <link rel="shortcut icon" type="image/x-icon" href="assets/img/wish3d.ico?20170313" media="screen">
    <meta content='width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0' name='viewport'>
    <link rel="stylesheet" href="assets/css/font-awesome-4.7.0/css/font-awesome.css">
    <link rel="stylesheet" href="assets/css/bootstrap.min.css">
    <link rel="stylesheet" href="assets/css/material-kit-1.css?v=1.3.0">
    <link rel="stylesheet" href="assets/css/css.css">
    <link rel="stylesheet" href="assets/css/engine3d.cs

In [80]:
link_crawler(url,'.*?')

Downloading:  http://www.engine3d.com
Downloading:  http://www.miitbeian.gov.cn
Downloading:  http://news.engine3d.com/newdetail-5.html
Downloading:  http://www.engine3d.com/newlist.html
Downloading:  http://www.engine3d.com/newdetail-16.html
Downloading:  http://www.engine3d.com/newdetail-15.html
Downloading:  http://www.engine3d.com/newdetail-14.html
Downloading:  http://www.engine3d.com/newdetail-13.html
Downloading:  http://www.engine3d.com/newdetail-12.html
Downloading:  http://www.engine3d.com/newdetail-11.html
Downloading:  http://www.engine3d.com/newdetail-10.html
Downloading:  http://www.engine3d.com/newdetail-9.html
Downloading:  http://www.engine3d.com/newdetail-8.html
Downloading:  http://www.engine3d.com/newdetail-7.html
Downloading:  http://www.engine3d.com/newdetail-6.html
Downloading:  http://www.engine3d.com/newdetail-5.html
Downloading:  http://www.engine3d.com/newdetail-4.html
Downloading:  http://www.engine3d.com/newdetail-3.html
Downloading:  http://www.engine3d.co

RemoteDisconnected: Remote end closed connection without response

这两个网站的返回错误是不一样的：  
一个是`Download Error:  TOO MANY REQUESTS`，`NameError: name 'code' is not defined`
一个是`RemoteDisconnected   Traceback (most recent call last)`,`RemoteDisconnected: Remote end closed connection without response`。第二个没有返回e.reason。直接是失去连接。

In [86]:
help(urllib)

Help on package urllib:

NAME
    urllib

MODULE REFERENCE
    https://docs.python.org/3.7/library/urllib
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

PACKAGE CONTENTS
    error
    parse
    request
    response
    robotparser

FILE
    /Users/caimeijuan/anaconda/envs/python35/lib/python3.7/urllib/__init__.py




#### 1.2.4. 链接爬虫的高级功能

##### 解析robots.txt

In [101]:
import urllib.robotparser as robotparser

In [102]:
rp = robotparser.RobotFileParser()
rp.set_url('http://example.webscraping.com/robots.txt')

In [None]:
rp.read()

In [98]:
help(rp.read)

Help on method read in module urllib.robotparser:

read() method of urllib.robotparser.RobotFileParser instance
    Reads the robots.txt URL and feeds it to the parser.



使用了robots.txt文件中封禁的代理名称。

In [103]:
url = 'http://example.webscraping.com'
user_agent = 'BadCrawler'

In [104]:
rp.can_fetch(user_agent, url)

False

改用其他名称。

In [105]:
user_agent = 'GoodCrawler'

In [106]:
rp.can_fetch(user_agent, url)

False

**假如不运行rp.read()**'，则即使是GoodCrawler'这样的代理，也**无法获取**网页内容。

加入这个功能：

In [108]:
#_*_coding:utf-8_*_
import urllib
import re
import chardet
import urllib.robotparser as robotsparser

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    seen = set(crawl_queue)
    rp = robotparser.RobotFileParser()
    rp.set_url('http://example.webscraping.com/robots.txt')
    rp.read()
    while crawl_queue:
        url = crawl_queue.pop()
        if rp.can_fetch(user_agent, url):
            html = download(url)
        else:
            print('Blocked by robots.txt: ', url)
        for link in get_link(html):
            if re.compile(link_regex).search(link):
                link = urllib.parse.urljoin(seed_url,link)
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

def get_link(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)  
    return webpage_regex.findall(html)

def download(url, user_agent='BadCrawler', num_retries=2):
    # 原先的输出为二进制，非string。把转string的代码放到这里，改为输出string。
    print('Downloading: ', url)
    headers = {'User-agent':user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('Download Error: ', e.reason)
        html = None
        if num_retries>0:
            if hasattr(e, 'code') and 500<=code<600:
                return download(url, user_agent, num_retries-1)
    encode_type = chardet.detect(html)
    html = html.decode(encode_type['encoding'])
    return html

#增加default，减少获取量。
link_crawler('http://example.webscraping.com', 'default/(index|view)')

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/places/default/index/1
Downloading:  http://example.webscraping.com/places/default/index/2
Downloading:  http://example.webscraping.com/places/default/index/3
Downloading:  http://example.webscraping.com/places/default/index/4
Downloading:  http://example.webscraping.com/places/default/index/5
Downloading:  http://example.webscraping.com/places/default/index/6
Downloading:  http://example.webscraping.com/places/default/index/7
Downloading:  http://example.webscraping.com/places/default/index/8
Downloading:  http://example.webscraping.com/places/default/index/9
Downloading:  http://example.webscraping.com/places/default/index/10
Downloading:  http://example.webscraping.com/places/default/index/11
Download Error:  TOO MANY REQUESTS


NameError: name 'code' is not defined

`user_agent='BadCrawler'`没有带来封禁，仍然死于too many requests.这是为什么呢？

### ---------------------

In [1]:
import urllib

## 了解urllib

### urllib.error

- urllib.error.ContentTooShortError: Exception raised when downloaded size does not match content-length.当urlretrieve() 函数检测到下载数据量小于预期量（由Content-Length标头给出）时，会引发此异常。
- urllib.error.HTTPError: Raised when HTTP error occurs, but also acts like non-error return. 虽然是一个异常（它的子类URLError），但 HTTPError它也可以作为一个非特殊的文件类返回值（urlopen()返回相同的东西）。这在处理异常HTTP错误（例如身份验证请求）时非常有用。
- urllib.error.URLError: Base class for I/O related errors. 处理程序遇到问题时会引发此异常（或派生异常）。它是的子类OSError。
- urllib.error.urllib

### urllib.response

- urllib.response.addbase: Base class for addinfo and addclosehook. Is a good idea for garbage collection.
- urllib.response.addclosehook: Class to add a close hook to an open file.
- urllib.response.addinfo: class to add an info() method to an open file.
- urllib.response.addinfourl: class to add info() and geturl() methods to an open file.


### urllib.request: 打开和浏览url中内容 

- urllib.request.url2pathname: 
OS-specific conversion from a relative URL of the 'file' scheme to a file system path; not recommended for general use.  

- urllib.request.urlcleanup: 
Clean up temporary files from urlretrieve calls.  

- urllib.request.urljoin： 
Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.  

- urllib.request.urlopen: 
Open the URL url, which can be either a string or a Request object. 
    `urlopen(url, data=None, timeout=<object object at 0x109fa9630>, *, cafile=None, capath=None, cadefault=False, context=None)`   

- urllib.request.urlretrieve: 
Retrieve a URL into a temporary location on disk.
    `urlretrieve(url, filename=None, reporthook=None, data=None)`

- urllib.request.urlsplit:  
Parse a URL into 5 components:`<scheme>://<netloc>/<path>?<query>#<fragment>`
    `urlsplit(url, scheme='', allow_fragments=True)`

- urllib.request.urlparse:  
Parse a URL into 6 components:`<scheme>://<netloc>/<path>;<params>?<query>#<fragment>`
    `urlparse(url, scheme='', allow_fragments=True)`  
    
- urllib.request.urlunparse:
Put a parsed URL back together again.  This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had redundant delimiters, e.g. a ? with an empty query (the draft states that these are equivalent).

- urllib.request.Request: 
    `Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)`
    `type(urllib.request.Request(url)): urllib.request.Request `，可替代`urllib.request.urlopen(url)`中的url。

### urllib.parse:解析url

- urllib.parse.urldefrag: 
Removes any existing fragment from URL. Returns a tuple of the defragmented URL and the fragment.  If the URL contained no fragments, the second element is the empty string.

- urllib.parse.urlencode: 
Encode a dict or sequence of two-element tuples into a URL query string.
    `urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x10a8c7ea0>)`  

- urllib.parse.urljoin: 
Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.
    `urljoin(base, url, allow_fragments=True)`

- urllib.parse.urlparse: 
Parse a URL into 6 components:` <scheme>://<netloc>/<path>;<params>?<query>#<fragment>`
    `urlparse(url, scheme='', allow_fragments=True)`

- urllib.parse.urlunparse: 
Put a parsed URL back together again.  This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had redundant delimiters, e.g. a ? with an empty query (the draft states that these are equivalent).

- urllib.parse.urlsplit: 
Parse a URL into 5 components:` <scheme>://<netloc>/<path>?<query>#<fragment>`  
    `urlsplit(url, scheme='', allow_fragments=True)`

- urllib.parse.urlunsplit: 
 Combine the elements of a tuple as returned by urlsplit() into a complete URL as a string. The data argument can be any five-item iterable. This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent).