## 1.爬虫基础

我们会一步步解决这几个问题：
1. 怎么获取网页源码？
2. 怎么去解析源代码？
3. 怎么通过程序模拟登录？
4. 怎么保持登录状态的访问？
5. 怎么解决JS动态解析问题？
6. 怎么解决IP限制？（代理）
7. 怎么破解验证码？（识别）
8. 怎么破解人机交互验证？
9. 怎么爬取APP数据？（抓包）

### 1.1.爬虫流程

1. 发起请求：
    - 发送一个`Request`（可以包含headers等信息）
2. 获取响应内容：
    - 得到一个`Response`，其内容就是页面内容
    - 类型可以是HTML、Json、二进制数据(图片、视频、文件等)
3. 解析内容
    - 对得到的内容进行解析，筛选出需要的内容
4. 保存数据
    - 把需要的数据保存到DB或者特定格式

**PS：Request请求补充：**
1. 请求方式：主要是`Get`和`Post`，还有诸如Put、Delete、Options等
2. 请求头：请求时包含的头部信息，如：`User-Agent`、`Host`、`Cookies`等
3. 请求体：请求时额外携带的数据，eg：表单数据等（`主要是Post请求`）

**PS：Response响应补充：**
1. 状态码：200成功，301跳转、404找不到页面、502服务器错误
2. 响应头：内容类型、内容长度、服务器信息、设置Cookie等（`Set-Cookie`）
3. 响应体：请求资源的内容，eg：页面HTML、Json、图片等二进制数据``

### 1.2.请求库

### 1.2.1.内置的urllib库

组成部分：
1. `urllib.request` # 请求模块
2. `urllib.error` # 异常处理模块
3. `urllib.parse` # url解析模块
4. `urllib.robotparser` # `robots.txt`解析模块

PS：Python2的`urllib2.urlopen()` => Python3的`urllib.request.urlopen()`

#### 1.响应演示如下

In [1]:
import urllib

response = urllib.request.urlopen("https://docs.python.org")
# 状态码
print(response.status)
# 获取指定响应头
print(response.getheader("Server")) # 是getheader（没有s）
# 获取全部响应头
print(response.getheaders())

200
nginx
[('Server', 'nginx'), ('Content-Type', 'text/html'), ('Last-Modified', 'Fri, 08 Mar 2019 00:27:23 GMT'), ('ETag', '"5c81b6eb-27bd"'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Strict-Transport-Security', 'max-age=315360000; includeSubDomains; preload'), ('Via', '1.1 varnish'), ('Content-Length', '10173'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 08 Mar 2019 13:41:04 GMT'), ('Via', '1.1 varnish'), ('Age', '43530'), ('Connection', 'close'), ('X-Served-By', 'cache-jfk8124-JFK, cache-bur17520-BUR'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 14'), ('X-Timer', 'S1552052465.718755,VS0,VE0'), ('Vary', 'Accept-Encoding')]


In [2]:
print(response.read().decode("utf-8")) # 获取响应体


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>3.7.2 Documentation</title>
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <script type="text/javascript" src="_static/language_data.js"></script>
    
    <script type="text/javascript" src="_static/sidebar.js"></script>
    

#### 2.请求演示如下

官方文档：<https://docs.python.org/3/library/urllib.request.html>

In [4]:
import urllib

# 1.Get请求
response = urllib.request.urlopen("https://docs.python.org")
html = response.read().decode("utf-8")
print(html)


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>3.7.2 Documentation</title>
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <script type="text/javascript" src="_static/language_data.js"></script>
    
    <script type="text/javascript" src="_static/sidebar.js"></script>
    

In [5]:
import urllib

# 2.Post请求
url = "http://httpbin.org/post" # 这是一个提供http测试的网站
# url编码（二进制格式）
data = urllib.parse.urlencode({"username":"dnt","password":"dnt"}).encode("utf-8")
# Post提交数据（加data就是post，不加就是get）
response = urllib.request.urlopen(url, data=data)
# 查看返回内容
result = response.read().decode("utf-8")
print(result)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "dnt", 
    "username": "dnt"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "121.235.195.127, 121.235.195.127", 
  "url": "https://httpbin.org/post"
}



In [6]:
import urllib

try:
    # 3.timeout设置
    response = urllib.request.urlopen("https://httpbin.org/get", timeout=0.1)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as ex:
    print(ex)

<urlopen error timed out>


In [7]:
# 官方文档：https://docs.python.org/3/library/urllib.robotparser.html
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url(url="https://docs.python.org/robots.txt")
rp.read()

# 判断url是否可爬
print(rp.can_fetch("*", "https://docs.python.org/dev"))     # 可爬
print(rp.can_fetch("*", "https://docs.python.org/release")) # 不可爬

True
False


#### 3.制定化请求演示

Request：<https://docs.python.org/3/library/urllib.request.html#request-objects>

In [1]:
import urllib

# 指定request的headers
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
request = urllib.request.Request("http://www.biquge.cm/login.php",headers = headers)

# 用request包裹一下也一样使用
response = urllib.request.urlopen(request)
print(response.read().decode("gbk")) # 这个网站的编码是gbk（一般都是utf-8）

<!doctype html>
<html>
<head>
<title>笔趣阁_书友最值得收藏的网络小说阅读网</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="笔趣阁,网络小说,小说阅读网,小说" />
<meta name="description" content="笔趣阁是广大书友最值得收藏的网络小说阅读网，网站收录了当前最火热的网络小说，免费提供高质量的小说最新章节，是广大网络小说爱好者必备的小说阅读网。" />
<link rel="stylesheet" type="text/css" href="/images/BiQuGeCm.css"/>
<script src="//libs.baidu.com/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="/images/BiQuGeCm.js"></script>
<script type="text/javascript" src="/scripts/wap.js"></script>
</head>
<body>
<div id="wrapper">
<script>login();</script>
		<div class="header">
			<div class="header_logo">
				<a href="/">笔趣阁</a>
			</div>
			<script>bqg_panel();</script>
		</div>
		<div class="nav">
			<ul>
<li><a href="/">首页</a></li>
<li><a href="/modules/article/bookcase.php">书架</a></li>
<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
<li><a href="/wuxiaxiaoshuo/">武侠小说</a></li>
<li><a href="/dush

In [2]:
import urllib

# 请求头
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
# POST表单数据
data = urllib.parse.urlencode({"username":"dnt","password":"dnt","action":"login"}).encode("utf-8")
# 构造request
request = urllib.request.Request("http://www.biquge.cm/login.php", data = data, headers = headers, method="POST") # POST大写
# POST请求
response = urllib.request.urlopen(request)
# 输出页面内容
print(response.read().decode("gbk"))  # 这个网站的编码是gbk（一般都是utf-8）

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=gbk" />
<meta http-equiv="refresh" content='4; url=http://www.biquge.cm/'>
<title>登录成功</title>
<link rel="stylesheet" type="text/css" media="all" href="http://www.biquge.cm/themes/skin/style.css" />
<script language="Javascript">
function Show(divid){
  if(document.all) divid.filters.revealTrans.apply(); 
  divid.style.visibility = "visible"; 
  if(document.all) divid.filters.revealTrans.play(); 
}
function Hide(divid){
  if(document.all) divid.filters.revealTrans.apply();
  divid.style.visibility = "hidden";
  if(document.all) divid.filters.revealTrans.play();
}
setTimeout("Hide(document.getElementById('msgboard'))",3000);
</script>
<script language="javascript" type="text/javascript" src="http://www.biquge.cm/scripts/common.js"></script>
			</head>
<body onload="Show(document.getElementById('msgboard'))">
<div style="width:100%; height:100%; text-align:center; padding-top:150px;">
<div id

#### 4.代理相关演示

Handler：<https://docs.python.org/3/library/urllib.request.html#proxyhandler-objects>

In [1]:
import urllib

# 设置代理
# PS：如果有本地代理软件可以使用：127.0.0.1:端口
proxy_handler = urllib.request.ProxyHandler({
    "http":"http://127.0.0.1:1080",
    "https":"http://127.0.0.1:1080"
})
# 获取openner对象
opener = urllib.request.build_opener(proxy_handler)
# 发送请求（url可以是构造的Request对象）
response = opener.open("http://httpbin.org/get")
print(response.read().decode("utf-8"))

{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "origin": "47.94.230.42, 47.94.230.42", 
  "url": "https://httpbin.org/get"
}



#### 5.Cookie

官方文档：<https://docs.python.org/3/library/http.cookiejar.html>

In [1]:
import urllib
import http.cookiejar

# 实例化cookie对象
cookie = http.cookiejar.CookieJar()
# 构建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 获取openner对象
opener = urllib.request.build_opener(handler)
# 发送请求
response = opener.open("http://www.baidu.com")

# 遍历被赋值的cookie
for item in cookie:
    print(f"{item.name}：{item.value}")

BAIDUID：74183E8848D87E4B3C6C3A5ADA76BC75:FG=1
BIDUPSID：74183E8848D87E4B3C6C3A5ADA76BC75
H_PS_PSSID：26524_1456_25809_21126_18559_28607_28584_28557_28519_22159
PSTM：1552365525
delPer：0
BDSVRTM：0
BD_HOME：0


In [2]:
# 1.cookie持久化保存

import urllib
import http.cookiejar

# 保存cookie为文件，在未失效前都可以保持登录状态
cookie = http.cookiejar.MozillaCookieJar("cookie.log") # CookieJar的子类

# 这边和上面一样，只是多了个cookie的持久化
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")

# 遍历被赋值的cookie
for item in cookie:
    print(f"{item.name}：{item.value}")

# cookie持久化
# ignore_discard=True 保存需要被丢弃的cookie，ignore_expires=True 过期的cookie也保存
cookie.save(ignore_discard=True,ignore_expires=True)

BAIDUID：DEEE77C745644D65499E4EB19BAD878C:FG=1
BIDUPSID：DEEE77C745644D65499E4EB19BAD878C
H_PS_PSSID：1430_21100_18560_28607_28585_26350_28603_28625_28606
PSTM：1552365526
delPer：0
BDSVRTM：0
BD_HOME：0


**查看一下`cookie.log`的内容**：
```shell
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	3699840499	BAIDUID	F9E665DE3DF72C6DBA64895A2E390059:FG=1
.baidu.com	TRUE	/	FALSE	3699840499	BIDUPSID	F9E665DE3DF72C6DBA64895A2E390059
.baidu.com	TRUE	/	FALSE		H_PS_PSSID	1439_25809_21123_28607_28584_28557_28603_28606
.baidu.com	TRUE	/	FALSE	3699840499	PSTM	1552356852
.baidu.com	TRUE	/	FALSE		delPer	0
www.baidu.com	FALSE	/	FALSE		BDSVRTM	0
www.baidu.com	FALSE	/	FALSE		BD_HOME	0
```

In [3]:
# 1.cookie持久化加载

import urllib
import http.cookiejar

cookie = http.cookiejar.MozillaCookieJar()
cookie.load("cookie.log", ignore_discard=True, ignore_expires=True)

# 遍历持久化的cookie
for item in cookie:
    print(f"{item.name}：{item.value}")

handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
print(len(response.read().decode("utf-8")))

BAIDUID：DEEE77C745644D65499E4EB19BAD878C:FG=1
BIDUPSID：DEEE77C745644D65499E4EB19BAD878C
H_PS_PSSID：1430_21100_18560_28607_28585_26350_28603_28625_28606
PSTM：1552365526
delPer：0
BDSVRTM：0
BD_HOME：0
152993


In [4]:
# 2.cookie持久化保存（方法二）

import urllib
import http.cookiejar

# 实例化cookie对象
cookie = http.cookiejar.LWPCookieJar("cookie.log")
# 构建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 构建opener对象
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")

for item in cookie:
    print(f"{item.name}：{item.value}")

# ignore_discard=True 保存需要被丢弃的cookie，ignore_expires=True 过期的cookie也保存    
cookie.save(ignore_discard=True, ignore_expires=True)

BAIDUID：EB644FB1E014B90D39201AC232DE71ED:FG=1
BIDUPSID：EB644FB1E014B90D39201AC232DE71ED
H_PS_PSSID：26523_1453_21119_28607_28584_28558_28604_28606
PSTM：1552365531
delPer：0
BDSVRTM：0
BD_HOME：0


**查看一下`cookie.log`的内容**：
```shell
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="DF147C18AA3C7FB03A7562B6070A2DA1:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-03-30 07:17:47Z"; version=0
Set-Cookie3: BIDUPSID=DF147C18AA3C7FB03A7562B6070A2DA1; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-03-30 07:17:47Z"; version=0
Set-Cookie3: H_PS_PSSID=1448_21093_28607_28584_26350_28643_28606; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1552363421; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-03-30 07:17:47Z"; version=0
Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
```

In [5]:
# 2.cookie持久化加载

import urllib
import http.cookiejar

cookie = http.cookiejar.LWPCookieJar()
cookie.load("cookie.log")

for item in cookie:
    print(f"{item.name}：{item.value}")

handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
# print(response.read().decode("utf-8"))

BAIDUID：EB644FB1E014B90D39201AC232DE71ED:FG=1
BIDUPSID：EB644FB1E014B90D39201AC232DE71ED
PSTM：1552365531


#### 6.异常处理

官方文档：<https://docs.python.org/3/library/urllib.error.html>

演示如下：

In [1]:
import urllib

try:
    # 访问一个404页面
    response = urllib.request.urlopen("https://www.cnblogs.com/dotnetcrazy/p/abc.html")
except urllib.error.URLError as e:
    print(e.reason) # 错误原因

Not Found


In [2]:
import urllib

try:
    # 访问一个404页面
    response = urllib.request.urlopen("https://www.cnblogs.com/dotnetcrazy/p/abc.html")
except urllib.error.HTTPError as ex:
    print("urllib.error.HTTPError")
    print(ex.reason, ex.code, ex.headers, sep="\n")
except urllib.error.URLError as ex:
    print("urllib.error.URLError")
    print(ex.reason) # 错误原因

urllib.error.HTTPError
Not Found
404
Date: Wed, 20 Mar 2019 13:29:44 GMT
Content-Type: text/html
Content-Length: 759
Connection: close
Cache-Control: private, max-age=10
Expires: Wed, 20 Mar 2019 13:29:54 GMT
Last-Modified: Wed, 20 Mar 2019 13:29:44 GMT
X-UA-Compatible: IE=10
X-Frame-Options: SAMEORIGIN




In [3]:
import socket
import urllib

try:
    # 摸拟一个超时请求
    response = urllib.request.urlopen("https://dotnetcrazy.cnblogs.com",timeout=0.01)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as ex:
    print(type(ex.reason))
    if isinstance(ex.reason, socket.timeout):
        print("请求超时~")


<class 'socket.timeout'>
请求超时~


#### 7.URL解析

官方文档：<https://docs.python.org/3/library/urllib.parse.html>

- `urllib.request.urlparse()`：将urlstring解析成由6个部分组成的Tuple
- `urllib.parse.urlencode()`：
- `urllib.request.urljoin()`：
- `urllib.request.urlunparse()`：

In [4]:
import urllib

# url解析成对应的信息
result = urllib.request.urlparse("https://www.baidu.com/s?wd=mmd&ie=utf-8#top")

print(type(result))
print(result)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=mmd&ie=utf-8', fragment='top')


In [5]:
import urllib

# 指定协议类型
result = urllib.request.urlparse("www.baidu.com/s?wd=mmd&ie=gbk#top", scheme="http")
print(result)

# 如果url带有协议，则使用url里面的协议
result = urllib.request.urlparse("https://www.baidu.com/s?wd=mmd&ie=gbk#top", scheme="http")
print(result)

ParseResult(scheme='http', netloc='', path='www.baidu.com/s', params='', query='wd=mmd&ie=gbk', fragment='top')
ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=mmd&ie=gbk', fragment='top')


In [6]:
import urllib

# 扩展：url反拼接
data = ["http", "www.baidu.com", "/s", "", "wd=mmd", "top"]
url = urllib.request.urlunparse(data)
print(url)

http://www.baidu.com/s?wd=mmd#top


In [7]:
import urllib

# 加入基本URL和可能的相对URL以形成绝对URL
result = urllib.request.urljoin("http://www.baidu.com/s","?wd=mmd&ie=gbk#top")
print(result)

http://www.baidu.com/s?wd=mmd&ie=gbk#top


In [8]:
import urllib

# ★按照第一个demo来使用即可（路径,参数）★

# 演示几个可能出现的拼接
result = urllib.request.urljoin("http://www.baidu.com/s#top","?wd=mmd&ie=gbk")
print(1, result) #top 消失了

# 尽量拼接，如果后面url相对完整就以后面为准
# 前后都是一个网站，那就以后面为准
result = urllib.request.urljoin("www.baidu.com","www.oschina.net")
print(2, result)

result = urllib.request.urljoin("http://www.baidu.com","http://www.oschina.net")
print(3, result)

# 这种也是个坑，别这么玩
result = urllib.request.urljoin("http://www.baidu.com","www.oschina.net")
print(4, result)

result = urllib.request.urljoin("http://www.baidu.com","/s?wd=mmd&ie=gbk#top")
print(5, result) # 不推荐这么写

result = urllib.request.urljoin("www.baidu.com","/s?wd=mmd&ie=gbk#top")
print(6, result) # 不推荐这么写

1 http://www.baidu.com/s?wd=mmd&ie=gbk
2 www.oschina.net
3 http://www.oschina.net
4 http://www.baidu.com/www.oschina.net
5 http://www.baidu.com/s?wd=mmd&ie=gbk#top
6 /s?wd=mmd&ie=gbk#top


In [9]:
import urllib

# 很多数据都是字典类型的，那么可以通过这个快速拼接url
params = {"name":"小明", "age":25, "wechat":"dotnetcrazy"}
url = "http://www.baidu.com" + "?" + urllib.parse.urlencode(params)
print(url) # 中文自动url编码

http://www.baidu.com?name=%E5%B0%8F%E6%98%8E&age=25&wechat=dotnetcrazy


In [10]:
import urllib

# 作为Post请求的数据
url = "http://httpbin.org/post" # 这是一个提供http测试的网站
# url编码一下（记得先转换成二进制数据）
data = urllib.parse.urlencode({"username":"dnt","password":"dnt"}).encode("utf-8")
# Post提交数据（加data就是post，不加就是get）
response = urllib.request.urlopen(url, data=data)
# 查看返回内容
result = response.read().decode("utf-8")
print(result)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "dnt", 
    "username": "dnt"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "47.94.230.42, 47.94.230.42", 
  "url": "https://httpbin.org/post"
}



### 1.2.2.Requests基础

案例演示如下：

In [1]:
import requests # 导入requests模块

In [2]:
# 设置请求头（dict类型）
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
# 摸拟一个Get请求
response = requests.get("http://www.biquge.cm/12/12097/", headers=headers)

In [3]:
# 获取Response的状态码
response.status_code

200

In [4]:
# 获取Response的响应头
response.headers # 字典类型

{'Date': 'Fri, 08 Mar 2019 12:49:53 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d08dd22378bb84bf3bee75a38b5d541971552049393; expires=Sat, 07-Mar-20 12:49:53 GMT; path=/; domain=.biquge.cm; HttpOnly', 'Last-Modified': 'Wed, 06 Mar 2019 15:16:33 GMT', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'yunjiasu-nginx', 'CF-RAY': '4b44ee82f2ff479a-WUX', 'Content-Encoding': 'gzip'}

In [5]:
# PS：可以设置响应体的编码格式
response.encoding="gbk" # "utf-8"

# 获取Response的响应体
print(response.text)

'<!doctype html>\r\n<head>\r\n<meta http-equiv="Cache-Control" content="no-siteapp"/>\r\n<meta http-equiv="Cache-Control" content="no-transform"/>\r\n<script type="text/javascript" src="/scripts/wap.js"></script>\r\n<meta http-equiv="mobile-agent" content="format=html5; url=http://m.biquge.cm/12/12097/"/>\r\n<meta http-equiv="mobile-agent" content="format=xhtml; url=http://m.biquge.cm/12/12097/"/>\r\n<meta http-equiv="Content-Type" content="text/html; charset=gbk" />\r\n<title>恐怖复苏最新章节列表_恐怖复苏最新章节目录_笔趣阁</title>\r\n<meta name="keywords" content="恐怖复苏,佛前献花,恐怖复苏最新章节"/>\r\n<meta name="description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。"/> \r\n<meta property="og:type" content="novel"/>\r\n<meta property="og:title" content="恐怖复苏"/>\r\n<meta property="og:description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。"/>\r\n<meta property="og:image" content="http://www.biquge.cm/files/article/image/12/1209

In [6]:
# PS：获取图片等二进制内容
response = requests.get("http://images2018.cnblogs.com/blog/1127869/201805/1127869-20180530164144904-1221603693.jpg")

# 获取二进制格式的响应体
print(response.content)

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\xff\xe1\x00"Exif\x00\x00MM\x00*\x00\x00\x00\x08\x00\x01\x01\x12\x00\x03\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x02\x01\x01\x02\x02\x02\x02\x02\x02\x02\x02\x03\x05\x03\x03\x03\x03\x03\x06\x04\x04\x03\x05\x07\x06\x07\x07\x07\x06\x07\x07\x08\t\x0b\t\x08\x08\n\x08\x07\x07\n\r\n\n\x0b\x0c\x0c\x0c\x0c\x07\t\x0e\x0f\r\x0c\x0e\x0b\x0c\x0c\x0c\xff\xdb\x00C\x01\x02\x02\x02\x03\x03\x03\x06\x03\x03\x06\x0c\x08\x07\x08\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\xff\xc0\x00\x11\x08\x04m\x04u\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01}\x01\x02\x03

In [7]:
# 可以把二进制内容写入磁盘
with open("wx.jpg","wb") as f:
    f.write(response.content) # 把图片写入磁盘

In [9]:
!dir # !ls

 驱动器 D 中的卷是 软件
 卷的序列号是 000D-2898

 D:\Works\BaseCode\python\notebook\8.Spider 的目录

2019/03/08  20:49    <DIR>          .
2019/03/08  20:49    <DIR>          ..
2019/03/06  17:21    <DIR>          .ipynb_checkpoints
2018/12/02  19:18            43,975 1.网罗天下之~正则表达.ipynb
2019/03/08  20:49           973,427 2.爬虫.ipynb
2019/03/08  20:33             1,331 ghostdriver.log
2019/03/08  20:50           176,576 wx.jpg
               4 个文件      1,195,309 字节
               3 个目录  1,163,186,176 可用字节


### 1.3.常用解析

常用的解析方式：
1. `Json解析`
2. `正则表达`
3. `BeautifulSoup`
4. `PyQuery`
5. `XPath`



### 1.4.动态解析

JS动态渲染解决：
1. 分析Ajax请求
2. `Selenium`（`WebDriver`）
3. `Splash`
4. `PyV8`、`Ghost.py`等

**PS：`Selenium`为了调用谷歌浏览器，需要安装对应的`chromedriver`**
- `http://chromedriver.storage.googleapis.com/index.html`
- 终端输入`chromedriver`能显示配置信息即可
- 一般都是放在Python所在文件夹

PS：`PhantomJS`是命令行下的浏览器（无界面），不依赖其他
- 下载：`http://phantomjs.org/download.html`
- 只要保证`PhantomJS执行文件`配置在环境变量就行
- 一般只copy`PhantomJS执行文件`到某个环境变量配置过的路径（eg：Python）

**PS：新版本的Selenium不再支持PhantomJS了**

In [1]:
# 无界面可以这么用
from selenium import webdriver

# 创建chrome参数对象
opt = webdriver.ChromeOptions()
# 把chrome设置成无界面模式，不论windows还是linux都可以，自动适配对应参数
opt.headless = True
# 创建chrome无界面对象
browser = webdriver.Chrome(options=opt)
# 访问百度
browser.get("http://www.biquge.cm/12/12097/")
print(browser.page_source)

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta http-equiv="Cache-Control" content="no-transform" />
<script src="http://push.zhanzhang.baidu.com/push.js"></script><script type="text/javascript" src="/scripts/wap.js"></script>
<meta http-equiv="mobile-agent" content="format=html5; url=http://m.biquge.cm/12/12097/" />
<meta http-equiv="mobile-agent" content="format=xhtml; url=http://m.biquge.cm/12/12097/" />
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<title>恐怖复苏最新章节列表_恐怖复苏最新章节目录_笔趣阁</title>
<meta name="keywords" content="恐怖复苏,佛前献花,恐怖复苏最新章节" />
<meta name="description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。" /> 
<meta property="og:type" content="novel" />
<meta property="og:title" content="恐怖复苏" />
<meta property="og:description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。" />
<meta proper

In [2]:
# 新版Selenium模块将不再支持PhantomJS
# 预览一下不支持的警告
from selenium import webdriver

url = "http://www.biquge.cm/12/12097/"
# browser = webdriver.Chrome() # GUI
browser = webdriver.PhantomJS() # 无GUI
browser.get(url) # 阻塞等

# 获取页面源代码
print(browser.page_source)

# warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
# UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead



<!DOCTYPE html><html><head>
<meta http-equiv="Cache-Control" content="no-siteapp">
<meta http-equiv="Cache-Control" content="no-transform">
<script src="http://push.zhanzhang.baidu.com/push.js"></script><script type="text/javascript" src="/scripts/wap.js"></script>
<meta http-equiv="mobile-agent" content="format=html5; url=http://m.biquge.cm/12/12097/">
<meta http-equiv="mobile-agent" content="format=xhtml; url=http://m.biquge.cm/12/12097/">
<meta http-equiv="Content-Type" content="text/html; charset=gbk">
<title>恐怖复苏最新章节列表_恐怖复苏最新章节目录_笔趣阁</title>
<meta name="keywords" content="恐怖复苏,佛前献花,恐怖复苏最新章节">
<meta name="description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。"> 
<meta property="og:type" content="novel">
<meta property="og:title" content="恐怖复苏">
<meta property="og:description" content="恐怖复苏最新章节由网友提供，《恐怖复苏》情节跌宕起伏、扣人心弦，是一本情节与文笔俱佳的网络小说，笔趣阁免费提供佛前献花的恐怖复苏最新清爽干净的文字章节在线阅读。">
<meta property="og:image" content="http://www.biquge.cm/files/article