- 代码参考：https://www.gitbook.com/book/germey/python3webspider/details
- 主要涉及urllib和requests库

In [None]:
import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

In [None]:
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

In [None]:
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

In [None]:
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.google.com.hk', timeout=3)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print(type(e.reason))
        print('TIME OUT')

- urlopen是用来发送请求的，参数可是字符串（网址），也可以是Request对象
- 使用Request对象可以更加灵活的设计请求内容

In [None]:
import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

`class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)`

In [None]:
from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/7.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'Name': 'proverbs'
}

data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
res = request.urlopen(req)
print(res.read().decode('utf-8'))

- 利用Handler来构建Opener
- Handler相当于高级的Request，Opener相当于高级的urlopen

In [109]:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'user'
password = '123'
url = 'http://120.27.34.24:9001'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
  <title>Supervisor Status</title>
  <link href="stylesheets/supervisor.css" rel="stylesheet" type="text/css" />
  <link href="images/icon.png" rel="icon" type="image/png" />
</head>
<body>
<div id="wrapper">

  <div id="header">
    <img alt="Supervisor status" src="images/supervisor.gif" />
  </div>

  <div>
    <div class="hidden">#</div>

    <form action="index.html" method="post">
      <ul class="clr" id="buttons">
        <li id="refresh"><a href="index.html?action=refresh">&nbsp;</a></li>
        <li id="restart_all"><a href="index.html?action=restartall">&nbsp;</a></li>
        <li id="stop_all"><a href="index.html?action=stopall">&nbsp;</a></li>
      </ul>

      <table cellspacing="0">
        <thead>
        <tr>
          <th class="state">State</th>
          <th class="desc">Description</th>
          <th class="name">Name</th>
       

- 以上为认证

In [54]:
from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://61.191.41.130:80',
    'https': 'https://220.167.220.14:808'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('http://httpbin.org/get')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
    "Accept-Encoding": "identity, deflate, compress, gzip", 
    "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6", 
    "Cache-Control": "max-age=0", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "If-Modified-Since": "Wed, 10 May 2017 04:38:04 GMT", 
    "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.126 Safari/533.4"
  }, 
  "origin": "61.191.40.75", 
  "url": "http://httpbin.org/get"
}



- 以上为代理（https不知道为什么不能用）

In [None]:
import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

In [None]:
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

In [None]:
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

- 以上展示了cookie的显示、存为文件、从文件载入
- 需要模仿登录：找到post账号和密码的地址（可以使用http://www.proverbs.top:12345/login ，账号379548839@qq.com，密码xuhao，提取csrf登录）

- 一下是关于requests（第三方库）的使用方法
- requests库比urllib更为简单和强大

In [51]:
import requests

req = requests.get('https://www.baidu.com')
print(type(req))
print(req.status_code)
print(type(req.text))
print(req.text)
print(req.cookies)


<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxleng

In [55]:
import requests
r = requests.get('http://httpbin.org/get')
print(r.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "origin": "120.236.174.143", 
  "url": "http://httpbin.org/get"
}



- 在requests中加入data

In [58]:
import requests

data = {
    'name': 'germey',
    'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(type(r.text))
print(r.text)
print(type(r.json()))
print(r.json())

<class 'str'>
{
  "args": {
    "age": "22", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "origin": "120.236.174.143", 
  "url": "http://httpbin.org/get?age=22&name=germey"
}

<class 'dict'>
{'headers': {'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.9.1', 'Accept': '*/*', 'Connection': 'close', 'Accept-Encoding': 'gzip, deflate'}, 'origin': '120.236.174.143', 'args': {'name': 'germey', 'age': '22'}, 'url': 'http://httpbin.org/get?age=22&name=germey'}


- 在requests中加入headers

In [59]:
import requests
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get("https://www.zhihu.com/explore", headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text)
print(titles)

['\n俄罗斯远东原住民为什么那么少?\n', '\n计算机专业本科就业待遇如何？考研更好？考研方向有哪些？\n', '\n如何评价4月新番《与僧侣交合的色欲之夜》？\n', '\n什么是「少女心」？\n', '\n有哪些演技不错的中国青年演员？\n', '\n有哪些原来红极一时，现在已经衰退但是还是有死忠维护着的圈子？\n', '\n如何评价 fgo 2017 年一季度手游营收全球第二？\n', '\n华为 P10 的疏油层和闪存运存事件以及屏幕拖影事件是否会影响 P10 销量？\n', '\n在我家筑巢的鸟攻击我怎么办？\n', '\n如何评价电视剧《外科风云》？\n']


- text获取的是str，content获取的是bytes
- 可以通过content获取二进制文件（图片，视频，音频等）


In [60]:
import requests

r = requests.get("https://github.com/favicon.ico")
print(r.text)
print(r.content)

         (  &          (  N  (                                                    v�        �i                            ���              ���                    ��               ����            ��,\�"        4�����    0�   ����8        @�����-����;                        :�������O                                L������                                      ������                                        ������!                                ������4                                @���8���          
                  ���8    ���6   �����   t7���           ������������                  ���

In [62]:
import requests

r = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(r.content)
    f.close()

In [64]:
import requests

data = {'name': 'germey', 'age': '22'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.post("http://httpbin.org/post", data=data, headers=headers)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "22", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
  }, 
  "json": null, 
  "origin": "120.236.174.143", 
  "url": "http://httpbin.org/post"
}



- requests的post的响应

In [65]:
import requests

r = requests.get('http://www.jianshu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'Date': 'Wed, 10 May 2017 04:59:35 GMT', 'Connection': 'keep-alive', 'Transfer-Encoding': 'chunked', 'ETag': 'W/"623701344a3badf0f74957912c193b7d"', 'Set-Cookie': '_session_id=YTFvMjRMZ2pyeXpmY2FUdWRCWGJ2NVpEZmZXaVBxdzZkRVE1ZU1KTWlNck9IL2Rlc3NDcmNIRzAyaTRnaGN1OTEyS0U1K3RpU2hvbzhsbXl6YXZOVjQxbWNkRVpNUE54MkRSVGp6b0ZBRkRURXhJUmxnTDBpVFVZUXRaZlcrbTMyOVpLWWh5R3gwdkxOT2VzZHRlOVlwZ2FBOW9GZ2pyYzRscWRjT29VMHRnbFFsSFBvMUJFN1FtMGFKalB3TThZSXdROGdwdE9IN0tBa0U2OENkbjRUN3BQdkFGYzgxSFArNXNvMER2NmtwMm0veTFTMlo4amhVbWVyMSsybHcwRS0tWkpwRjJ6WmVPQ292WjU0M2lnREprQT09--df3a1f82dba59d4b56f4ae1053a948c45cb6ea3f; path=/; HttpOnly', 'Cache-Control': 'max-age=0, private, must-revalidate', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'Server': 'Tengine', 'X-Via': '1.1 jianyidong20:3 (Cdn Cache Server V2.0)', 'X-XSS-Protection': '1; mode=block', 'Content-Encoding': 'gzip', 'X-Request-Id': '0a5a7ee3-a632-4c70-aff2-ceda3eba408f',

- requests文件上传

In [66]:
import requests

files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:application/octet-stream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABERE3YTExPFDg4OEgAAAAAAAAAADw8PERERFLETExNpAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABQUFJYTExT8ExMU7QAAABkAAAAAAAAAAAAAABgVFRf/FRUX/xERE4UAAAAAAAAAAAAAAAAAAAAAAAAAABERE8ETExTuERERHg8PDxAAAAAAAAAAAAAAAAAAAAANExMU9RUVF/8VFRf/EhIUrwAAAAAAAAAAAAAAABQUFJkVFRf/BQURLA0NDVwODg/BDw8PIgAAAAAAAAAADg4ONBAQEP8VFRf/FRUX/xUVF/8TExOPAAAAAA8PDzAPDQ//AAAA+QEBAe0CAgL/AgIC9g0NDTgAAAAAAAAAAAcHB0ACAgLrFRUX/xUVF/8VFRf/FRUX/xERES0TExacFBQV/wEBAfwPDxH7DAwROwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA0NEToTExTnFRUX/xUVF/8TExOaExMT2RUVF/8VFRf/ExMTTwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEBAQTBUVF/8VFRf/ExMT2hMTFPYVFRf/FBQU8AAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAITExTxFRUX/xMTFPYTExT3FRUX/xQUFOEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBQU4RUVF/8TExT3ExMU3hUVF/8TExT5Dw8PIQAAAAAAAAAAA

- requests的cookie处理

In [67]:
import requests

r = requests.get("https://www.baidu.com")
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>, <Cookie __bsi=1814891151942961897_00_23_N_N_1_0303_C02F_N_N_N_0 for .www.baidu.com/>]>
BDORZ=27315
__bsi=1814891151942961897_00_23_N_N_1_0303_C02F_N_N_N_0


In [77]:
import requests

headers = {
    'Cookie': 'q_c1=6a939ad4208e4978a11dd6c405be2d94|1493955688000|1490864656000; r_cap_id="NDE5MGZjOWJmNWMxNGU0NzhmMjEyOWIxZDYyYzM1NzU=|1493487069|a08db0edf678dc66a04f317a5111c5086a3cad03"; cap_id="NDIwMDUzY2EwMDE5NDFmNTllYjVlM2M3NWY2MDFkNzI=|1493487069|e5c33041a487f832fc555766d56221b51a442d67"; __utma=155987696.1717155654.1490930864.1490930864.1490930864.1; __utmz=155987696.1490930864.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _zap=fcc5c4a6-8f1c-4eaa-be04-5b79a87d0308; d_c0="AFBCitVljwuPTpKWXjiKInj33-DAqGaCEak=|1491381823"; z_c0=Mi4wQUFEQUlpWWZBQUFBVUVLSzFXV1BDeGNBQUFCaEFsVk5XRnNzV1FENGFfMG9OQmdabm5ZM25fcTB4Ym0wRWFxSXFn|1494392604|cc5026d6aff0d22baed289a71f47562c1c5c382b; _xsrf=35a188b80dd581852e93f37fd782fd27; aliyungf_tc=AQAAAHmf3z0zugwAj67seCEhFN3Zo6/G; acw_tc=AQAAAOxgrB/LwwwAj67seBHUOyuhiYgK', 
    'Host': 'www.zhihu.com',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0',
}
r = requests.get("http://www.zhihu.com", headers=headers)
print(r.text)

<!DOCTYPE html>
<html lang="zh-CN" class="">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="renderer" content="webkit" />
<meta name="description" content="一个真实的网络问答社区，帮助你寻找答案，分享知识。"/>
<meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
<title>知乎 - 与世界分享你的知识、经验和见解</title>



<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png" sizes="60x60">

<link rel="shortcut icon" href="https://static.zhihu.com/static/favicon.ico" type="image/x-icon" />


- 目前这种方法不能登录知乎

- 设想这样一个场景，你第一个请求利用了requests.post()方法登录了某个网站，第二次想获取成功登录后的自己的个人信息，你又用了一次requests.get()方法。实际上，这相当于打开了两个浏览器，是两个完全不相关的会话，两个不同的session
- 解决方案：
    - 每次都使用cookie
    - 使用session维持会话

In [100]:
import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

{
  "cookies": {
    "number": "123456789"
  }
}



- 以上，在实例中我们请求了一个测试网址，http://httpbin.org/cookies/set/number/123456789 请求这个网址我们可以设置一Cookie，名称叫做number，内容是123456789，后面的网址 http://httpbin.org/cookies可以获取当前的Cookie 。

- 以下，requests可以用于ssl证书验证，默认true（验证）

In [88]:
import requests

response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

200




- requests的代理

In [107]:
import requests

proxies = {
    'http': 'http://61.191.41.130:80',
    'https': 'https://220.167.220.14:808'
}

req = requests.get("http://www.baidu.com", proxies=proxies)
print(req.status_code)


200


- requests的认证

In [108]:
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)

200
