# Python requests 库常规用法

Python 的 requests 是一个简单易用的 HTTP 库，但我之前很少用到它。我开始重视是因为工程师分享给我 Rum Bot 代码中，用到了这个方法来和 api 交互，同时也用到它来爬取和下载图片。

这篇学习笔记汇总了 requests 的基本用法。

### 安装

执行 import 检查自己本地是否已经安装。如果没有安装，在命令行模式下执行 pip 安装即可。

```sh
pip install requests
```

In [1]:
import requests

### 各种请求

requests 有多个请求。最常用到的是 `get()` 和 `post()`，其它我目前没有用到过。

在 requests 这个库的简介中，也是只举例了 get() 和 post() 这两个方法。

In [2]:
requests?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'requests' from 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\requests\\__init__.py'>
[1;31mFile:[0m        c:\programdata\anaconda3\lib\site-packages\requests\__init__.py
[1;31mDocstring:[0m  
Requests HTTP Library
~~~~~~~~~~~~~~~~~~~~~

Requests is an HTTP library, written in Python, for human beings. Basic GET
usage:

   >>> import requests
   >>> r = requests.get('https://www.python.org')
   >>> r.status_code
   200
   >>> 'Python is a programming language' in r.content
   True

... or POST:

   >>> payload = dict(key1='value1', key2='value2')
   >>> r = requests.post('https://httpbin.org/post', data=payload)
   >>> print(r.text)
   {
     ...
     "form": {
       "key2": "value2",
       "key1": "value1"
     },
     ...
   }

The other HTTP methods are supported - see `requests.api`. Full documentation
is at <http://python-requests.org>.

:copyright: (c) 2017 by Kenneth Reitz.
:license: Apache 2.0, see LICENSE

下述 url 也都是 requests 开发者所提供的，大胆尝试。

In [3]:
requests.get('http://httpbin.org/get')

<Response [200]>

In [4]:
requests.post('http://httpbin.org/post')

<Response [405]>

In [5]:
requests.put('http://httpbin.org/put')

<Response [200]>

In [6]:
requests.delete('http://httpbin.org/delete')

<Response [200]>

In [7]:
requests.head('http://httpbin.org/get')

<Response [200]>

In [8]:
requests.options('http://httpbin.org/get')

<Response [200]>

In [9]:
requests.get('http://httpbin.org/get')

<Response [200]>

这些请求的返回值，是 requests 自定义的一种类型。

In [10]:
resp = requests.get('http://httpbin.org/get')
type(resp)

requests.models.Response

In [11]:
requests.models.Response?

[1;31mInit signature:[0m [0mrequests[0m[1;33m.[0m[0mmodels[0m[1;33m.[0m[0mResponse[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
The :class:`Response <Response>` object, which contains a
server's response to an HTTP request.
[1;31mFile:[0m           c:\programdata\anaconda3\lib\site-packages\requests\models.py
[1;31mType:[0m           type
[1;31mSubclasses:[0m     


以 Get() 请求为例，可以看到它具备如下属性：

In [12]:
import requests
url = "http://httpbin.org/get"
resp = requests.get(url)

状态码。200，一般表示正常。其它码值见 api 设计，通常码值含义都是一致的。

In [13]:
#状态码
resp.status_code

200

In [14]:
type(resp.status_code)

int

In [15]:
#请求url
resp.url

'http://httpbin.org/get'

In [16]:
type(resp.url)

str

In [17]:
#头信息
resp.headers

{'Content-Length': '306', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/json', 'Date': 'Thu, 16 Dec 2021 09:04:08 GMT', 'Keep-Alive': 'timeout=58', 'Server': 'gunicorn/19.9.0'}

In [18]:
type(resp.headers)

requests.structures.CaseInsensitiveDict

In [19]:
h = resp.headers
for k,v in enumerate(h):
    print(v,h[v])

Content-Length 306
Access-Control-Allow-Credentials true
Access-Control-Allow-Origin *
Content-Type application/json
Date Thu, 16 Dec 2021 09:04:08 GMT
Keep-Alive timeout=58
Server gunicorn/19.9.0


头信息的 `Connection` 很关键，默认是 `keep-alive`，但有时会导致连接过多，而被目标拒绝访问。可以 update 为 `Close`:

In [20]:
resp.headers.update({"Connection":"Close"})
resp.headers

{'Content-Length': '306', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/json', 'Date': 'Thu, 16 Dec 2021 09:04:08 GMT', 'Keep-Alive': 'timeout=58', 'Server': 'gunicorn/19.9.0', 'Connection': 'Close'}

In [21]:
#cookies
resp.cookies

<RequestsCookieJar[]>

In [22]:
type(resp.cookies)

requests.cookies.RequestsCookieJar

In [23]:
#以文本形式打印网页源码
resp.text

'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-61bb0108-5e3a1f7e23c9ec900bc3e2c3"\n  }, \n  "origin": "8.211.137.111", \n  "url": "http://httpbin.org/get"\n}\n'

In [24]:
type(resp.text)

str

In [25]:
#以字节流形式打印
resp.content

b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.22.0", \n    "X-Amzn-Trace-Id": "Root=1-61bb0108-5e3a1f7e23c9ec900bc3e2c3"\n  }, \n  "origin": "8.211.137.111", \n  "url": "http://httpbin.org/get"\n}\n'

In [26]:
type(resp.content)

bytes

字节流有个很好的用途。

比如：利用字节流下载并保存图片

In [27]:
import requests

url = "https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png"
picname = url.split("/")[-1]
picfile = f'D://{picname}'
resp = requests.get(url)
b = resp.content
with open(picfile,'wb') as f:
    f.write(b)

用来下载文件也是可以的，但数据形式是 html，并不是 markdown。

注意 `open()` 的参数是 `wb`，其中 `w` for writing,`b` for bytes.

In [28]:
import requests

url = "https://github.com/rumsystem/quorum/blob/main/API.md"
name = url.split("/")[-1]
file = f'D://{name}'
resp = requests.get(url)
b = resp.content
with open(file,'wb') as f:
    f.write(b)

### 返回数据转换为 json 类型

`resp.json()` 方法和 `json.loads(resp.text)` 效果相同。

In [29]:
import requests

resp = requests.get("http://httpbin.org/get")
resp.json()

{'args': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.22.0',
  'X-Amzn-Trace-Id': 'Root=1-61bb0109-047ab6d810ea1f5a1e5e846c'},
 'origin': '8.211.137.111',
 'url': 'http://httpbin.org/get'}

url 的返回值符合 json 语法格式才可以。否则会抛出错误。

比如：

In [30]:
import requests

resp = requests.get("http://baidu.com")
resp.json()

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

总是可以用 resp.content、resp.text 这两个属性来查看返回值。

In [31]:
resp.text[:100]

'<html>\n<meta http-equiv="refresh" content="0;url=http://www.baidu.com/">\n</html>\n'

### 带参数的请求

#### 1、直接将参数放在 url 内

In [32]:
import requests

resp = requests.get("http://httpbin.org/get?name=gemey&age=22")
resp.json()

{'args': {'age': '22', 'name': 'gemey'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.22.0',
  'X-Amzn-Trace-Id': 'Root=1-61bb0126-00bb22fb76ff0b1f5c545f37'},
 'origin': '8.211.137.111',
 'url': 'https://httpbin.org/get?name=gemey&age=22'}

#### 2、把参数填写在 dict 格式的数据 data 中，发起请求时赋值给 params 参数

In [33]:
import requests

data = {
    'name': 'tom',
    'age': 20
}

resp = requests.get('http://httpbin.org/get', params=data)
resp.json()

{'args': {'age': '20', 'name': 'tom'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.22.0',
  'X-Amzn-Trace-Id': 'Root=1-61bb0127-7b63ed67287cbfcb3e3b06f6'},
 'origin': '8.211.137.111',
 'url': 'https://httpbin.org/get?name=tom&age=20'}

3、为 post 请求添加参数

In [34]:
import requests
data = {"name":"tom","age":6}
requests.post('http://httpbin.org/post', data=data)

<Response [405]>

### 为请求添加头信息

In [35]:
import requests
headers = {'User-Agent':
         'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
requests.get('http://www.baidu.com',headers=headers)

<Response [200]>

也可以对 headers 进行 update。

In [36]:
import requests
session = requests.Session()
session.verify = r"C:\Users\75801\AppData\Local\Programs\prs-atm-app\resources\quorum_bin\certs\server.crt"
session.headers.update({
                "USER-AGENT": "asiagirls-py-bot",
                "Content-Type": "application/json",
})

url = "https://127.0.0.1:55043/api/v1/groups"
session.get(url)

<Response [200]>

### 会话维持

In [37]:
import requests
session = requests.Session()
session.verify = r"C:\Users\75801\AppData\Local\Programs\prs-atm-app\resources\quorum_bin\certs\server.crt"
session.headers.update({
                "USER-AGENT": "asiagirls-py-bot",
                "Content-Type": "application/json",
})


session.get("https://127.0.0.1:55043/api/v1/groups")

<Response [200]>

In [38]:
session

<requests.sessions.Session at 0x23af8049c50>

In [39]:
import requests
session = requests.Session()
session.verify = r"C:\Users\75801\AppData\Local\Programs\prs-atm-app\resources\quorum_bin\certs\server.crt"
session.headers.update({
                "USER-AGENT": "asiagirls-py-bot",
                "Content-Type": "application/json",
                "Connection": "close"
})


session.get("https://127.0.0.1:55043/api/v1/groups")

<Response [200]>

In [40]:
session

<requests.sessions.Session at 0x23af8051cc0>

In [41]:
session.get("https://127.0.0.1:55043/api/v1/network")

<Response [200]>

In [42]:
session

<requests.sessions.Session at 0x23af8051cc0>

## 使用代理

同添加headers方法，代理参数也要是一个dict

这里使用requests库爬取了IP代理网站的IP与端口和类型

因为是免费的，使用的代理地址很快就失效了。

我没用过这个功能。这段笔记来自网上检索。

In [None]:
import requests
import re

def get_html(url):
    proxy = {
        'http': '120.25.253.234:812',
        'https': '163.125.222.244:8123'
    }
    heads = {}
    heads['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    req = requests.get(url, headers=heads,proxies=proxy)
    html = req.text
    return html

def get_ipport(html):
    regex = r'<td data-title="IP">(.+)</td>'
    iplist = re.findall(regex, html)
    regex2 = '<td data-title="PORT">(.+)</td>'
    portlist = re.findall(regex2, html)
    regex3 = r'<td data-title="类型">(.+)</td>'
    typelist = re.findall(regex3, html)
    sumray = []
    for i in iplist:
        for p in portlist:
            for t in typelist:
                pass
            pass
        a = t+','+i + ':' + p
        sumray.append(a)
    print('高匿代理')
    print(sumray)


if __name__ == '__main__':
    url = 'http://www.kuaidaili.com/free/'
    get_ipport(get_html(url))

## 证书验证设置

verify 与 ca 认证相关。某些网站不需要，某些网站需要；如果你不提供，会遇到报错。但可以采用如下方式忽略报错：

In [45]:
import requests
from requests.packages import urllib3

urllib3.disable_warnings()  #从urllib3中消除警告
resp = requests.get('https://www.12306.cn',verify=False)  #证书验证设为FALSE
resp.status_code

200

但这个作为示例不太好，`verify`无论是 `True` 或 `False` 或 `""` 返回码都是 200。

In [46]:
import requests
from requests.packages import urllib3

urllib3.disable_warnings()  #从urllib3中消除警告
resp = requests.get('https://www.12306.cn',verify=True)
resp.status_code

200

In [47]:
import requests
from requests.packages import urllib3

urllib3.disable_warnings()  #从urllib3中消除警告
resp = requests.get('https://www.12306.cn',verify="")
resp.status_code

200

所以我用 Rum 的 API 试试看，它对 verify 是有要求的。

In [48]:
requests.get('https://127.0.0.1:55043/api/v1/groups',verify=True)

SSLError: HTTPSConnectionPool(host='127.0.0.1', port=55043): Max retries exceeded with url: /api/v1/groups (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

In [49]:
requests.get('https://127.0.0.1:55043/api/v1/groups',verify="")

<Response [200]>

In [50]:
#对verify设置 False，能拿到返回值
requests.get('https://127.0.0.1:55043/api/v1/groups',verify=False)

<Response [200]>

In [51]:
v = r"C:\Users\75801\AppData\Local\Programs\prs-atm-app\resources\quorum_bin\certs\server.crt"
requests.get('https://127.0.0.1:55043/api/v1/groups',verify=v)

<Response [200]>

只是有点奇怪，`verify` 只要不是 `True` 都能正常返回。

## 异常捕获

如果用于长时间运行，需要捕获异常并针对性处理。比如有：

In [52]:
import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException

try:
    response = requests.get('http://www.baidu.com',timeout=0.5)
    print(response.status_code)
except ReadTimeout:
    print('timeout')
except HTTPError:
    print('httperror')
except RequestException:
    print('reqerror')

200


以上就是关于 requests 库的常见用法。