Python爬虫的urllib 418错误 #108

imuncle · 2020-04-20T07:12:19Z

闲着无聊学一学python爬虫，定义了一个获取页面HTML内容的函数：

def askURL(url):
    request = urllib.request.Request(url)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

结果运行起来输出的是418，说明出现了HTTP Error 418错误。

看到这个错误，就想到可能有反爬虫机制，多半要模拟浏览器访问，直接爬取会被拦截。

于是打开浏览器按f12，随便访问一个网站，选中连接，找Headers，往下拉找到其中User-Agent代表用的哪个请求的浏览器

于是代码修改如下：

def askURL(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

就可以成功获取页面内容了！

The text was updated successfully, but these errors were encountered:

imuncle added the web web开发点滴 label Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python爬虫的urllib 418错误 #108

Python爬虫的urllib 418错误 #108

imuncle commented Apr 20, 2020

Python爬虫的urllib 418错误 #108

Python爬虫的urllib 418错误 #108

Comments

imuncle commented Apr 20, 2020