Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python爬虫的urllib 418错误 #108

Open
imuncle opened this issue Apr 20, 2020 · 0 comments
Open

Python爬虫的urllib 418错误 #108

imuncle opened this issue Apr 20, 2020 · 0 comments
Labels
web web开发点滴

Comments

@imuncle
Copy link
Owner

imuncle commented Apr 20, 2020

闲着无聊学一学python爬虫,定义了一个获取页面HTML内容的函数:

def askURL(url):
    request = urllib.request.Request(url)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

结果运行起来输出的是418,说明出现了HTTP Error 418错误

看到这个错误,就想到可能有反爬虫机制,多半要模拟浏览器访问,直接爬取会被拦截。

于是打开浏览器按f12,随便访问一个网站,选中连接,找Headers,往下拉找到其中User-Agent代表用的哪个请求的浏览器

image

于是代码修改如下:

def askURL(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

就可以成功获取页面内容了!

@imuncle imuncle added the web web开发点滴 label Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
web web开发点滴
Projects
None yet
Development

No branches or pull requests

1 participant