We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
闲着无聊学一学python爬虫,定义了一个获取页面HTML内容的函数:
def askURL(url): request = urllib.request.Request(url)#发送请求 try: response = urllib.request.urlopen(request)#取得响应 html= response.read()#获取网页内容 # print (html) return html except urllib.error.URLError as e: if hasattr(e,"code"): print (e.code) if hasattr(e,"reason"): print (e.reason)
结果运行起来输出的是418,说明出现了HTTP Error 418错误。
418
看到这个错误,就想到可能有反爬虫机制,多半要模拟浏览器访问,直接爬取会被拦截。
于是打开浏览器按f12,随便访问一个网站,选中连接,找Headers,往下拉找到其中User-Agent代表用的哪个请求的浏览器
Headers
User-Agent
于是代码修改如下:
def askURL(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'} request = urllib.request.Request(url, headers=headers)#发送请求 try: response = urllib.request.urlopen(request)#取得响应 html= response.read()#获取网页内容 # print (html) return html except urllib.error.URLError as e: if hasattr(e,"code"): print (e.code) if hasattr(e,"reason"): print (e.reason)
就可以成功获取页面内容了!
The text was updated successfully, but these errors were encountered:
No branches or pull requests
闲着无聊学一学python爬虫,定义了一个获取页面HTML内容的函数:
结果运行起来输出的是
418
,说明出现了HTTP Error 418错误。看到这个错误,就想到可能有反爬虫机制,多半要模拟浏览器访问,直接爬取会被拦截。
于是打开浏览器按f12,随便访问一个网站,选中连接,找
Headers
,往下拉找到其中User-Agent
代表用的哪个请求的浏览器于是代码修改如下:
就可以成功获取页面内容了!
The text was updated successfully, but these errors were encountered: