# 数据抓取

- 计划抓取某网站水文数据,先判断网站状态码：

### 方法一：urllib库

In [1]:
from urllib.request import urlopen
status=urlopen("http://www.gxwater.gov.cn/Web/ArticleList.aspx?CategoryID=55").code
print(status)

HTTPError: HTTP Error 404: Not Found

提示状态码为404，说明无法访问。

### 方法二：requests库

In [2]:
import requests
code=requests.get("http://www.gxwater.gov.cn/Web/ArticleList.aspx?CategoryID=55").status_code
print(code)

404


提示状态码未404

## 解决方法

- 修改USER_AGENTS
- GoogleChrome开发者模式打开，Network中观察动态更新的文件，例如js、aspx等文件，查看这些文件Headers中`Request URL`和`Referer`等几项，主要是查询清楚，生成的数据的文件及更新数据时的搜索条件是什么灯，例如`url=http://www.gxwater.gov.cn/Publish/Reservoir/BLL/AjaxHandle/RsChartDataProvider.ashx?stcd=80716200&start=2017-10-1 08:00:00&end=2017-12-17 08:00:00&type=Inq`表示在数据存储在ashx文件中，而查询的条件包括站码stcd、起始时间start、终止时间end、数据类型type等。

In [53]:
import requests
from bs4 import BeautifulSoup
import random
session = requests.Session()

USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]

# 随机生成user-agent
class RandomUAMiddleware(object):

    def process_request(self, request, spider):
        request.headers["User-Agent"]=random.choice(USER_AGENTS)

headers = {"User-Agent":random.choice(USER_AGENTS),"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webq,*/*;q=0.8"}
        
url = "http://www.gxwater.gov.cn/Publish/Reservoir/BLL/AjaxHandle/RsChartDataProvider.ashx?stcd=80808000&start=2017-12-1%2008:00:00&end=2017-12-17%2008:00:00&type=Inq"

req = session.get(url,headers=headers)

bsObj = BeautifulSoup(req.text, "lxml")
print(bsObj)

<html><body><p>{
	"start":"2017-12-01 08:00:00",
	"end":"2017-12-17 08:00:00",
	"base":{
		"STCD":"80808000",
		"STNM":"青狮潭",
		"STTP":"RR",
		"NSTTP":"3",
		"WRZ":"225.00",
		"GRZ":"225.00",
		"STLC":"广西灵川县青狮潭镇前宅村",
		"NTM":"2017-12-12 08:00:00",
		"NVAL":"1.70",
		"WPTN":"4"
},
	"data":{
	"schema":[
		{"name":"STCD","type":"string"},
		{"name":"TM","type":"datetime"},
		{"name":"VAL","type":"decimal"},
		{"name":"WPTN","type":"string"},
		{"name":"RID","type":"int"}
	],
	"data":[
	
		{"STCD":"80808000        ","TM":"2017-12-12 08:00:00","VAL":1.70,"WPTN":"4","RID":"1"},
		{"STCD":"80808000        ","TM":"2017-12-11 08:00:00","VAL":1.60,"WPTN":"4","RID":"2"},
		{"STCD":"80808000        ","TM":"2017-12-10 08:00:00","VAL":1.50,"WPTN":"4","RID":"3"},
		{"STCD":"80808000        ","TM":"2017-12-09 08:00:00","VAL":6.80,"WPTN":"4","RID":"4"},
		{"STCD":"80808000        ","TM":"2017-12-08 08:00:00","VAL":1.85,"WPTN":"4","RID":"5"},
		{"STCD":"80808000        ","T

### 参考

- [为何大量网站不能抓取?爬虫突破封禁的6种常见方法](http://www.test404.com/post-663.html)
- [Python爬虫爬取动态页面思路+实例（一）](http://blog.csdn.net/qq_30242609/article/details/53788228)