# 爬取城市空气质量、理解爬虫原理

爬虫是收集数据的常用工具，它的基本操作是：    
- 建立网络链接
- 专业的工具包，从指定的网页，摘取相应的信息
- 爬取的信息，进行处理与保存

## 处理城市信息


爬取的网站：空气质量指数 (http://www.tianqihoubao.com/aqi/)  


HTML基础  
超链接 `<a href='链接地址'>链接的文字</a>`  
在网页使用F12(或右键审查元素)可以调开控制台，查看网页源代码。   
[更多关于html](https://www.runoob.com/html/html-tutorial.html)

In [None]:
# html = "".join(open('./citychk.txt').readlines())

In [None]:
import re

m = re.findall('href="/aqi/\w*.html">.{0,5} ', html)

# lambd = lambda x: (x.split('>')[1].strip(), x[11:x.index('.')])
# or use function
def generate_code(x):
    end = x.index('.')
    code = x[11:end]
    name = x.split('>')[1].strip()
    return name,code
    
city_coding = list(map(generate_code, m))

# remove duplicate data
print(len(city_coding), len(set(city_coding)))
city_coding = set(city_coding)

# save
with open('./city_coding.txt', 'w') as f:
    for line in city_coding:
        f.write('\t'.join(line) + '\n')
print('Saved!')

**项目完整代码 For Reference**  
[高民权_中国城市空气质量数据抓取_Github](https://github.com/fortyMiles/ChineseAirConditionCrawler)      
Github中的`get_location_info.py`文件对应city_coding的生成

## 天气数据抓取

### 读取city_coding

In [None]:
def get_city_coding(file='./city_coding.txt'):
    city_coding = {}
    with open(file) as f:
        for line in f.readlines():
            line = line.strip()
            try: 
                city, coding = line.split('\t')
                city_coding[city.strip()] = coding.strip()
            except Exception as e:
                continue
    return city_coding

city_coding = get_city_coding()

### 拼接成自己想要的URL地址  

如果是当前月份可以看到直接使用城市名称即可，如 http://www.tianqihoubao.com/aqi/hangzhou.html  
如果查询的是历史月份，可以看到是这种格式 http://www.tianqihoubao.com/aqi/hangzhou-201702.html

In [None]:
def build_url(city_coding, year=None, month=None):
    BASE = 'http://www.tianqihoubao.com/aqi/'
    city_base_url = BASE + '{}.html'
    city_data_base_url = BASE + '{}-{}{}.html'
    
    if year is not None and month is not None:
        month = str(month) if month >= 10 else '0' + str(month)
        return city_data_base_url.format(city_coding, year, month)
    else:
        return city_base_url.format(city_coding)
    
hangzhou = city_coding['杭州']
print(build_url(hangzhou))
print(build_url(hangzhou, 2018, 5))

### 使用python进行数据抓取

#### HTTP协议
超文本传输协议（HTTP，HyperText Transfer Protocol）是互联网上应用最为广泛的一种网络协议。所有的www文件都必须遵守这个标准。  

HTTP用于客户端和服务器之间的通信。协议中规定了客户端应该按照什么格式给服务器发送请求，同时也约定了服务端返回的响应结果应该是什么格式。    

请求访问文本或图像等信息的一端称为客户端，而提供信息响应的一端称为服务器端。 

客户端告诉服务器请求访问信息的方法：
- Get 获得内容
- Post 提交表单来爬取需要登录才能获得数据的网站
- put 传输文件  

更多参考：
[HTTP请求状态](https://www.runoob.com/http/http-status-codes.html)  
了解200 404 503
 - 200 OK      //客户端请求成功
 - 404 Not Found  //请求资源不存在，eg：输入了错误的URL
 - 503 Server Unavailable  //服务器当前不能处理客户端的请求，一段时间后可能恢复正常。

#### Requests
纯粹HTML格式的网页通常被称为静态网页，静态网页的数据比较容易获取。   
在静态网页抓取中，有一个强大的Requests库能够让你轻易地发送HTTP请求。    

In [None]:
# 在终端上安装 Requests
pip install requents

In [None]:
# 获取响应内容
import requests
r = requests.get('https://www.crummy.com/')
# 
print('文本编码：（服务器使用的文本编码）', r.encoding)

print('响应状态码：（200表示成功）', r.status_code)

print('字符串方式的响应体：（服务器响应的内容）', r.text)

拓展知识：
- [Unicode和UTF-8有什么区别?(盛世唐朝回答)](https://www.zhihu.com/question/23374078)

#### 使用 BeautifulSoup 解析网页  
BeautifulSoup 是一个工具箱，可以从HTML或XML文件中提取数据。   
参考：[Beautiful Soup 4.2.0 文档](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html)

**首先安装包**  

``` bash
pip install bs4
```

In [None]:
from bs4 import BeautifulSoup
# 把网页响应的字符串转化为soup对象，然后使用soup库的功能
soup = BeautifulSoup(r.text,'html.parser')
# 获得标题内容tti
print(soup.title)
# 可以对代码进行美化
print(soup.prettify())

#### 使用 lxml 解析网页
介绍一个比较流行的解析库 lxml。   

In [None]:
# 导入lxml中的etree模块
from lxml import etree

# 调用etree模块中的HTML()类，将text作为参数传入
html = etree.HTML(r.text)

print(type(html))
result = etree.tostring(html)
print(result)


`tostring()`方法将解析后的HTML文档输出，可以看到输出的类型为bytes，我们可以利用`decode()`方法将其转化为str类型输出

In [None]:
print(result.decode('utf-8'))

一般用 `//` 开头选择符合要求的所有节点，比如对于上面的html，我们选取所有`<link>`节点。    
[更多参考](https://www.cnblogs.com/baowee/p/11364941.html)

In [None]:
link_result = html.xpath('//link')
print(link_result)

In [None]:
# //div 选取了HTML文档中所有<div>节点，/ 表示选取当前节点的直接子节点。
div_a_result = html.xpath('//div/a')
print(div_a_result)

首先，通过F12查看hangzhou-201805.html请求，可以看到`Content-Type: text/html; charset=gb2312` 所示使用的是GBK编码

In [None]:
import requests
from bs4 import BeautifulSoup

hangzhou = city_coding['杭州']
url = build_url(hangzhou, 2018, 5)

# 发送请求
# get post

response = requests.get(url)

# 查看相关信息
# help(reponse)

print(response.status_code, response.ok)

# 打印返回的结果
print(response.encoding)
html = response.text
soup = BeautifulSoup(html)

# 一些属性 
# 网页的title
print(soup.title)
# 网页的文本
# print(soup.text)

# 查找属性
data_table = soup.find_all('table')
print(len(data_table))
# print(data_table)

# 既然只有一个table
# 可以使用下面
data_table = soup.table

# 然后进行更加细化的数据分析

In [None]:
# 查看下data_table内容
# print(data_table)

# data.contents 将对象下的元素都获取得到 返回List
# 可以看到第一行是表头 
# 并且隔一行有一个\n元素

name_index = 1
content = data_table.contents[name_index:]

result = []
for index, c in enumerate(content[::2]):
    if index == 0:
        result.append(tuple(['city'] + c.text.split()))
    else:
        result.append(tuple([hangzhou] + c.text.split()))
        
print(len(result), result)

### 整体代码总结

In [None]:
import re
import requests
from bs4 import BeautifulSoup

def get_city_coding(file='./city_coding.txt'):
    city_coding = {}
    with open(file) as f:
        for line in f.readlines():
            line = line.strip()
            try: 
                city, coding = line.split('\t')
                city_coding[city.strip()] = coding.strip()
            except Exception as e:
                continue
    return city_coding

def build_url(city_coding, year=None, month=None):
    BASE = 'http://www.tianqihoubao.com/aqi/'
    city_base_url = BASE + '{}.html'
    city_data_base_url = BASE + '{}-{}{}.html'
    
    if year is not None and month is not None:
        month = str(month) if month >= 10 else '0' + str(month)
        return city_data_base_url.format(city_coding, year, month)
    else:
        return city_base_url.format(city_coding)
    
def parse(url, city_name):
    response = requests.get(url)
    if response.ok:
        html = response.text
        soup = BeautifulSoup(html)
        data_table = soup.table

        name_index = 1
        content = data_table.contents[name_index:]

        result = []
        for index, c in enumerate(content[::2]):
            if index == 0:
                result.append(tuple(['city'] + c.text.split()))
            else:
                result.append(tuple([city_name] + c.text.split()))
        
        return result
    else:
        print('Network Error:', response.status_code)

city_coding = get_city_coding()
want_city = city_coding['杭州']
url = build_url(want_city, 2019, 9)
result = parse(url, want_city)
print(result)

将抓取的数据进行保存

In [None]:
import csv
import os

def save_csv(file, data):
    if data == None or len(data) == 1: return
    if os.path.exists(file):
        with open(file, 'a') as f:
            writer = csv.writer(f)
            writer.writerows(data[1:])
    else:
        with open(file, 'w') as f:
            writer = csv.writer(f)
            writer.writerows(data)

save_csv(f'./{want_city}.csv', result)

### 开始抓取所有数据
- **Refrain from running in Jupyter!!**
- **在Jupyter中慎重使用！！**

In [None]:
allcities = list(get_city_coding().keys())

file = r'./allcity_2019.csv'
for city in allcities:
    city_code = city_coding[city]
    for year in range(2019, 2018,-1):
        for month in range(1,13):
            url = build_url(city_code, year, month)
            data = parse(url, city_code)
            print(f'\r{city}\t{year}-{month} {len(result)}', end='')
            save_csv(file, data)

**项目完整代码 For Reference**  
[高民权_中国城市空气质量数据抓取_Github](https://github.com/fortyMiles/ChineseAirConditionCrawler)

In [None]:
import requests
from bs4 import BeautifulSoup


def get_city_coding():
    CITY_CODIN = './city_coding.txt'
    city_coding = {}
    with open(CITY_CODIN, 'r') as f:
        for line in f.readlines():
            line = line.strip()
            try:
                city, coding = line.split('\t')
                city_coding[city.strip()] = coding.strip()
            except ValueError as e:
                continue

    return city_coding


def build_url(city_coding, year=None, month=None):
    BASE = 'http://www.tianqihoubao.com/aqi/'
    city_base_url = BASE + "{}.html"
    city_data_base_url = BASE + "{}-{}{}.html"

    if year is not None and month is not None:
        month = str(month) if month >= 10 else '0' + str(month)
        return city_data_base_url.format(city_coding, year, month)
    else:
        return city_base_url.format(city_coding)


def get_from_http(city_coding, year=None, month=None):
    '''
    
    :param city_coding: city Chinese Name, e.g hangzhou 
    :param year: e.g 2016
    :param month: e.g 10
    :param day:  e.g 5
    :return: {
                'city': string,
                'air_conditions': [air_condition]
             }
             
             air_condition = (Date, AQI, Pm2.5, Pm10, No2, So2, Co, O3)
             
    '''

    url = build_url(city_coding, year, month)

    content = get_some_day_air_condition(city_coding, url)

    return content


def get_some_day_air_condition(city_coding, url):
    try:
        r = requests.get(url)
        if r.status_code == 200:
            r.encoding = 'GBK'
            html_file = r.text
            soup = BeautifulSoup(html_file, 'html.parser')

            data_table = soup.find_all('table')
            data_table = soup.table

            return parse(city_coding, data_table)
        else:
            return None
    except Exception as e:
        print('connnect error')
        print(e)
        return None


def parse(city_coding, data):
    #data.contents[1].text.split()
    #data.contents[3].text.split()
    name_index = 1
    content = data.contents[name_index:]

    result = []

    for index, c in enumerate(content[::2]):
        if index == 0:
            result.append(tuple(['city'] + c.text.split()))
        else:
            result.append(tuple([city_coding] + c.text.split()))

    return result


if __name__ == '__main__':
    #get_from_http('杭州', 2015, 10, 6)
    city_coding = get_city_coding()
    assert city_coding['杭州'] == 'hangzhou'

    hangzhou = city_coding['杭州']

    print('testing')

    assert build_url(hangzhou, 2016, 5) == "http://www.tianqihoubao.com/aqi/hangzhou-201605.html"
    assert build_url(hangzhou, 2016) == "http://www.tianqihoubao.com/aqi/hangzhou.html"
    assert build_url(hangzhou) == "http://www.tianqihoubao.com/aqi/hangzhou.html"

    assert get_some_day_air_condition("hanghzhou", "http://www.tianqihoubao.com/aqi/hangzhou-201605.html") is not None

    data = get_some_day_air_condition("hangzhou", "http://www.tianqihoubao.com/aqi/hangzhou-201605.html")
    #print(data)

    city_data = get_from_http('hangzhou', 2015, 10)
    print(city_data)

    print('test done')

**注意 get_city_coding 中添加文件编码为UTF-8**

拓展知识：
- [哪些 Python 库让你相见恨晚？](https://www.zhihu.com/question/24590883)