# 网页抓取（Web Scraping）《the President’s lies》全文，仅需 16 行代码

整理者：曹鑫 （[CDA.cn](http://cda.cn) 联合创始人）

项目地址：[《普通人也可以学 Python》](https://github.com/imcda/Python-Tutorial-for-Humans)

## 第 1 部分

- 什么是网页抓取？
- 查看要抓取的文章，如 [《Trump’s Lies》](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)
- 检查网页源代码
- 获取到的讯息：
    - HTML 代码由标签（tags）组成
    - 标签有很多的属性（attributes）
    - 标签可以进行多层嵌套（nested）

## 第 2 部分

- 用 Python 读取网页
- 通过 Beautiful Soup 分析网页源代码
- 收集所有记录

## 第 3 部分

- 提取日期
- 提取`谎言（Lie）`的内容
- 提取`解释（Explanation）`的内容
- 提取`URL`网站地址
- 回顾探究 Beautiful Soup 方法及其属性
- 创建数据集（dataset）

## 第 4 部分

- 应用表格数据结构
- 导出数据集到 CSV 文件
- 总结：16行 Python 代码
- 附录1：网页爬虫的建议
- 附录2：网页抓取的资源
- 附录3：Beautiful Soup 的另类语法

# 第一部分

## 什么是网页抓取？



## 查看要抓取的文章

让我们一起打开网页[《Trump’s Lies》](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)来查看。

## 检查网页源代码（HTML）

`Hello <strong><em>CDA</em> students</strong>`

Hello <strong><em>CDA</em> students</strong>


# 第二部分
## 用 Python 读取网页

`pip install requests`

In [2]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [42]:
a_list = ['1','2','100','5','20']


print(a_list.sort)

<built-in method sort of list object at 0x11cd46fa0>


In [3]:
# 打印出网页源代码的前 500 个字符
print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


## 用 Beautiful Soup 来解析 HTML
`pip install beautifulsoup4`

**注意：** 这里安装的小写和下方导入的写法

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

## 收集所有的记录

观察代码，你可以使用浏览器里面的`检查元素`，你可以复制代码到编辑器，比如`Sublime Text`，做一个样式美化，你就能看得更清楚了。

```
<span class="short-desc">
    <strong>
        日期（DATE）
    </strong>
        谎言（LIE）
    <span class="short-truth">
        <a href="链接（URL）" target="_blank">
        解释（EXPLANATION）
        </a>
    </span>
</span>
```


In [5]:
results = soup.find_all('span', attrs={'class':'short-desc'})

In [6]:
len(results)

180

In [7]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [8]:
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

# 第三部分
## 提取日期（date）

In [9]:
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [10]:
first_result.find('strong')

<strong>Jan. 21 </strong>

In [11]:
first_result.find('strong').text

'Jan. 21\xa0'

### `\xa0` 是不间断空白符 `&nbsp`;
我们通常所用的空格是 `\x20` ，是在标准ASCII可见字符 `0x20~0x7e` 范围内。
而 `\xa0` 属于 `latin1 （ISO/IEC_8859-1）`中的扩展字符集字符，代表空白符`nbsp(non-breaking space)`。
`latin1` 字符集向下兼容 `ASCII （ 0x20~0x7e ）`。通常我们见到的字符多数是 `latin1` 的，比如在 MySQL 数据库中。

### `\u3000` 是全角的空白符
根据 `Unicode` 编码标准及其基本多语言面的定义， `\u3000` 属于 CJK 字符的 CJK 标点符号区块内，是空白字符之一。它的名字是 `Ideographic Space`，有人译作表意字空格、象形字空格等。顾名思义，就是全角的 `CJK` 空格。它跟 `nbsp` 不一样，是可以被换行间断的。常用于制造缩进， `wiki` 还说用于抬头，但没见过。

In [12]:
type(first_result.find('strong').text)

str

In [13]:
first_result.find('strong').text[0:-1]

'Jan. 21'

In [14]:
first_result.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

## 提出谎言（lie）

In [15]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [16]:
first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [17]:
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [18]:
type(first_result.contents[1])

bs4.element.NavigableString

In [19]:
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

## 提取解释（explanation）

In [20]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [21]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [22]:
first_result.find('a').text

'(He was for an invasion before he was against it.)'

In [23]:
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

## 提取链接（URL）

In [24]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [25]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

## 回顾探究一下 Beautiful Soup 的方法和属性

你可以用以下这两个方法来处理最开始的 soup 对象或者是 Tag 对象：

+ `find()`：搜索第一个符合要求的标签，并返回一个 Tag 对象
+ `find_all()`：搜索所有符合要求的标签，并返回一个结果集合对象，你可以把它当作是一个 Tag 对象的列表

你可以使用以下两个属性来提取 Tag 对象中的信息：

+ `text`：提取 Tag 中的文本信息，并返回一个字符串
+ `contents`：提取 Tag 对象中的子对象，返回一个 Tag 对象和字符串的列表

## 创建数据集

In [26]:
records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

In [27]:
len(records)

180

In [28]:
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

# 第四部分
## 应用表格数据结构

In [29]:
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

In [30]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [31]:
df.tail()

Unnamed: 0,date,lie,explanation,url
175,"Oct. 25, 2017",We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,"Oct. 27, 2017","Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,"Nov. 1, 2017","Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,"Nov. 7, 2017",When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,"Nov. 11, 2017","I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


In [32]:
df['date'] = pd.to_datetime(df['date'])

In [33]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [34]:
df.tail()

Unnamed: 0,date,lie,explanation,url
175,2017-10-25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,2017-10-27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,2017-11-01,"Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,2017-11-07,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,2017-11-11,"I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


## 导出数据集到 CSV 文件

In [35]:
df.to_csv('CSVFiles/trump_lies.csv', index=False, encoding='utf-8')

In [36]:
df = pd.read_csv('CSVFiles/trump_lies.csv', parse_dates=['date'], encoding='utf-8')
df

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,2017-10-25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,2017-10-27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,2017-11-01,"Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,2017-11-07,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 4 columns):
date           180 non-null datetime64[ns]
lie            180 non-null object
explanation    180 non-null object
url            180 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 5.8+ KB


## 总结：16 行 Python 代码

In [38]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs = {'class': 'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))
    
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('CSVFiles/trump_lies.csv', index=False, encoding='utf-8')

## 附录一：注意事项

+ 网页抓取适合稳定、架构良好的网页
+ 网页抓取是一个「脆弱」的方式创建数据集合
+ 如果你能够从一个网站直接下载数据，或者网站提供一个数据的 API 接口，这些都是比网页抓取更好的方式
+ 如果你从同一个网站下载过多页面，最好在你的代码中加入一些延迟设计
+ 在抓取网页之前，最好查看一下网站的`robots.txt`文件，如`www.nytimes.com/robots.txt`

## 附录二：学习资源

+ 本案例地址：
+ [Beautiful Soup 文档](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)，写的像一本教程，值得一读

## 附录三：另类语法

In [39]:
# 通过名字搜索标签内容
first_result.find('strong')

# 短写法：当作一个属性
first_result.strong

<strong>Jan. 21 </strong>

In [40]:
# 通过名字和属性搜不同的标签内容
# results = soup.find_all('span', attrs = {'class': 'short-desc'})
# results
# 短写法：如果你不指定方法，程序将默认是 `find_all`
# results = soup('span', attrs={'class':'short-desc'})
# results
# 更短的写法：你可以指定属性作为一个参数
results = soup('span', class_='short-desc')
results

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_