# <a name="Chapter1">实验概览</a>

豆瓣图书Top 250是通过网络投票获得的前250名受欢迎的图书的排名，网页的URL地址是：https://book.douban.com/top250

## <a name="Chapter2">爬取数据</a>

### ❖ 获取网页内容<a name="Chapter2.1"></a>

我们首先编写Python代码访问豆瓣图书Top250的页面，获取页面内HTML文本内容。

In [1]:
import requests

In [2]:
reponse = requests.get("https://book.douban.com/top250?start=0")    # URL

In [3]:
reponse

<Response [418]>

<font color="green">
注：requests库是一个Python下常用的用于http请求的模块，可以方便的对网页进行爬取，是学习python爬虫的较好的http请求模块。其中两个重要的对象是Request（封装HTTP请求）和Response（封装返回内容）<br/>
</font>

伪装成浏览器去访问, 需要添加User-Agent头信息，用字典{key1:value1,key2:value2}

In [4]:
my_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"}
reponse = requests.get("https://book.douban.com/top250?start=0", headers = my_headers) 

In [None]:
reponse.status_code, reponse.text

In [None]:
print(reponse.text)

### ❖分析HTML文本结构<a name="Chapter2.2"></a>

运行程序，我们就轻松的获得了网页的HTML信息了。<br/>
但是我们得到HTML信息后，怎样得到我们的目标数据呢？<br/>
我们打开浏览器，按F12到开发者工具，我们从网页源码里查找到数据位置，截图如下：

<div align="center"><img src="./images/DoubanCrawler.2-1.png", width="1000"/></div>

可以看到书名信息包含在**class="p12"**的div里的a标签内，是a标签的**title**属性。

**数据路径：**$HTML => div(class="p12") => a.title\ (书名)$

发现目标位置后，就简单多了。我们利用BeautifulSoup来获得一个对象，按找标准的缩进显示的html代码。

#### <span style="color: darkorange">注意:</span> 
这里我们首先要加载BeautifulSoup库包，加载命令行为：<br/>
!pip install bs4 --upgrade -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com<br/>
当最后提示 <font color="red">"Successfully installed bs4-x.x.x"</font> 时，表示安装成功。

In [7]:
#BeautifulSoup库      - 靓汤
# !pip install bs4 -i https://pypi.doubanio.com/simple

In [8]:
from bs4 import BeautifulSoup

In [9]:
BooksPageHTML = BeautifulSoup(reponse.text, "html.parser")

In [10]:
print(BooksPageHTML)


<!DOCTYPE html>

<html class="ua-windows ua-webkit book-new-nav" lang="zh-cmn-Hans">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>豆瓣读书 Top 250</title>
<script>!function(e){var o=function(o,n,t){var c,i,r=new Date;n=n||30,t=t||"/",r.setTime(r.getTime()+24*n*60*60*1e3),c="; expires="+r.toGMTString();for(i in o)e.cookie=i+"="+o[i]+c+"; path="+t},n=function(o){var n,t,c,i=o+"=",r=e.cookie.split(";");for(t=0,c=r.length;t<c;t++)if(n=r[t].replace(/^\s+|\s+$/g,""),0==n.indexOf(i))return n.substring(i.length,n.length).replace(/\"/g,"");return null},t=e.write,c={"douban.com":1,"douban.fm":1,"google.com":1,"google.cn":1,"googleapis.com":1,"gmaptiles.co.kr":1,"gstatic.com":1,"gstatic.cn":1,"google-analytics.com":1,"googleadservices.com":1},i=function(e,o){var n=new Image;n.onload=function(){},n.src="https://www.douban.com/j/except_report?kind=ra022&reason="+encodeURIComponent(e)+"&environment="+encodeURIComponent(o)},r=function(o){try{t.call(e,o)}catch(e){t(o

In [None]:
type(BooksPageHTML)

In [11]:
AllBooksDivs = BooksPageHTML.find_all("div", class_="pl2")      #25本，都会放在列表里面     class是python保留字，需要class_
type(AllBooksDivs), AllBooksDivs

(bs4.element.ResultSet,
 [<div class="pl2">
  <a href="https://book.douban.com/subject/1007305/" onclick="&quot;moreurl(this,{i:'0'})&quot;" title="红楼梦">
                  红楼梦
  
                  
                </a>
  
  
  
                    <img alt="可试读" src="/pics/read.gif" title="可试读"/>
  </div>,
  <div class="pl2">
  <a href="https://book.douban.com/subject/4913064/" onclick="&quot;moreurl(this,{i:'1'})&quot;" title="活着">
                  活着
  
                  
                </a>
  
  
  
                    <img alt="可试读" src="/pics/read.gif" title="可试读"/>
  </div>,
  <div class="pl2">
  <a href="https://book.douban.com/subject/4820710/" onclick="&quot;moreurl(this,{i:'2'})&quot;" title="1984">
                  1984
  
                  
                </a>
  <br/>
  <span style="font-size:12px;">Nineteen Eighty-Four</span>
  </div>,
  <div class="pl2">
  <a href="https://book.douban.com/subject/6082808/" onclick="&quot;moreurl(this,{i:'3'})&quot;" title="百年孤独">
    

In [12]:
list(AllBooksDivs)[2]

<div class="pl2">
<a href="https://book.douban.com/subject/4820710/" onclick="&quot;moreurl(this,{i:'2'})&quot;" title="1984">
                1984

                
              </a>
<br/>
<span style="font-size:12px;">Nineteen Eighty-Four</span>
</div>

In [None]:
AllBooksDivs[2].find("a")

In [None]:
AllBooksDivs[24].find("a")["title"]

In [None]:
BookNames = []
for EachBook in AllBooksDivs:
#     print(EachBook.find("a")["title"])
    BookNames.append(EachBook.find("a")["title"])

type(BookNames),BookNames

### ❖ 从指定HTML标签中提取数据<a name="Chapter2.3"></a>
#### ➣ 获取书名<a name="Chapter2.3.1"></a>

现在我们要用到BeautifulSoup的find_all()选择器，因为我们这一页有很多书，而每一本书的信息都包含在class="pl2"的div标签内，我们使用find_all()就可以直接得到本页所有书的书名了。<br/>

### ❖ 数据保存<a name="Chapter2.4"></a>
获得信息后，就是保存数据了。保存数据也很简单，Python的文件读写操作就可以实现。代码如下：

In [None]:
import csv                # comma seperated values  逗号分隔的数值格式文件

In [None]:
DB_FileName = "豆瓣图书.csv"
with open(DB_FileName, mode="wt", encoding="utf-8") as bookfile:              #wt:write+text, 编码utf-8, 文件入口(句柄)bookfile
    bookwriter = csv.writer(bookfile)                                         #以csv格式写入数据
    bookwriter.writerow(["书名",])
    for BookName in BookNames:
        bookwriter.writerow([BookName])
print("{}-文件保存成功".format(DB_FileName))

### ➣ 获取国家、作者、出版社、出版日期和价格信息<a name="Chapter2.3.2"></a>
我们得到了书名数据之后，接下来是从另一个位置中获取国家、作者、出版社、出版日期和价格，方法和获取书名方法一样：

**数据路径：**$HTML => div\ (class="p12") => p\ (class="pl") => text\ (\ [国家]\ 作者\ /\ 译者\ /\ 出版社\ /\ 出版时间\ /\ 价格\ )$
<br/><br/>


In [None]:
AllBookInfos = BooksPageHTML.find_all("p", class_="pl")
AllBookInfos[0]

In [None]:
AllBookInfos[4].text

In [None]:
AllBookInfos[0].text.split("/")

In [None]:
for BookInfo in AllBookInfos:
    BookInfoDetail = []
    for element in BookInfo.text.split("/"):
        BookInfoDetail.append(element.strip())
    print(BookInfoDetail)

In [None]:
import re          #regular expression 正则表达式，用于搜索和匹配
nations = []
authors = []
presses = []
pub_dates = []
prices = []
for BookInfo in AllBookInfos:
    BookInfoDetail = []
    for element in BookInfo.text.split("/"):
        BookInfoDetail.append(element.strip())
    
    # 每本书的信息都存在BookInfoDetail列表里面
    if BookInfoDetail[0].startswith("["):
        pos = BookInfoDetail[0].index("]")
        nations.append(      BookInfoDetail[0] [1:pos]       )
        authors.append(       BookInfoDetail[0][pos+1: ]     )
    elif BookInfoDetail[0].startswith("（"):
        pos = BookInfoDetail[0].index("）")
        nations.append(      BookInfoDetail[0] [1:pos]       )
        authors.append(       BookInfoDetail[0][pos+1: ]     )
    else:
        nations.append(      "中国"       )
        authors.append(       BookInfoDetail[0]     )
        
    # 获取出版社和出版日期
    if (re.findall("\d+(?:\.\d+)?",  BookInfoDetail[-3]  ) != []):     #匹配模板\d+  ()?
        presses.append(      BookInfoDetail[-4]          )
        pub_dates.append(    BookInfoDetail[-3]           )
    else:
        presses.append(      BookInfoDetail[-3]          )
        pub_dates.append(    BookInfoDetail[-2]           )
    # 获取出版价格
    prices.append(  re.findall("\d+(?:\.\d+)?",  BookInfoDetail[-1])[0]     )
        
print(nations, authors, presses, pub_dates, prices)    

### ❖ 数据保存<a name="Chapter2.4"></a>
获得信息后，就是保存数据了。

In [None]:
DB_FileName = "豆瓣图书.csv"
with open(DB_FileName, mode="wt", encoding="utf-8", newline="") as bookfile:              #wt:write+text, 编码utf-8, 文件入口(句柄)bookfile
    bookwriter = csv.writer(bookfile)                                         #以csv格式写入数据
    bookwriter.writerow(["书名","国家", "作者", "出版社", "出版日期", "价格"])
    for BookName,nation,author,press,pub_date,price in zip(BookNames,nations,authors,presses,pub_dates,prices):
        bookwriter.writerow([BookName,nation,author,press,pub_date,price])
print("{}-文件保存成功".format(DB_FileName))

### ➣ 各项数据汇总<a name="Chapter2.3.4"></a>
我们要把他们放在一起，打印出来，就是一页的数据信息了。
这里我们使用zip()函数，zip()函数在运算时，会以一个或多个序列做为参数，返回一个元组的列表。同时将这些序列中并排的元素配对。

## ❖ 从多页获取数据<a name="Chapter2.5"></a>
但是，我们要的是 250 条数据，而不是一页的十几条数据，那么要怎么获得到所有的数据呢？

我们可以检查页面的信息，可以看到页面一共 10 页。<br/>
第一页的URL是https://book.douban.com/top250?start=0<br/>
而最后一页的 URL 是https://book.douban.com/top250?start=225<br/>

我们接着多看几页，<br/>
第二页是https://book.douban.com/top250?start=25<br/>
第三页是https://book.douban.com/top250?start=50

规律已经很清晰了，我们的页面的页数信息是最后的 <font color="red">start=</font> 后面的数字。而且数字从<font color="red">0</font>开始到<font color="red">225</font>，每一页数字加<font color="red">25</font>。<br/>
这就很简单了，我们以<font color="red">https://book.douban.com/top250?start=</font>为基层URL，每一页在后面加页面的页数数字。就可以得到所有的页面url了。<br/>
再以for循环迭代每一个 url，使用上面获取数据的方法，获得所有的数据信息。

获取所有页面URL的代码如下：<br/>
<font color="forestgreen">注：我们把它保存在list里面，好用作循环迭代。</font>

In [None]:
for i in range(10):
    print(i)

In [None]:
[np.sin(i)      for i in range(10)]        #列表推导式

In [None]:
urllist = [  "https://book.douban.com/top250?start={}".format(number)    for number in range(0, 250, 25)   ]
print(urllist)

In [None]:
import requests
from bs4 import BeautifulSoup
import re          #regular expression 正则表达式，用于搜索和匹配

my_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"}
BookNames = []
nations = []
authors = []
presses = []
pub_dates = []
prices = []

for EachUrl in urllist:
    # 获取页面
    reponse = requests.get(EachUrl, headers = my_headers) 
    # beautifulsoup解析
    BooksPageHTML = BeautifulSoup(reponse.text, "html.parser")

    # 得到书名
    AllBooksDivs = BooksPageHTML.find_all("div", class_="pl2")
    for EachBook in AllBooksDivs:
        BookNames.append(EachBook.find("a")["title"])

    # 得到书的其他信息
    AllBookInfos = BooksPageHTML.find_all("p", class_="pl")
    for BookInfo in AllBookInfos:
        BookInfoDetail = []
        for element in BookInfo.text.split("/"):
            BookInfoDetail.append(element.strip())

        # 每本书的信息都存在BookInfoDetail列表里面
        if BookInfoDetail[0].startswith("["):
            pos = BookInfoDetail[0].index("]")
            nations.append(      BookInfoDetail[0] [1:pos]       )
            authors.append(       BookInfoDetail[0][pos+1: ]     )
        elif BookInfoDetail[0].startswith("（"):
            pos = BookInfoDetail[0].index("）")
            nations.append(      BookInfoDetail[0] [1:pos]       )
            authors.append(       BookInfoDetail[0][pos+1: ]     )
        else:
            nations.append(      "中国"       )
            authors.append(       BookInfoDetail[0]     )

        # 获取出版社和出版日期
        if (re.findall("\d+(?:\.\d+)?",  BookInfoDetail[-3]  ) != []):     #匹配模板\d+  ()?
            presses.append(      BookInfoDetail[-4]          )
            pub_dates.append(    BookInfoDetail[-3]           )
        else:
            presses.append(      BookInfoDetail[-3]          )
            pub_dates.append(    BookInfoDetail[-2]           )
        # 获取出版价格
        prices.append(  re.findall("\d+(?:\.\d+)?",  BookInfoDetail[-1])[0]     )    

print(BookNames, nations, authors, presses, pub_dates, prices) 

### ➣ 添加评分信息和评论人数<a name="Chapter2.3.3"></a>
之后的书籍评分内容也是同样方式获得，只是数据所在的标签不同，但是方法一样：

**数据路径：**$HTML => span\ (class="rating\_nums") => text\ (评分)$ <br/>
**数据路径：**$HTML => span\ (class="pl") => text\ (评论人数)$
<br/><br/>

### 保存文件

In [None]:
DB_FileName = "豆瓣图书.csv"
with open(DB_FileName, mode="wt", encoding="utf-8", newline="") as bookfile:              #wt:write+text, 编码utf-8, 文件入口(句柄)bookfile
    bookwriter = csv.writer(bookfile)                                         #以csv格式写入数据
    bookwriter.writerow(["书名","国家", "作者", "出版社", "出版日期", "价格"])
    for BookName,nation,author,press,pub_date,price in zip(BookNames,nations,authors,presses,pub_dates,prices):
        bookwriter.writerow([BookName,nation,author,press,pub_date,price])
print("{}-文件保存成功".format(DB_FileName))

# <a name="Chapter4">数据展现</a>
## ❖ 从存储文件中提取数据<a name="Chapter4.1"></a>

保存了爬取的数据，接下来我们需要对这些数据做一些简单的分析和展现。<br/>
第一步：我们读取刚才爬取下来的文本信息，并打印它。

In [None]:
!pip install pandas -i https://pypi.doubanio.com/simple

In [None]:
!pip list

In [None]:
import pandas as pd
import numpy as np

In [None]:
DB_FileName = "豆瓣图书.csv"
BookData = pd.read_csv(DB_FileName, encoding="utf-8")
BookData

In [None]:
BookData.head(15)

In [None]:
BookData.describe()       #数值列

In [None]:
BookData.info

In [None]:
BookData.head()

In [None]:
BookData[["书名", "国家", "作者"]].tail(10)

In [None]:
BookData["国家"] == "中国"
BookData["作者"].str.len()>3
(BookData["国家"] == "中国") & (BookData["作者"].str.len()>3)


In [None]:
BookData.loc[   (BookData["国家"] == "中国") & (BookData["作者"].str.len()>3)          ]

# 英国6,132,192,      日本78,160, 234  德国92,  捷克141,   美国199, 不丹215

In [None]:
# 英国6,132,192,      日本78,160, 234  德国92,  捷克141,   美国199, 不丹215
BookData.loc[[78,160, 234] ]

In [None]:
BookData.loc[[6,132,192]    , "国家"  ] = "英国"
BookData.loc[[78,160, 234]    , "国家"  ] = "日本"
BookData.loc[[92]    , "国家"  ] = "德国"
BookData.loc[[141]    , "国家"  ] = "捷克"
BookData.loc[[199]    , "国家"  ] = "美国"
BookData.loc[[215]    , "国家"  ] = "不丹"


In [None]:
BookData.loc[   (BookData["国家"] == "中国") & (BookData["作者"].str.len()>3)          ]

In [None]:
BookData.loc[   (BookData["国家"] == "中国") & (BookData["作者"].str.len()==3)          ]

# 法国34,丹麦126

In [None]:
BookData.loc[[34]    , "国家"  ] = "法国"
BookData.loc[[126]    , "国家"  ] = "丹麦"

In [None]:
BookData.loc[   (BookData["国家"] == "中国") & (BookData["作者"].str.len()==2)          ]

In [None]:
BookData.loc[[193]    , "国家"  ] = "阿拉伯"

In [None]:
sorted(BookData["国家"].unique())

In [None]:
BookData.loc[     (BookData["国家"]=="清") | (BookData["国家"]=="明")    , "国家"  ] = "中国"
BookData.loc[     BookData["国家"]=="俄"    , "国家"  ] = "俄罗斯"
BookData.loc[     BookData["国家"]=="印"    , "国家"  ] = "印度"
BookData.loc[     BookData["国家"]=="奥"    , "国家"  ] = "奥地利"
BookData.loc[     BookData["国家"]=="德"    , "国家"  ] = "德国"
BookData.loc[     BookData["国家"]=="意"    , "国家"  ] = "意大利"
BookData.loc[     BookData["国家"]=="挪"    , "国家"  ] = "挪威"
BookData.loc[     BookData["国家"]=="日"    , "国家"  ] = "日本"
BookData.loc[     BookData["国家"]=="法"    , "国家"  ] = "法国"
BookData.loc[     BookData["国家"]=="澳"    , "国家"  ] = "澳大利亚"
BookData.loc[     BookData["国家"]=="白俄"    , "国家"  ] = "白俄罗斯"
BookData.loc[     BookData["国家"]=="苏"    , "国家"  ] = "苏联"
BookData.loc[     BookData["国家"]=="英"    , "国家"  ] = "英国"
BookData.loc[     BookData["国家"]=="葡"    , "国家"  ] = "葡萄牙"
BookData.loc[     BookData["国家"]=="美"    , "国家"  ] = "美国"

In [None]:
sorted(BookData["国家"].unique())