# Python爬虫指南

## 基础知识和准备

### 概念与基础知识

#### 超文本传输协议（HTTP）

超文本传输协议（HTTP）是一个用于传输超媒体文档（例如 HTML）的应用层协议。它是为 Web 浏览器与 Web 服务器之间的通信而设计的，但也可以用于其他目的。

<br>

**客户端-服务端模型（BS）**

HTTP 遵循经典的客户端-服务端模型，客户端打开一个连接以发出请求，然后等待它收到服务器端响应。HTTP 是无状态协议，这意味着服务器不会在两个请求之间保留任何数据（状态）。

<br>

**HTTP首部（HTTP Header）**

HTTP消息首部被用来描述资源信息，或是客户端和服务器的行为。

<br>

**HTTP请求方法**

可以使GET，POST方法来完成不同操作，同时也有一些其他的方法，如 OPTIONS，DELETE 和 TRACE。

<br>

**HTTP状态返回码**

HTTP响应状态代码指示特定HTTP请求是否已成功完成。例如200表示请求成功。

<br>

想要进一步了解HTTP，请点击[此处](https://developer.mozilla.org/zh-CN/docs/Web/HTTP)

### HTML、CSS和JavaScript

#### HTML

HTML（超文本标记语言——HyperText Markup Language）是构成 Web 世界的一砖一瓦。它定义了网页内容的含义和结构。除HTML 以外的其它技术则通常用来描述一个网页的表现与展示效果（如 CSS），或功能与行为（如 JavaScript）。

```html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>测试页面</title>
  </head>
  <body>
    <p>Hello World</p>
  </body>
</html>
```

想要尝试编写网页，请点击[此处](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_intro)，或[此处](https://www.w3cschool.cn/tryrun/showhtml/tryhtml_headers)。

想要进一步了解HTML语言，请点击[此处](https://developer.mozilla.org/zh-CN/docs/Web/HTML)。

#### CSS

层叠样式表 (Cascading Style Sheets，缩写为 CSS），是一种 样式表 语言，用来描述 HTML 或 XML（包括如 SVG、MathML、XHTML 之类的 XML 分支语言）文档的呈现。CSS 描述了在屏幕、纸质、音频等其它媒体上的元素应该如何被渲染的问题。

例如

```html
<h1 style="color:blue;">This is a Blue Heading</h1>
```

想要进一步了解CSS，请点击[此处](https://developer.mozilla.org/zh-CN/docs/Web/CSS)。

#### JavaScript

JavaScript是一种脚本，一门编程语言，它可以在网页上实现复杂的功能，网页展现给你的不再是简单的静态信息，而是实时的内容更新，交互式的地图，2D/3D 动画，滚动播放的视频等等。

想要进一步了解JavaScript，请点击[此处](https://developer.mozilla.org/zh-CN/docs/learn/JavaScript)

### 什么是网页爬取

网页爬取是一种通过多种手段收集网络数据的方式。

**常用方法**

编写一个自动化程序向网络服务器请求数据（通常是用HTML表单或其他网页文件），然后对数据进行解析，提取需要的信息。

### Python相关库

#### [Requests](https://requests.kennethreitz.org/en/master/)

[Requests](https://requests.kennethreitz.org/en/master/)是一个优雅且简洁的HTTP库。

安装使用如下命令

```python
pip install requests
```

#### [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html)

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html)是用来从HTML或XML文件中提取数据的Python库。

安装使用如下命令

```python
pip install beautifulsoup4
```

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，其中最常用的是lxml。安装lxml解析器，使用如下命令

```python
pip install lxml
```

## 网页爬虫初步

### 爬取第一个网页

- 使用requests库的get()函数

In [1]:
import requests

r = requests.get('http://www.baidu.com')

使用状态码查看是否爬取成功，如果返回200，就表示成功。

In [2]:
r.status_code

200

查看爬取的内容

In [3]:
r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span cl

**编码问题**

对于中文网页，如果是乱码，可以查看字符编码方式，并对此进行重新设置。

In [4]:
r.encoding = "UTF-8"
r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su

### 解析网页初步

使用beautiful soup库进行HTML解析。

> Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象

In [6]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'lxml')

In [6]:
print(soup.title)
print(soup.title.string)
print(soup.title.get_text())
print(soup.title.contents)

<title>百度一下，你就知道</title>
百度一下，你就知道
百度一下，你就知道
['百度一下，你就知道']


#### 定位页面元素

常用方法是**find()**和**find_all()**。

In [7]:
for item in soup.find_all('a'):
    print(item)

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a>
<a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a>
<a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
<a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a>


In [8]:
# to extract all links
for item in soup.find_all("a"):
    print(item.attrs["href"])

http://news.baidu.com
http://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1
//www.baidu.com/more/
http://home.baidu.com
http://ir.baidu.com
http://www.baidu.com/duty/
http://jianyi.baidu.com/


In [9]:
soup.find('a', {"name": "tj_trnews"})

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>

**CSS选择器与网页元素提取**

使用.select()，它支持CSS选择器。

CSS选择器参考手册，点击[此处](https://www.w3school.com.cn/cssref/css_selectors.asp)。

In [10]:
soup.select("[name=tj_trnews]")

[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>]

In [None]:
# to extract a link
soup.select('[name="tj_trnews"]')[0].attrs["href"]

## 案例一：高德地图

例如爬取上海的交通健康指数，点击[此处](https://report.amap.com/detail.do?city=310000)。

In [12]:
import json

headers={"User-Agent" : "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
  "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language" : "en-us",
  "Connection" : "keep-alive",
  "Accept-Charset" : "GB2312,utf-8;q=0.7,*;q=0.7"}

r = requests.get('https://report.amap.com/ajax/cityHourly.do?cityCode=110000&dataType=1', headers = headers)

r.json()

[[1576998000000, 1.54],
 [1577001600000, 1.56],
 [1577005200000, 1.74],
 [1577008800000, 1.61],
 [1577012400000, 1.34],
 [1577016000000, 1.25],
 [1577019600000, 1.21],
 [1577023200000, 1.15],
 [1577026800000, 1.09],
 [1577030400000, 1.08],
 [1577034000000, 1.07],
 [1577037600000, 1.06],
 [1577041200000, 1.05],
 [1577044800000, 1.05],
 [1577048400000, 1.05],
 [1577052000000, 1.22],
 [1577055600000, 1.87],
 [1577059200000, 1.99],
 [1577062800000, 1.66],
 [1577066400000, 1.53],
 [1577070000000, 1.41],
 [1577073600000, 1.3],
 [1577077200000, 1.36]]

In [13]:
import time

data = r.json()
for item in data:
    timearray = time.localtime(int(item[0]/1000))
    otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timearray)
    print(otherStyleTime, ": ", item[1])

2019-12-22 15:00:00 :  1.54
2019-12-22 16:00:00 :  1.56
2019-12-22 17:00:00 :  1.74
2019-12-22 18:00:00 :  1.61
2019-12-22 19:00:00 :  1.34
2019-12-22 20:00:00 :  1.25
2019-12-22 21:00:00 :  1.21
2019-12-22 22:00:00 :  1.15
2019-12-22 23:00:00 :  1.09
2019-12-23 00:00:00 :  1.08
2019-12-23 01:00:00 :  1.07
2019-12-23 02:00:00 :  1.06
2019-12-23 03:00:00 :  1.05
2019-12-23 04:00:00 :  1.05
2019-12-23 05:00:00 :  1.05
2019-12-23 06:00:00 :  1.22
2019-12-23 07:00:00 :  1.87
2019-12-23 08:00:00 :  1.99
2019-12-23 09:00:00 :  1.66
2019-12-23 10:00:00 :  1.53
2019-12-23 11:00:00 :  1.41
2019-12-23 12:00:00 :  1.3
2019-12-23 13:00:00 :  1.36


**挑战：爬取近7天的交通拥堵指数**

## 案例二：华东理工大学学术讲座

爬取华东理工大学的学术讲座（https://news.ecust.edu.cn/reports)

- 第一步，初始化参数

In [7]:
import re

original_web_address = "https://news.ecust.edu.cn"
web_address_fmt = "https://news.ecust.edu.cn/reports?page={}"

headers={"User-Agent" : "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
  "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language" : "en-us",
  "Connection" : "keep-alive",
  "Accept-Charset" : "GB2312,utf-8;q=0.7,*;q=0.7"}

- 第二步，找到最后一页的页数

In [8]:
r = requests.get(web_address_fmt.format("1"), headers = headers)
soup = BeautifulSoup(r.text, 'lxml')

> 演示过程

In [9]:
soup.select('.last')

[<li class="last"><a href="/news?category_id=45"><span>校友</span></a></li>,
 <li class="last"><a href="/news?category_id=44"><span>就业</span></a></li>,
 <li class="last"><a href="/news?category_id=54"><span>社会服务</span></a></li>,
 <li class="last">
 <a href="/reports?page=20">末页 </a>
 </li>]

In [17]:
print('1: ', soup.select('.last')[-1])
print('2: ', soup.select('.last')[-1].a)
print('3: ', (soup.select('.last')[-1]).a['href'])
print('4: ', re.split("=", (soup.select('.last')[-1]).a['href']))

1:  <li class="last">
<a href="/reports?page=20">末页 </a>
</li>
2:  <a href="/reports?page=20">末页 </a>
3:  /reports?page=20
4:  ['/reports?page', '20']


> 正式版本

In [18]:
for item in soup.select('.last'):
    found = item.find_all(href=re.compile("reports\?"))
    if len(found) > 0:
        last_page_number = re.split("=", found[0]['href'])[-1]

print(last_page_number)

20


- 第三步，爬取讲座网址

In [19]:
r = requests.get(web_address_fmt.format("1"), headers = headers)
soup = BeautifulSoup(r.text, 'lxml')

> 演示过程

In [20]:
soup.select('.content')

[<div class="content">
 <ul>
 <li>
 <a href="/reports/3334">
 <span class="time">开讲时间：2019-12-25</span>
 <span class="content_icon"></span>
 <span>
             Accelerating algal systems and synthetic biology re
               ...
           </span>
 </a>
 </li>
 <li>
 <a href="/reports/3339">
 <span class="time">开讲时间：2019-12-24</span>
 <span class="content_icon"></span>
 <span>
             美国图书馆及信息科学专业人才教育的创新及伊大信息学院的发展
           </span>
 </a>
 </li>
 <li>
 <a href="/reports/3338">
 <span class="time">开讲时间：2019-12-20</span>
 <span class="content_icon"></span>
 <span>
             金属催化剂精准设计
           </span>
 </a>
 </li>
 <li>
 <a href="/reports/3337">
 <span class="time">开讲时间：2019-12-19</span>
 <span class="content_icon"></span>
 <span>
             Biomolecules
           </span>
 </a>
 </li>
 <li>
 <a href="/reports/3330">
 <span class="time">开讲时间：2019-12-19</span>
 <span class="content_icon"></span>
 <span>
             In situ surface spectroscopy and microscopy of zirc
       

In [21]:
soup_content = soup.select('.content')[0]
for item in soup_content.select('ul > li'):
    print(item)

<li>
<a href="/reports/3334">
<span class="time">开讲时间：2019-12-25</span>
<span class="content_icon"></span>
<span>
            Accelerating algal systems and synthetic biology re
              ...
          </span>
</a>
</li>
<li>
<a href="/reports/3339">
<span class="time">开讲时间：2019-12-24</span>
<span class="content_icon"></span>
<span>
            美国图书馆及信息科学专业人才教育的创新及伊大信息学院的发展
          </span>
</a>
</li>
<li>
<a href="/reports/3338">
<span class="time">开讲时间：2019-12-20</span>
<span class="content_icon"></span>
<span>
            金属催化剂精准设计
          </span>
</a>
</li>
<li>
<a href="/reports/3337">
<span class="time">开讲时间：2019-12-19</span>
<span class="content_icon"></span>
<span>
            Biomolecules
          </span>
</a>
</li>
<li>
<a href="/reports/3330">
<span class="time">开讲时间：2019-12-19</span>
<span class="content_icon"></span>
<span>
            In situ surface spectroscopy and microscopy of zirc
              ...
          </span>
</a>
</li>
<li>
<a href="/reports/3336">
<s

In [22]:
soup_content = soup.select('.content')[0]
for item in soup_content.select('ul > li'):
    print(item.select('.time'))

[<span class="time">开讲时间：2019-12-25</span>]
[<span class="time">开讲时间：2019-12-24</span>]
[<span class="time">开讲时间：2019-12-20</span>]
[<span class="time">开讲时间：2019-12-19</span>]
[<span class="time">开讲时间：2019-12-19</span>]
[<span class="time">开讲时间：2019-12-18</span>]
[<span class="time">开讲时间：2019-12-18</span>]
[<span class="time">开讲时间：2019-12-18</span>]
[<span class="time">开讲时间：2019-12-16</span>]
[<span class="time">开讲时间：2019-12-13</span>]
[<span class="time">开讲时间：2019-12-12</span>]
[<span class="time">开讲时间：2019-12-12</span>]
[<span class="time">开讲时间：2019-12-11</span>]
[<span class="time">开讲时间：2019-12-09</span>]
[<span class="time">开讲时间：2019-12-09</span>]
[<span class="time">开讲时间：2019-12-06</span>]
[<span class="time">开讲时间：2019-12-06</span>]
[<span class="time">开讲时间：2019-12-06</span>]
[<span class="time">开讲时间：2019-12-04</span>]
[<span class="time">开讲时间：2019-12-02</span>]
[]
[]
[]
[]
[]
[]
[]
[]


In [23]:
soup_content = soup.select('.content')[0]
for item in soup_content.select('ul > li'):
    if len(item.select('.time')) > 0:
        print(item)

<li>
<a href="/reports/3334">
<span class="time">开讲时间：2019-12-25</span>
<span class="content_icon"></span>
<span>
            Accelerating algal systems and synthetic biology re
              ...
          </span>
</a>
</li>
<li>
<a href="/reports/3339">
<span class="time">开讲时间：2019-12-24</span>
<span class="content_icon"></span>
<span>
            美国图书馆及信息科学专业人才教育的创新及伊大信息学院的发展
          </span>
</a>
</li>
<li>
<a href="/reports/3338">
<span class="time">开讲时间：2019-12-20</span>
<span class="content_icon"></span>
<span>
            金属催化剂精准设计
          </span>
</a>
</li>
<li>
<a href="/reports/3337">
<span class="time">开讲时间：2019-12-19</span>
<span class="content_icon"></span>
<span>
            Biomolecules
          </span>
</a>
</li>
<li>
<a href="/reports/3330">
<span class="time">开讲时间：2019-12-19</span>
<span class="content_icon"></span>
<span>
            In situ surface spectroscopy and microscopy of zirc
              ...
          </span>
</a>
</li>
<li>
<a href="/reports/3336">
<s

In [24]:
soup_content = soup.select('.content')[0]
for item in soup_content.select('ul > li'):
    if len(item.select('.time')) > 0:
        print(item.a['href'])

/reports/3334
/reports/3339
/reports/3338
/reports/3337
/reports/3330
/reports/3336
/reports/3333
/reports/3335
/reports/3332
/reports/3327
/reports/3331
/reports/3328
/reports/3326
/reports/3329
/reports/3324
/reports/3320
/reports/3319
/reports/3325
/reports/3317
/reports/3322


> 正式版本

In [25]:
soup_content = soup.select('.content')[0]
report_webs = ["".join([original_web_address, item.a['href']]) for item in soup_content.select('ul > li') if len(item.select('.time')) > 0]

print(report_webs)

['https://news.ecust.edu.cn/reports/3334', 'https://news.ecust.edu.cn/reports/3339', 'https://news.ecust.edu.cn/reports/3338', 'https://news.ecust.edu.cn/reports/3337', 'https://news.ecust.edu.cn/reports/3330', 'https://news.ecust.edu.cn/reports/3336', 'https://news.ecust.edu.cn/reports/3333', 'https://news.ecust.edu.cn/reports/3335', 'https://news.ecust.edu.cn/reports/3332', 'https://news.ecust.edu.cn/reports/3327', 'https://news.ecust.edu.cn/reports/3331', 'https://news.ecust.edu.cn/reports/3328', 'https://news.ecust.edu.cn/reports/3326', 'https://news.ecust.edu.cn/reports/3329', 'https://news.ecust.edu.cn/reports/3324', 'https://news.ecust.edu.cn/reports/3320', 'https://news.ecust.edu.cn/reports/3319', 'https://news.ecust.edu.cn/reports/3325', 'https://news.ecust.edu.cn/reports/3317', 'https://news.ecust.edu.cn/reports/3322']


- 第四步，爬取每个讲座信息

> 演示过程

In [26]:
r = requests.get(report_webs[0], headers = headers)
soup = BeautifulSoup(r.text, 'lxml')

In [27]:
soup.table

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr>
<td align="left" height="22" width="80"> 报告题目:</td>
<td align="left" style="word-break:break-all;word-wrap:break-word;">
          Accelerating algal systems and synthetic biology research using high-throughput strategies
        </td>
</tr>
<tr><td> </td><td> </td></tr>
<tr>
<td align="left" height="22" width="80"> 开始时间:</td>
<td align="left" style="word-break:break-all;word-wrap:break-word;">
          2019-12-25 13:00:00
        </td>
</tr>
<tr><td> </td><td> </td></tr>
<tr>
<td align="left" height="22" width="80"> 报告地点:</td>
<td align="left" style="word-break:break-all;word-wrap:break-word;">
          实验18楼315室  
        </td>
</tr>
<tr><td> </td><td> </td></tr>
<tr>
<td align="left" height="22" width="80"> 报 告 人:</td>
<td align="left" style="word-break:break-all;word-wrap:break-word;">
          西湖大学李小波研究员
        </td>
</tr>
<tr><td> </td><td> </td></tr>
<tr>
<td align="left" height="22" width="80"> 主办单

In [28]:
for item in soup.table.select('tr'):
    for unit in item.select('td'):
        if unit.string is not None:
            print(unit.string.lstrip().rstrip(), len(unit.string.lstrip().rstrip()))

报告题目: 5
Accelerating algal systems and synthetic biology research using high-throughput strategies 90
 0
 0
开始时间: 5
2019-12-25 13:00:00 19
 0
 0
报告地点: 5
实验18楼315室 9
 0
 0
报 告 人: 6
西湖大学李小波研究员 10
 0
 0
主办单位: 5
生物反应器工程国家重点实验室 14
 0
 0
备    注: 7
 0
 0
 0


In [29]:
for item in soup.table.select('tr'):
    cells = item.select('td')
    if len(cells) < 2:
        continue
    if cells[1].string is None:
        continue
    if len(cells[1].string.lstrip().rstrip()) < 1:
        continue
    
    print(cells[0].string.lstrip().rstrip(), cells[1].string.lstrip().rstrip())

报告题目: Accelerating algal systems and synthetic biology research using high-throughput strategies
开始时间: 2019-12-25 13:00:00
报告地点: 实验18楼315室
报 告 人: 西湖大学李小波研究员
主办单位: 生物反应器工程国家重点实验室


In [30]:
lecture_info = []

for item in soup.table.select('tr'):
    cells = item.select('td')
    if len(cells) < 2:
        continue
    if cells[1].string is None:
        continue
    if len(cells[1].string.lstrip().rstrip()) < 1:
        continue
    
    title = re.sub('\s+', '', re.split(":", cells[0].string.lstrip().rstrip())[0])
    content = cells[1].string.lstrip().rstrip()
    lecture_info.append((title, content))

print(lecture_info)

[('报告题目', 'Accelerating algal systems and synthetic biology research using high-throughput strategies'), ('开始时间', '2019-12-25 13:00:00'), ('报告地点', '实验18楼315室'), ('报告人', '西湖大学李小波研究员'), ('主办单位', '生物反应器工程国家重点实验室')]


> 正式版本

In [31]:
import re
import requests

original_web_address = "https://news.ecust.edu.cn"
web_address_fmt = "https://news.ecust.edu.cn/reports?page={}"

# 获取网页的bs对象
def get_web(website):
    headers={"User-Agent" : "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
          "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
          "Accept-Language" : "en-us",
          "Connection" : "keep-alive",
          "Accept-Charset" : "GB2312,utf-8;q=0.7,*;q=0.7"}
    r = requests.get(website, headers = headers)
    soup = BeautifulSoup(r.text, 'lxml')
    
    return soup

# 获取网页数量
def get_page_num(website):
    last_page_number = -1
    soup = get_web(website)
    
    for item in soup.select('.last'):
        found = item.find_all(href=re.compile("reports\?"))
        if len(found) > 0:
            last_page_number = int(re.split("=", found[0]['href'])[-1])
    
    return last_page_number

# 获取讲座的链接
def get_lecture_links(website, original_web_address):
    soup = get_web(website)
    soup_content = soup.select('.content')[0]
    report_webs = ["".join([original_web_address, item.a['href']]) for item in soup_content.select('ul > li') if len(item.select('.time')) > 0]
    
    return report_webs

# 获取单个讲座信息
def get_lecture_info(website):
    lecture_info = []
    soup = get_web(website)

    for item in soup.table.select('tr'):
        cells = item.select('td')
        if len(cells) < 2:
            continue
        if cells[1].string is None:
            continue
        if len(cells[1].string.lstrip().rstrip()) < 1:
            continue

        title = re.sub('\s+', '', re.split(":", cells[0].string.lstrip().rstrip())[0])
        content = cells[1].string.lstrip().rstrip()
        lecture_info.append((title, content))
    
    return lecture_info

> 测试

In [32]:
get_page_num(web_address_fmt.format('1'))

20

In [33]:
get_lecture_links(web_address_fmt.format('2'), original_web_address)

['https://news.ecust.edu.cn/reports/3321',
 'https://news.ecust.edu.cn/reports/3318',
 'https://news.ecust.edu.cn/reports/3323',
 'https://news.ecust.edu.cn/reports/3312',
 'https://news.ecust.edu.cn/reports/3316',
 'https://news.ecust.edu.cn/reports/3311',
 'https://news.ecust.edu.cn/reports/3308',
 'https://news.ecust.edu.cn/reports/3310',
 'https://news.ecust.edu.cn/reports/3315',
 'https://news.ecust.edu.cn/reports/3305',
 'https://news.ecust.edu.cn/reports/3307',
 'https://news.ecust.edu.cn/reports/3304',
 'https://news.ecust.edu.cn/reports/3293',
 'https://news.ecust.edu.cn/reports/3309',
 'https://news.ecust.edu.cn/reports/3306',
 'https://news.ecust.edu.cn/reports/3302',
 'https://news.ecust.edu.cn/reports/3296',
 'https://news.ecust.edu.cn/reports/3299',
 'https://news.ecust.edu.cn/reports/3294',
 'https://news.ecust.edu.cn/reports/3300']

In [34]:
get_lecture_info("https://news.ecust.edu.cn/reports/3321")

[('报告题目', '名师讲坛：漫谈质疑与创新'),
 ('开始时间', '2019-11-29 13:30:00'),
 ('报告地点', '研究生楼第三多媒体教室'),
 ('报告人', '发展中国家科学院院士陈关荣教授'),
 ('主办单位', '信息科学与工程学院')]

> 爬取讲座信息

In [35]:
lectures_info = []
page_num = get_page_num(web_address_fmt.format('1'))

for i in range(1, 3):
    print('page: ', i)
    lecture_links = get_lecture_links(web_address_fmt.format(str(i)), original_web_address)
    for lecture_link in lecture_links:
        print(lecture_link)
        lectures_info.append(get_lecture_info(lecture_link))

print(lectures_info)

page:  1
https://news.ecust.edu.cn/reports/3334
https://news.ecust.edu.cn/reports/3339
https://news.ecust.edu.cn/reports/3338
https://news.ecust.edu.cn/reports/3337
https://news.ecust.edu.cn/reports/3330
https://news.ecust.edu.cn/reports/3336
https://news.ecust.edu.cn/reports/3333
https://news.ecust.edu.cn/reports/3335
https://news.ecust.edu.cn/reports/3332
https://news.ecust.edu.cn/reports/3327
https://news.ecust.edu.cn/reports/3331
https://news.ecust.edu.cn/reports/3328
https://news.ecust.edu.cn/reports/3326
https://news.ecust.edu.cn/reports/3329
https://news.ecust.edu.cn/reports/3324
https://news.ecust.edu.cn/reports/3320
https://news.ecust.edu.cn/reports/3319
https://news.ecust.edu.cn/reports/3325
https://news.ecust.edu.cn/reports/3317
https://news.ecust.edu.cn/reports/3322
page:  2
https://news.ecust.edu.cn/reports/3321
https://news.ecust.edu.cn/reports/3318
https://news.ecust.edu.cn/reports/3323
https://news.ecust.edu.cn/reports/3312
https://news.ecust.edu.cn/reports/3316
https:/

> 后续处理 —— 导出数据到EXCEL

In [36]:
import pandas as pd

In [37]:
lecture_data = pd.DataFrame([(item[0][1], item[1][1], item[2][1], item[3][1], item[4][1]) for item in lectures_info])
item = lectures_info[0]
lecture_data.columns = (item[0][0], item[1][0], item[2][0], item[3][0], item[4][0])
lecture_data.to_excel('lecture_data.xlsx', index = False)

> 后续处理 —— 推送讲座信息

In [38]:
import arrow

lecture_to_be_pushed = [lecture for lecture in lectures_info if arrow.get(lecture[1][1]) > arrow.utcnow()]

lecture_text = "=" * 50 + "讲座信息" + "=" * 50 + "\n"
for lecture in lecture_to_be_pushed:
    lecture_text = "\n".join([lecture_text, "\n".join([": ".join(item) for item in lecture]), "\n"])

lecture_text = "".join([lecture_text, "=" * 108 + "\n"])
print(lecture_text)


报告题目: Accelerating algal systems and synthetic biology research using high-throughput strategies
开始时间: 2019-12-25 13:00:00
报告地点: 实验18楼315室
报告人: 西湖大学李小波研究员
主办单位: 生物反应器工程国家重点实验室


报告题目: 美国图书馆及信息科学专业人才教育的创新及伊大信息学院的发展
开始时间: 2019-12-24 09:00:00
报告地点: 徐汇校区图书馆201室
报告人: 阮炼教授
主办单位: 图书馆




In [39]:
import yagmail

yag = yagmail.SMTP("plutoese@126.com", "1q2w3e4r5t", host='smtp.126.com')
yag.send('cfzhang@163.com', '最新讲座信息', lecture_text)

{}