# 爬虫学习 使用“requests-html”模块

此模块可以让我们更方便的解析html页面，使得爬虫更加简单易懂。

使用方法详见[官方文档](https://requests-html.kennethreitz.org/)。

<b>[中文文档](https://cncert.github.io/requests-html-doc-cn/#/)</b>


## 爬新闻页面

### 示例1(通过css定位）

In [8]:

from requests_html import HTMLSession

session = HTMLSession()

r = session.get("https://news.cnblogs.com/n/recommend")

# 通过CSS找到新闻标签
news = r.html.find('h2.news_entry > a')

for new in news:
    print(new.text)  # 获得新闻标题
    print(new.absolute_links)  # 获得新闻链接

《动物森友会》刷爆朋友圈的背后 是逃离时代的我们
{'https://news.cnblogs.com/n/658292/'}
一群日本小学生，在《我的世界》里搞了场“云”毕业
{'https://news.cnblogs.com/n/658276/'}
这款超硬核的游戏，蕴藏着微软实现复兴的巨大野心
{'https://news.cnblogs.com/n/658219/'}
Google 开源 Pigweed，涉足嵌入式开发
{'https://news.cnblogs.com/n/658198/'}
帮罗永浩算笔账：创业这些年，到底挣了多少钱？
{'https://news.cnblogs.com/n/658176/'}
被谷歌剪掉命根子的出海应用，没几个冤枉的
{'https://news.cnblogs.com/n/658174/'}
马云给高中生写了一份信，三年后，让全世界都来考数学
{'https://news.cnblogs.com/n/658169/'}
怀疑开发者在“造核弹”？GitHub 不断封禁开源项目
{'https://news.cnblogs.com/n/658107/'}
AMD 发布 Ryzen 9 系列移动 CPU，笔记本市场变天了么？
{'https://news.cnblogs.com/n/658087/'}
为什么阿里喜欢全面并购，腾讯喜欢战略投资？
{'https://news.cnblogs.com/n/658059/'}
从百度掉队谈起：给百度开药，其实只有一条
{'https://news.cnblogs.com/n/658007/'}
支付宝若要击败美团，难度到底有多大？
{'https://news.cnblogs.com/n/658006/'}
追踪网赚游戏：是谁割了你，而你又割了谁？
{'https://news.cnblogs.com/n/657964/'}
成本480万的《三体》改编动画，如何做到9.9分？
{'https://news.cnblogs.com/n/657962/'}
互联网公司的灰色战争
{'https://news.cnblogs.com/n/657960/'}
图灵奖背后：一个奥斯卡拿到手软，一个公司卖了160亿
{'https://news.cnblogs.com/n/657944

### 示例2（通过xpath定位）

In [1]:
from requests_html import HTMLSession

session = HTMLSession()

r = session.get("https://www.liepin.com/zhaopin/?key=web+mining")

# 通过xpath找到工作标签
news = r.html.xpath('//div[@class="job-info"]/h3/a')
for new in news:
    print(new.text)  # 获得工作标题
    print(new.absolute_links)  # 获得工作链接

web mining Engineer
{'https://www.liepin.com/job/1917951351.shtml'}
( Senior） Research Manager
{'https://www.liepin.com/job/1923198421.shtml'}
(Associate）research Director
{'https://www.liepin.com/job/1920754473.shtml'}
高级数据分析员(信息系统)Sr. Data Analyst(IS)
{'https://www.liepin.com/job/1927198767.shtml'}
SAP ApplicationDevelopment Project Lead
{'https://www.liepin.com/job/1926638333.shtml'}
Machine Learning Deployment Engineer
{'https://www.liepin.com/job/1926370455.shtml'}
Machine Learning Deployment Engineer
{'https://www.liepin.com/job/1926370453.shtml'}
Associate Director, Global Risk Management
{'https://www.liepin.com/job/1926267845.shtml'}
Business Analyst
{'https://www.liepin.com/a/19634343.shtml'}
Risk Management Specialist
{'https://www.liepin.com/job/1926256361.shtml'}
Developer – WeChat, Microsoft LUIS, BotFramework and NLP
{'https://www.liepin.com/job/1922703265.shtml'}
Data Scientist
{'https://www.liepin.com/a/19176709.shtml'}
Machine Learning Deployment Engineer
{'https://ww

> jupyter黑魔法（显示爬取的图片）

In [2]:
# html图
from IPython.core.display import display, HTML
display(HTML('<img src="https://httpstatusdogs.com/img/418.jpg" alt="">'))

In [3]:
# markdown图
from IPython.core.display import display, Markdown
display(Markdown('![](https://httpstatusdogs.com/img/418.jpg)'))

![](https://httpstatusdogs.com/img/418.jpg)

In [4]:
# 加值功能 抓网页图

In [6]:
from requests_html import HTMLSession

session = HTMLSession()

r = session.get("https://cn.bing.com/images/trending")

# 通过xpath找到工作标签
items = r.html.xpath('//img/@src')

for url in items:
#     print(url)  # 获得图片src url
    display(Markdown('![]({url})'.format(url=url)))  # 展示图片    



![](https://tse1-mm.cn.bing.net/th/id/OET.bc70150f0a0f46e98b97ab7418544007?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.6c67b92a90e642419f9d56226f0a1f25?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e03c2838234f4cd1b3aa09541a208aab?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.843fba32bfe94c4085c86ca17dc7dcf1?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.c79e967af54b4bd2962ff5762873f19d?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.114569a5ac5948839511c68dbb63138c?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.af1cd8f22e12465fbc3a3710d5c49937?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.c1db9a3ba2be4e009d47afb71ab3c20b?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e78b0b979bc74f59b809294edc34d10a?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.249361366e2d43858acde0a77c5a1d4d?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.21a0caa8b95c4b479180e20814e190e4?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.0c62c79ee15146caa94830cbffe68a15?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.91319928824a4a619e21ed4bb785c006?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.269a1c8ab83f419db21f1447477f9d4d?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e7020eeba24a423d9694f2874de81238?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.569ec4dfca424c539a920eb5c8f9103e?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.d78d1790f0d54d0697f6cf01571d8c5d?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.0cb8103b50f644f7803116f3f3afbe19?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.15ebe1689a5e49148b32625ce7dfc3e7?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.9232ec87d1b44164b89cc86bceda2f89?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.84f864f390bc4c4da7a24442518e24d4?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.8d985aa5d30644b38716ac203eaecf32?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.94ec66594fbc486ea4a80fcf46a94a43?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.ce8e35cf44254cc1937d3599379a8d2c?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.9c7eaa350b1b47f6bd2eaaf62e5a186d?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.4d85185c1c38460a99da2ecaa594b66e?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.b48c7915bd5c4e4ba4ca09d71f79757c?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.fa5f3e5bd55f44f9b0f85f0271fa58bd?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.d55a3d45e7c8436ea982b946e014cbdf?w=135&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.126389af6ec542d783dd0b987f3f5535?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e47575bfe7e04cb58e1145d0dd8bf661?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.443f23182e1b4a39bba6b5addd0a0a9e?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.9bfc3c7a14934163a44268a2f93839d8?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.efe3f9fc7f784df3824c2c4f6e05e77f?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.b795a280749f48ca95a9dea4163cdb3c?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.4dc363ac1e1b4b6e996aee01aada03c0?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.009d7b6a10864f5895dd227b8b6aca5a?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.4a23232f3da340bbbcea184defa95785?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.33c7e256696349ab8261e9a8af156ac7?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.07125a24199f49aca5c46ba40c130f86?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e6e9bf3521a9440d80883ad9b303517b?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.fae29efc4ff648029f1ef70585843e62?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.2d20012bbff14d7b9b2fb9b386fba7d7?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.22eceb574e414031bec812c28ead4b0d?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.1ec0493ebfc0451983e24ef7afab32cc?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.1ca1d0f4d5af42f09decbe20d9258e08?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.ac36515ef40d41ce872ad24873b2b01e?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.191598f3fd704b93a141faa025145673?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.eff35fefece14bc999ae04aa7c012c2c?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.5354bdc5fb5143968c7b8dcda370919e?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.0c681e62143e49929c2f44e30ab61b90?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.738c80cb3daa48e79b6f29f100ec42ed?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.8ec6c4aee5194b0a896db5ef3191afab?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.7706e7f7221949bcb5f237c4fbfb7bc2?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.78ff3df35c654a4b913b297134bd63e0?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.409ef668387749e4af68a0dfce45c983?w=272&h=272&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.9ad45a55f5ac4133909b7cf15cc36fde?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.ae179eedbe6e425484b14c8978fa4079?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.164bf99f079a41c69d9fa0c349fa825f?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.9fb2529795ce419bb3f96a701a0edb36?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.3e3a5b91f801498984228dcc9ba6b743?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.81582f824b024116a5b7a91c185f968e?w=272&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.e255cd1d563846faa4b1551556a2b85d?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.6a0f85e050a54119a65ebecacd4d203a?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.2a67a683253746ae8f1062188c0d0c49?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](https://tse1-mm.cn.bing.net/th/id/OET.0c734667886f4459af7239e907d705f7?w=135&h=135&c=7&rs=1&o=5&pid=1.9)

![](/rs/5g/LB/ic/ytiieusXgM2K8bLkEDP-AS1ePds.png)

# 爬图和豆瓣


## 图

In [7]:
import pandas as pd


URL_src = { "HTTP状态狗" : "https://httpstatusdogs.com/",
           "豆瓣电影排行榜" : "https://movie.douban.com/chart"  }


In [8]:
df = pd.read_html(URL_src['HTTP状态狗'], encoding="utf8", header=0, index_col=0)


# 会出现403报错

HTTPError: HTTP Error 403: Forbidden

In [10]:
# 为什么？没找到内容？要请求内容，才会返回。


**改试 requests + lxml**

In [11]:
import requests
r = requests.get(URL_src['HTTP状态狗'])
print (r.status_code, r.content)

200 b'<!doctype html>\n<html lang="en">\n<head>\n<meta charset="utf-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n<title>HTTP Status Dogs</title>\n<meta name="rating" content="general" />\n<meta name="subject" content="Hypertext Transfer Protocol Response status codes. And dogs." />\n<meta name="author" content="Mike Lee" />\n<meta name="publisher" content="HTTP Status Dogs" />\n<meta name="copyright" content="Mike Lee" />\n<meta name="host" content="httpstatusdogs.com" />\n<meta name="description" content="HTTP Status Dogs. Hypertext Transfer Protocol Response status codes. And dogs." />\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta property="og:type" content="website" />\n<meta property="og:title" content="HTTP Status Dogs" />\n<meta property="og:description" content="HTTP Status Dogs. Hypertext Transfer Protocol Response status codes. And dogs." />\n<meta property="og:image" content="https://httpstatusdogs.com/img/200.jpg" />

In [12]:
# 有内容返回了！

### 下载爬图

In [13]:
# 建个目录，运行文件同级

In [15]:
!mkdir img

mkdir: img: File exists


In [16]:
# 开始下载

In [17]:
import requests
import shutil

for imgfilename in list_img_src:
    path = urllib.parse.urljoin( url_base, imgfilename)
    
    resp = requests.get(path, stream=True)
    if r.status_code == 200:
        with open(imgfilename, mode="wb") as f:    # mode = write binary
            resp.raw.decode_content = True
            shutil.copyfileobj(resp.raw, f) 
    del resp
            

NameError: name 'list_img_src' is not defined

## 爬豆瓣

In [18]:
import requests
r = requests.get(URL_src['豆瓣电影排行榜'])
print (r.status_code, r.content)

418 b''


In [21]:
# 418就是他不认识你，类型错误差不多。
# 带上自己的身份认证去请求(headers)

In [22]:
import requests
from urllib import parse
_headers_ = {
        "Accept": "text/plain, */*; q=0.01", 
        "Connection": "keep-alive",
        "Host" : "movie.douban.com", 
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3250.0 Iron Safari/537.36",
}
s = requests.Session()
u = URL_src['豆瓣电影排行榜']
r = s.get(u, headers=_headers_)

In [23]:
print (r.status_code, r.content)

200 b'<!DOCTYPE html>\n<html lang="zh-cmn-Hans" class="ua-windows ua-webkit">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n    <meta name="renderer" content="webkit">\n    <meta name="referrer" content="always">\n    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />\n    <title>\n\xe8\xb1\x86\xe7\x93\xa3\xe7\x94\xb5\xe5\xbd\xb1\xe6\x8e\x92\xe8\xa1\x8c\xe6\xa6\x9c\n</title>\n    \n    <meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />\n    <meta http-equiv="Pragma" content="no-cache">\n    <meta http-equiv="Expires" content="Sun, 6 Mar 2005 01:00:00 GMT">\n    \n    <meta name="keywords" content="\xe7\x94\xb5\xe5\xbd\xb1\xe6\x8e\x92\xe8\xa1\x8c\xe6\xa6\x9c\xe3\x80\x81\xe6\x96\xb0\xe7\x89\x87\xe6\x8e\x92\xe8\xa1\x8c\xe6\xa6\x9c\xe3\x80\x81\xe8\xb1\x86\xe7\x93\xa3\xe7\x94\xb5\xe5\xbd\xb1250"/>\n    <meta name="description" content="\xe8\xb1\x86\xe7\x93\xa3\xe7\x94\xb5\xe5\xbd\xb1\xe6\x8e\x92\

In [24]:
df = pd.read_html(r.content, encoding="utf8", header=0, index_col=0)
df

[Empty DataFrame
 Columns: []
 Index: [], Empty DataFrame
 Columns: [饥饿站台  / 饥饿斗室(港) / 绝命大平台(台)  2019-09-06(多伦多电影节) / 2019-11-08(西班牙) / 伊万·马萨戈 / 佐里昂·伊圭里奥尔 / 安东尼亚·圣胡安 / 埃米利奥·布阿勒 / 亚历山德拉·玛桑凯 / 马里奥·帕尔多 / 阿尔吉斯·阿洛斯卡斯 / 米里亚姆·马丁 / 西班牙 / elhoyolapelicula.com / 加尔德·加兹特鲁·乌鲁蒂亚...  7.8  (104267人评价)]
 Index: [], Empty DataFrame
 Columns: [隐形人  / 隐身人 / 隐形客(港)  2020-02-28(美国) / 伊丽莎白·莫斯 / 奥利弗·杰森-科恩 / 阿尔迪斯·霍吉 / 迈克尔·多曼 / 斯托姆·瑞德 / 本尼迪克·哈迪 / 哈丽特·戴尔 / 瑞妮·林 / 布莱恩·米根 / 薇薇安·格里尔 / 尼古拉斯·霍普 / 克利夫·威廉姆斯 / 萨姆·史密斯...  7.3  (68173人评价)]
 Index: [], Empty DataFrame
 Columns: [绅士们  / 疯狂绅士帮(港) / 绅士追杀令(台)  2020-01-24(美国) / 马修·麦康纳 / 查理·汉纳姆 / 亨利·戈尔丁 / 米歇尔·道克瑞 / 杰瑞米·斯特朗 / 科林·法瑞尔 / 休·格兰特 / 琳·勒内 / 马克斯·班内特 / 布列塔尼·阿什沃思 / 伊川东吾 / 尤金娜·库日敏娜 / 乔丹·朗 / 杰森·王 / 可可·萨姆纳...  8.4  (33595人评价)]
 Index: [], Empty DataFrame
 Columns: [大赢家  / The Winners  [可播放]  2020-03-20(中国大陆) / 大鹏 / 柳岩 / 代乐乐 / 张子贤 / 田雨 / 孟鹤堂 / 陶慧 / 许娣 / 王戈 / 杜源 / 阿如那 / 张绍荣 / 张帆 / 夏甄 / 杨砚铎 / 臧鸿飞 / 李萍 / 乔晟一 / 孟非 / 屈菁菁 / 腾格尔 / 叶晞月 / 杜维瀚 / 耿业庭 / 庞博 / 姜志刚...  6.8  (114865人评价)]
 Ind

In [25]:
# 出来了！

> 扩展  （获取官网信息，翻页处理）
 

In [157]:
from requests_html import HTMLSession
import pandas as pd
session = HTMLSession()
items=[]
times=[]

for i in range(1,8):
    url="http://www.nfu.edu.cn/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/{}.html".format(i)

    r = session.get(url)

    item = r.html.xpath('//li/div/a/@title')
    time = r.html.xpath('//li/font/text()')
#     print(item) 

#获取所有标题以及时间
    for a in item:
        items.append(a)
    for a in time:
        times.append(a)
# print(items)
# print(times)
    df = pd.DataFrame(
        {
            "标题":items,
            "时间":times,
        })
df




Unnamed: 0,标题,时间
0,文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会,2020-01-06
1,文学与传媒学院2019年学术研讨会暨总结大会顺利召开,2020-01-06
2,展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕,2019-12-20
3,文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束,2019-11-22
4,文学与传媒学院教师招聘启事,2019-11-05
5,创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行,2019-11-04
6,垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行,2019-11-04
7,以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束,2019-09-16
8,文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖,2019-09-09
9,文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩,2019-09-09
