# 3分钟Python爬取9000张表情包图片

本视频的演示步骤：

1. 使用requests爬取200个网页
2. 使用BeautifulSoup实现图片的标题和地址解析
3. 将图片下载到本地目录

这2个库的详细用法，请看我的其他视频课程

In [1]:
import requests
from bs4 import BeautifulSoup
import re

## 1、下载共200个页面的HTML

In [2]:
def download_all_htmls():
    """
    下载所有列表页面的HTML，用于后续的分析
    """
    htmls = []
    for idx in range(200):
        url = f"https://fabiaoqing.com/biaoqing/lists/page/{idx+1}.html"
        print("craw html:", url)
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    print("success")
    return htmls

In [3]:
# 执行爬取
htmls = download_all_htmls()

craw html: https://fabiaoqing.com/biaoqing/lists/page/1.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/2.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/3.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/4.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/5.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/6.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/7.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/8.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/9.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/10.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/11.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/12.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/13.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/14.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/15.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/16.html
craw html: https:

craw html: https://fabiaoqing.com/biaoqing/lists/page/133.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/134.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/135.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/136.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/137.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/138.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/139.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/140.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/141.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/142.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/143.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/144.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/145.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/146.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/147.html
craw html: https://fabiaoqing.com/biaoqing/lists/page/1

In [4]:
htmls[0][:1000]

'<html>\n\n<head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n    <title>热门表情_发表情，表情包大全fabiaoqing.com</title>\n    <meta name="Keywords" content="热门表情,表情包,聊天表情,微信表情包,QQ表情包,发表情,表情包大全,表情包下载,表情下载,表情包大战,贴吧表情包,表情包集中营,斗图">\n    <meta name="Description" content="全网热门表情。发表情，最大最全的表情包网站，分享最新最热的表情包、聊天表情、微信表情包、QQ表情包、金馆长表情包、蘑菇头表情包等各类表情。">\n    <meta name="referrer" content="no-referrer" />\n    <link rel="stylesheet" type="text/css" href="//lib.baomitu.com/semantic-ui/2.2.2/semantic.min.css" />\n    <link rel="stylesheet" type="text/css" href="/Public/css/fbq.css?v=2018" />\n    <script data-ad-client="ca-pub-5486123269162001" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>\n    <script async src="//pagead2.googlesyndication.com/pagead/

## 2、解析HTML得到所有的图片标题和URL地址

In [5]:
def parse_single_html(html):
    """
    解析单个HTML，得到数据
    @return list((img_title, img_url))
    """
    soup = BeautifulSoup(html, 'html.parser')
    img_divs = soup.find_all("div", class_="tagbqppdiv")
    datas = []
    for img_div in img_divs:
        img_node = img_div.find("img")
        if not img_node: continue
        datas.append((img_node["title"], img_node["data-original"]))
    return datas

In [6]:
import pprint
pprint.pprint(parse_single_html(htmls[0])[:10])

[('阿弥陀佛，施主放下骂图，立地成佛！',
  'http://ww2.sinaimg.cn/bmiddle/9150e4e5gy1g6qlfb10avj20d70f7gmf.jpg'),
 ('看见你就烦（草莓果酱ox白眼 GIF 动图表情包）',
  'http://wx1.sinaimg.cn/bmiddle/006APoFYly1g68tiftpbmg30bh0bh4o5.gif'),
 ('我在哭', 'http://wx3.sinaimg.cn/bmiddle/006qir4ogy1g54eoes2q2j309q09cdgh.jpg'),
 ('我的人生只要这样躺着混日子就很幸福了',
  'http://ww4.sinaimg.cn/bmiddle/9150e4e5gy1g6qm7x6fiuj20mw0mmt9y.jpg'),
 ('草莓果酱ox动图表情包',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g64664qyc0g20bf0br4jn.gif'),
 ('噗呲 放屁（沙雕羊驼动图表情包）',
  'http://wx1.sinaimg.cn/bmiddle/78b88159gy1g69cze2hkkg20bp0bpx0y.gif'),
 ('来群里转转（熊猫头旋转 GIF 动图）',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g68tzab8xng207608wwou.gif'),
 ('我爱你（草莓果酱oxGIF 动图表情包）',
  'http://wx2.sinaimg.cn/bmiddle/006APoFYly1g68uwg8djlg30b60b6e57.gif'),
 ('锁屏 带薪拉屎',
  'http://wx3.sinaimg.cn/bmiddle/ceeb653ely1g654hwdsjkg20dc0avgm4.gif'),
 ('我要可爱死你（草莓果酱ox表情包）',
  'http://wx2.sinaimg.cn/bmiddle/bf976b12gy1g68hx2gtleg208c08bk8q.gif')]


In [7]:
# 执行所有的HTML页面的解析
all_imgs = []
for html in htmls:
    all_imgs.extend(parse_single_html(html))

In [8]:
all_imgs[:10]

[('阿弥陀佛，施主放下骂图，立地成佛！',
  'http://ww2.sinaimg.cn/bmiddle/9150e4e5gy1g6qlfb10avj20d70f7gmf.jpg'),
 ('看见你就烦（草莓果酱ox白眼 GIF 动图表情包）',
  'http://wx1.sinaimg.cn/bmiddle/006APoFYly1g68tiftpbmg30bh0bh4o5.gif'),
 ('我在哭', 'http://wx3.sinaimg.cn/bmiddle/006qir4ogy1g54eoes2q2j309q09cdgh.jpg'),
 ('我的人生只要这样躺着混日子就很幸福了',
  'http://ww4.sinaimg.cn/bmiddle/9150e4e5gy1g6qm7x6fiuj20mw0mmt9y.jpg'),
 ('草莓果酱ox动图表情包',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g64664qyc0g20bf0br4jn.gif'),
 ('噗呲 放屁（沙雕羊驼动图表情包）',
  'http://wx1.sinaimg.cn/bmiddle/78b88159gy1g69cze2hkkg20bp0bpx0y.gif'),
 ('来群里转转（熊猫头旋转 GIF 动图）',
  'http://wx1.sinaimg.cn/bmiddle/ceeb653ely1g68tzab8xng207608wwou.gif'),
 ('我爱你（草莓果酱oxGIF 动图表情包）',
  'http://wx2.sinaimg.cn/bmiddle/006APoFYly1g68uwg8djlg30b60b6e57.gif'),
 ('锁屏 带薪拉屎',
  'http://wx3.sinaimg.cn/bmiddle/ceeb653ely1g654hwdsjkg20dc0avgm4.gif'),
 ('我要可爱死你（草莓果酱ox表情包）',
  'http://wx2.sinaimg.cn/bmiddle/bf976b12gy1g68hx2gtleg208c08bk8q.gif')]

In [9]:
len(all_imgs)

8999

## 3、下载图片到本地目录

In [10]:
for idx, (title, img_url) in enumerate(all_imgs):
    # 移除标点符号，只保留中文、大小写字母和阿拉伯数字
    reg = "[^0-9A-Za-z\u4e00-\u9fa5]"
    title = re.sub(reg, '', title)
    
    # 发现了超长的图片标题，做截断
    if len(title)>10: title = title[:10]
    
    # 得到jpg还是gif后缀
    post_fix = img_url[-3:]
    filename = f"./output/{title}.{post_fix}"
    
    print(idx, filename)
    img_data = requests.get(img_url)
    with open(filename,"wb")as f:
        f.write(img_data.content)

print("success")

0 ./output/阿弥陀佛施主放下骂图.jpg
1 ./output/看见你就烦草莓果酱o.gif
2 ./output/我在哭.jpg
3 ./output/我的人生只要这样躺着.jpg
4 ./output/草莓果酱ox动图表情.gif
5 ./output/噗呲放屁沙雕羊驼动图.gif
6 ./output/来群里转转熊猫头旋转.gif
7 ./output/我爱你草莓果酱oxG.gif
8 ./output/锁屏带薪拉屎.gif
9 ./output/我要可爱死你草莓果酱.gif
10 ./output/我尼玛傻了都.jpg
11 ./output/你今天表现蛮好10分.gif
12 ./output/真烦人得找个理由做她.gif
13 ./output/哇哦草莓果酱ox表情.jpg
14 ./output/哥哥又说笑了乔碧萝表.gif
15 ./output/锁屏带薪拉屎.gif
16 ./output/我简直难上加难麻将表.jpg
17 ./output/滚蛋相亲红包哥表情包.jpg
18 ./output/17591800下班.gif
19 ./output/乔碧萝听了都想笑斗鱼.jpg
20 ./output/四只猪疯狂对视.jpg
21 ./output/靓仔敬礼.gif
22 ./output/靓女敬礼.gif
23 ./output/要不是空气免费我都活.jpg
24 ./output/参加公主殿下萌娃黄夏.jpg
25 ./output/我就笑笑不说话.gif
26 ./output/今大很忙但没有忘记想.gif
27 ./output/走去哥哥心里.jpg
28 ./output/哈哈哈笑死我啦GIF.gif
29 ./output/垃圾分类请问你是什么.gif
30 ./output/买西瓜吗买一个大西瓜.gif
31 ./output/我太难难南南了.jpg
32 ./output/扭起来熊猫头尬舞GI.gif
33 ./output/1甲基8乙基二环42.jpg
34 ./output/自卑人士.gif
35 ./output/撒贝宁疑问问号哈.jpg
36 ./output/我给大家开个空调.jpg
37 ./output/我拎着开心快乐都掉光.gif
38 ./output/以后就叫我敢敢吧一个.jpg
39 ./output

308 ./output/你们要买西瓜吗买一送.jpg
309 ./output/你很过分啊你印尼小胖.gif
310 ./output/我信号圈外面没信号就.gif
311 ./output/我怀疑你在开车但我没.jpg
312 ./output/早安沙雕羊驼.gif
313 ./output/火车票机票酒店你觉得.jpg
314 ./output/睡觉吧狗命最重要.jpg
315 ./output/我好烦想饮酒.jpg
316 ./output/悲伤爆炸谁也安慰不了.jpg
317 ./output/忙碌的日子照顾好自己.gif
318 ./output/打球吗我不菜药酱药水.jpg
319 ./output/自带爱你的光芒印尼小.gif
320 ./output/你可以对我指点但不能.gif
321 ./output/抱紧我的猪宝贝.gif
322 ./output/偷偷亲你萌娃罗熙动图.gif
323 ./output/我太累了没办法扇你你.jpg
324 ./output/老戴里都是你罗熙表情.jpg
325 ./output/复习不完了.jpg
326 ./output/喂来份夜宵熬夜的人应.gif
327 ./output/说你到底是干了没有.gif
328 ./output/谁又在迫害我游ne娃.jpg
329 ./output/悲伤爆炸谁也安慰不了.jpg
330 ./output/人可以休息灵魂还要耍.jpg
331 ./output/我希望你做个甜甜的梦.jpg
332 ./output/来了来了萌娃黄夏温h.jpg
333 ./output/戴好没有快点鸭鸭表情.jpg
334 ./output/滚印尼小胖TATAN.gif
335 ./output/在吗出来挨打.jpg
336 ./output/我怀疑你有病但是我没.jpg
337 ./output/有车别藏着掖着啊都开.gif
338 ./output/一天除了骚啥也没干美.jpg
339 ./output/讲不通的不如直接开骂.jpg
340 ./output/老公宝贝哥哥亲爱的男.gif
341 ./output/狗子歪头问号表情包.jpg
342 ./output/笑到肚子疼.gif
343 ./output/你现在知错已经太晚了.jpg
344 ./output/这人傻逼吧印尼小

KeyboardInterrupt: 