Skip to content
This repository has been archived by the owner on Sep 6, 2023. It is now read-only.

用python爬虫保存美国农业部网站上的水果【证件照】 #114

Open
jwenjian opened this issue Oct 4, 2019 · 13 comments
Open

Comments

@jwenjian
Copy link
Owner

jwenjian commented Oct 4, 2019

美国农业部为全世界已知水果制作了 7500 幅水彩「证件照」并提供高清下载,链接在这里

草莓

这次的爬虫的目的是保存这些证件照到本地磁盘。


分析

原页面共收录了7584张图片,分为380页,每页20条。

第一页的链接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=0
第二页的链接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=20
...
以此类推,还是比较简单的。

每条数据的HTML元素布局如下:

我们可以获取到:

  • artist
  • year
  • scientific name
  • common name
  • 缩略图的url

点击图片进入到详情页面:

点击Download high resolution按钮,我们就可以获取到原图了。

但是这样的话就意味着每张图都要打开一个新的页面,后来发现缩略图的url和原图的url有关联:

  • 缩略图url, ../download/POM00007435/thumbnail
  • 原图url, https://usdawatercolors.nal.usda.gov/pom/download.xhtml?id=POM00007435

我们只要从缩略图的url中获取到POM00007435,就可以拼出对应的原图地址了。

爬虫

依赖

  1. requests
  2. beautifulsoup4

源码

  1. 循环380次,对应380页
  2. 每个页面获取20条记录对应的html标签
  3. 对于每个html标签
  4. 获取artist,year等信息
  5. 从缩略图url拼出对应的原图url
  6. 下载原图,保存到本地
import requests
from bs4 import BeautifulSoup

IMG_FOLDER = 'fruit_images/'


def run():
    for (idx, page) in enumerate(range(380)):
        resp = requests.get(
            'https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start={}&searchText=&searchField=&sortField='.format(
                idx * 20))
        soup = BeautifulSoup(resp.text, 'html.parser')
        for (div_idx, div) in enumerate(soup.select('div.document')):
            doc = div.select_one('dl.defList')
            artist = doc.select_one(':nth-child(2)>a').get_text()
            year = doc.select_one(':nth-child(4)>a').get_text()
            # cannot parse scientific name or common name for some pictures, just use 'none' instead to avoid terminating
            scientific_name = 'none' if doc.select_one(':nth-child(6)>a') is None else doc.select_one(
                ':nth-child(6)>a').get_text()
            common_name = 'none' if doc.select_one(':nth-child(8)>a') is None else doc.select_one(
                ':nth-child(8)>a').get_text()
            thumb_pic_src = div.select_one('div.thumb-frame>a>img')['src']
            id = (idx + 1) * 20 + div_idx + 1
            info = FruitInfo(id, artist, year, scientific_name, common_name, thumb_pic_src)
            print(info)
            info.download_and_save()


class FruitInfo:
    def __init__(self, id, artist, year, scientific_name, common_name, thumb_pic_url):
        self.id = id
        self.artist = artist
        self.year = year
        self.scientific_name = scientific_name
        self.common_name = common_name
        self.thumb_pic_url = thumb_pic_url

    def download_and_save(self):
        filename = '{}-{}-{}-{}.png'.format(self.id, self.common_name, self.year, self.artist).replace(' ', '_')
        print('filename = ', filename)
        ori_img_url = self.__parse_ori_img_url()
        print('original img url = ', ori_img_url)
        resp = requests.get(ori_img_url)
        with open(IMG_FOLDER + filename, 'wb') as f:
            f.write(resp.content)
            print('saved...', filename)

    def __parse_ori_img_url(self) -> str:
        img_id = self.thumb_pic_url.split('/')[2]
        print('img id = ', img_id)
        return 'https://usdawatercolors.nal.usda.gov/pom/download.xhtml?id={}'.format(img_id)

    def __str__(self):
        return 'FruitInfo(artist={},year={},scientific_name={},common_name={},thumb_pic_url={})'.format(self.artist,
                                                                                                        self.year,
                                                                                                        self.scientific_name,
                                                                                                        self.common_name,
                                                                                                        self.thumb_pic_url)


if __name__ == '__main__':
    run()

本地运行需要设置代理,否则打不开美国农业部的网站

Github

usda-fruit-img-spider

打包好的images.zip(大图,非原图), 1.1Gb

@jwenjian
Copy link
Owner Author

@Kingson
Copy link

Kingson commented Nov 26, 2019

可以商用吗?

@catroll
Copy link

catroll commented Nov 26, 2019

这个不错,可以保存以下,给我女儿看看

@jwenjian
Copy link
Owner Author

@catroll 确实, 小孩子应该挺喜欢这种风格的水果图片的

@jwenjian
Copy link
Owner Author

@Kingson 不清楚, 我还没有找到特别明确的版权说明. 建议在usda网站上再找找 😭

@screamff
Copy link

screamff commented Dec 4, 2019

可以再爬个维基把中英文对照替换

@1181406961
Copy link

应该是网络原因导致图片失真

@jwenjian
Copy link
Owner Author

jwenjian commented Dec 4, 2019

@screamff 好想法

@Mran
Copy link

Mran commented May 13, 2020

我也爬了这个,全是大图,转还成了webp格式。

@jwenjian
Copy link
Owner Author

我也爬了这个,全是大图,转还成了webp格式。

可以的,比较好练手,没啥反爬策略

@TomIsYourName
Copy link

ValueError: A pseudo-class must be prefixed with a tag name.
有人遇到这个错误了么?

@liuyib
Copy link

liuyib commented Oct 31, 2022

404

@jwenjian
Copy link
Owner Author

jwenjian commented Nov 1, 2022

404

还真是, 最新的地址在这里, 不过网页结构都变了, 自动化的 python 脚本估计是不能直接用了

https://naldc.nal.usda.gov/usda_pomological_watercolor?q=&search_field=all_fields

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants