# Introduction

In this notebook I scrape image data from Google, Yandex, and Flickr on Google Chrome. The goal is to add images of Marvel superheros and villains to the data set in this repository.

## Load all packages

In [1]:
from PIL import Image
import os
from selenium import webdriver
import time
import io
import requests
import hashlib

## Scraping from Google Chrome

The bit of code below functions to scrape images from Google. This method is based on this post which gives info on how to scrape image data from Google: [https://medium.com/p/a96feda8af2d](https://medium.com/p/a96feda8af2d)

In [2]:
search_term = 'spiderman'

In [3]:
DRIVER_PATH = './chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

In [4]:
wd.get('https://google.com')

In [5]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

In [6]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG")
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

In [7]:
def search_and_download(search_term:str,driver_path:str,target_path='./characters',number_images=5):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

In the code below we may have to adjust number of images we can pull from Google. If it finds less than the number of images indicated it will return an error.

In [8]:
img_path = 'flowers/' + search_term + '/'

if search_term not in os.listdir('flowers/.'): 
    os.mkdir(img_path)

search_and_download(search_term=search_term,driver_path=DRIVER_PATH,number_images=50)

Found: 100 search results. Extracting links from 0:100
Found: 50 image links, done!
SUCCESS - saved https://media.istockphoto.com/photos/yellow-daffodil-against-white-background-picture-id120727389 - as ./flowers/daffodil/1397234944.jpg
SUCCESS - saved https://cdn.shopify.com/s/files/1/1902/7917/products/DaffodilFortissimo2021-Portrait_x2000_crop_center.jpg?v=1617279021 - as ./flowers/daffodil/3207aa00ae.jpg
SUCCESS - saved https://mobileimages.lowes.com/productimages/f3c85dfb-cb30-4c06-878b-e245bd197ad4/11369044.jpg?size=pdhi - as ./flowers/daffodil/466b6f0fdc.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcScdj4m1UGmu2dgxnD61D7cjc9_gTcgTIBETw&usqp=CAU - as ./flowers/daffodil/438afe9ced.jpg
SUCCESS - saved https://h2.commercev3.net/cdn.brecks.com/images/800/99613A.jpg - as ./flowers/daffodil/1c1600bc94.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSJME7IgOraWhIyp5NQDUme7uEqtaXTvFBX4g&usqp=CAU - as ./flowers/daffodil/db6bf5e341.jpg


## Scraping images on Flickr

The two lines of code search Flickr for the number of images specified after the search term. See the following repository to download, setup, and get 'flickrGetUrl.py' and 'get_images.py' to run: [https://github.com/ultralytics/flickr_scraper](https://github.com/ultralytics/flickr_scraper). The versions of that code that are run here are slightly modified for easier use in this repository.

In [9]:
run flickrGetUrl.py daffodil 500

Fetching url for image number 1
Fetching url for image number 2
Fetching url for image number 3
Fetching url for image number 4
Fetching url for image number 5
Fetching url for image number 6
Fetching url for image number 7
Fetching url for image number 8
Fetching url for image number 9
Fetching url for image number 10
Fetching url for image number 11
Fetching url for image number 12
Fetching url for image number 13
Fetching url for image number 14
Fetching url for image number 15
Fetching url for image number 16
Fetching url for image number 17
Fetching url for image number 18
Fetching url for image number 19
Fetching url for image number 20
Fetching url for image number 21
Fetching url for image number 22
Fetching url for image number 23
Fetching url for image number 24
Fetching url for image number 25
Fetching url for image number 26
Fetching url for image number 27
Fetching url for image number 28
Fetching url for image number 29
Fetching url for image number 30
Fetching url for im

Fetching url for image number 251
Fetching url for image number 252
Fetching url for image number 253
Fetching url for image number 254
Fetching url for image number 255
Fetching url for image number 256
Fetching url for image number 257
Fetching url for image number 258
Fetching url for image number 259
Fetching url for image number 260
Fetching url for image number 261
Fetching url for image number 262
Fetching url for image number 263
Fetching url for image number 264
Fetching url for image number 265
Fetching url for image number 266
Fetching url for image number 267
Fetching url for image number 268
Fetching url for image number 269
Fetching url for image number 270
Fetching url for image number 271
Fetching url for image number 272
Fetching url for image number 273
Fetching url for image number 274
Fetching url for image number 275
Fetching url for image number 276
Fetching url for image number 277
Fetching url for image number 278
Fetching url for image number 279
Fetching url f

Done fetching urls, fetched 500 urls out of 500
Writing out the urls in the current directory
Done!!!


In [10]:
run get_images.py image_urls.csv daffodil

Starting download 1 of  333
Done downloading 1 of 333
Starting download 2 of  333
Done downloading 2 of 333
Starting download 3 of  333
Done downloading 3 of 333
Starting download 4 of  333
Done downloading 4 of 333
Starting download 5 of  333
Done downloading 5 of 333
Starting download 6 of  333
Done downloading 6 of 333
Starting download 7 of  333
Done downloading 7 of 333
Starting download 8 of  333
Done downloading 8 of 333
Starting download 9 of  333
Done downloading 9 of 333
Starting download 10 of  333
Done downloading 10 of 333
Starting download 11 of  333
Done downloading 11 of 333
Starting download 12 of  333
Done downloading 12 of 333
Starting download 13 of  333
Done downloading 13 of 333
Starting download 14 of  333
Done downloading 14 of 333
Starting download 15 of  333
Done downloading 15 of 333
Starting download 16 of  333
Done downloading 16 of 333
Starting download 17 of  333
Done downloading 17 of 333
Starting download 18 of  333
Done downloading 18 of 333
Starting d

Done downloading 146 of 333
Starting download 147 of  333
Done downloading 147 of 333
Starting download 148 of  333
Done downloading 148 of 333
Starting download 149 of  333
Done downloading 149 of 333
Starting download 150 of  333
Done downloading 150 of 333
Starting download 151 of  333
Done downloading 151 of 333
Starting download 152 of  333
Done downloading 152 of 333
Starting download 153 of  333
Done downloading 153 of 333
Starting download 154 of  333
Done downloading 154 of 333
Starting download 155 of  333
Done downloading 155 of 333
Starting download 156 of  333
Done downloading 156 of 333
Starting download 157 of  333
Done downloading 157 of 333
Starting download 158 of  333
Done downloading 158 of 333
Starting download 159 of  333
Done downloading 159 of 333
Starting download 160 of  333
Done downloading 160 of 333
Starting download 161 of  333
Done downloading 161 of 333
Starting download 162 of  333
Done downloading 162 of 333
Starting download 163 of  333
Done downloadi

Done downloading 288 of 333
Starting download 289 of  333
Done downloading 289 of 333
Starting download 290 of  333
Done downloading 290 of 333
Starting download 291 of  333
Done downloading 291 of 333
Starting download 292 of  333
Done downloading 292 of 333
Starting download 293 of  333
Done downloading 293 of 333
Starting download 294 of  333
Done downloading 294 of 333
Starting download 295 of  333
Done downloading 295 of 333
Starting download 296 of  333
Done downloading 296 of 333
Starting download 297 of  333
Done downloading 297 of 333
Starting download 298 of  333
Done downloading 298 of 333
Starting download 299 of  333
Done downloading 299 of 333
Starting download 300 of  333
Done downloading 300 of 333
Starting download 301 of  333
Done downloading 301 of 333
Starting download 302 of  333
Done downloading 302 of 333
Starting download 303 of  333
Done downloading 303 of 333
Starting download 304 of  333
Done downloading 304 of 333
Starting download 305 of  333
Done downloadi

## Scraping images on Yandex

The next line of code searches Yandex for images of the search term and puts them in a temporary folder named 'downloads' which I then delete after moving the images out of there. Note that these are run outside of Python (hence the ! before each command).

In [11]:
! yandex-images-download Chrome --keywords "daffodil" --limit 500 

Output directory is set to "downloads/"
Limit of images is set to 500
Downloading images for daffodil...
  Found 50 pages of daffodil.
  Scrapping page 1/17...
    Downloaded the image. ==> downloads/daffodil/1-1276781550UFzX.jpg
    Downloaded the image. ==> downloads/daffodil/post_5d59a23c287b8.jpg
    Downloaded the image. ==> downloads/daffodil/images%7Ccms-image-000038783.jpg
    Downloaded the image. ==> downloads/daffodil/zheltye-narcissy-populyarnye-sorta-i-sovety-po-uho.jpg
    fail: https://fs3.fotoload.ru/f/0318/1522120735/af7634e0f8.jpg error: <class 'requests.exceptions.SSLError'>
    Downloaded the image. ==> downloads/daffodil/narcissy_posadka_i_vyrashchivanye_v_otkrytom_grunt.jpg
    Downloaded the image. ==> downloads/daffodil/plant-flower-botany-yellow-flora-narcissus-flowerb.jpg
    Downloaded the image. ==> downloads/daffodil/daffodil-1403154_1280.jpg
    fail: http://mobimg.b-cdn.net/v3/fetch/97/978ee474695a5e938b7785bc23912b02.jpeg error: ('Something is wrong here

    Downloaded the image. ==> downloads/daffodil/flower-2314333_1280.jpg
    Downloaded the image. ==> downloads/daffodil/screen-6.jpg
    Downloaded the image. ==> downloads/daffodil/1920x1200_450870_[www.ArtFile.ru].jpg
    fail: http://mobimg.b-cdn.net/v3/fetch/08/08871092a8415fca40efc5b2ac1f8a3c.jpeg error: ('Something is wrong here.', " Error: (<class 'KeyError'>, KeyError('content-type'))")
    Downloaded the image. ==> downloads/daffodil/grass-plant-meadow-flower-spring-yellow-flora-flow.jpg
    Downloaded the image. ==> downloads/daffodil/Kak-pravilno-posadit-narcissy-osenju-20.jpg
    Downloaded the image. ==> downloads/daffodil/40287492768369b.jpg (2).jpg
    Downloaded the image. ==> downloads/daffodil/zTX5anoac.jpg
    Downloaded the image. ==> downloads/daffodil/4508295636_91a1e32b7f_o.jpg
    Downloaded the image. ==> downloads/daffodil/308997-svetik_1600x1200.jpg
    Downloaded the image. ==> downloads/daffodil/39273.750x0.jpg
    Downloaded the image. ==> downloads/daff

    Downloaded the image. ==> downloads/daffodil/daffodils-1260004_1280.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/Nartsiss-32.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/narcissy-zheltye-lepestki-3191.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/nature-blossom-plant-white-flower-petal-bloom-spri (1).jpg
    Downloaded the image. ==> downloads/daffodil/celaya-polyana-raspustivshixsya-narcissov.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/jonkelevidniy.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/096f0ca2b1dedaaff3c0852b43fb76e0.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/f163ef3d0367f94ca5bd0554eab57cbb.jpg (1).jpg
    fail: https://s1.1zoom.ru/b5050/901/Daffodils_Closeup_448493_1440x900.jpg error: img_url response is not ok. response: <Response [404]>.
    fail: https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Narcissus-closeup.jpg/1200px-Narcissus-closeup.jpg error: img_url re

    Downloaded the image. ==> downloads/daffodil/%D0%9D%D0%B0%D1%80%D1%86%D0%B8%D1%81%D1%81%D1%8B-% (2).jpg
    Downloaded the image. ==> downloads/daffodil/6091004.jpg
    Downloaded the image. ==> downloads/daffodil/daffodil-288004_1280.jpg
    Downloaded the image. ==> downloads/daffodil/nartsissi_3.jpg
    Downloaded the image. ==> downloads/daffodil/48307409bb29d45.jpg
    fail: https://www.publicdomainpictures.net/pictures/10000/velka/750-1232688607r3pR.jpg error: img_url response is not ok. response: <Response [503]>.
    Downloaded the image. ==> downloads/daffodil/priroda-narcissy-zeltye.jpg
    Downloaded the image. ==> downloads/daffodil/plant-meadow-flower-petal-botany-yellow-flora-yell.jpg
    Downloaded the image. ==> downloads/daffodil/Nartsiss-27.jpg
    Downloaded the image. ==> downloads/daffodil/screen-2.jpg
    Downloaded the image. ==> downloads/daffodil/814282.jpg
    fail: https://mobimg.b-cdn.net/v3/fetch/ce/ce1a37e3c46999f97aa6e07e25eca15d.jpeg error: ('Somethi

    Downloaded the image. ==> downloads/daffodil/f056de25-eb28-4d88-8ae4-4868a7ed9a91_3.jpg
    Downloaded the image. ==> downloads/daffodil/narcissus-670640_1280.jpg
    Downloaded the image. ==> downloads/daffodil/daffodil-field-324108_1280.jpg
    Downloaded the image. ==> downloads/daffodil/2b3d621a313f1955d8fec9f3f12a5436.jpg
    Downloaded the image. ==> downloads/daffodil/72442c30711c8de1df3671ca53a09312.jpg
    Downloaded the image. ==> downloads/daffodil/blossom-plant-flower-bloom-spring-yellow-daffodils.jpg
    fail: https://mobimg.b-cdn.net/v3/fetch/dc/dc0da060a1dbf39981f165ef2219bf36.jpeg error: ('Something is wrong here.', " Error: (<class 'KeyError'>, KeyError('content-type'))")
    fail: https://upload.wikimedia.org/wikipedia/commons/3/39/Daffodils_bloom.jpg error: img_url response is not ok. response: <Response [403]>.
    Downloaded the image. ==> downloads/daffodil/2048x1537_724158_[www.ArtFile.ru].jpg
    Downloaded the image. ==> downloads/daffodil/narcissus-1299777

  Scrapping page 9/17...
    Downloaded the image. ==> downloads/daffodil/blossom-plant-white-flower-petal-bloom-pollen-spri (1).jpg
    Downloaded the image. ==> downloads/daffodil/Daffodils-Wallpaper-For-Desktop.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/daffodil-87648_1280.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/gornyj-narcissa.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/jmtZs42DITINW2R4DbdA.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/a5587868f4b0c89.jpg (1).jpg
    Downloaded the image. ==> downloads/daffodil/narcissy-makro-boke-1070.jpg (2).jpg
    fail: https://mobimg.b-cdn.net/v3/fetch/c6/c65a5bdafdf4d45b59f803c9d3c8e3ee.jpeg error: ('Something is wrong here.', " Error: (<class 'KeyError'>, KeyError('content-type'))")
    Downloaded the image. ==> downloads/daffodil/nature-blossom-plant-white-flower-petal-bloom-spri (3).jpg
    Downloaded the image. ==> downloads/daffodil/97-1270834846A6A7.jpg (1).jpg
    D

### Do some cleanup
Now I just want to move all files to the corresponding flowers directory and remove all duplicate files. Duplicates are downloaded with a space and number in the name (i.e., 'file (1).jpg' would be a duplicate of 'file.jpg').

In [12]:
! mv downloads/daffodil/* flowers/daffodil/.

In [13]:
! rm -rf downloads

In [14]:
! rm flowers/daffodil/*\ *