# Introduction

In this notebook I scrape image data from Google and Yandex on Google Chrome. The goal is to add images of Marvel superheros and villains to the data set in this repository.

So far, have added:

Good

- Spider-Man
- Captain Marvel

Evil

## Load all packages

In [1]:
from PIL import Image
import os
from selenium import webdriver
import time
import io
import requests
import hashlib

## Scraping from Google Chrome

The bit of code below functions to scrape images from Google. This method is based on this post which gives info on how to scrape image data from Google: [https://medium.com/p/a96feda8af2d](https://medium.com/p/a96feda8af2d)

In [2]:
character = 'captain_america'
class_term = 'good'
search_term = character + '_' + 'cartoons'

In [3]:
DRIVER_PATH = './chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

In [4]:
wd.get('https://google.com')


In [5]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

In [6]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG")
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

In [7]:
def search_and_download(search_term:str,driver_path:str,target_path='characters/' + class_term,number_images=5):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

In the code below we may have to adjust number of images we can pull from Google. If it finds less than the number of images indicated it will return an error.

In [8]:
img_path = 'characters/' + class_term + '/' + search_term + '/'

if search_term not in os.listdir('characters/' + class_term + '/'): 
    os.mkdir(img_path)

search_and_download(search_term=search_term,driver_path=DRIVER_PATH,number_images=30)

Found: 100 search results. Extracting links from 0:100
Found: 30 image links, done!
SUCCESS - saved https://img.favpng.com/16/3/3/captain-america-hulk-cartoon-comics-drawing-png-favpng-kFsjiYzQhjxqghyprQ4k6iyLT.jpg - as characters/good/captain_america_cartoons/7ef95ea235.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR_QVAY0QTVqTAgBeQozKIaULIZ-K3HUccSIg&usqp=CAU - as characters/good/captain_america_cartoons/de347e2125.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQlpEQaiZumgLj7mD7r9JcqavAw9zesS4Soig&usqp=CAU - as characters/good/captain_america_cartoons/082f3948ed.jpg
SUCCESS - saved https://i.ytimg.com/vi/0Zph6S5iEdc/maxresdefault.jpg - as characters/good/captain_america_cartoons/b50ae759ae.jpg
SUCCESS - saved https://i.pinimg.com/736x/15/f9/6d/15f96d6a7f1bbe1ec1c0c82c080966d5.jpg - as characters/good/captain_america_cartoons/a128a1e8d5.jpg
SUCCESS - saved https://i5.walmartimages.com/asr/03d55391-2c1f-4191-a7be-adaec1543683.b3a75

## Scraping images on Yandex

The next line of code searches Yandex for images of the search term and puts them in a temporary folder named 'downloads' which I then delete after moving the images out of there. Note that these are run outside of Python (hence the ! before each command).

In [9]:
! yandex-images-download Chrome --keywords "captain america cartoons" --limit 50 

Output directory is set to "downloads/"
Limit of images is set to 50
Downloading images for captain america cartoons...
  Found 50 pages of captain america cartoons.
  Scrapping page 1/2...
    Downloaded the image. ==> downloads/captain america cartoons/Cartoon-Image-Of-Captain-America-PNG-Captain-Ameri.jpg
    Downloaded the image. ==> downloads/captain america cartoons/e1a70241390388aae93d1feef9bc88a2.png
    Downloaded the image. ==> downloads/captain america cartoons/png-clipart-captain-america-captain-america-infant.png
    Downloaded the image. ==> downloads/captain america cartoons/38-383819_captain-america-clipart-9ipzkr9at-avenge.png
    Downloaded the image. ==> downloads/captain america cartoons/cartoon-captain-america-png-15.png
    Downloaded the image. ==> downloads/captain america cartoons/742d649b06da5d12fc0c8c9bda7f3df3.png
    Downloaded the image. ==> downloads/captain america cartoons/10-100179_free-png-captain-america-png-images-tran.png
    Downloaded the image. 

### Do some cleanup
Now I just want to move all files to the corresponding directory and remove all duplicate files. Duplicates are downloaded with a space and number in the name (i.e., 'file (1).jpg' would be a duplicate of 'file.jpg').

In [10]:
! mv downloads/captain\ america\ cartoons/* characters/good/captain_america_cartoons/.

In [11]:
! rm -rf downloads

In [12]:
! rm characters/good/captain_america_cartoons/*\ *.jpg

rm: characters/good/captain_america_cartoons/* *.jpg: No such file or directory
