# Introduction

In this notebook I scrape image data from Google, Yandex, and Flickr on Google Chrome. The goal is to add image data to the [Flowers Recognition](https://www.kaggle.com/alxmamaev/flowers-recognition) data set available on Kaggle with new species of flowers. So far I have used this code to add images of the following flower species to the [Flowers Recognition](https://www.kaggle.com/alxmamaev/flowers-recognition) data set:

- aster
- orchid

## Load all packages

In [None]:
from PIL import Image
import os
from selenium import webdriver
import time
import io
import requests
import hashlib

## Scraping from Google Chrome

The bit of code below functions to scrape images from Google. This method is based on this post which gives info on how to scrape image data from Google: [https://medium.com/p/a96feda8af2d](https://medium.com/p/a96feda8af2d)

In [None]:
search_term = 'iris'

In [None]:
DRIVER_PATH = './chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)

In [None]:
wd.get('https://google.com')

In [None]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

In [None]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG")
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

In [None]:
def search_and_download(search_term:str,driver_path:str,target_path='./flowers',number_images=5):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

In the code below we may have to adjust number of images we can pull from Google. If it finds less than the number of images indicated it will return an error.

In [None]:
img_path = 'flowers/' + search_term + '/'

if search_term not in os.listdir('flowers/.'): 
    os.mkdir(img_path)

search_and_download(search_term=search_term,driver_path=DRIVER_PATH,number_images=160)

## Scraping images on Flickr

The two lines of code search Flickr for the number of images specified after the search term. See the following repository to download, setup, and get 'flickrGetUrl.py' and 'get_images.py' to run: [https://github.com/ultralytics/flickr_scraper](https://github.com/ultralytics/flickr_scraper). The versions of that code that are run here are slightly modified for easier use in this repository.

In [None]:
run flickrGetUrl.py iris 500

In [None]:
run get_images.py image_urls.csv iris

## Scraping images on Yandex

The next line of code searches Yandex for images of the search term and puts them in a temporary folder named 'downloads' which I then delete after moving the images out of there. Note that these are run outside of Python (hence the ! before each command).

In [None]:
! yandex-images-download Chrome --keywords "iris" --limit 500 

### Do some cleanup
Now I just want to move all files to the corresponding flowers directory and remove all duplicate files. Duplicates are downloaded with a space and number in the name (i.e., 'file (1).jpg' would be a duplicate of 'file.jpg').

In [None]:
! mv downloads/iris/* flowers/iris/.

In [None]:
! rm -rf downloads

In [None]:
! rm flowers/iris/*\ *