### The following code downloads images from Google Chrome and saves
The code can be used to create a proprietary dataset for Computer Vision tasks by executing the following tasks:
- Opening up a Chrome instance
- Looping through Search Results
- Saving Images on Local
- Followingly, the labeling needs to be done manually

##### Requirements:
- Selenium (pip install selenium)
- Chromedriver (download from https://sites.google.com/chromium.org/driver/)

### Script Steps:
1. The code creates a folder for the topic of the search e.g. "Asian Cities".
2. The code creates a subfolder for each search term in a search topic "Singapore Sights".
3. The code loops through all the image results from a search term.
4. The code saves each of the images in a subfolder (max set to 100 for now)
5. Download the images in each respective subfolder

In [3]:
#Selenium helps you use this executable to automate Chrome
from multiprocessing.sharedctypes import Value
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import io
from datetime import datetime as dt
from PIL import Image
import time
import os

### Download all necessary images via Google Image Search
Steps needing modification before using a personal case:

- 1.1.1 - Set up all necessary links for web pages and assign valid labels per search term
- 1.1.2 - Run backend code to define how pages are being scraped
- 1.1.3 - Download the Images per each search term in the specific subfolder


In [4]:
# Download the driver from chromedriver website for relevant OS i.e. MAC, Windows, Debian, etc.
PATH = '/Users/noah_/Documents/Development/chromedriver-mac-arm64/chromedriver'
wd = webdriver.Chrome(executable_path=PATH)

  wd = webdriver.Chrome(executable_path=PATH)


### Parameters:

In [5]:
### Define a name for the general name of the images to be downloaded
topic = 'Images Asian Cities'
## Set folder destination of download:
image_folder = '/Users/noah_/Documents/Development/Images/'
## Assign how many images shall be downloaded
number_images = 10

### 1.1.1 - Add relevant URLs of Search Terms and assign a Label per Search Term

In [6]:
### Add to the list each Google Search URL, which needs to be checked
google_urls = [
                'https://www.google.com/search?sca_esv=555829701&sxsrf=AB5stBi94pwOfebFNdkyHl_WkQCoF398mw:1691742992132&q=singapore+sights&tbm=isch&source=lnms&sa=X&ved=2ahUKEwiypOPgmdSAAxXyyzgGHUoiCD8Q0pQJegQIDhAB&biw=1869&bih=1062&dpr=2.2',
                'https://www.google.com/search?q=kuala+lumpur+sights&tbm=isch&ved=2ahUKEwiesdzhmdSAAxV02zgGHcX4BngQ2-cCegQIABAA&oq=kua+sights&gs_lcp=CgNpbWcQARgAMgYIABAHEB46BAgjECc6BQgAEIAEOgYIABAIEB46BwgAEBgQgARQ6wZYthtgmiRoBXAAeACAAbECiAHBBpIBBzYuMi4wLjGYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=EvPVZJ7gBvS24-EPxfGbwAc&bih=1062&biw=1869',
                'https://www.google.com/search?q=Manila+Sights&tbm=isch&ved=2ahUKEwj0qaT5m9SAAxWG5TgGHYuADnUQ2-cCegQIABAA&oq=Manila+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgYIABAIEB4yBggAEAgQHjIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABFCoA1ioA2DSCGgAcAB4AIABMogBY5IBATKYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=XPXVZPSLGIbL4-EPi4G6qAc&bih=1062&biw=1869',
                'https://www.google.com/search?q=Shanghai+Sights&tbm=isch&ved=2ahUKEwiv3ab6m9SAAxWsm2MGHWEICI8Q2-cCegQIABAA&oq=Shanghai+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgYIABAIEB4yBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQ6BAgjECdQ7gRY7gRgjAloAHAAeACAATuIAXCSAQEymAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=XvXVZK-2IKy3juMP4ZCg-Ag&bih=1062&biw=1869',
                'https://www.google.com/search?q=Tokyo+Sights&tbm=isch&ved=2ahUKEwjmpcGAnNSAAxWT6DgGHfYNCcYQ2-cCegQIABAA&oq=Tokyo+Sights&gs_lcp=CgNpbWcQAzIHCAAQigUQQzIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQyBggAEAgQHjIGCAAQCBAeMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEOgQIIxAnULgHWLgHYNMKaABwAHgAgAF4iAG-AZIBAzEuMZgBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=a_XVZKbEIZPR4-EP9puksAw&bih=1062&biw=1869',
                'https://www.google.com/search?sca_esv=555829701&sxsrf=AB5stBiWXjtNyphCmlYcquuR6WqgExPC6g:1691743624189&q=Seoul+Sights&tbm=isch&source=lnms&sa=X&ved=2ahUKEwiW2ZSOnNSAAxU42DgGHe0vBHgQ0pQJegQIBhAB&biw=1869&bih=1062&dpr=2.2',
                'https://www.google.com/search?sca_esv=555829701&sxsrf=AB5stBhfUURmmsiuqSM5gvZfxTKxBMr4DA:1691743638136&q=Osaka+Sights&tbm=isch&source=lnms&sa=X&ved=2ahUKEwjTmeiUnNSAAxUs7TgGHU9oDC0Q0pQJegQIDBAB&biw=1869&bih=1062&dpr=2.2',
                'https://www.google.com/search?q=Kyoto+Sights&tbm=isch&ved=2ahUKEwjngt2VnNSAAxWlpekKHboDA3kQ2-cCegQIABAA&oq=Kyoto+Sights&gs_lcp=CgNpbWcQAzIHCAAQigUQQzIFCAAQgAQyBQgAEIAEMgYIABAIEB4yBggAEAgQHjIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDoECCMQJzoGCAAQBxAeOggIABAFEAcQHjoICAAQCBAHEB5QjwZY_yJgziRoCXAAeACAAaQBiAHHBpIBAzkuMpgBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=mPXVZOfWAqXLpge6h4zIBw&bih=1062&biw=1869',
                'https://www.google.com/search?q=Jakarta+Sights&tbm=isch&ved=2ahUKEwizxq2anNSAAxXQ_zgGHXPKAYsQ2-cCegQIABAA&oq=Jakarta+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEOgQIIxAnOgYIABAIEB5QkQVYkQVgpwhoAHAAeACAAbsBiAHHApIBAzAuMpgBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=ofXVZPPxLdD_4-EP85SH2Ag&bih=1062&biw=1869',
                'https://www.google.com/search?q=Sydney+Sights&tbm=isch&ved=2ahUKEwiK8omvnNSAAxWUz6ACHbWXDKMQ2-cCegQIABAA&oq=Sydney+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQyBggAEAgQHjIGCAAQCBAeMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEOgQIIxAnUNkCWNkCYM8FaABwAHgAgAFIiAGNAZIBATKYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=zfXVZMrXDJSfg8UPta-ymAo&bih=1062&biw=1869',
                'https://www.google.com/search?q=Auckland+Sights&tbm=isch&ved=2ahUKEwiCxay1nNSAAxV8yjgGHaufDaIQ2-cCegQIABAA&oq=Auckland+Sights&gs_lcp=CgNpbWcQAzIECCMQJzIFCAAQgAQyBQgAEIAEMgUIABCABDIGCAAQCBAeMgYIABAIEB4yBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgARQrwNYrwNg5gZoAHAAeACAAZ4CiAGeApIBAzItMZgBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=2vXVZILwFfyU4-EPq7-2kAo&bih=1062&biw=1869',
                'https://www.google.com/search?q=Beijing+Sights&tbm=isch&ved=2ahUKEwiKur3AnNSAAxXv6DgGHZy-AHQQ2-cCegQIABAA&oq=Beijing+Sights&gs_lcp=CgNpbWcQAzIECCMQJzIFCAAQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgAQyBwgAEBgQgARQ9gJY9gJgtAZoAHAAeACAAT-IAT-SAQExmAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=8fXVZMr9Ku_R4-EPnP2CoAc&bih=1062&biw=1869',
                'https://www.google.com/search?q=Mumbai+Sights&tbm=isch&ved=2ahUKEwikjfbGnNSAAxU15TgGHQ2dCIYQ2-cCegQIABAA&oq=Mumbai+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgUIABCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDoECCMQJ1DYA1jYA2D5CWgAcAB4AIAB0wGIAYoCkgEFMS4wLjGYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=__XVZOSRDbXK4-EPjbqisAg&bih=1062&biw=1869',
                'https://www.google.com/search?q=Guangzhou+Sights&tbm=isch&ved=2ahUKEwjEjb_PnNSAAxV1oWMGHfWpAowQ2-cCegQIABAA&oq=Guangzhou+Sights&gs_lcp=CgNpbWcQAzIFCAAQgAQyBQgAEIAEMgUIABCABDIHCAAQGBCABDIHCAAQGBCABDoECCMQJ1C9A1i9A2CoCmgAcAB4AIABdYgBrAGSAQMxLjGYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=EfbVZITBC_XCjuMP9dOK4Ag&bih=1062&biw=1869',
                'https://www.google.com/search?q=Delhi+Sights&tbm=isch&ved=2ahUKEwjytOHznNSAAxXa5TgGHXZhBWMQ2-cCegQIABAA&oq=Delhi+Sights&gs_lcp=CgNpbWcQAzIHCAAQigUQQzIFCAAQgAQyBQgAEIAEMgYIABAFEB4yBggAEAgQHjIGCAAQBRAeMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEMgcIABAYEIAEOgQIIxAnUJYEWJYEYL4HaABwAHgAgAFhiAGnAZIBATKYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=XfbVZLKSD9rL4-EP9sKVmAY&bih=1062&biw=1869',
                'https://www.google.com/search?q=Dhaka+Sights&tbm=isch&ved=2ahUKEwis6tn8nNSAAxW67DgGHemzD8QQ2-cCegQIABAA&oq=Dhaka+Sights&gs_lcp=CgNpbWcQAzIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDIHCAAQGBCABDoECCMQJzoFCAAQgAQ6BggAEAUQHjoGCAAQCBAeOgYIABAHEB5Q4wZY4wZgxA1oAHAAeACAAXmIAakBkgEDMS4xmAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=b_bVZOz2PLrZ4-EP6ee-oAw&bih=1062&biw=1869'
]

# Add Labels for each Category
labels = [
    # 0 - 
    'Singapore Sights',
    # 1 - 
    'Kuala Lumpur Sights', 
    # 2 - 
    'Manila Sights', 
    # 3 - 
    'Shanghai Sights', 
    # 4 - 
    'Tokyo Sights',
    # 5 - 
    'Seoul Sights', 
    # 6 - 
    'Osaka Sights',
    # 7 - 
    'Kyoto Sights', 
    # 8 - 
    'Jakarta Sights',
    # 9 - 
    'Sydney Sights', 
    # 10 - 
    'Auckland Sights',
    # 11 - 
    'Beijing Sights',
    # 12 - 
    'Mumbai Sights',
    # 13 - 
    'Guangzhou Sights',
    # 14 - 
    'Delhi Sights',
    # 15 - 
    'Dhaka Sights'
]

### 1.1.2 - Set up the code to find each image in a webpage

In [7]:
### Find all Images per each search term

def get_images_from_google(wd, delay, max_images, url):
	def scroll_down(wd):
		wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
		time.sleep(delay)

	url = url
	wd.get(url)

	image_urls = set()
	skips = 0
	while len(image_urls) + skips < max_images:
		scroll_down(wd)
		thumbnails = wd.find_elements(By.CLASS_NAME, "Q4LuWd")

		for img in thumbnails[len(image_urls) + skips:max_images]:
			try:
				img.click()
                ## Improvement in time sleep until the optimal quality of the image is visible 
				time.sleep(delay)
			except:
				continue

			print(len(image_urls))
			images = wd.find_elements(By.CLASS_NAME, "VQAsE")
			print('Images', images)
			for image in images:
				if image.get_attribute('src') in image_urls:
					max_images += 1
					skips += 1
					break  
                    
                ### Checks if image exists and whether the link has a http:
				if image.get_attribute('src') and 'http' in image.get_attribute('src'):
					image_urls.add(image.get_attribute('src'))
                        
                # print(f"Found {len(image_urls)}")
	return image_urls

### Function to save images with current time
def download_image(down_path, url, file_name, image_type='JPEG',
                   verbose=True):
    try:
        time = dt.now()
        curr_time = time.strftime('%H:%M:%S')
        #Content of the image will be a url
        img_content = requests.get(url).content
        #Get the bytes IO of the image
        img_file = io.BytesIO(img_content)
        #Stores the file in memory and convert to image file using Pillow
        image = Image.open(img_file)
        file_pth = down_path + file_name

        with open(file_pth, 'wb') as file:
            image.save(file, image_type)

        if verbose == True:
            print(f'The image: {file_pth} downloaded successfully at {curr_time}.')
    except Exception as e:
        print(f'Unable to download image from Google Photos due to\n: {str(e)}')

### 1.1.3 - Download the Images to the folders

In [None]:
if __name__ == '__main__':
    # Google search URLS
    
    # Check the length of the lists
    if len(google_urls) != len(labels):
        raise ValueError('The length of the url list does not match the labels list.')

    image_path = image_folder + f'{topic}'
    # Make the directory if it doesn't exist
    for lbl in labels:
        if not os.path.exists(image_path + '/' + lbl):
            print(f'Making directory: {str(lbl)}')
            os.makedirs(image_path + '/' + lbl)
        else:
            continue
    print('Next Step \n')
    for url_current, lbl in zip(google_urls, labels):
        urls = get_images_from_google(wd, 0, number_images, url_current)
        # Once we have added our urls to empty set then 
        print('Downloading Images \n')
        for i, url in enumerate(urls):
            print(i)
            download_image(down_path= image_folder + f'/{topic}/{lbl}/', 
                        url=url, 
                        file_name=str(i+1)+ '.jpg',
                        verbose=True)

            print('Image saved from category: ', lbl, 'via URL: ', url)
    wd.quit()

### Notes:
- If the code stops it might be due to a timeout issue, it might be recommended to the rerun the script separately for each search term.
- The script can be amended to not display the Chromedriver browser window -> this would improve the performance of the script