Use this notebook to get images for training a custom detector in TensorFlow. 

Roadmap:

*Identify a website that has the images you need - understand how to download images

*Determine how to tailor an approach to get images

*Use Chromedriver, Selenium, and BeautifulSoup to scrape the site for images

*Save images in JPEG (.jpg) format to a folder called /images

Key Reference: https://medium.com/@thimblot/data-augmentation-boost-your-image-dataset-with-few-lines-of-python-155c2dc1baec

Images Credit: Images used according to terms and conditions at http://www.simpsoncrazy.com/

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup
from io import BytesIO

import re
import os

from PIL import Image
import requests
import uuid

In [None]:
#define 'sourceURL' as the target page that contains the list of zipped csv files
sourceURL = "http://www.simpsoncrazy.com/pictures/homer"

#call chrome webdriver as 'driver' (https://sites.google.com/a/chromium.org/chromedriver/downloads)
driver = webdriver.Chrome('/Chromedriver201912/chromedriver')  

#use webdriver to call the URL
driver.get(sourceURL)

#make sure the page loads before moving on, check by table visibility
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="wrapper"]/div[2]/table')))
#in the above line, the XPATH value is found by using the XPATH Helper in Google Chrome

#pass page to beautiful soupe as 'soup'
soup = BeautifulSoup(driver.page_source, 'html.parser')

#scrape links from beautiful soup in 'links'
links = []

#find image urls ending with .gif; based on research
for link in soup.findAll('a', attrs={'href':re.compile(".gif")}):
    
    #create links list and append first half of url
    links.append('http://www.simpsoncrazy.com'+link.get('href'))
    
    #print links list to confirm; comment out line below to hide list
    print('http://www.simpsoncrazy.com'+link.get('href')) 

In [None]:
#the purpose of this cell is to set the directory for wherever you want to save images to in the next step. 

#get the current working directory
currentdir = os.getcwd()

#create a path to make a new folder called images
path = currentdir + '/images' 

#make the new folder called images
os.mkdir(path) #skip this step or comment it out if you already have an images directory

#print to confirm directory
print(currentdir)

In [None]:
#iterate through links containing images
for imagelink in links:
    
    #Open using Pillow through bytestream
    im = Image.open(BytesIO(requests.get(imagelink).content))

    #generate a random id and convert to string
    random = str(uuid.uuid4()) 

    #save each image with prefix 'image_' + random unique ID; as JPEG (preferred)
    im.convert('RGB').save(path + '/image_' + random + '.jpg', format = 'JPEG')

#https://www.geeksforgeeks.org/python-pil-image-frombuffer-method/
#https://www.geeksforgeeks.org/generating-random-ids-using-uuid-python/

In [None]:
# run to stop the driver
driver.quit()