<a href="https://colab.research.google.com/github/joedockrill/image-scraper/blob/master/ImageScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DuckDuckGo and Google Image Scraper

This notebook is an image scraper for creating deep learning datasets. Expand this section for help. 

Hugs & kisses, Joe Dockrill. 

credits: 
- [Deepan Prabhu Babu](https://github.com/deepanprabhu/duckduckgo-images-api) for the base DuckDuckGo code
- Iegor Timukhin for pointing out that the search constraints param was sitting under my nose the whole time

This notebook can scrape from Google and DuckDuckGo but Google is really just an emergency backup in case the DuckDuckGo code breaks at some point.

The thumbnails from DDG are larger, the search options are better and the results include the original (full sized) image url which you can have downloaded instead of a thumbnail by using an img_size.

Bear in mind that you will get more failures downloading original images because of out of date links, truncated downloads, and sites which ban hot-linking.

**Version 2** \
\
You can now constrain DDG searches as follows:

```
duckduckgo_search(label: str, keywords: str, max_results: int=100,
                      img_size: ImgSize=ImgSize.Thumbs, 
                      img_type: ImgType=ImgType.Photo,
                      img_layout: ImgLayout=ImgLayout.Square,
                      img_color: ImgColor=ImgColor.All) -> None:

img_size can be one of the following: (default=ImgSize.Thumbs)
Thumbs, Small, Medium, Large, Wallpaper
 
img_type can be one of the following: (default=ImgType.Photo)
All, Photo, Clipart, Gif, Transparent

img_layout can be one of the following: (default=ImgLayout.Square)
All, Square, Tall, Wide
  
img_color can be one of the following: (default = ImgColor.All)
All, Color, Monochrome, Red, Orange, Yellow, Green, Blue, Purple, Pink, Brown, Black, Gray, Teal, White
```

Workflow:
- Write some search functions in the "Download your images here" cell
- Run the image cleaner to delete rubbish
- Zip it all up
- Download it or copy it to Google Drive

Images will be downloaded into folders by label name. If you want a one level zip file with all the images at the root just pass an empty string as a label name.

If you would prefer to create a CSV file of label/url pairs you can do that at the bottom of the notebook.


# Code

In [20]:
#@title RUN THIS CELL for code setup.
#@markdown If you're new to Colab and you want to see the code, you can select this cell, 
#@markdown click the ... menu in the top right of the cell then click Form->Hide Form

from pathlib import Path
import shutil
import requests
import re
import json
import time
from bs4 import BeautifulSoup
from PIL import Image as PImage
from PIL import ImageDraw as PImageDraw
import ipywidgets as widgets
from IPython.display import display
from google.colab import files
from google.colab import drive
from typing import Callable
from enum import Enum
import pandas as pd

BASE_FOLDER = "images"

##########################################################################################
# scraping
##########################################################################################
def google_scrape_urls(keywords: str, max_results: int) -> list:
  '''scrape urls from google image search'''
  BASE_URL = "https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&q="

  HEADERS = {
      'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive',
  }
  
  searchurl = BASE_URL + keywords
  resp = requests.get(searchurl, headers=HEADERS)
  html = resp.text
  
  soup = BeautifulSoup(html, "html.parser")
  results = soup.findAll("img", {"data-src":True}, limit=max_results)
  
  links = []
  for re in results:
    links.append(re["data-src"])

  return links  

class ImgSize(Enum):
  Thumbs=""
  Small="Small"
  Medium="Medium"
  Large="Large"
  Wallpaper="Wallpaper"

class ImgType(Enum):
  All=""
  Photo="photo"
  Clipart="clipart"
  Gif="gif"
  Transparent="transparent"

class ImgLayout(Enum):
  All=""
  Square="Square"
  Tall="Tall"
  Wide="Wide"
  
class ImgColor(Enum):
  All=""
  Color="color"
  Monochrome="Monochrome"
  Red="Red"
  Orange="Orange"
  Yellow="Yellow"
  Green="Green"
  Blue="Blue"
  Purple="Purple"
  Pink="Pink" 
  Brown="Brown"
  Black="Black" 
  Gray="Gray" 
  Teal="Teal"
  White="White"

def duckduckgo_scrape_urls(keywords: str, max_results: int, 
                           img_size: ImgSize=ImgSize.Thumbs, 
                           img_type: ImgType=ImgType.Photo,
                           img_layout: ImgLayout=ImgLayout.Square,
                           img_color: ImgColor=ImgColor.All) -> list:
  '''scrape urls from duckduckgo image search'''
  BASE_URL = 'https://duckduckgo.com/'
  params = {
    'q': keywords
  };
  results = 0
  links = []

  resp = requests.post(BASE_URL, data=params)
  match = re.search(r'vqd=([\d-]+)\&', resp.text, re.M|re.I)
  assert match is not None, "Failed to obtain search token"

  HEADERS = {
      'authority': 'duckduckgo.com',
      'accept': 'application/json, text/javascript, */*; q=0.01',
      'sec-fetch-dest': 'empty',
      'x-requested-with': 'XMLHttpRequest',
      'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
      'sec-fetch-site': 'same-origin',
      'sec-fetch-mode': 'cors',
      'referer': 'https://duckduckgo.com/',
      'accept-language': 'en-US,en;q=0.9',
  }

  constraints = ""
  if(img_size != ImgSize.Thumbs): constraints +=  "size:" + img_size.name
  constraints += ","
  if(img_type != ImgType.All): constraints +=  "type:" + img_type.name
  constraints += ","
  if(img_layout != ImgLayout.All): constraints +=  "layout:" + img_layout.name
  constraints += ","
  if(img_color != ImgColor.All): constraints +=  "color:" + img_color.name
  
  PARAMS = (
      ('l', 'us-en'),
      ('o', 'json'),
      ('q', keywords),
      ('vqd', match.group(1)),
      ('f', constraints),
      ('p', '1'),
      ('v7exp', 'a'),
  )

  requestUrl = BASE_URL + "i.js"

  while True:
      while True:
          try:
              resp = requests.get(requestUrl, headers=HEADERS, params=PARAMS)
              data = json.loads(resp.text)
              break
          except ValueError as e:
              print("Hit request throttle, sleeping and retrying")
              time.sleep(5); #seems a lot but ok...
              continue

      #result["thumbnail"] is normally big enough for most purposes
      #result["width"], result["height"] are for the full size img in result["image"]
      #result["image"] url to full size img on orig site (so may be less reliable) 
      #result["url"], result["title"].encode('utf-8') from the page the img came from
      
      for result in data["results"]:
        if(img_size == ImgSize.Thumbs): links.append(result["thumbnail"])
        else:                       links.append(result["image"])

        if(max_results is not None):
          if(len(links) >= max_results) : return links

      if "next" not in data:
          #no next page, all done
          return links

      requestUrl = BASE_URL + data["next"]

##########################################################################################
# searching & downloading
##########################################################################################
def google_search(label: str, keywords: str, max_results: int=100) -> None:
  '''run a google search and download the images'''
  print("Google search: ", keywords)
  links = google_scrape_urls(keywords,max_results)
  download_urls(label, links)

def duckduckgo_search(label: str, keywords: str, max_results: int=100,
                           img_size: ImgSize=ImgSize.Thumbs, 
                           img_type: ImgType=ImgType.Photo,
                           img_layout: ImgLayout=ImgLayout.Square,
                           img_color: ImgColor=ImgColor.All) -> None:
  '''run a duckduckgo search and download the images'''
  print("Duckduckgo search:", keywords)
  links = duckduckgo_scrape_urls(keywords, max_results, img_size, img_type, img_layout, img_color)
  download_urls(label, links)

def download_urls(label: str, links: list) -> None:
  '''downloads urls into the folder for that label'''
  if(len(links) == 0):
    print("Nothing to download!"); return

  folder = Path(BASE_FOLDER)/label
  folder.mkdir(parents=True, exist_ok=True)

  print("Downloading", len(links), "results into", folder)

  bar = widgets.IntProgress(0, 0, len(links) - 1)
  display(bar)

  i = 1
  mk_fp = lambda i: folder/(str(i).zfill(3) + ".jpg")
  is_file = lambda i: mk_fp(i).exists()
  while is_file(i): i += 1 # don't overwrite previous searches
  
  for link in links:
      try:
        resp = requests.get(link)      
        fp = mk_fp(i)
        fp.write_bytes(resp.content)

        try:
          img = PImage.open(fp)
          img.verify()
          img.close()
        except Exception as e:
          # print(e)
          print(fp, "is invalid")
          fp.unlink()
      except:
        print("Exception occured while retrieving", link)

      i += 1
      bar.value += 1

  bar.bar_style = "success"

def save_urls(filename: str, scrape_func: Callable, label: str, keywords: str, max_results: int) -> None:
  '''run a search and concat the urls to a csv'''
  fp = Path(filename)
  if(fp.exists() == False):
    df = pd.DataFrame(columns=["URL", "Label"])
    df.to_csv(filename, index=False)

  urls = scrape_func(keywords, max_results)
  rows = []

  for url in urls:
    rows.append({"URL":url, "Label":label})
    
  df = pd.concat([pd.read_csv(filename), pd.DataFrame(rows)]) 
  df.to_csv(filename, index=False)

##########################################################################################
# moving files around
##########################################################################################
def download_file(filename: str) -> None:
  '''trigger a file download from colab to local system'''
  files.download(filename)

def transfer_to_drive(filename: str, dest_folder: str="Datasets") -> None:
  '''transfer file from colab runtime to google drive'''
  drive.mount("/content/drive") 
  folder = Path("/content/drive/My Drive")/dest_folder
  folder.mkdir(parents=True, exist_ok=True)
  
  shutil.copyfile(filename, str(folder/filename))

**Run this cell to delete all image files (to create another dataset or reset)**

In [None]:
!rm -r -f images/*

**Download your images here**

In [None]:
# change your zip name and run some searches
# help and options are in the hidden cell at the top

ZIP_NAME = "images.zip" 

params = {
    "max_results": 100,             # can go higher, 477 at the time of writing
    "img_size":    ImgSize.Thumbs, 
    "img_type":    ImgType.Photo,
    "img_layout":  ImgLayout.Square,
    "img_color":   ImgColor.All
}

# EG:
# ZIP_NAME = "Clowns.zip"
# duckduckgo_search("Nice", "nice clowns", **params)
# duckduckgo_search("Scary", "scary clowns", **params)

# you can also use google_search() if you prefer or if the ddg code breaks.


In [None]:
#@title Dataset Cleaner
#@markdown Run this cell for a dataset cleaner you can use to get rid of inappropriate
#@markdown images before zipping up your dataset.

#@markdown ---

##########################################################################################
# globals & event handler
##########################################################################################
ICLN_BATCH_SZ = 8

# this may look nauseating and but creating new widgets is literally about 10x slower than 
# updating existing ones so the ui gets created once and updated forever more
icln_folder = None
icln_batches = None
icln_pager = None
icln_grid = None
icln_empty_folder = None

def delete_on_click(btn):
  fn, img, batch, idx = btn.tag
  img.value = icln_deleted_img() # display red 'deleted' cross
  icln_batches[batch][idx] = ""  # so we know it's deleted as we page back & forth
  btn.disabled = True
  try:    Path(fn).unlink()      # dbl-clicks result in us trying to delete twice
  except: pass

def paging_on_click(btn):
  folder, batch = btn.tag
  icln_render_batch(folder, batch)

def reload_on_click(btn):
  icln_render_batch(icln_folder, 0, force_reload=True)

def folder_on_change(change):
  if(change["type"] == "change" and change["name"] == "value"):
    icln_render_batch(change["new"], 0)
  
##########################################################################################
# UI creation
##########################################################################################
def icln_deleted_img():
  # creates the red "deleted" placeholder cross once, loads it and caches it
  DELETED_IMG = "deleted_img"
  
  if(DELETED_IMG not in icln_deleted_img.__dict__):
    img = PImage.new("RGB",(150,150), color="white")

    draw = PImageDraw.Draw(img)
    draw.line((5, 5, 140, 140), fill="red", width=10)
    draw.line((5, 140, 140, 5), fill="red", width=10)

    # must be able to go from pil to something the widget likes without bouncing off disc :-/
    img.save("deleted.jpg")  
    icln_deleted_img.__dict__[DELETED_IMG] = open("deleted.jpg", "rb").read()

  return icln_deleted_img.__dict__[DELETED_IMG]

def icln_create_widgets(batch_size):
  # create the UI widgets
  global icln_pager
  global icln_grid
  global icln_empty_folder

  # image/delete button pairs
  display_items = []
  for i in range(batch_size):
    img = widgets.Image()
    img.layout.width="150px"
    btn = widgets.Button(description="Delete")
    btn.on_click(delete_on_click)
    box = widgets.VBox(children=[img,btn])
    box.layout.margin = "5px"
    display_items.append(box)

  # paging
  btnFirst = widgets.Button(description="|<<") 
  btnPrev = widgets.Button(description="<<")
  lblPage = widgets.Label(value="Page NNN of KKK")
  lblPage.layout = widgets.Layout(display="flex", justify_content="center", width="100px")
  btnNext = widgets.Button(description=">>")
  btnLast = widgets.Button(description=">>|")
  
  pgbtns = [btnFirst, btnPrev, btnNext, btnLast]
  for btn in pgbtns: btn.on_click(paging_on_click)
  for btn in pgbtns: btn.layout.width = "60px"

  # folder drop down
  root = Path(BASE_FOLDER)
  folders = [f.stem for f in root.glob("*") if (f.is_dir() and f.stem[0] != ".")]
  folders.sort()
  rootfiles = [f for f in root.glob("*") if f.is_file()]
  if(len(rootfiles) > 0): folders = ["/"] + folders
  ddlFolder = widgets.Dropdown(options=folders, description="Folder: ")
  ddlFolder.observe(folder_on_change)

  # reload button
  btnReload = widgets.Button(description="↻")
  btnReload.layout = widgets.Layout(width="40px", margin="0px 0px 0px 10px")
  btnReload.on_click(reload_on_click)

  # plug it all in and display
  icln_pager = widgets.HBox(children=[btnFirst, btnPrev, lblPage, btnNext, btnLast, 
                                      ddlFolder, btnReload])  
  icln_grid = widgets.GridBox(display_items, 
                              layout=widgets.Layout(grid_template_columns="repeat(4, 25%)",
                                                    margin="15px"))
  icln_empty_folder = widgets.HTML(value="<h2>No images left to display in this folder.</h2>")
  icln_empty_folder.layout.visibility = "hidden"

  display(icln_pager)
  display(icln_empty_folder)
  display(icln_grid)
  
##########################################################################################
# UI rendering
##########################################################################################
def icln_render_batch(folder, batch, force_reload=False):
  global icln_folder
  global icln_batches
  global icln_pager
  global icln_grid

  if(folder == "/"): folder = ""
  path = Path(BASE_FOLDER)/folder

  if((icln_folder != folder) or (force_reload)): 
    # get the files, split into batches  
    files = list(path.glob("*.jpg"))
    icln_batches = [files[i:i + ICLN_BATCH_SZ] for i in range(0, len(files), ICLN_BATCH_SZ)]
    icln_folder = folder

    if(len(files) == 0):
      # fail gracefully if they've deleted every image in this folder
      icln_empty_folder.layout.visibility = "visible"
      # icln_grid.layout.visibility = "hidden" <-- doesn't work :-@
      for child in icln_grid.children: child.layout.visibility = "hidden"
      btnFirst, btnPrev, lblPage, btnNext, btnLast,_,_ = icln_pager.children
      lblPage.value = "Page 0 of 0"
      for btn in [btnFirst, btnPrev, btnNext, btnLast]: btn.disabled = True
      return
    else:
      icln_empty_folder.layout.visibility = "hidden"
      icln_grid.layout.visibility = "visible"

  # display the images
  for i, fp in enumerate(icln_batches[batch]):
    icln_grid.children[i].layout.visibility = "visible"
    img = icln_grid.children[i].children[0]
    btn = icln_grid.children[i].children[1]

    if(fp == ""):
      img.value = icln_deleted_img()
      btn.disabled = True
    else:
      img.value = open(fp, "rb").read()
      btn.tag = (fp, img, batch, i)
      btn.disabled = False

  if(len(icln_batches[batch]) < ICLN_BATCH_SZ):
    # partial batch on the last page, hide the rest of the grid
    for i in range(len(icln_batches[batch]), ICLN_BATCH_SZ):
      icln_grid.children[i].layout.visibility = "hidden"
    
  # update the paging controls
  btnFirst, btnPrev, lblPage, btnNext, btnLast,_,_ = icln_pager.children
  btnFirst.tag = (folder, 0) 
  btnPrev.tag = (folder, max(0, batch-1)) 
  btnNext.tag = (folder, min(len(icln_batches)-1, batch+1)) 
  btnLast.tag = (folder, len(icln_batches)-1) 
  lblPage.value = "Page {} of {}".format(batch+1, len(icln_batches))
  for btn in [btnFirst, btnPrev, btnNext, btnLast]: btn.disabled = btn.tag[1] == batch

##########################################################################################
# and to actually create the UI and render the first folder in the list:
##########################################################################################
icln_create_widgets(ICLN_BATCH_SZ)
_,_,_,_,_,ddlFolder,_ = icln_pager.children
icln_render_batch(ddlFolder.value, 0)


**Run this cell to create a zip file**

In [None]:
!rm -f {ZIP_NAME}
!zip -q -r {ZIP_NAME} images

**Run one of these cells to get your zip file**

In [None]:
# download to your local system
download_file(ZIP_NAME)

In [None]:
# copy to google drive 
transfer_to_drive(ZIP_NAME, dest_folder="Datasets")

# Create a CSV file of URLs

If you'd rather distribute a file with the image URLs and labels and have people download the images themselves you can do so here.

In [21]:
CSV_NAME = "images.csv" #change this to something more meaningful

!rm -f {CSV_NAME}

# save_urls(CSV_NAME, duckduckgo_scrape_urls, "dogs", "dogs or puppies", 10)
# save_urls(CSV_NAME, duckduckgo_scrape_urls, "cats", "cats or kittens", 10)
# save_urls(CSV_NAME, duckduckgo_scrape_urls, "rabbits", "rabbits sitting in mugs", 10)

In [None]:
# download to your local system
download_file(CSV_NAME)

In [None]:
# copy to google drive 
transfer_to_drive(CSV_NAME, dest_folder="Datasets")