## Web Scrapping Dog Database

The "Dog Data Scraper" notebook kicks off our API testing journey by gathering test data. To properly assess our API, we need lots of dog photos. Influencer dogs, known for their high-quality images and identifiable breeds, are perfect for this task. That's why I created a script using selenium to scrape photos of these dogs in Google Images, helping us gather the data we need for testing.

### Libraries

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import os
import io
from PIL import Image
import pandas as pd
import time

Similarly, collecting usernames of dog influencers didn't provide the ideal test sample. Therefore, after researching various websites, I compiled my own list of 50 influencer dogs renowned for their high-quality photos on Google Images.

In [3]:
dogs_ig = [
    '@itsdougthepug',
    '@jiffpom',
    '@marniethedog',
    '@manny_the_frenchie',
    '@crusoe_dachshund',
    '@samsonthedood',
    '@reagandoodle',
    '@barkleysircharles',
    '@popeyethefoodie',
    '@izzythe.frenchie',
    '@tunameltsmyheart',
    '@toastmeetsworld',
    '@mensweardog',
    '@thiswildidea',
    '@aspenthemountainpup',
    '@dogwithsign',
    '@goldenunicornrae',
    '@jackson_the_dalmatian',
    '@madmax_fluffyroad',
    '@pavlovthecorgi',
    '@tuckerbudzyn',
    '@ppteamkler',
    '@Theladyshortcake',
    '@rocco_roni',
    '@Chompersthecorgi',
    '@siberianhusky_jax',
    '@good.boy.ollie',
    '@bluestaffyboulder',
    '@lecorgi',
    '@carterchowchow',
    '@_gsdbear',
    '@harlso',
    '@KeyushTheStuntDog',
    '@mayapolarbear',
    '@marutaro',
    '@henrythecoloradodog',
    '@eddie_jackrussell',
    '@tecuaniventura',
    '@ppteamaria',
    '@emmatheminifrenchie',
    '@balooitsme',
    '@pipperontour',
    '@mayathedox',
    '@hi_im_chewie',
    '@frankietothemoon',
    '@tinkerbellethedog',
    '@loki_the_wolfdog',
    '@dailydougie',
    '@tikatheiggy',
    '@norbertthedog'
]

### Scrapping


To create the scraping function, I followed a tutorial by TechwithTim (https://www.youtube.com/watch?v=NBuED2PivbY&t=1541s). Although the tutorial provided a similar function, I made some modifications to adapt it to our needs. Specifically, I updated it to use classes for finding elements, ensuring it serves our specific purpose effectively.

For this task, we developed three primary functions:

i) `get_images_from_google(wd, query, delay, max_images)`: This function retrieves the URLs of images in their original scale. It does so by sequentially opening each image that appears in a Google search based on the provided query.

ii) `download_image(download_path, url, file_name)`: This function utilizes Pillow to download the image to the local machine.

iii) `download_dogs_images(wd, dogs_ig, delay, max_images)`: We also devised a generalized function that iterates through the list of dog influencers and downloads the images one by one.


In [6]:
def get_images_from_google(wd, query, delay, max_images):
	'''
	wd: webdriver
	query: str
	delay: int
	max_images: int

	returns: set

	Save full scale images from google images
	'''

	# Scroll down to load more images
	def scroll_down(wd):
		# Scroll down to the bottom
		wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
		# Wait to load page
		time.sleep(delay)
		
	# Remove the '@' from the query
	query = query.replace('@', '')
	# Create the url
	url = f"https://www.google.com/search?q={query}&tbm=isch"
	# Open the browser
	wd.get(url)
	# Create a set to store the image urls
	image_urls = set()
	# Set the number of skips to 0
	skips = 0
	# Scroll down to load more images
	while len(image_urls) + skips < max_images:
		# Scroll down
		scroll_down(wd)
		# Find the thumbnails
		thumbnails = wd.find_elements(By.CLASS_NAME, "Q4LuWd")

		# Click on the thumbnails to get the full scale images
		for img in thumbnails[len(image_urls) + skips:max_images]:
			try:
				img.click()
				time.sleep(delay)
			except:
				continue
			# Find the full scale images
			images = wd.find_elements(By.CLASS_NAME, "sFlh5c")
			# Add the image to the set
			for image in images:
				# If the image is already in the set, skip it
				if image.get_attribute('src') in image_urls:
					# Add to the skips
					max_images += 1
					skips += 1
					break

				# If the image is a full scale image, add it to the set
				if image.get_attribute('src') and 'http' in image.get_attribute('src'):
					image_urls.add(image.get_attribute('src'))
					print(f"Found {len(image_urls)}")

	return image_urls


def download_image(download_path, url, file_name):
	'''
	download_path: str
	url: str
	file_name: str

	returns: None

	Download an image from a url
	'''
	
	try:
		# Create the download path if it doesn't exist
		image_content = requests.get(url).content
		# Open the image
		image_file = io.BytesIO(image_content)
		image = Image.open(image_file)
		# Save the image
		file_path = download_path + file_name
		with open(file_path, "wb") as f:
			image.save(f, "JPEG")
		print("Success")
	except Exception as e:
		print('FAILED -', e)


def download_dogs_images(wd, dogs_ig, delay, max_images):
	'''
	wd: webdriver
	dogs_ig: list
	delay: int
	max_images: int

	returns: None

	Download images of dogs from google images
	'''

	# Iterate over the dogs
	for dog in dogs_ig:
		# Set the query to the dog's name
		query = dog
		# Get the image urls
		urls = get_images_from_google(wd, query, delay, max_images)
		# Download the images
		for i, url in enumerate(urls):
			download_image("imgs/", url, str(dog) + str(i) + ".jpg")

### Image Details

Finally, to organize image data efficiently, I scripted a process in Python. Initially, I specified the folder path containing the images and listed all files within it. Then, I iterated through each image file, extracting its dimensions, file size, and aspect ratio. Using Pandas, I structured this data into a DataFrame, which I sorted and exported to a CSV file named "image_data.csv" for easy reference and analysis.

In [7]:
# Path to the folder containing images
folder_path = "imgs/"

# List all files in the folder
files = os.listdir(folder_path)

# Initialize lists to store file names and dimensions
file_names = []
dimensions = []
file_sizes = []
aspect_ratios = []

# Iterate through each file
for file in files:
    # Check if the file is an image
    if file.endswith((".png", ".jpg", ".jpeg", ".gif")):
        # Get the full path of the image
        image_path = os.path.join(folder_path, file)
        
        # Open the image
        with Image.open(image_path) as img:
            # Get the dimensions
            width, height = img.size
            # Calculate aspect ratio
            aspect_ratio = width / height
            # Get file size
            file_size = os.path.getsize(image_path) / (1024 * 1024)  # Convert to MB
            # Append file name, dimensions, file size, and aspect ratio to the lists
            file_names.append(file)
            dimensions.append((width, height))
            file_sizes.append(file_size)
            aspect_ratios.append(aspect_ratio)

# Create a DataFrame
df = pd.DataFrame({"File Name": file_names, 
                   "Dimensions": dimensions, 
                   "File Size (MB)": file_sizes,
                   "Aspect Ratio": aspect_ratios})
df = df.sort_values(by="File Name")

# Export the DataFrame to a CSV file
df.to_csv("./output/raw/images/image_data.csv", index=False)