# Scraping our Yellowstone.ai trailcam images

The goal of the scraper is to create a textfile of image links which we'll download using the widely used `requests` library. We have to do this because yellowstone.ai doesn't offer a way to download images in bulk.

ChromeDriver is a browser that we can control with scripting. We'll use the Selenium API to "drive" it – aka issue browser commands such as "click here", "enter my email address there", and importantly, "copy that image link".

We'll build our scraper by issuing commands one-at-a-time in a trial and error fashion until we're confident that we have all the pieces to put them together and create the final scraper.

# Building the scraper step-by-step

## Setup

Go `https://chromedriver.chromium.org/downloads` and download the version of ChromeDriver that matches your Chrome version. Move it to your project's root folder. Right-click + open the downloaded file to let your OS know that it's safe.

In [1]:
# !pip install selenium

Verify it works by importing Selenium's webdriver and creating a driver – it should open an empty browser window. Note that importing Options and manually setting options is only required when running this on linux machines (useful for running on servers).

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)

  driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)


Selenium is the library we use to give ChromeDriver commands. We'll also use the time library to issue `sleep` commands so that we can wait for Yellowstone's website updates that take place in response to our commands. We'll use the `csv` library to save the results, which will be a list of image urls.

In [3]:
from time import sleep
import csv

## Login

Now we'll start issuing browser commands. Before we start issuing web driver commands, let's store our yellowstone.ai email and password. I keep mine in a text file called `creds.txt`.

In [4]:
file = open('creds.txt', "r")
my_email, my_password = file.read().split("\n")
file.close()

Now we can start issuing commands to the web driver. bring up the page that contains the images and wait one second for the page to load. After running, we should now be looking at an email/password login page.

In [5]:
driver.get("https://my.yellowstone.ai/media")
sleep(1)

We need to identify the email input html element and the password input element. Ideally, we would like to use an `id` tag b/c those are guaranteed to be unique. A `name` isn't always unique, but that may work. `class` is a last choice.

This page has `id`s for the two forms we want to grab, so we'll use `driver.find_element_by_id` to grab those elements.

(Tip for using non-id attributes: if multiple objects match the search, the first on the page will be returned.)

In [6]:
email_input = driver.find_element_by_id("email")

  email_input = driver.find_element_by_id("email")


Now we can use the `send_keys` method of this search result to send our email.

In [7]:
email_input.send_keys(my_email)

We'll do the same thing for the password.

In [8]:
password_input = driver.find_element_by_id("password")
password_input.send_keys(my_password)

  password_input = driver.find_element_by_id("password")


Now we can hit "return" on the password input. To do that, we send the `RETURN` key by using selenium's built-in keys.

In [9]:
from selenium.webdriver.common.keys import Keys

In [10]:
password_input.send_keys(Keys.RETURN)

It worked!

## Scrape

Before we start scraping, I want to show you that you have access to the page's source code at any time. We probably won't use `driver.page_source` often, but it's good to know that it exists and should shed some light on what's going on!

In [11]:
# print(driver.page_source) # warning: large output

Let's begin.

This time we want to get all the images instead of one specific image. Upon inspection, we can use a class name with the method `find_elements_by_class_name` (notice that elements is plural).

In [12]:
imgs = driver.find_elements_by_class_name("shadow-lg")
imgs[:3]

  imgs = driver.find_elements_by_class_name("shadow-lg")


[<selenium.webdriver.remote.webelement.WebElement (session="284d90a3c3215aaf129c2b971fa869ff", element="ec8807d9-d71a-4040-8bcd-219515941eb8")>,
 <selenium.webdriver.remote.webelement.WebElement (session="284d90a3c3215aaf129c2b971fa869ff", element="1dfcb59a-8e4e-48f9-8ca2-910fe06a23c9")>,
 <selenium.webdriver.remote.webelement.WebElement (session="284d90a3c3215aaf129c2b971fa869ff", element="1ccf3ce0-c84a-404c-b51d-cf68e63a3e44")>]

To access an attribute from an element, we need to use a try/except. So we'll first instantiate an empty list `links`, then we'll try to append the `src` attribute from each element in `imgs`.

In [13]:
links = []

for img in imgs:
    try: links.append(img.get_attribute("src"))
    except: print("failed to get src")
        
links[:3] , len(links)

([None,
  'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1279.JPG',
  'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1277.JPG'],
 201)

## Scrape next page

Excellent! We scraped image links from one page. Now we need to move to the next page and scrape the next batch of links.

Let's find the page buttons at the bottom of the page and try to individually select the ">" button which brings us to the next page.

I think this will work because I went to the very last page, and this button doesn't exist there. So, we can click the ">" button and save the links until ">" stops working!

To select a `button` element by specifying the value of one of its attributes, xpath comes in handy. The syntax for xpath accepted by `find_element_by_xpath` is `'//element[@attr="value"]'`.

Examples:
- `driver.find_elements_by_xpath('//img[@src="https://www.rorymm.com/fun"]')
- `driver.find_elements_by_xpath('//div[@class="abcde" or @class="zyxwv"]')

In [14]:
next_page_button = driver.find_element_by_xpath('//button[@dusk="nextPage.after"]')
next_page_button.send_keys(Keys.RETURN)

  next_page_button = driver.find_element_by_xpath('//button[@dusk="nextPage.after"]')


Since we're loading a new page with new imgs, we'll sleep for a few seconds (1 second is fine for black and white ims but color imgs need longer – the later pages need the full 2).

In [15]:
sleep(2)

Now we'll scrape this page's images. The code is identical to last time EXCEPT we already have `links`, so we don't want to accidentally reinstantiate it.

In [16]:
imgs = driver.find_elements_by_class_name("shadow-lg")

for img in imgs:
    try: links.append(img.get_attribute("src"))
    except: print("failed to get src")
        
links[:5]

  imgs = driver.find_elements_by_class_name("shadow-lg")


[None,
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1279.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1277.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639148040_SYFW1276.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639148039_SYFW1274.JPG']

In [17]:
len(links)

402

Very close – each page should have 200 imgs, so we have two extra.

## Cleaning and saving the links

Let's see what shouldn't be here.

In [18]:
links [:3]

[None,
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1279.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/thumb_1639164201_SYFW1277.JPG']

Well wha-da-ya-know, the very first item is a `None`. Let's remove all the `None`s and see if that works.

In [19]:
for i,l in enumerate(links):
    if l == None: links.pop(i)

len(links)

400

Perfect.

These are thumbnails of size 400x300, which isn't a bad size for deep learning, but I want to get the full sized images. I tried removing "thumb_" from the link, and that worked! I'll edit the links so they all are full sized.

In [20]:
links_big = [link.replace('thumb_', '') for link in links]
links_big[:3]

['https://d1xrbm8v2c14yb.cloudfront.net/1468/1639164201_SYFW1279.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/1639164201_SYFW1277.JPG',
 'https://d1xrbm8v2c14yb.cloudfront.net/1468/1639148040_SYFW1276.JPG']

Now we're ready to put it all together and save the links to a text file. Saving to a text file will help me make sure I don't spend time downloading imgs I already have in the future!

# Putting it all together

In [24]:
from fastai.vision.all import *
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
import requests
import shutil
import csv



##-----------------------------------------------------------------------------
##----- Params

CREDS = 'creds.txt'
URL = 'https://my.yellowstone.ai/media'
LINUX = True # change to False if running on a Mac (Windows untested)

if LINUX:
    DRIVER_BIN = '/usr/local/bin/chromedriver'
    IMAGES = '/home/rory/data/trailcam'
else:
    DRIVER_BIN = '/Users/rorymccallion/repos/scrapers/chromedriver96'
    IMAGES = '/Users/rorymccallion/repos/trailcam/imgs'



##-----------------------------------------------------------------------------
##----- Webdriver Setup

print(f"STARTING Yellowstone.ai image scraper.")

file = open(CREDS, "r")
my_email, my_password = file.read().split("\n")
file.close()

# Summon chrome webdriver.
print("Starting headless webdriver ...")
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(DRIVER_BIN, options=options)

    
# Open chrome, go to url, wait for page to load.
driver.get(URL)
sleep(1)
print(f"Navigating to {URL} ...")


# Find email input and enter email.
email_input = driver.find_element_by_id("email")
email_input.send_keys(my_email)


# Same for password, then "hit enter" to submit the form.
password_input = driver.find_element_by_id("password")
password_input.send_keys(my_password)
password_input.send_keys(Keys.RETURN)
print(f"Logging into {my_email}'s account ...")




##-----------------------------------------------------------------------------
##----- Scrape Image Links

# We're about to start scraping image links and storing them in a list. When
#  we've finished scraping image links, we'll compare this list of links to the
#  list of image files we've already downloaded to determine which links have
#  new images. We'll then download those new images.

links = []

# Do the following on each page of images:
pagenum = 1
while True:
    
    # Wait for the imgs to load, then find them into a list.
    sleep(2)
    imgs = driver.find_elements_by_class_name("shadow-lg")
    print(f"Scraping image links from page {pagenum} ...")
    
    
    # Store their src attribute (their link) in links, else break.
    for img in imgs:
        try:
            links.append(img.get_attribute("src"))
        except:
            print("Couldn't get src attribute; moving on...")
            break
    
    
    # Remove `None` values so we can test that we grabbed exactly 200.
    for i,l in enumerate(links):
        if l == None:
            links.pop(i)
            
            
    # Do the test.
    if len(links) % 200 != 0:
        # Note on the next line: the print says "fewer" but it COULD be greater than!
        print(f"Stopping on page {pagenum}: this page has fewer than 200 links.")
        print(f"Total links scraped: {len(links)}")
        break
    
    
    # Find the next page button and click it.
    xpath_str = '//button[@dusk="nextPage.after"]'
    try:
        driver.find_element_by_xpath(xpath_str).send_keys(Keys.RETURN)
        pagenum += 1
    except:
        print(f"Stopping on page {pagenum}: couldn't find next page.")
        print(f"Total links scraped: {len(links)}")
        break

        
        
                
##-----------------------------------------------------------------------------
##----- Find New Links


# Get fullsized image links by changing URLs to remove "thumb_".
links = [link.replace('thumb_', '') for link in links]

# Get filenames from links.
link_fns = [link.split("/")[-1].replace('.JPG','.jpg') for link in links]

# Get filenames from already downloaded images.
image_fns = [path.name for path in get_image_files(IMAGES)]

# If a link's fn isn't in the image fns, it's new and should be downloaded.
new_links = [l for fn,l in zip(link_fns, links) if fn not in image_fns]

print(f"Current trailcam images: {len(image_fns)}.")
print(f"New images to download: {len(new_links)}.")




##-----------------------------------------------------------------------------
##----- Download New Images


succ, fail = 0, 0
for link in new_links:
    
    r = requests.get(link, stream = True)
    
    if r.status_code == 200:
        r.raw.decode_content = True
        filename = link.split("/")[-1].replace('.JPG','.jpg')
        with open(IMAGES + "/" + filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
        print('Sucessfully downloaded', link)
        succ += 1
    else:
        print('Failed to download:', link)
        fail += 1

print(f"Downloaded {succ} images successfully, {fail} failed.")
print(f"FINISHED scraping & downloading Yellowstone.ai trailcam images.")

STARTING Yellowstone.ai image scraper.
Starting headless webdriver ...


  driver = webdriver.Chrome(DRIVER_BIN, options=options)


Navigating to https://my.yellowstone.ai/media ...


  email_input = driver.find_element_by_id("email")
  password_input = driver.find_element_by_id("password")


Logging into mccallionr+yellowstoneai@gmail.com's account ...


  imgs = driver.find_elements_by_class_name("shadow-lg")


Scraping image links from page 1 ...


  driver.find_element_by_xpath(xpath_str).send_keys(Keys.RETURN)


Scraping image links from page 2 ...
Scraping image links from page 3 ...
Scraping image links from page 4 ...
Scraping image links from page 5 ...
Scraping image links from page 6 ...
Stopping on page 6: this page has fewer than 200 links.
Total links scraped: 1018
Current trailcam images: 1018.
New images to download: 0.
Downloaded 0 images successfully, 0 failed.
FINISHED scraping & downloading Yellowstone.ai trailcam images.
