# Web Scraping iNaturalist
[iNaturalist](https://www.inaturalist.org/) is an interesting and useful source of biological image data. Users provide (often high-quality) images of plants, bugs, and other orgnaisms by uploading them through an app, and it solves the problem of data labeling since labels (for example, what species of plant) are crowd-sourced.  

In this notebook I scrape photos of common toxic or rash plants in North America (like poison oak or poison ivy) as well as common nontoxic plants which are often confused with their toxic counterparts. The end goal is to train a model that can classify a plant image and help determine if it is one of the toxic species. [The dataset created by this notebook can be found on Kaggle](https://www.kaggle.com/datasets/hanselliott/toxic-plant-classification) 

If decent results can be achieved, this model can be deployed as an app. Once again, this makes iNaturalist an exciting data source, considering that users of iNaturalist tend to submit photos taken from their phone, and a plant classification app would rely on phone-taken photos as well.

In [1]:
# web scraping
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
## https://stackoverflow.com/questions/64717302/deprecationwarning-executable-path-has-been-deprecated-selenium-python

# images
import PIL
from io import BytesIO
import cv2

# file and os management
import os, re, time, shutil
from imutils import paths

# generic
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 

# notebook display
from IPython.display import display, clear_output

In [50]:
# https://www.chop.edu/news/health-tip/recognizing-poison-ivy-oak-and-sumac

## Metadata
Create simple dataframes containing information for the 5 toxic species and the 5 nontoxic species.  
Note that all of the species below are present in the [Herbarium 2022 competition dataset](https://www.kaggle.com/competitions/herbarium-2022-fgvc9/data) available on Kaggle. This data can be combined with the herbarium image data based on the `herbarium22_category_id`.  
Note that the `path` column is prepared specifcally for use in Kaggle. 

In [37]:
# Toxic plant images:
toxic_meta = pd.DataFrame({
    "class_id" : [0, 1, 2, 3, 4],
    "slang" : ["Western Poison Oak", "Eastern Poison Oak", "Eastern Poison Ivy", "Western Poison Ivy", "Poison Sumac"],
    "scientific_name" : ["Toxicodendron diversilobum", "Toxicodendron pubescens", "Toxicodendron radicans",
                         "Toxicodendron rydbergii", "Toxicodendron vernix"],
    "herbarium22_category_id" : [14625, 14626, 14627, 14628, 14629],
    "path" : ["../input/toxic-plant-classification/tpc-imgs/toxic_images/000/", "../input/toxic-plant-classification/tpc-imgs/toxic_images/001/",
             "../input/toxic-plant-classification/tpc-imgs/toxic_images/002/", "../input/toxic-plant-classification/tpc-imgs/toxic_images/003/",
             "../input/toxic-plant-classification/tpc-imgs/toxic_images/004/"],
    "search_url" : ["https://www.inaturalist.org/taxa/51080-Toxicodendron-diversilobum/browse_photos",
                    "https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos",
                    "https://www.inaturalist.org/taxa/58732-Toxicodendron-radicans/browse_photos",
                    "https://www.inaturalist.org/taxa/58729-Toxicodendron-rydbergii/browse_photos",
                    "https://www.inaturalist.org/taxa/54767-Toxicodendron-vernix/browse_photos"]
})

toxic_meta.to_csv("tpc_meta.csv")


class_to_slang = dict(zip(toxic_meta.class_id, toxic_meta.slang)) ##potentially useful dicts to index
class_to_sciname = dict(zip(toxic_meta.class_id, toxic_meta.scientific_name))
toxic_meta.to_csv("../iNaturalist/toxic_metadata.csv", index=False)

toxic_meta

Unnamed: 0,class_id,slang,scientific_name,herbarium22_category_id,path,search_url
0,0,Western Poison Oak,Toxicodendron diversilobum,14625,../input/toxic-plant-classification/tpc-imgs/t...,https://www.inaturalist.org/taxa/51080-Toxicod...
1,1,Eastern Poison Oak,Toxicodendron pubescens,14626,../input/toxic-plant-classification/tpc-imgs/t...,https://www.inaturalist.org/taxa/52083-Toxicod...
2,2,Eastern Poison Ivy,Toxicodendron radicans,14627,../input/toxic-plant-classification/tpc-imgs/t...,https://www.inaturalist.org/taxa/58732-Toxicod...
3,3,Western Poison Ivy,Toxicodendron rydbergii,14628,../input/toxic-plant-classification/tpc-imgs/t...,https://www.inaturalist.org/taxa/58729-Toxicod...
4,4,Poison Sumac,Toxicodendron vernix,14629,../input/toxic-plant-classification/tpc-imgs/t...,https://www.inaturalist.org/taxa/54767-Toxicod...


In [38]:
# Nontoxic images:
nontoxic_meta = pd.DataFrame({
    "class_id" : [0, 1, 2, 3, 4],
    "slang": ["Virginia creeper", "Boxelder", "Jack-in-the-pulpit", "Bear Oak", "Fragrant Sumac"],
    "scientific_name": ["Parthenocissus quinquefolia", "Acer negundo L.", "Arisaema triphyllum", "Quercus ilicifolia", "Rhus aromatica"],
    "herbarium22_category_id": [10340, 83, 1055, 12267, 12479],
    "path" : ["../input/toxic-plant-classification/tpc-imgs/nontoxic_images/000/", "../input/toxic-plant-classification/tpc-imgs/nontoxic_images/001/",
             "../input/toxic-plant-classification/tpc-imgs/nontoxic_images/002/", "../input/toxic-plant-classification/tpc-imgs/nontoxic_images/003/",
             "../input/toxic-plant-classification/tpc-imgs/nontoxic_images/004/"],
    "search_url": ["https://www.inaturalist.org/taxa/50278-Parthenocissus-quinquefolia/browse_photos",
                   "https://www.inaturalist.org/taxa/47726-Acer-negundo/browse_photos",
                   "https://www.inaturalist.org/taxa/50310-Arisaema-triphyllum/browse_photos",
                   "https://www.inaturalist.org/taxa/130737-Quercus-ilicifolia/browse_photos",
                   "https://www.inaturalist.org/taxa/58738-Rhus-aromatica/browse_photos"]
})
nontoxic_meta.to_csv("../iNaturalist/nontoxic_metadata.csv", index=False)
nontoxic_meta

Unnamed: 0,class_id,slang,scientific_name,herbarium22_category_id,path,search_url
0,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,https://www.inaturalist.org/taxa/50278-Parthen...
1,1,Boxelder,Acer negundo L.,83,../input/toxic-plant-classification/tpc-imgs/n...,https://www.inaturalist.org/taxa/47726-Acer-ne...
2,2,Jack-in-the-pulpit,Arisaema triphyllum,1055,../input/toxic-plant-classification/tpc-imgs/n...,https://www.inaturalist.org/taxa/50310-Arisaem...
3,3,Bear Oak,Quercus ilicifolia,12267,../input/toxic-plant-classification/tpc-imgs/n...,https://www.inaturalist.org/taxa/130737-Quercu...
4,4,Fragrant Sumac,Rhus aromatica,12479,../input/toxic-plant-classification/tpc-imgs/n...,https://www.inaturalist.org/taxa/58738-Rhus-ar...


# Scraping iNaturalist 
Scrape the search page's html content for image URLs, and then get the images from their url

In [10]:
def scrape_for_urls(search_url, max_urls=5):
    """
    Use selenium to open the page, load all html content, find urls hidden in html elements, and save the urls.
    By default, the search page is set so Grouping=None, Plant Phenology=Any, Order By=Faves, Photo Licensing=Any,
    and Quality Grade=Research.
    """
    driver = webdriver.Chrome("C:/Users/hanse/Documents/chromedriver_win32/chromedriver.exe")
    driver.get(search_url)
    time.sleep(5) #sleep_between_interactions

    reached_page_end = False
    last_height = driver.execute_script("return document.body.scrollHeight")

    # Stage 1: Continue scrolling until enough images are loaded 
    nonunique_urls = []
    while (not reached_page_end) & (len(set(nonunique_urls)) < max_urls):
        # Scrape through site counting urls. Load more images if need more urls to reach max_urls
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ##scroll to end of page
        elem = driver.find_element(By.XPATH, "//*") ##find all HTML elements
        source_code = elem.get_attribute('outerHTML')  ##save html source code
        time.sleep(3)
        split_html = source_code.split("CoverImage low undefined loaded") ##split by the class which contains image url in stlye
        for i in range(len(split_html)):
            result = re.search('&quot;(.*)&quot', split_html[i])
            if result is not None:
                nonunique_urls += [result.group(1)]
            if len(set(nonunique_urls)) >= max_urls: ##continue until the number of UNIQUE urls in the list is max_urls...
                break
        new_height = driver.execute_script("return document.body.scrollHeight")
        if last_height == new_height: #check if the height has stopped changing
            reached_page_end = True #if so, we've maxed out and need to stop
        else:
            last_height = new_height
    
    # Stage 2: Put the unique urls found in the html into a list. Stop at max_urls
    urls = []
    elem = driver.find_element(By.XPATH, "//*") ##find all HTML elements one final time
    source_code = elem.get_attribute('outerHTML')  ##save final html source code with all loaded urls
    time.sleep(3)
    split_html = source_code.split("CoverImage low undefined loaded") ##split string and isolate urls
    for i in range(len(split_html)):
        result = re.search('&quot;(.*)&quot', split_html[i])
        if result is not None:
            urls += [result.group(1)]
        if len(urls) >= max_urls: ##continue saving until number of urls is = to max
            break

    driver.close()
    return urls

In [9]:
def get_images_and_save(class_id, urls, toxic, clean_dir=True):
    "Retrieves images from urls and saves them to a directory based on their class_id"
    if toxic:
        dir=f'../iNaturalist/toxic_images/00{class_id}/'
    else:
        dir=f'../iNaturalist/nontoxic_images/00{class_id}/'

    if clean_dir:
        if os.path.isdir(dir): # see if dir exists - if so remove it to get a clean slate
            shutil.rmtree(dir)
        os.mkdir(dir) # make a new empty dir
    else:
        if not os.path.isdir(dir):
            os.mkdir(dir)

    total = 0
    for idx, url in enumerate(urls):
        # before saving image to disk, determine the image name & path
        if total < 10:
            str_total = "00"+str(total)
        elif 9 < total < 100:
            str_total = "0"+str(total)
        elif total > 99:
            str_total = str(total)
        img_path = os.path.join(dir+str_total+".jpg")
        try: #try the url, if successful open the image and save to the img_path
            res = requests.get(urls[idx], timeout=60)
            img = PIL.Image.open(BytesIO(res.content))
            img.save(img_path)
            # update the counter
            total += 1
            clear_output(wait=True) ##allow print statements to overwrite previous ones
            display(f"[INFO] downloaded: {img_path} | Total {total}")
        except Exception as e:
            display(f"[INFO] error downloading {img_path}...skipping")
            #display(e)

In [20]:
max_urls = 1000   ##about 10 urls found per scroll
toxic_urls_list = []
nontoxic_urls_list = []

# SCRAPE iNaturalist HTML SOURCE FOR IMAGE URLS 
for i in range(5):
    "GET TOXIC URLS"
    toxic_urls_list += [scrape_for_urls(toxic_meta.loc[i, 'search_url'], max_urls)]
    

    "GET NONTOXIC URLS"
    nontoxic_urls_list += [scrape_for_urls(nontoxic_meta.loc[i, 'search_url'], max_urls)]


## 28m 20s when max_urls=1000

In [28]:
for i in range(5):
    print(f"Toxic Class {i}. URLS = {len(set(toxic_urls_list[i]))}")
    print(f"Nontoxic Class {i}. URLS = {len(set(nontoxic_urls_list[i]))}")

Toxic Class 0. URLS = 1000
Nontoxic Class 0. URLS = 1000
Toxic Class 1. URLS = 954
Nontoxic Class 1. URLS = 1000
Toxic Class 2. URLS = 1000
Nontoxic Class 2. URLS = 1000
Toxic Class 3. URLS = 1000
Nontoxic Class 3. URLS = 1000
Toxic Class 4. URLS = 1000
Nontoxic Class 4. URLS = 1000


In [18]:
# DOWNLOAD & SAVE THE CORRESPONDING IMAGES TO LOCAL DISK
for i in range(5):
    "GET TOXIC IMAGES"
    get_images_and_save(class_id=i, urls=toxic_urls_list[i], toxic=True)
    "GET NONTOXIC IMAGES"
    get_images_and_save(class_id=i, urls=nontoxic_urls_list[i], toxic=False)

## >81m for 10,000 images

'[INFO] downloaded: ../iNaturalist/nontoxic_images/004/998.jpg | Total 999'

In [None]:
## Fix issue with toxic class 2 (sumac - should be at least 600 urls available)
## Figure out deal with nontoxic class 4

# Clean Up

In [31]:
def clean_image_paths(image_directory):
    """
    Tries to load each image using OpenCV. If it returns None, the image is faulty and we delete it.
    If loading the image does not work and produces an error, we delete it.  
    """
    # loop over the image paths we just downloaded
    for imagePath in paths.list_images(image_directory):
        # initialize if the image should be deleted or not
        delete = False
        # try to load the image
        try:
            image = cv2.imread(imagePath)
            # if the image is `None` then we could not properly load it
            # from disk, so delete it
            if image is None:
                delete = True
        # if OpenCV cannot load the image then the image is likely
        # corrupt so we should delete it
        except:
            print("Except")
            delete = False
        # check to see if the image should be deleted
        if delete:
            display("[INFO] deleting {}".format(imagePath))
            os.remove(imagePath)

In [32]:
for cat in ["000/", "001/", "002/", "003/", "004/"]:
    tox_dir = ("../iNaturalist/toxic_images/"+cat)
    clean_image_paths(tox_dir)
    nontox_dir = os.path.join("../iNaturalist/nontoxic_images/",cat)
    clean_image_paths(nontox_dir)

At this point, I do a visual inspection of the images to determine if anything should be deleted manually.  

Then reset image filenames so they are indexed from 000, 001, ... uninterruped.

In [33]:
p1 = '../iNaturalist/toxic_images/'
p2 = '../iNaturalist/nontoxic_images/'
for path in [p+c for p in [p1, p2] for c in ["000/", "001/", "002/", "003/", "004/"]]:
    for i, filename in enumerate(os.listdir(path)):
        if i < 10:
             str_i = "00"+str(i)
        elif 9 < i < 100:
            str_i = "0"+str(i)
        elif i > 99:
            str_i = str(i)

        os.rename(path + filename, path + str_i + ".jpg")

# Image Counts by Category

In [34]:
# Final Image Counts: 
p1 = "../iNaturalist/toxic_images/"
p2 = "../iNaturalist/nontoxic_images/"

print("Toxic Images:")
total = 0
for c in ['000/', '001/', '002/', '003/', '004/']:
    pth = os.path.join(p1, c)
    ims = len([name for name in os.listdir(pth)])
    total += ims
    print(f"Category {c} - Images: {ims}")
print("total = ", total)


print("")
print("Nontoxic Images:")
total = 0
for c in ['000/', '001/', '002/', '003/', '004/']:
    pth = os.path.join(p2, c)
    ims = len([name for name in os.listdir(pth)])
    total += ims
    print(f"Category {c} - Images: {ims}")
print("total = ", total)


Toxic Images:
Category 000/ - Images: 1000
Category 001/ - Images: 954
Category 002/ - Images: 1000
Category 003/ - Images: 1000
Category 004/ - Images: 999
total =  4953

Nontoxic Images:
Category 000/ - Images: 1000
Category 001/ - Images: 1000
Category 002/ - Images: 1000
Category 003/ - Images: 1000
Category 004/ - Images: 999
total =  4999


# Saving the URLs

In [36]:
for i in range(5):
    with open(f'../iNaturalist/saved_urls/toxic_{i}_urls.txt', 'w') as f:
        for url in toxic_urls_list[i]:
            f.write(f"{url}\n")
    with open(f'../iNaturalist/saved_urls/nontoxic_{i}_urls.txt', 'w') as f:
        for url in nontoxic_urls_list[i]:
            f.write(f"{url}\n")

# Expanding the metadata
We can create one full metadata CSV with a row for every image.

In [41]:
# Instatiate empty df
full_meta = pd.DataFrame()

# For each class_id
for class_id in range(5):
    # iterate through the 
    for tox, meta in enumerate([nontoxic_meta, toxic_meta]): #0 for nontoxic, 1 for toxic
        # identify the path to its folder and the number of images in that folder
        if tox==0: local_path = f'../iNaturalist/nontoxic_images/00{class_id}/'
        if tox==1: local_path = f'../iNaturalist/toxic_images/00{class_id}/'
        kaggle_path = meta.loc[class_id, "path"]
        n_imgs = len([img for img in os.listdir(local_path)])
        # for each image, add a row to the empty df with the full path to the image (and other metadata)
        for i in range(n_imgs): 
            ##ensure that the image name is correct by adding 0s where appropriate
            if i < 10:
                str_i = "00"+str(i)
            elif 9 < i < 100:
                str_i = "0"+str(i)
            elif i > 99:
                str_i = str(i)
            ##create row and append to df
            img_row = pd.DataFrame({
                "class_id" : [class_id],
                "slang" : meta.loc[class_id, "slang"],
                "scientific_name" : meta.loc[class_id, "scientific_name"],
                "herbarium22_category_id" : meta.loc[class_id, "herbarium22_category_id"],
                "path" : kaggle_path+str_i+".jpg",
                "toxicity" : int(tox)
            })
            full_meta = pd.concat([full_meta, img_row])

            
full_meta = full_meta.reset_index()
full_meta = full_meta.drop(labels="index",axis=1)


In [50]:
# Collapse species into 4 categories (poison oak, ivy, sumac, and nontoxic plants) for potential use
specieslabel_to_slang = {0:"poison-oak", 1:"poison-ivy", 2:"poison-sumac", 3:"nontoxic"}

full_meta["species_label"] = int(0) ##Poison Oak
full_meta.loc[((full_meta.class_id==2) | (full_meta.class_id==3)) & (full_meta.toxicity==1), 
         "species_label"] = int(1) ##Poison Ivy
full_meta.loc[(full_meta.class_id==4) & (full_meta.toxicity==1), "species_label"] = int(2) ##Poison Sumac
full_meta.loc[(full_meta.toxicity==0), "species_label"] = int(3) ##Nontoxic


full_meta.to_csv("../iNaturalist/full_metadata.csv", index=False)
full_meta

Unnamed: 0,class_id,slang,scientific_name,herbarium22_category_id,path,toxicity,species_label
0,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,0,3
1,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,0,3
2,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,0,3
3,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,0,3
4,0,Virginia creeper,Parthenocissus quinquefolia,10340,../input/toxic-plant-classification/tpc-imgs/n...,0,3
...,...,...,...,...,...,...,...
9947,4,Poison Sumac,Toxicodendron vernix,14629,../input/toxic-plant-classification/tpc-imgs/t...,1,2
9948,4,Poison Sumac,Toxicodendron vernix,14629,../input/toxic-plant-classification/tpc-imgs/t...,1,2
9949,4,Poison Sumac,Toxicodendron vernix,14629,../input/toxic-plant-classification/tpc-imgs/t...,1,2
9950,4,Poison Sumac,Toxicodendron vernix,14629,../input/toxic-plant-classification/tpc-imgs/t...,1,2
