# Introduction
Urban Rivers is a conservation organization helping to restore the Chicago River.  

Part of the project involves tracking changes in biodiversity attributable to the installation of floating wetlands.  
Volunteers have placed and maintain motion detection cameras (camera traps) along installations, natural river banks, and the existing metal retaining walls on the water way.

These pictures are available on s3 - this code investigates downloading a portion of those images for testing with SpeciesNet.

This workbook was used for detections using Kaggle's GPUs and is linked therein.  
https://www.kaggle.com/code/morescope/speciesnet-testing-urbanrivers

## Notebook Setup and Required Packages

Before importing packages below, retrieve your Kaggle API key following [these steps](https://www.kaggle.com/discussions/questions-and-answers/357399#2880045).  You should end up downloading a `kaggle.json` file, whose content will look like this:

`{"username":"your-kaggle-id","key":"your-kaggle-API-key"}`

In [26]:
# Data Handling
import pandas as pd
import numpy as np

# IO - getting files and images from MongoDB and S3
from pymongo import MongoClient

# # https://www.kaggle.com/discussions/questions-and-answers/357399
# # NOTE - Uncomment next two lines if `kaggle_secrets` throws errors
# import sys
# sys.path.append("docker-python/patches")
from kaggle_secrets import UserSecretsClient

import requests

from concurrent.futures import ThreadPoolExecutor

from pathlib import Path
from PIL import Image
from io import BytesIO

import os
import re
import shutil
import json
import time

# Install speciesnet and related megadetector libraries
!pip install -Uqq speciesnet megadetector-utils

from IPython.display import display
from IPython.display import JSON

from speciesnet import SpeciesNet
import kagglehub



In [27]:
# Run a quick check to see if the GPU is being used
# !pip install chardet  # NOTE - try this if next line throws a warning
!python -m speciesnet.scripts.gpu_test

*** Running Torch tests ***

Torch version: 2.6.0
CUDA available (according to PyTorch): False
No GPUs reported by PyTorch
PyTorch reports that Metal Performance Shaders are available


In [28]:
# Configuration for Multithreading and Batching
num_batches = 10
max_threads = 8
output_root = Path("output")

# Prepare folders
output_root.mkdir(exist_ok=True)
images_root = Path("images")
images_root.mkdir(exist_ok=True)

## Access The URIs from S3 through MongoDB

In [None]:
# Get the stored mongo uri secret - and if not using Kaggle, should be able to use `os.getenv()`
# user_secrets = UserSecretsClient()
# mongo_uri = user_secrets.get_secret("MONGO_URI")
mongo_uri = os.getenv("MONGO_URI")

# Connect to the MongoDB client
client = MongoClient(mongo_uri)
 
# Access the database and collection
db = client['test']
collection = db['cameratrapmedias'] 
 
# Query the collection to retrieve records with image URLs, metadata, and the first index of 'relativePath'
data = list(collection.aggregate([
    {
        '$project': {
            '_id': 0,
            'publicURL': 1,
            'timestamp': 1,
            'folderName': { '$arrayElemAt': ['$relativePath', 1] },
            'fileName': 1
        }
    },
    # { '$limit': 150 }
]))
 
# Convert the data to a pandas DataFrame for exploration
df = pd.DataFrame(data)

# Export the small array to a CSV file for preview
df.to_csv('ur_test_medias.csv', index=False)

## We are going to create a column that creates a file name to save the image

In [31]:
# This function will format the final string
def make_filename(s):
    # s = s.lower()
    s = re.sub(r'[^\w\s.-]', '', s) # remove special characters except dash or underscore or period
    s = re.sub(r'\s+', '_', s) # replace whitespace with underscore
    return s

# Combine the relative path second (folder name) + fileName
df['imageName'] = df['folderName'] + '--' + df['fileName']
df['imageName'] = df['imageName'].apply(make_filename)

print(df.head())

            timestamp                                          publicURL  \
0 2024-01-27 13:33:15  https://urbanriverrangers.s3.amazonaws.com/ima...   
1 2024-01-24 18:56:50  https://urbanriverrangers.s3.amazonaws.com/ima...   
2 2024-01-24 19:01:54  https://urbanriverrangers.s3.amazonaws.com/ima...   
3 2024-01-24 19:03:05  https://urbanriverrangers.s3.amazonaws.com/ima...   
4 2024-01-24 19:04:19  https://urbanriverrangers.s3.amazonaws.com/ima...   

       fileName                               folderName  \
0  SYFW0060.JPG                   2024-01-30_prologis_02   
1  SYFW0001.JPG  2024-01-30_Learnin_platform_camera_test   
2  SYFW0002.JPG  2024-01-30_Learnin_platform_camera_test   
3  SYFW0004.JPG  2024-01-30_Learnin_platform_camera_test   
4  SYFW0006.JPG  2024-01-30_Learnin_platform_camera_test   

                                           imageName  
0               2024-01-30_prologis_02--SYFW0060.JPG  
1  2024-01-30_Learnin_platform_camera_test--SYFW0...  
2  2024-01-30_Lea

Now that we have a connection to the MongoDB server and access to the URLs, let's use the download images.

## Download Images

In [33]:
%%time
# Create a directory to save the images
output_root.mkdir(exist_ok=True)
path = Path('images')
path.mkdir(exist_ok=True)

# Create a tool for resizing so cropping top and bottom can happen while keeping the aspect ratio
def resize_to_height(image, target_height=256):
    og_width, og_height = image.size
    new_width = int(og_width * (target_height / og_height))
    return image.resize((new_width, target_height))

# Create a tool for downloading and processing images
def process_row(row, dest_folder):
    url = row['publicURL']
    filename = row['imageName']
    # Download the image
    dest = dest_folder/filename

    try:
        # Download image to memory
        response = requests.get(url)
        response.raise_for_status()

        # Open and process the image
        image = Image.open(BytesIO(response.content)).convert("RGB")
        image = resize_to_height(image, target_height=256)
        image.save(dest, format="JPEG", quality=85)
        
    except Exception as e:
        print(f"failed to process{filename}: {e}")

# Download and display some images where at least an animal was found - ex rat
df_test = df[42410:42910] # 500 images with some known animal detections
df_big_chunk = df[0:10000] # first 10000 images
df_bigger_chunk = df[10001:60001] # second 50k

# Process Batches
for batch_idx, df_chunk in enumerate(np.array_split(df_test, num_batches)): # change to df_bigger_chunk to split a larger batch size
    batch_folder = images_root / f'batch_{batch_idx}'
    batch_folder.mkdir(exist_ok=True)

    print(f'Processing batch {batch_idx + 1} / {num_batches} with {len(df_chunk)} images...')

    start = time.time()
    
    with ThreadPoolExecutor(max_workers=max_threads) as executor:
        executor.map(lambda row: process_row(row, batch_folder), [row for _, row in df_chunk.iterrows()])

    end = time.time()
    print(f"Batch {batch_idx+1} took {end - start:.2f} seconds.")
        
print(f'{len(df_test)} Images Downloaded and Resized')  # ... df_bigger_chunk

  return bound(*args, **kwds)


Processing batch 1 / 10 with 50 images...
Batch 1 took 2.26 seconds.
Processing batch 2 / 10 with 50 images...
Batch 2 took 2.38 seconds.
Processing batch 3 / 10 with 50 images...
Batch 3 took 2.31 seconds.
Processing batch 4 / 10 with 50 images...
Batch 4 took 2.43 seconds.
Processing batch 5 / 10 with 50 images...
Batch 5 took 2.37 seconds.
Processing batch 6 / 10 with 50 images...
Batch 6 took 2.54 seconds.
Processing batch 7 / 10 with 50 images...
Batch 7 took 2.39 seconds.
Processing batch 8 / 10 with 50 images...
Batch 8 took 2.29 seconds.
Processing batch 9 / 10 with 50 images...
Batch 9 took 2.04 seconds.
Processing batch 10 / 10 with 50 images...
Batch 10 took 2.02 seconds.
500 Images Downloaded and Resized
CPU times: user 22.4 s, sys: 4.22 s, total: 26.6 s
Wall time: 23.1 s


In [7]:
# Uncomment and run this if the images need to be redone
# !rm images -r
# !rm docs.zip
# %lsmagic

## Running Species Net on the Full Dataset
Now that we have the max number of images downloaded (19.5GB) let's run speciesnet

Note there might be a better way of doing this using bytes downloaded from s3 - but I haven't figured that part out yet.

### We're going to try a multithreading chunks approach

In [34]:
def print_predictions(predictions_dict: dict) -> None:
    print("Predictions:")
    for prediction in predictions_dict["predictions"][0:1]:
        print(prediction["filepath"], "=>", prediction["prediction"])

### Download Model

In [35]:
# Choose the folder we're going to download the model to
model_path = 'content/models'
os.makedirs(model_path, exist_ok=True)

# Download the model (it will go to a folder like /kaggle/input/...)
download_path = kagglehub.model_download('google/speciesnet/PyTorch/v4.0.1a',
                                          force_download=True)

print(f'Model downloaded to temporary folder: {download_path}')

# List the contents of the downloaded directory to identify the actual files/subdirectories
model_files = os.listdir(download_path)

# Copy the contents of the model file to our destination folder
for item_name in model_files:
    source_path = os.path.join(download_path, item_name)
    destination_path = os.path.join(model_path, item_name)
    if os.path.isfile(source_path):
        shutil.copy2(source_path, destination_path)
    elif os.path.isdir(source_path):
        shutil.copytree(source_path, destination_path, dirs_exist_ok=True)

print(f'{len(model_files)} files copied to: {model_path}')

Downloading 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/always_crop_99710272_22x8_v12_epoch_00148.labels.txt...




Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/README.md...
Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/taxonomy_release.txt...



[A

100%|██████████| 119/119 [00:00<00:00, 55.0kB/s]

Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/geofence_release.2025.02.27.0702.json...




[A

Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/info.json...





100%|██████████| 399/399 [00:00<00:00, 719kB/s]

Downloading from https://www.kaggle.com/api/v1/models/google/speciesnet/PyTorch/v4.0.1a/1/download/always_crop_99710272_22x8_v12_epoch_00148.pt...






100%|██████████| 250k/250k [00:00<00:00, 1.61MB/s]


100%|██████████| 343k/343k [00:00<00:00, 1.81MB/s]

[A
[A


[A[A[A
[A


[A[A[A
100%|██████████| 5.03M/5.03M [00:00<00:00, 7.75MB/s]



[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A

Model downloaded to temporary folder: /Users/oldadministrator/.cache/kagglehub/models/google/speciesnet/PyTorch/v4.0.1a/1
6 files copied to: content/models





In [36]:
# Pick the model we want to use (4.0.1a)
model = SpeciesNet(model_path)

print('Model Loaded')

Model Loaded


In [38]:
# Let's format a request string as a list of dicts (aka JSON string format)
def create_instances(batch_folder):
    image_paths = [f'{batch_folder}/{f}' for f in os.listdir(batch_folder) if f.lower().endswith('.jpg')]

    instances = []
    for image_path in image_paths:
        instances.append({
            'filepath': image_path,
            'latitude': 41.906782,
            'longitude': -87.651927
        })

    # Check that it's saved correctly by verifying the first
    print(instances[0])

    return instances


for batch_index in range(len(os.listdir(images_root))):
    instances = create_instances(f'{images_root}/batch_{batch_index+1}')

    # make the predictions and get a sense of how long it would take
    %time predictions_dict = model.predict(instances_dict={"instances": instances})

    print_predictions(predictions_dict) # show the first prediction of each batch

    # Save the dict to the batch folder
    with open(f'{images_root}/batch_{batch_index}/predictions_dict_{batch_index}.json', 'w') as f:
        json.dump(predictions_dict, f, indent=2)

    print(f'predictions_dict_{batch_index}.json saved to {images_root}/batch_{batch_index}')

{'filepath': 'images/batch_1/2024-02-01_Bubbly_spypoint_garden--PICT3482.JPG', 'latitude': 41.906782, 'longitude': -87.651927}
CPU times: user 19.9 s, sys: 2min 41s, total: 3min 1s
Wall time: 27.7 s
Predictions:
images/batch_1/2024-02-01_Bubbly_spypoint_garden--PICT3482.JPG => f1856211-cfb7-4a5b-9158-c0f72fd09ee6;;;;;;blank
predictions_dict_0.json saved to images/batch_0
{'filepath': 'images/batch_2/2024-05-25_WM_Boardwalk_D--PICT1409.JPG', 'latitude': 41.906782, 'longitude': -87.651927}
CPU times: user 19.7 s, sys: 2min 43s, total: 3min 2s
Wall time: 25.7 s
Predictions:
images/batch_2/2024-05-25_WM_Boardwalk_D--PICT1409.JPG => f1856211-cfb7-4a5b-9158-c0f72fd09ee6;;;;;;blank
predictions_dict_1.json saved to images/batch_1
{'filepath': 'images/batch_3/2024-05-25_WM_Boardwalk_D--PICT1353.JPG', 'latitude': 41.906782, 'longitude': -87.651927}
CPU times: user 19.6 s, sys: 2min 46s, total: 3min 6s
Wall time: 26.7 s
Predictions:
images/batch_3/2024-05-25_WM_Boardwalk_D--PICT1353.JPG => f18562

FileNotFoundError: [Errno 2] No such file or directory: 'images/batch_10'

In [39]:
# To concatenate all the json files
output_file = output_root / "predictions_dict_master.json"

# Initialize the master predictions list
master_predictions = []

# Loop through files matching the pattern
for json_file in sorted(images_root.glob("batch_*/predictions_dict_*.json")):
    with open(json_file, "r") as f:
        data = json.load(f)
        if "predictions" in data:
            master_predictions.extend(data["predictions"])  # Concatenate predictions!
        else:
            print(f"{json_file} missing 'predictions' key")

# Write the combined predictions to a new file
with open(output_file, "w") as f:
    json.dump({"predictions": master_predictions}, f, indent=2)

print(f"Combined {len(master_predictions)} predictions into {output_file}")


Combined 1444 predictions into output/predictions_dict_master.json


## Let's save the predictions dict json file

In [42]:
# Create a docs folder for previewing the images
output_path = 'working/output/docs'
os.makedirs(output_path, exist_ok=True)

# change n to sample to -1 for all
!python -m megadetector.postprocessing.postprocess_batch_results output/predictions_dict_master.json working/output/docs --num_images_to_sample 2000 --confidence_threshold 0.5

Loading results from output/predictions_dict_master.json
This appears to be a SpeciesNet output file, converting to MD format
Writing temporary results to /var/folders/kw/7t07x0j577n5mp3fbbf1q4lh0000gq/T/megadetector_temp_files/7c13e294-3905-11f0-81e5-fa4f0e7b467b.json
Converting results to dataframe
Finished loading MegaDetector results for 1444 images from output/predictions_dict_master.json
Assigning images to rendering categories
100%|████████████████████████████████████| 1444/1444 [00:00<00:00, 76902.05it/s]
Finished loading and preprocessing 1444 rows from detector output, predicted 42 positives.
100%|██████████████████████████████████████| 1444/1444 [00:09<00:00, 150.58it/s]
Rendered 1444 images (of 1444) in 9.59 seconds (0.01 seconds per image)
Generating classification category report
This appears to be a SpeciesNet output file, converting to MD format
Writing temporary results to /var/folders/kw/7t07x0j577n5mp3fbbf1q4lh0000gq/T/megadetector_temp_files/81e2aa84-3905-11f0-81e5-

## Let's zip the folder so we can easily download it

In [43]:
shutil.make_archive('working/output/docs', 'zip', 'working/output/docs')

# and finally clean up the tree that made it this far
shutil.rmtree('working/output/docs')  # Deletes the folder
