# Downloading North American Camera Trap Images

https://lila.science/datasets/nacti

"This data set contains 3.7M camera trap images from five locations across the United States, with labels for 28 animal categories, primarily at the species level (for example, the most common labels are cattle, boar, and red deer). Approximately 12% of images are labeled as empty. We have also added bounding box annotations to 8892 images (mostly vehicles and birds)."

Goal: Download max(all,55K) images for all classes in the NACTI dataset (incl. empty). I have 500GB to work with.

The dataset maintainers provide these instructions for downloading their data:
- [lila database "sas urls"](https://lila.science/wp-content/uploads/2020/03/lila_sas_urls.txt)
- [example python script](https://github.com/microsoft/CameraTraps/blob/master/data_management/download_lila_subset.py)
- [quick writeup on different ways to download the data](https://lila.science/image-access)

# Specifying our subset

We're going to download the images into our local data directory and into a new folder named `nacti`.

In [1]:
from utils import *
import requests

DATA = Path("/home/rory/data")
NACTI = DATA / "nacti"
ZIP = NACTI / "metadata.json.zip"

We're going to first download the NACTI annotations file (named `metadata.json`). We need to do this so we can find out exactly what the animal classes are named. We need exact names because the script to download subsets requires those exact names. (Note that in the output below, you should only have the `metadata.json.zip` file if you're running this for the first time.

In [2]:
URL = 'https://lilablobssc.blob.core.windows.net/nacti/nacti_metadata.json.zip'

r = requests.get(URL)
with open(ZIP, 'wb') as f:
    f.write(r.content)

list(NACTI.ls())

[Path('/home/rory/data/nacti/models'),
 Path('/home/rory/data/nacti/backups'),
 Path('/home/rory/data/nacti/imgs'),
 Path('/home/rory/data/nacti/bad_imgs'),
 Path('/home/rory/data/nacti/bad_paths-empty.txt'),
 Path('/home/rory/data/nacti/metadata.json'),
 Path('/home/rory/data/nacti/metadata.json.zip'),
 Path('/home/rory/data/nacti/lila_sas_urls.txt'),
 Path('/home/rory/data/nacti/bad_img_files.txt'),
 Path('/home/rory/data/nacti/urls_to_download-empty.txt'),
 Path('/home/rory/data/nacti/2021-12-13-1830_cats6_err036')]

I then ran `unzip` in my terminal.

In [3]:
#!unzip ...

Let's take a look at the annos json's highest level keys.

In [4]:
annos = load_json(NACTI_ANNOS)
len(annos['categories']) , annos.keys()

(59, dict_keys(['images', 'info', 'categories', 'annotations']))

We can tell already that this probably follows the COCO json annos format. Let's take a look at one of the records in `categories`:

In [5]:
annos['categories'][0]

{'id': 1,
 'name': 'alces alces',
 'species': 'alces alces',
 'genus': 'alces',
 'family': 'cervidae',
 'ord': 'artiodactyla',
 'class': 'mammalia',
 'common name': 'moose'}

After looking at this record, it's clear that I personally need to work with the `common name` field because I don't know the "binomial" name of virtually any animal.

In [6]:
common_names = list(set([a['common name'] for a in annos['categories']]))
common_names

['raccoon',
 'vehicle',
 'california quail',
 'american marten',
 'unidentified bird',
 'american red squirrel',
 'ermine',
 'wolf',
 'black-tailed jackrabbit',
 'gray fox',
 'cougar',
 'fox squirrel',
 'domestic cow',
 'horse',
 'eastern gray squirrel',
 'unidentified pocket gopher',
 'empty',
 'elk',
 'red deer',
 'black rat',
 'white-tailed deer',
 'bobcat',
 'nine-banded armadillo',
 'striped skunk',
 'domestic dog',
 'north american porcupine',
 'california ground squirrel',
 'dark-eyed junco',
 'virginia opossum',
 'unidentified pack rat',
 'yellow-bellied marmot',
 'wild turkey',
 'unidentified sciurus',
 'unidentified deer',
 'house wren',
 'coyote',
 'long-tailed weasel',
 'moose',
 'unidentified corvus',
 'mule deer',
 'unidentified mouse',
 "steller's jay",
 'wild boar',
 'american crow',
 'mourning dove',
 'donkey',
 'unidentified chipmunk',
 'unidentified rodent',
 'american black bear',
 'gray jay',
 'unidentified deer mouse',
 'unidentified rabbit',
 'north american rive

I need to pass the subset-downloader script a list of the names of classes. Each name in that list needs to be the `name` field (not the `common_name` field). It will then download all of the images in that class. (Note that 'person' is not a category due to the legal PII issues it would create. Don't share images of people without their consent!).

In [10]:
my_common_names = [
    'california quail',
    'wild turkey',
    'mule deer',
    'cougar',
    'california ground squirrel',
    'american black bear',
    'vehicle',
    'empty'
]

Now I just map the `common_name` back to the `name` to get my final list.

In [11]:
my_names = [a['name'] for a in annos['categories'] if a['common name'] in my_common_names]
my_names

['callipepla californica',
 'meleagris gallopavo',
 'odocoileus hemionus',
 'puma concolor',
 'otospermophilus beecheyi',
 'ursus americanus',
 'vehicle',
 'empty']

# Downloading our subset with `download_lila_subset.py`

The following code is a very slightly altered version of the script provided by the Microsoft camera trap team that's available from their GitHub [here](https://github.com/microsoft/CameraTraps/blob/master/data_management/download_lila_subset.py). A lot of the code pertains to using `azcopy` (which I didn't use).

## Get links to the images of animals in our subset

In [None]:
#
# download_lila_subset.py
#
# Example of how to download a list of files from LILA, e.g. all the files
# in a data set corresponding to a particular species.
#



# ----- Imports ----- #

import json
import urllib.request
import tempfile
import zipfile
import os

from tqdm import tqdm
from multiprocessing.pool import ThreadPool
from urllib.parse import urlparse




# ----- Constants ----- #

# This file specifies which datasets are available
metadata_url = 'http://lila.science/wp-content/uploads/2020/03/lila_sas_urls.txt'

# List of the datasets we want images from
datasets_of_interest = ['NACTI']

# Our subset of species
species_of_interest = my_names

# Where we'll save the downloaded images
output_dir = "/home/rory/data/nacti/downloads"
os.makedirs(output_dir,exist_ok=True)
overwrite_files = False
n_download_threads = 50




# ----- Helper Functions ----- #

def download_url(url, destination_filename=None, force_download=False, verbose=True):
    """
    Download a URL (defaulting to a temporary file)
    """
    if destination_filename is None:
        temp_dir = os.path.join(tempfile.gettempdir(),'lila')
        os.makedirs(temp_dir,exist_ok=True)
        url_as_filename = url.replace('://', '_').replace('.', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
            
    if (not force_download) and (os.path.isfile(destination_filename)):
        print('Bypassing download of already-downloaded file {}'.format(os.path.basename(url)))
        return destination_filename
    
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),destination_filename),end='')
    
    os.makedirs(os.path.dirname(destination_filename),exist_ok=True)
    urllib.request.urlretrieve(url, destination_filename)  
    assert(os.path.isfile(destination_filename))
    
    if verbose:
        nBytes = os.path.getsize(destination_filename)    
        print('...done, {} bytes.'.format(nBytes))
        
    return destination_filename


def download_relative_filename(url, output_base, verbose=False):
    """
    Download a URL to output_base, preserving relative path
    """
    p = urlparse(url)
    # remove the leading '/'
    assert p.path.startswith('/'); relative_filename = p.path[1:]
    destination_filename = os.path.join(output_base,relative_filename)
    download_url(url, destination_filename, verbose=verbose)
    
    
def unzip_file(input_file, output_folder=None):
    """
    Unzip a zipfile to the specified output folder, defaulting to the same location as
    the input file    
    """
    if output_folder is None:
        output_folder = os.path.dirname(input_file)
        
    with zipfile.ZipFile(input_file, 'r') as zf:
        zf.extractall(output_folder)

        
        
              
# ----- First, download and parse the metadata file ----- #

# Put the master metadata file in the same folder where we're putting images
p = urlparse(metadata_url)
metadata_filename = os.path.join(output_dir,os.path.basename(p.path))
download_url(metadata_url, metadata_filename)

# Read lines from the master metadata file
with open(metadata_filename,'r') as f:
    metadata_lines = f.readlines()
metadata_lines = [s.strip() for s in metadata_lines]

# Parse those lines into a table
metadata_table = {}

for s in metadata_lines:
    
    if len(s) == 0 or s[0] == '#':
        continue
    
    # Each line in this file is name/sas_url/json_url
    tokens = s.split(',')
    assert len(tokens)==3
    url_mapping = {'sas_url':tokens[1],'json_url':tokens[2]}
    metadata_table[tokens[0]] = url_mapping
    
    assert 'https' not in tokens[0]
    assert 'https' in url_mapping['sas_url']
    assert 'https' in url_mapping['json_url']

    
    
    
# ----- Second, download and extract metadata for the specified datasets ----- #

for ds_name in datasets_of_interest:
    
    assert ds_name in metadata_table
    json_url = metadata_table[ds_name]['json_url']
    
    p = urlparse(json_url)
    json_filename = os.path.join(output_dir,os.path.basename(p.path))
    download_url(json_url, json_filename)
    
    # Unzip if necessary
    if json_filename.endswith('.zip'):
        
        with zipfile.ZipFile(json_filename,'r') as z:
            files = z.namelist()
        assert len(files) == 1
        unzipped_json_filename = os.path.join(output_dir,files[0])
        if not os.path.isfile(unzipped_json_filename):
            unzip_file(json_filename,output_dir)        
        else:
            print('{} already unzipped'.format(unzipped_json_filename))
        json_filename = unzipped_json_filename
    
    metadata_table[ds_name]['json_filename'] = json_filename
    # ...for each dataset of interest




# ----- Third, make the list of files to download (for all data sets) ----- #

# Flat list or URLS, for use with direct Python downloads
urls_to_download = []

# For use with azcopy
downloads_by_dataset = {}

for ds_name in datasets_of_interest:
    
    json_filename = metadata_table[ds_name]['json_filename']
    sas_url = metadata_table[ds_name]['sas_url']
    
    base_url = sas_url.split('?')[0]    
    assert not base_url.endswith('/')
    
    sas_token = sas_url.split('?')[1]
    assert not sas_token.startswith('?')
    
    ## Open the metadata file
    
    with open(json_filename, 'r') as f:
        data = json.load(f)
    
    categories = data['categories']
    for c in categories:
        c['name'] = c['name'].lower()
    category_id_to_name = {c['id']:c['name'] for c in categories}
    annotations = data['annotations']
    images = data['images']


    ## Build a list of image files (relative path names) that match the target species

    category_ids = []
    
    for species_name in species_of_interest:
        matching_categories = list(filter(lambda x: x['name'] == species_name, categories))
        if len(matching_categories) == 0:
            continue
        assert len(matching_categories) == 1
        category = matching_categories[0]
        category_id = category['id']
        category_ids.append(category_id)
    
    print('Found {} matching categories for data set {}:'.format(len(category_ids),ds_name))
    
    if len(category_ids) == 0:
        continue
    
    for i_category,category_id in enumerate(category_ids):
        print(category_id_to_name[category_id],end='')
        if i_category != len(category_ids) -1:
            print(',',end='')
    print('')
    
    # Retrieve all the images that match that category
    image_ids_of_interest = set([ann['image_id'] for ann in annotations if ann['category_id'] in category_ids])
    
    print('Selected {} of {} images for dataset {}'.format(len(image_ids_of_interest),len(images),ds_name))
    
    # Retrieve image file names
    filenames = [im['file_name'] for im in images if im['id'] in image_ids_of_interest]
    assert len(filenames) == len(image_ids_of_interest)
    
    # Convert to URLs
    for fn in filenames:        
        url = base_url + '/' + fn
        urls_to_download.append(url)

    downloads_by_dataset[ds_name] = {'sas_url':sas_url,'filenames':filenames}
    
# ...for each dataset



print('Found {} images to download'.format(len(urls_to_download)))

Note that there are many `empty` images, and I didn't use all of them due to limited disk space. To save space, I ran this script twice: once to get the animals and vehicles, and a second time to get the `empty` images. On that second time, I only kept a random subset of the links to empty images by running the following line of code at this step (I skipped this step the first go-around):

In [13]:
urls_to_download = L(urls_to_download).shuffle()[:55_000]

## Download images from the list of links

In [19]:
# ----- Download images ----- #

def download_file(url, output_base, verbose=False):
    """
    Download a URL to output_base, preserving relative path
    """
    p = urlparse(url)
    # remove the leading '/'
    assert p.path.startswith('/'); relative_filename = p.path[1:]
    destination_filename = os.path.join(output_base,relative_filename)
    download_url(url, destination_filename, verbose=verbose)

    
def download_from_list(urls, dest, n_threads=50):
    n = len(urls)
    print(f"Downloading {n} files to {dest} ...")
    if n_threads <= 1:
        for url in tqdm(urls):        
            download_file(url, dest, verbose=True)
    else:
        pool = ThreadPool(n_threads)        
        tqdm(pool.imap(lambda fn: download_file(fn, dest, verbose=False), urls), total=n)
    print(f"Finished downloading files.")
    
    
download_from_list(urls_to_download, output_dir, n_download_threads)

Downloading 55000 files to /home/rory/data/nacti/downloads ...


  0%|          | 0/55000 [00:00<?, ?it/s]

Finished downloading files.
