# Getting started with Pyseter

Pyseter is an Python package for sorting images by an automatically generated ID. The main functions of Pyseter are:

1. Extracting features from images
2. Clustering images by proposed ID
3. Sorting images by cluster
4. Grading images by distinctiveness

This notebook will walk you through each major function. First, let's make sure that pyseter is properly installed, and that it can access Pytorch.  

In [8]:
import pyseter

pyseter.verify_pytorch()

✓ PyTorch 2.7.1+cu126 detected
✓ CUDA GPU available: NVIDIA H200 NVL


If you're on a Mac, you should see something like

```
✓ PyTorch 2.7.0 detected
✓ Apple Silicon (MPS) GPU available
```

Please note, however, that *AnyDorsal* consumes quite a bit of memory. As such, only Apple Silicon devices with 16 GB or more of memory will work. Ideally, future versions of Pyseter will use a smaller model.

If neither Apple Silicon or an NVIDIA GPU are available, you will see a message like this.

```
✓ PyTorch 2.7.1+cu126 detected
! No GPU acceleration available. Expect slow feature extraction.
```

**A note for R users** In R, the above code block would look something like

```
library(pyseter)
verify_pytorch()
```

or, 

```
pyseter::verify_pytorch()
```

Imports work a little differently in Python. First, we need tell Python that this package is available for imports, `import pyseter`, then we need to explicitly call the function from the library `pyseter.verify_pytorch()`. To an R user, this can feel overly wordy. Nevertheless, this wordiness helps keep the global environment clean. Whereas R sessions frequently have to deal with [masking names](https://adv-r.hadley.nz/functions.html?q=masking#lexical-scoping), this rarely happens in Python. 

## Folder management

The main purpose of pyseter is organizing images into folders. To do keep things clean and tidy, we recommend establishing a `working directory` with a subfolder, e.g., called, `all images`, that contains every image you want to be sorted (see below for a different case). Optionally, you might want to have a .csv with encounter information in the working directory. This .csv would contain two columns: one for the image name, i.e., every image in `all images`, and another for the encounter. As such, the working directory would look like this. 

```
working directory
├── encounter_info.csv
├── all images
│   └──00cef32dc62b0f.jpg
│   └──3ecc025ea6f9bf.jpg
│   └──9f18762a48696b.jpg
│   └──36f78517a512dd.jpg
│   └──470d524b4d5303.jpg
       ...
│   └──4511c9e5cb7acb.jpg
```

### Optional: prep_images

Sometimes, you might have your images organized into subfolders by encounter. 

```
working_dir
└── original_images
    ├── SL_HI_006_20220616 (CROPPED)
    │   ├── 2022-06-16_CLD500_CL_006.JPG
    │   ├── 2022-06-16_CLD500_CL_007.JPG
    │   ├── 2022-06-16_CLD500_CL_008.JPG
    │   ├── 2022-06-16_CLD500_CL_021.JPG
    │   ├── 2022-06-16_CLD500_CL_042.JPG
...
    ├── SL_HI_007_20220616 (CROPPED)
    │   ├── 2022-06-16_CLD500_CL_346.JPG
    │   ├── 2022-06-16_CLD500_CL_347.JPG
    │   ├── 2022-06-16_CLD500_CL_371.JPG
    │   ├── 2022-06-16_CLD500_CL_372.JPG
```

In this case, you might want to accomplish two tasks: move all these images to one folder, and create a .csv that indicates which image belongs to which encounter (i.e., a map from image to encounter). The `prep_images()` function does just that. 

In [9]:
from pyseter.sort import prep_images

# various directories we'll be working with
working_dir = '/home/pattonp/koa_scratch/id_data/working_dir'
original_image_dir = working_dir + '/original_images'

# new directory containing every image
image_dir = working_dir + '/all_images'

prep_images(original_image_dir, all_img_dir=image_dir)

Copied 1230 images to: /home/pattonp/koa_scratch/id_data/working_dir/all_images
Saved encounter information to: /home/pattonp/koa_scratch/id_data/working_dir/original_images/encounter_info.csv


**A note for R users** In python, you can concatenate strings with the `+` operator. This is equivalent to `paste0(working_dir, '/original_images')` in R.   

**A note for R users** Packages in Python tend to be subdivided into modules based on their functions. In pyseter, the `sort` module contains functions for sorting files, including other forms of file management. 

## Extracting features

Pyseter extracts *feature vectors* for every image with [AnyDorsal](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.14167), an algorithm for identifying whales and dolphins of many species. Feature vectors summarize three-dimensional images into one-dimensional vectors that are useful for the task at hand, in this case, individual identification. 

Before we extract the feature vectors, let's first create a subfolder within our working directory to save them in. This isn't necessary, yet keeps things tidy. 

In [10]:
import os

# in case you want to save the features after extracting them 
feature_dir = working_dir + '/features'
os.makedirs(feature_dir, exist_ok=True)


**A note for R users**  The module, `os`,  is part of Python's *standard library*. People often refer to R and its standard libraries as "base R." Base R includes the stats library, which provides the function `rnorm`. The `os` module has many functions for tinkering with your operating system. 

We will extract features with the `FeatureExtractor` class. To do so, we first need to initialize the class. This sets up important parameters, such as the `batch_size`, which is the number of images that will be processed in parallel. Larger batches should run faster, although your mileage may vary. If you specify too large of a batch, you may encounter an `OutOfMemoryError` (see below for an example). If you encounter this error, try specifying a larger batch size. If you encounter this error with a very small batch size (say, 2), you may need to resize your images. You can do this manually by reducing the file size in an image editing software, or [with Python](https://imagekit.io/blog/image-resizing-in-python/)

```
OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacity of 5.81 GiB of which 9.06 MiB is free. Including non-PyTorch memory, this process has 5.76 GiB memory in use. Of the allocated memory 5.64 GiB is allocated by PyTorch, and 50.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

**A note for R users** Python error messages are comically long, putting CVS receipts to shame. This is because they show the entire traceback, i.e., this error caused this error caused this error, etc. To quickly diagnose the problem, scroll to the bottom of the message. Then, you can further dissect it by scrolling up.

The second parameter is the `model_path`, which is the full path to the *AnyDorsal* model weights. T0D0: We have to find somewhere to host these weights, and to get permission from Ted since they are a Happywhale product. 

In [11]:
from pyseter.extract import FeatureExtractor

# specify the configuration for the extractor 
fe = FeatureExtractor(
    batch_size=4,
    model_path='/home/pattonp/ristwhales/ristwhales_model.pth'
)

Using device: cuda (NVIDIA H200 NVL)


Once we've initialize the class, we can use its associated methods (functions). In this case, the only one we are interested in is `extract()`, which extracts a feature vector for every image in a specified directory. This can take several minutes, so we typically save the results afterwards. 

**A note for R users** Classes and methods also exist in R, but operate more behind the scenes. For example, `x <- data.frame()` initializes an object of class data.frame, and `summary(x)` calls the summary method for data.frames. Python makes this relationship more explicit. For example, the equivalent (although non-sensical) Python code would be `x = data.frame()` and `x.summary()`. 

In [12]:
import numpy as np

# extract the features for the input directory then save them
features = fe.extract(image_dir=image_dir)

# this saves them as an numpy array
out_path = feature_dir + '/features.npy'
np.save(out_path, features)

Loading model...
Loading model from: /home/pattonp/ristwhales/ristwhales_model.pth


NotADirectoryError: [Errno 20] Not a directory: '/home/pattonp/koa_scratch/id_data/working_dir/all_images'

The object `features` is a dictionary, whose keys are the filenames and whose values are the feature vectors associated with each filename. This helps ensure that each image is associated with the correct feature vector. Nevertheless, it can be easier to work with actual numpy arrays. To do so, convert the keys to a list, then to a numpy array

In [None]:
filenames = np.array(list(features.keys()))
feature_array = np.array(list(features.values()))

**A note for R users** The equivalent of a dictionary in R is a named vector 

## Identification

In [None]:
import pandas as pd

# we want to subdivide the clusters by encounter for easier viewing
bay1_encounter_info = pd.read_csv(image_root + '/IG_identifications.csv')
bay1_encounter_info.columns = ['image', 'encounter']
bay1_encounter_info.head()

In [None]:
from pyseter.sort import cluster_images, format_ids, report_cluster_results

# set up the configuration for the clustering algorithm
cluster_algorithm = 'hac'
similarity_threshold = 0.5

# cluster away! 
results = cluster_images(bay1_feature_array, cluster_algorithm, similarity_threshold)
cluster_ids_hac = format_ids(results)

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

In [24]:
from pyseter.sort import report_cluster_results

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

Found 1276 clusters.
Largest cluster has 53 images.


In [28]:
# create a dataframe proposed id and encounter for each image
hac_df = pd.DataFrame({'image': bay1_filenames, 'autosort_id': cluster_ids_hac})
hac_df = hac_df.merge(bay1_encounter_info)

hac_df.head()

Unnamed: 0,image,autosort_id,encounter
0,IG_2024_04_25_G1_IMG_7800_1.jpg,ID-0485,IG_2024_04_25_G1
1,IG_2024_04_30_G1_IMG_6212_1.jpg,ID-0353,IG_2024_04_30_G1
2,IG_2025_03_26_G6_IMG_5982_1.jpg,ID-0009,IG_2025_03_26_G6
3,IG_2025_04_03_G2_IMG_2495_1.jpg,ID-0269,IG_2025_04_03_G2
4,IG_2024_05_07_G1_IMG_5733_1.jpg,ID-1132,IG_2024_05_07_G1


In [29]:
from pyseter.sort import sort_images

bay1_out = image_root + '/IG_autosort'
sort_images(hac_df, bay1_dir, bay1_out)

Sorted 7306 images into 2602 folders.


In [34]:
# we want to subdivide the clusters by encounter for easier viewing
bay2_encounter_info = pd.read_csv(image_root + '/SE_identifications.csv')
bay2_encounter_info.columns = ['image', 'encounter']

bay2_filenames = np.array(list(bay2_features.keys()))
bay2_feature_array = np.array(list(bay2_features.values()))

In [31]:
# cluster away! 
results = cluster_images(bay2_feature_array, cluster_algorithm, similarity_threshold)
cluster_ids_hac = format_ids(results)

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

Clustering 6297 features with Hierachical Clustering.
Found 1005 clusters.
Largest cluster has 66 images.


In [35]:
# create a dataframe proposed id and encounter for each image
hac_df = pd.DataFrame({'image': bay2_filenames, 'autosort_id': cluster_ids_hac})
hac_df = hac_df.merge(bay2_encounter_info)

hac_df.head()

Unnamed: 0,image,autosort_id,encounter
0,SE_2025_04_28_G1_IMG_7314_1.jpg,ID-0071,SE_2025_04_28_G1
1,SE_2024_05_10_G3_IMG_6609_1.jpg,ID-0908,SE_2024_05_10_G3
2,SE_2024_04_22_G1_IMG_1275_3.jpg,ID-0924,SE_2024_04_22_G1
3,SE_2024_05_10_G1_IMG_4130_1.jpg,ID-0328,SE_2024_05_10_G1
4,SE_2025_04_28_G1_IMG_6897_1.jpg,ID-0021,SE_2025_04_28_G1


In [36]:
bay2_out = image_root + '/SE_autosort'
sort_images(hac_df, bay2_dir, bay2_out)

Sorted 6297 images into 2320 folders.
