# Getting started with Pyseter

Pyseter is an Python package for sorting images by an automatically generated ID. The main functions of Pyseter are:

1. Extracting features from images
2. Clustering images by proposed ID
3. Sorting images by cluster
4. Grading images by distinctiveness

This notebook will walk you through each major function. First, let's make sure that pyseter is properly installed, and that it can access Pytorch.  

In [3]:
import pyseter

pyseter.verify_pytorch()

✓ PyTorch 2.7.1+cu126 detected
✓ CUDA GPU available: NVIDIA A30


If you're on a Mac, you should see something like

```
✓ PyTorch 2.7.0 detected
✓ Apple Silicon (MPS) GPU available
```

Please note, however, that *AnyDorsal* consumes quite a bit of memory. As such, only Apple Silicon devices with 16 GB or more of memory will work. Ideally, future versions of Pyseter will use a smaller model.

If neither Apple Silicon or an NVIDIA GPU are available, expect a message like this.

```
✓ PyTorch 2.7.1+cu126 detected
! No GPU acceleration available. Expect slow feature extraction.
```

## Folder management

Pyseter 

```
.
├── image folder
│   └── pyproject.toml
```

In [2]:
# from pyseter.autosort import prep_images

# image_dir = '/Users/philtpatton/datasets/spinner/july25'
# prep_images(image_dir)

## Extracting features

In [9]:
import os

# directories with images
image_root = '/home/pattonp/koa_scratch/id_data/Guiana dolphins 24_25'

# join two strings together in python with '+' 
bay1_dir = image_root + '/IG'
bay2_dir = image_root + '/SE'

# in case you want to save the features after extracting them 
feature_dir = image_root + '/features'
os.makedirs(feature_dir, exist_ok=True)

In [4]:
from pyseter.extract import FeatureExtractor
import numpy as np

# specify the configuration for the extractor 
fe = FeatureExtractor(
    batch_size=32,
    model_path='/home/pattonp/ristwhales/ristwhales_model.pth'
)

# extract the features for the input directory then save them
bay1_features = fe.extract(image_dir=bay1_dir)

# this saves them as an numpy array
out_path = feature_dir + '/ig_features.npy'
np.save(out_path, bay1_features)

Using device: cuda (Quadro RTX 5000)
Loading model...


  model = create_fn(


Loading model from: /home/pattonp/ristwhales/ristwhales_model.pth
Extracting features...


100%|██████████| 229/229 [10:11<00:00,  2.67s/it]


In [13]:
# specify the configuration for the extractor 
fe = FeatureExtractor(
    batch_size=8,
    model_path='/home/pattonp/ristwhales/ristwhales_model.pth'
)

# extract the features for the second bay with the same extractor configuration
bay2_features = fe.extract(image_dir=bay2_dir)

# this saves them as an numpy array 
out_path = feature_dir + '/se_features.npy'
np.save(out_path, bay2_features)

Using device: cuda (Quadro RTX 5000)
Loading model...
Loading model from: /home/pattonp/ristwhales/ristwhales_model.pth
Extracting features...


100%|██████████| 788/788 [09:15<00:00,  1.42it/s]


## Identification

In [27]:
import pandas as pd

bay1_filenames = np.array(list(bay1_features.keys()))
bay1_feature_array = np.array(list(bay1_features.values()))

# we want to subdivide the clusters by encounter for easier viewing
bay1_encounter_info = pd.read_csv(image_root + '/IG_identifications.csv')
bay1_encounter_info.columns = ['image', 'encounter']
bay1_encounter_info.head()

Unnamed: 0,image,encounter
0,IG_2024_04_23_G5_IMG_4212_1.jpg,IG_2024_04_23_G5
1,IG_2024_04_23_G5_IMG_4224_1.jpg,IG_2024_04_23_G5
2,IG_2024_04_23_G5_IMG_4236_1.jpg,IG_2024_04_23_G5
3,IG_2024_04_23_G5_IMG_4237_1.jpg,IG_2024_04_23_G5
4,IG_2024_04_23_G5_IMG_4238_1.jpg,IG_2024_04_23_G5


In [None]:
from pyseter.sort import cluster_images, format_ids, report_cluster_results

# set up the configuration for the clustering algorithm
cluster_algorithm = 'hac'
similarity_threshold = 0.5

# cluster away! 
results = cluster_images(bay1_feature_array, cluster_algorithm, similarity_threshold)
cluster_ids_hac = format_ids(results)

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

In [24]:
from pyseter.sort import report_cluster_results

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

Found 1276 clusters.
Largest cluster has 53 images.


In [28]:
# create a dataframe proposed id and encounter for each image
hac_df = pd.DataFrame({'image': bay1_filenames, 'autosort_id': cluster_ids_hac})
hac_df = hac_df.merge(bay1_encounter_info)

hac_df.head()

Unnamed: 0,image,autosort_id,encounter
0,IG_2024_04_25_G1_IMG_7800_1.jpg,ID-0485,IG_2024_04_25_G1
1,IG_2024_04_30_G1_IMG_6212_1.jpg,ID-0353,IG_2024_04_30_G1
2,IG_2025_03_26_G6_IMG_5982_1.jpg,ID-0009,IG_2025_03_26_G6
3,IG_2025_04_03_G2_IMG_2495_1.jpg,ID-0269,IG_2025_04_03_G2
4,IG_2024_05_07_G1_IMG_5733_1.jpg,ID-1132,IG_2024_05_07_G1


In [29]:
from pyseter.sort import sort_images

bay1_out = image_root + '/IG_autosort'
sort_images(hac_df, bay1_dir, bay1_out)

Sorted 7306 images into 2602 folders.


In [34]:
# we want to subdivide the clusters by encounter for easier viewing
bay2_encounter_info = pd.read_csv(image_root + '/SE_identifications.csv')
bay2_encounter_info.columns = ['image', 'encounter']

bay2_filenames = np.array(list(bay2_features.keys()))
bay2_feature_array = np.array(list(bay2_features.values()))

In [31]:
# cluster away! 
results = cluster_images(bay2_feature_array, cluster_algorithm, similarity_threshold)
cluster_ids_hac = format_ids(results)

# quick summary of the clustering results
report_cluster_results(cluster_ids_hac)

Clustering 6297 features with Hierachical Clustering.
Found 1005 clusters.
Largest cluster has 66 images.


In [35]:
# create a dataframe proposed id and encounter for each image
hac_df = pd.DataFrame({'image': bay2_filenames, 'autosort_id': cluster_ids_hac})
hac_df = hac_df.merge(bay2_encounter_info)

hac_df.head()

Unnamed: 0,image,autosort_id,encounter
0,SE_2025_04_28_G1_IMG_7314_1.jpg,ID-0071,SE_2025_04_28_G1
1,SE_2024_05_10_G3_IMG_6609_1.jpg,ID-0908,SE_2024_05_10_G3
2,SE_2024_04_22_G1_IMG_1275_3.jpg,ID-0924,SE_2024_04_22_G1
3,SE_2024_05_10_G1_IMG_4130_1.jpg,ID-0328,SE_2024_05_10_G1
4,SE_2025_04_28_G1_IMG_6897_1.jpg,ID-0021,SE_2025_04_28_G1


In [36]:
bay2_out = image_root + '/SE_autosort'
sort_images(hac_df, bay2_dir, bay2_out)

Sorted 6297 images into 2320 folders.
