In [1]:
from active_embedder import *
from prob_cover import *
from data_labeler import *
from resnet_verification import *

[rank: 0] Global seed set to 42


## Active Learning for Classification Toolkit

Why we made this: Active Learning algorithms such as ProbCover have achieved state of the art results, and even made improvements over Self-Supervised and Semi-Supervised techniques. However, most of these results are in simualted environments: the datasets were actually labelled, but the labels were hidden from the model until the Active Learning "oracle" unhid them. We wanted to make Active Learning work for you -- dear Reader with a truly unlabelled dataset, by providing a full-service workflow. You start with an image folder of unlabelled data, and use this Toolkit to generate embeddings, select the examples to label, and save the labels. 

When to use this: This toolkit focuses on the Cold Start problem. Many Active Learning frameworks, such as weakly supervised or semi-supervised learning rely on an "initial set" of labelled examples and work to propogate those labels to unlabelled examples. The harder problem is when ALL your data is unlabelled. Where do you even know where to start labelling? That's where this toolkit comes in. We'll help you label as many examples as possible to get a working classifier, and provide guidance on when you can stop labelling

How to use the toolkit: Unfortunately, the interactivity of this notebook tends to slow way down on Colab. We recommend cloning this repo and running a jupyter notebook locally. 

The steps are as follows:

### [Part 1: Prep work](#part_1)

Enter the root directory where your images are stored in the cells below. The cell after that will find all images in the folder, so don't worry about file naming and folder structure

### [Part 2: Create Embeddings](#part_2)

Specify or create Embeddings. All our Active Learning algorithms require a good embedding space as a prerequisite. If you already have self-supervised embeddings from your data, simply enter where the npy or pth file is stored. If you don't have embeddings, you have three options -- using the forward pass on a pre-trained VGG model to generate embeddings, using the forward pass on a self-supervised ResNet to generate embeddings, or fine-tuning a self-supervised model on your dataset. The first two will run quickly, the second will likely take about a day to run. 

### [Part 3: Label your examples with Active Learning](#part_3)

Use your embeddings to select examples to label. Instantiate a class of our Active Learners as specified below, and it will generate a list of examples to label. The step will save a data manifest for your future use

### [Part_4: Test a model](#part_4)

Use those labels to build a DataLoader. You're now ready to train a classifier that should have maximum performance per labelled example!

A final note before we begin: To keep this notebook organized, most length functions (e.g., our Active Learner classes, our visualization classes) are imported. However, the whole reason we made this a notebook instead of a GUI is so that you can see them, inspect them, and create your own classes that may well improve on anything we did. At your leisure, browse the rest of the code in this repo to understand how these algorithms maximize performance, and modify to your heart's delight. 

Sounds good? Let's get started

## <a id="part_1"> Part 1 <a>

Enter the root directory within which all images are stored. The format of this folder doesn't matter, but note that ALL images within the folder you specify will be added to your dataset

In [2]:
image_dir = "/deep/u/jsilberg/Brazil_hard_detections_1024_2500" # ENTER IMAGE FOLDER HERE
image_size = 256  # ENTER IMAGE HEIGHT OR WIDTH HERE (ASSUMED TO BE THE SAME)
image_format = ".jpg"  # SPECIFY FILE TYPE, WE SUPPORT .jpg, .jpeg, OR .png

In [3]:
image_list = []
for root, dirs, files in os.walk(image_dir, topdown=True):
   for name in files:
      if name[-4:]==image_format:
        image_list.append(os.path.join(root,name))

In [4]:
len(image_list)

557

If you already have embeddings for your images, please enter the location of the .npy or .pth below

In [5]:
embeddings_loc = None # REPLACE NONE WITH A PATH TO YOUR EMBEDDINGS IF YOU HAVE IT, DO NOTHING IF YOU WANT US TO CREATE EMBEDDINGS

## <a id ="part_2">Part 2: Create Embeddings<a>

**If you already have embeddings, skip ahead to [Part 3](#part_3)**

In [6]:
embeddings_path = "embeddings_simclr.pth" # ENTER WHERE YOU WANT THE EMBEDDINGS TO BE SAVED ENDING IN ".pth"

If you don't have embeddings, we'll create them for you. Choose from the following options:

If you are working with normal images (e.g., images at ground-level) use one of the following

1: VGG pre-trained on ImageNet (fast)

2: ResNet pre-trained on SimCLR (medium, requires large download)

3: Fine-tuning SimCLR on your data (runtime depends on the number of images in your dataset, but budget up to day for a dataset of 50000+ images)

If you are working with remote sensing (e.g., satellite images) use one of the following:

4: VGG pre-trained on fMOW (fast)

5: Frozen SatMAE (medium, requires large download)

6: Fine-tune SatMAE (runtime is around a day usually)

If you are working with medical imagery (e.g., X-rays), use one of the following:

7: VGG pre-trained on SOMETHING

8: Frozen medical SimCLR

9: Fine-tune on SimCLR

In [7]:
embedding_option = 2 # Replace with the number of the option you want to use

In [8]:
contrast_transforms = transforms.Compose([transforms.RandomHorizontalFlip(),
                                          transforms.RandomResizedCrop(size=224),
                                          transforms.RandomApply([
                                              transforms.ColorJitter(brightness=0.5,
                                                                     contrast=0.5,
                                                                     saturation=0.5,
                                                                     hue=0.1)
                                          ], p=0.8),
                                          transforms.RandomGrayscale(p=0.2),
                                          transforms.GaussianBlur(kernel_size=9),
                                          transforms.ToTensor(),
                                          transforms.Normalize((0.5,), (0.5,))
                                         ])

If you chose option 2, 5, or 8, you must provide a path to where you want the model stored, and a 

In [9]:
embedder = Embedder(image_list,image_size,embedding_option,transform=contrast_transforms)

In [10]:
embeddings_transform = transforms.Compose([transforms.CenterCrop(224),
                                          transforms.ToTensor(),
                                          transforms.Normalize((0.5,), (0.5,))
                                         ])

In [11]:
embeddings = embedder.get_embeddings(embeddings_transform)

Features extracted have the shape: torch.Size([557, 128])


Now we will save the embeddings to disk just to make sure we have them saved 

In [None]:
torch.save(embeddings,embeddings_path)

In [None]:
embeddings_loc = embeddings_path

## <a id=part_3>Part 3: Active Learning<a>
    
#### If you already created manifests of your data, skip to [Part 4](#part_4)

Please provide a list of the classes in your data and how many examples you want to label

At this point we will create an "Active Learner" instance, which provides a reference to the list of "best" examples to label. First enter your set of mutually exclusive labels, and let's decide how many examples we want to label:

In [5]:
labels_list = ["oil","water","wind"]

In [7]:
examples_to_label = 50

So far, we have implemented two kinds of Active Learners. The first, ProbCover, comes from <a href='https://arxiv.org/abs/2205.11320'>Active Learning Through a Covering Lens</a> and minimizes the 1-NN error for a given labeling budget. The second, CoverNN, is our own, and minimizes the likelihood that a knn-graph contains images of difference classes. Run ONE of the following two cells, based on which active learner you want to use. 

In [None]:
prob_labels = ProbCover(save_dir="",image_list=image_list,embeddings_loc=embeddings_loc,num_classes=len(labels_list),k=30,input_size=224)

In [8]:
prob_labels = CoverNN(save_dir="",image_list=image_list,embeddings_loc=embeddings_loc,num_classes=len(labels_list),k=30,input_size=224)

Finished loading data...
Start constructing graph using k=30
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([32, 501])
torch.Size([21, 501])
Finished constructing graph using k=30
Graph contains 15030 edges.


In [9]:
oracle_results = prob_labels.select_samples(examples_to_label)
print(oracle_results)

Start selecting 50 samples.
Iteration is 0.	Min distance is 0.041.	Coverage is 0.000
Iteration is 1.	Min distance is 0.054.	Coverage is 0.060
Iteration is 2.	Min distance is 0.057.	Coverage is 0.088
Iteration is 3.	Min distance is 0.058.	Coverage is 0.098
Iteration is 4.	Min distance is 0.058.	Coverage is 0.118
Iteration is 5.	Min distance is 0.060.	Coverage is 0.122
Iteration is 6.	Min distance is 0.061.	Coverage is 0.126
Iteration is 7.	Min distance is 0.064.	Coverage is 0.158
Iteration is 8.	Min distance is 0.064.	Coverage is 0.168
Iteration is 9.	Min distance is 0.064.	Coverage is 0.172
Iteration is 10.	Min distance is 0.065.	Coverage is 0.180
Iteration is 11.	Min distance is 0.065.	Coverage is 0.240
Iteration is 12.	Min distance is 0.066.	Coverage is 0.287
Iteration is 13.	Min distance is 0.066.	Coverage is 0.289
Iteration is 14.	Min distance is 0.067.	Coverage is 0.297
Iteration is 15.	Min distance is 0.067.	Coverage is 0.301
Iteration is 16.	Min distance is 0.068.	Coverage is 0.

In [10]:
oracle_image_list = [[image_list[prob_labels.dict_indices["train"][i]] for i in sublist] for sublist in oracle_results]

Please run the follow cell TWICE in a row for it to save your results properly. Some users have reported that running the cell once does not save properly in some instances. 

In [12]:
train_labeler = DataLabeler(oracle_image_list,labels_list)

100%|███████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 27.57it/s]


In [13]:
train_labeler.display_pictures_button(oracle_image_list,0,train_labeler.df)

Loading images in cluster 1...


VBox(children=(Dropdown(options=('oil', 'water', 'wind'), value='oil'), HBox(children=(VBox(children=(Image(va…

Let's take a look at your manifest!

In [14]:
train_labeler.df

Unnamed: 0,path,label
0,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
1,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
2,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
3,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
4,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
5,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
6,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
7,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
8,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water
9,/deep/u/jsilberg/Brazil_hard_detections_1024_2...,water


Running the cell above, you can see your labels have been saved as a csv. Let's save this csv right away so nothing happens to it. Feel free to edit the name

In [17]:
train_labeler.df.to_csv("train_df.csv")

The next few cells assume you will want to see how your model performs by creating a separate validation set. Here, you can label as many validation examples as you want at random, then below train a model to see how well you're doing.

In [18]:
val_to_label = 50
val_image_list = [[image_list[prob_labels.dict_indices["val"][i]]] for i in range(0,val_to_label)]

Please run the cell below TWICE in a row so that it saves properly

In [19]:
val_labeler = DataLabeler(val_image_list,labels_list)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 231.03it/s]


In [20]:
val_labeler.display_pictures_button(val_image_list,0,val_labeler.df)

Loading images in cluster 2...


VBox(children=(Dropdown(options=('oil', 'water', 'wind'), value='oil'), HBox(children=(VBox(children=(Image(va…

In [268]:
val_labeler.df

NameError: name 'val_labeler' is not defined

And let's save the validation manifest for safe-keeping

In [22]:
val_labeler.df.to_csv("val_df.csv")

## <a id="part_4"> Part 4: Testing your model <a>

In [6]:
labels_map={label: i for label, i in zip(labels_list,range(0,len(labels_list)))}

In [9]:
import pandas as pd
train_data = ManifestData("train_df.csv",label_map=labels_map,transform=embeddings_transform)
val_data = ManifestData("val_df.csv",label_map=labels_map,transform=embeddings_transform)
train_loader = DataLoader(train_data,batch_size=256,num_workers=4,shuffle=True)
val_loader = DataLoader(val_data,batch_size=1,num_workers=4,shuffle=False)
verification_model = VerificationModel(0.001,weight_decay=.001,num_classes=len(labels_list))
trainer = pl.Trainer(accelerator='gpu',max_epochs=50)

NameError: name 'embeddings_transform' is not defined

In [26]:
trainer.fit(verification_model,train_loader,val_loader)

  rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type             | Params
---------------------------------------------
0 | convnet | ResNet           | 11.7 M
1 | loss    | CrossEntropyLoss | 0     
2 | val_acc | Accuracy         | 0     
---------------------------------------------
11.7 M    Trainable params
0         Non-trainable params
11.7 M    Total params
46.758    Total estimated model params size (MB)
2022-11-16 14:04:06.354369: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
