# Annotate Images manually using Innotater

Add Annotations to our Butterfly data manually using the open source Innotater project. We will interactively flag images that shouldn't be in our dataset and also draw some bounding boxes which we'll use to make a more accurate model in later notebooks.

You'll need to install the Innotater - run the following in a Jupyter cell:

`!pip install jupyter-innotater`
or without the exclamation mark in a terminal shell.

If your environment is set up using Anaconda, you might first need to:

`conda install pip`
`export PIP_REQUIRE_VIRTUALENV=false`

To recreate the full experience of this project and manually draw annotations, you will need to delete the file butterflies_bboxes.csv containing my existing annotations. Or leave the file present to step through and see what I've done.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

In [2]:
# imports for the Innotater
from jupyter_innotater import Innotater
from jupyter_innotater.data import ImageInnotation, BoundingBoxInnotation, \
                MultiClassInnotation, TextInnotation, BinaryClassInnotation

In [3]:
# Source CSV created by '1 - Butterfly Downloads.ipynb'
BUTTERFLIES_ORIG_FILEPATH = Path('./butterflies_original.csv')

# Will write a new CSV containing our annotations
BUTTERFLIES_BBOXES_FILEPATH = Path('./butterflies_bboxes.csv')

# Where filenames referenced in the CSV can be found
IMAGE_FOLDER = Path('./butterfly_medium_images')

# If butterflies_bboxes.csv already exists reopen it (perhaps you've already drawn some bounding 
# boxes and are coming back to add more), otherwise start from butterflies_original.csv
df = pd.read_csv(BUTTERFLIES_BBOXES_FILEPATH if BUTTERFLIES_BBOXES_FILEPATH.is_file() else BUTTERFLIES_ORIG_FILEPATH)

In [4]:
cats = sorted(df['class'].drop_duplicates().values.tolist()); cats # Unique classes

['gatekeeper_butterfly', 'meadow_brown_butterfly']

In [5]:
# Add some extra columns if we are creating the annotation-enriched butterflies_bboxes.csv file 
# for the first time 
if not BUTTERFLIES_BBOXES_FILEPATH.is_file():
    for new_col in ('exclude','x','y','w','h'):
        df[new_col] = 0

## Prepare numpy arrays that will be fed into (and updated by) the Innotater.

`classes` will be a _row-count x 1_ matrix containing 0 for gatekeeper_butterfly and 1 for meadow_brown_butterfly

`excludes` will be the same shape, but containing 0 (default) for images that we want to keep in our train/val sets, 1 if we want to drop that image (if it should have never made it into the dataset anyway, e.g. not a photo of a butterfly, or multiple/misclassified butterflies).

`bboxes` will be a _row-count x 4_ containing bounding box co-ordinates, each row of 4 integers corresponding to x,y,w,h respectively where x,y is the top-left co-ordinate of the box, w,h the width and height.

In [6]:
classes = np.array([cats.index(c) for c in df['class']])
excludes = df['exclude'].values
bboxes = df[['x','y','w','h']].values

The default order of images in the CSV file will show nearly 500 images of the first class (Gatekeeper butterflies), and then another nearly 500 of the second class. We don't want to draw boxes for 951 images. Maybe we'll draw 200 total, but we want that to be split 100 per class. So we prepare an `indexes` argument to pass to Innotater so it shows us one image of each class alternating.

In [7]:
# Make an ordering so that we cycle through the different categories, 
# so as we step through we get to see the same number of images from each category
cat_dicts = {}
for i,cat in enumerate(df['class']):
    cat_dicts.setdefault(cat, []).append(i)

min_len = min([len(a) for a in cat_dicts.values()])

indexes = np.array([a[:min_len] for a in cat_dicts.values()]).transpose().reshape(-1)

In [8]:
indexes[:10] # Check that this alternates between somewhere near start of the list and the second half of list

array([  0, 482,   1, 483,   2, 484,   3, 485,   4, 486])

In [9]:
# If we are coming back to add more bounding boxes, we need to know where we got to last time
first_blank = ((bboxes[indexes] != 0).sum(axis=1) + excludes[indexes] == 0).nonzero()[0][0]
print(f'Next index needing a box: {first_blank}')

Next index needing a box: 260


## Create and Show Innotater Widget

We are now ready to use the Innotater to interactively step through each image to draw a bounding box and to flag any images that should be removed.

If it should be removed, check the `Exclude` checkbox - e.g. if no butterfly is present. Otherwise, at least for the first 200 images, we draw a bounding box around the butterfly. It's a lot of work to too many draw boxes, so we'll stop drawing after 200 but will step through the rest to see if any need to be excluded from the dataset still.

If the classification is wrong, it is also possible to switch the class, although generally I have trusted Flickr users' tags. It's up to you to come up with a plan for how to draw the boxes consistently (e.g. I decided to include all their little arms and legs!) and the criteria for exclusion from the dataset.

You can save the data you've made so far at any time by running the 'Save Your Data' section below.

In [10]:
# Construct the Innotater widget. Format is Innotater( inputs, targets, indexes=indexes)
# `inputs` is generally the 'x' side of the machine learning project, `targets` the 'y' side 
# that you are most likely to adjust using the Innotater.
# 
# The only real `inputs` is the image itself. We also show the image filename as text but it's 
# not needed.
# 
# `targets` are created from the bounding box matrix (bboxes), the excludes matrix which is 
# binary (0 or 1 in a checkbox), and the classes which in theory might be multiple classes so 
# is displayed as a selection list.

winnotater = Innotater( 
    [ImageInnotation(df['filename'], path=IMAGE_FOLDER, height=300, width=400),
     TextInnotation(df['filename'])
    ],
    [BoundingBoxInnotation(bboxes), # Assumes the boxes relate to the only available image input
     BinaryClassInnotation(excludes, name='Exclude'),
     MultiClassInnotation(classes, classes=cats, dropdown=True)
    ],
    indexes=indexes # Our mapping to ensure we see alternating classes of butterfly
)

winnotater.index = int(first_blank) # Jump to the first index that's missing a bounding box
display(winnotater) # Show the widget - will not be visible in GitHub preview

Innotater(children=(HBox(children=(VBox(children=(ImagePad(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x0…

In [11]:
# Show the first few lines of [exclude,x,y,w,h]. The `bboxes` and `excludes` variables are 
# updated automatically when you make changes in the widget above.

np.concatenate([excludes[indexes].reshape(-1,1),bboxes[indexes]], axis=-1)[:8] # Just to display - not required

array([[  1,   0,   0,   0,   0],
       [  0,  38, 197, 484, 268],
       [  1,   0,   0,   0,   0],
       [  0, 274,  31, 458, 556],
       [  1,   0,   0,   0,   0],
       [  0,  70,  29, 457, 457],
       [  0, 282, 195, 178, 170],
       [  0, 173,  49, 307, 349]])

## Save Your Data

In [12]:
# Numpy matrices are updated in real-time as you interact with the Innotater widget
# Now explicitly write those updated matrices back into the Pandas DataFrame
df[['x','y','w','h']] = bboxes
df['exclude'] = excludes
df['class'] = [cats[i] for i in classes]

# And save the full Pandas data back to a CSV file
df.to_csv(BUTTERFLIES_BBOXES_FILEPATH, index=False)

Having saved the data into the new butterflies_bboxes.csv file, you can close Jupyter and come back any time and this notebook will reload your new file where you left off.