# Setting up a labeling task for Street2Sat
This notebook provides code and necessary instructions for creating a street2sat labeling task from scratch

## 1. Generating list of random images to label

#### 1a. Connect to Gcloud and get all image paths

In [29]:
from datetime import datetime
from pathlib import Path
from google.cloud import aiplatform, firestore, storage
from tqdm.notebook import tqdm

import collections
import pandas as pd
import random

In [2]:
# You may need to run the below commands to authenticate GCloud and set the correct project
# !gcloud auth login
# !gcloud config set project bsos-geog-harvest1

In [3]:
# Initialize connections to cloud storage and database
client = storage.Client()
db = firestore.Client()
coll = db.collection("street2sat")

In [4]:
# Load in all available paths, this will take 3 minutes or so
all_paths = [blob.name for blob in tqdm(client.list_blobs('street2sat-uploaded', prefix=""))]  
random.shuffle(all_paths)

0it [00:00, ?it/s]

In [36]:
# Utility functions
cached = {}
csv_bucket = "street2sat-gcloud-labeling"

def get_images_already_being_labelled():
    """Gets images already labelled"""
    images_already_being_labelled = []
    csv_names = [blob.name for blob in client.list_blobs(csv_bucket, prefix="") if blob.name.endswith(".csv")]

    for csv_name in tqdm(csv_names, desc="Get already labelled"):
        uris = pd.read_csv(f"gs://{csv_bucket}/{csv_name}", header=None, sep="\n")[0]
        images_already_being_labelled += uris.to_list()

    # Ensure there are no duplicates in images already being labelled
    dupes = [item for item, count in collections.Counter(images_already_being_labelled).items() if count > 1]
    dupes.remove('0') # An index of 0 was erroneously output in previous csv
    assert len(dupes) == 0, "Found duplicates in images being labelled. One of the labeling tasks needs to be removed."
    return images_already_being_labelled

def amount_of_objects_detected(uri: str):
    """Finds amount of objects detected in image uri"""
    if uri in cached:
        return cached[uri]
    
    query = coll.where("input_img", "==", uri).limit(1).get()
    objects_detected = 0
    if len(query) > 0:
        item = query[0].to_dict()
        if "results" in item:
            objects_detected = len(item["results"])
    
    cached[uri] = objects_detected     
    return objects_detected

def get_paths_with_predicted_objects(amount: int, object_threshold: int = 5):
    """Gets URIs for images with more than {object_threshold} objects"""
    already_being_labelled = get_images_already_being_labelled()
    images_to_label = []
    with tqdm(total=amount, desc="Get URIs") as pbar:
        for i, p in enumerate(all_paths):
            uri = f"gs://street2sat-uploaded/{p}"
            if uri in already_being_labelled or amount_of_objects_detected(uri) <= object_threshold:
                continue
            pbar.update(1)
            images_to_label.append(uri)
            if len(images_to_label) == 1:
                print(f"Start range: {i}")
            if len(images_to_label) >= amount:
                print(f"End range: {i}")
                break
    return images_to_label

#### 1b. Generate CSVs

In [81]:
amount_of_csvs_to_generate = 5
for i in tqdm(range(amount_of_csvs_to_generate), desc="CSV Generation"):
    # Get random images to label with at least 5 objects predicted in each image
    images_to_label = get_paths_with_predicted_objects(100, object_threshold=5)
    csv_name = datetime.now().strftime(date_format) + ".csv"
    print(f"Saving to {csv_name}")
    df = pd.DataFrame(images_to_label)
    df.to_csv(f"gs://{csv_bucket}/{csv_name}", sep="\n", index=False, header=False)

CSV Generation:   0%|          | 0/5 [00:00<?, ?it/s]

Get already labelled:   0%|          | 0/18 [00:00<?, ?it/s]

Get URIs:   0%|          | 0/100 [00:00<?, ?it/s]

Start range: 6003
End range: 6528
Saving to 2021-11-22_13-3-20.csv


Get already labelled:   0%|          | 0/19 [00:00<?, ?it/s]

Get URIs:   0%|          | 0/100 [00:00<?, ?it/s]

Start range: 6537
End range: 7037
Saving to 2021-11-22_13-3-48.csv


Get already labelled:   0%|          | 0/20 [00:00<?, ?it/s]

Get URIs:   0%|          | 0/100 [00:00<?, ?it/s]

Start range: 7038
End range: 7547
Saving to 2021-11-22_13-4-20.csv


Get already labelled:   0%|          | 0/21 [00:00<?, ?it/s]

Get URIs:   0%|          | 0/100 [00:00<?, ?it/s]

Start range: 7553
End range: 8067
Saving to 2021-11-22_13-4-47.csv


Get already labelled:   0%|          | 0/22 [00:00<?, ?it/s]

Get URIs:   0%|          | 0/100 [00:00<?, ?it/s]

Start range: 8068
End range: 8660
Saving to 2021-11-22_13-5-19.csv


## 2. Creating a dataset on Vertex AI

In [102]:
# Finds which csvs to make 
existing_dataset_dates = []
for d in aiplatform.ImageDataset.list():
    cleaned_display_name = d.display_name.replace(" ", "") 
    existing_dataset_dates.append(datetime.strptime(cleaned_display_name, "%Y-%m-%d_%H-%M-%S"))

csvs_for_new_datasets = []
for blob in client.list_blobs(csv_bucket, prefix=""):
    if not blob.name.endswith(".csv"):
        continue
    csv_date = datetime.strptime(Path(blob.name).stem, "%Y-%m-%d_%H-%M-%S")
    if csv_date in existing_dataset_dates:
        continue
    csvs_for_new_datasets.append(blob)
print(f"Found {len(csvs_for_new_datasets)} csvs_for_new_datasets")

Found 16 csvs_for_new_datasets


In [106]:
# Create datasets
new_datasets = []
for csv_for_dataset in csvs_for_new_datasets:
    display_name = Path(csvs_for_new_datasets[0].name).stem
    assert datetime.strptime(display_name, "%Y-%m-%d_%H-%M-%S")
        
    ds = aiplatform.ImageDataset.create(
        display_name=display_name,
        gcs_source=f"gs://{csv_for_dataset.bucket.name}/{csv_for_dataset.name}",
        import_schema_uri=aiplatform.schema.dataset.ioformat.image.bounding_box,
        sync=False,
    )
    new_datasets.append(ds)  

INFO:google.cloud.aiplatform.datasets.dataset:Creating ImageDataset
INFO:google.cloud.aiplatform.datasets.dataset:Create ImageDataset backing LRO: projects/1012768714927/locations/us-central1/datasets/7294626331596161024/operations/1088044889729400832
INFO:google.cloud.aiplatform.datasets.dataset:ImageDataset created. Resource name: projects/1012768714927/locations/us-central1/datasets/7294626331596161024
INFO:google.cloud.aiplatform.datasets.dataset:To use this ImageDataset in another session:
INFO:google.cloud.aiplatform.datasets.dataset:ds = aiplatform.ImageDataset('projects/1012768714927/locations/us-central1/datasets/7294626331596161024')
INFO:google.cloud.aiplatform.datasets.dataset:Importing ImageDataset data: projects/1012768714927/locations/us-central1/datasets/7294626331596161024
INFO:google.cloud.aiplatform.datasets.dataset:Import ImageDataset data backing LRO: projects/1012768714927/locations/us-central1/datasets/7294626331596161024/operations/3587542682920026112
INFO:googl

INFO:google.cloud.aiplatform.datasets.dataset:To use this ImageDataset in another session:
INFO:google.cloud.aiplatform.datasets.dataset:ds = aiplatform.ImageDataset('projects/1012768714927/locations/us-central1/datasets/8564641426514640896')
INFO:google.cloud.aiplatform.datasets.dataset:Importing ImageDataset data: projects/1012768714927/locations/us-central1/datasets/8564641426514640896
INFO:google.cloud.aiplatform.datasets.dataset:Import ImageDataset data backing LRO: projects/1012768714927/locations/us-central1/datasets/8564641426514640896/operations/5003924765728047104
INFO:google.cloud.aiplatform.datasets.dataset:Creating ImageDataset
INFO:google.cloud.aiplatform.datasets.dataset:Create ImageDataset backing LRO: projects/1012768714927/locations/us-central1/datasets/1199004235950194688/operations/2698081756514353152
INFO:google.cloud.aiplatform.datasets.dataset:Creating ImageDataset
INFO:google.cloud.aiplatform.datasets.dataset:Create ImageDataset backing LRO: projects/10127687149

In [107]:
# Check status
done = len([_ for _ in new_datasets if ds._gca_resource])
total = len(new_datasets)
print(f"Completed processing: {done}/{total}")    

Completed processing: 0/15


## 3. Creating a labeling task
You can create a labeling task during or after dataset creation is complete.

a. Click **CREATE LABELING TASK**

b. For Labeling Task name: use the `%Y-%m-%d_%H-%-M-%S` format with a task suffix

![image.png](../assets/labeling-task/3b.png)

c. In step 2: input the labels as per [classes.txt](https://github.com/nasaharvest/street2sat_website/blob/main/street2sat_utils/crop_info/classes.txt)

![image-2.png](../assets/labeling-task/3c.png)

d. In step 3: select the instruction PDF available in `gs://street2sat-gcloud-labeling`. (This can be updated if necessary)

![image-3.png](../assets/labeling-task/3d.png)

e. In step 4: select the "Street2sat Labelers" group and add your email as a manager.

![image-4.png](../assets/labeling-task/3e.png)

f. After some time (maybe 15 minutes) you should receive an email from noreply-vertexai@google.com that looks like this:
![image-6.png](../assets/labeling-task/3f.png)


## 4. Labeling Manager Console
Using the second link the above email you can access the Manager console. 
- Tasks - shows the progress of the created Labeling task
- Specialists - shows the list of available labelers (more can be added here)
- Assignments - allows managing who is working on which task

Tasks and Specialists are pretty intuitive, but to change Assignment you must:
1. Populate the Specialists column with an existing specialist (labeler)
2. Populate the Tasks column with a created task
3. Make changes by checking or unchecking specialists
4. Commit changes using the button on the top right

![image.png](../assets/labeling-task/4.png)


# 5. Actual labeling
Members of the Street2sat labelers group should receive an email like this upon assignment:
![image.png](../assets/labeling-task/5.png)

However these emails have not been consistent so a direct link from Step 3f email may need to be manually sent.

Once the specialist clicks the link, they'll see the following page and will be ready to go:
![image-2.png](../assets/labeling-task/5b.png)

- Since this guide is for individuals creating labeling tasks, I'll leave details about actual labeling on Google Cloud to another slide/document.

## 6. Exporting the labels
To export labels navigate to Vertex AI datasets, click the 3 dots on the desirable dataset and select Export dataset
![image.png](../assets/labeling-task/6.png)

Save the output file directly to the `street2sat-gcloud-labeling` bucket.