Prep local dataset (so we can merge into it later)

In [None]:
!rm -rf /tmp/oxford_pet.lance

import lance
uri = "s3://eto-public/datasets/oxford_pet/oxford_pet.lance"
ds = lance.dataset(uri)
table = ds.to_table()
len(table)

lance.write_dataset(table, '/tmp/oxford_pet.lance')

Read out the dataset

In [None]:
import lance
uri = "/tmp/oxford_pet.lance"
ds = lance.dataset(uri)

Say we find 10 images that have bad labels

In [None]:
df = ds.head(10, columns=["external_image", "class", "_pk"]).to_pandas()
df['external_image'] = df.external_image.apply(lambda img: img).str.replace('s3://eto-public', 'https://eto-public.s3.us-west-2.amazonaws.com')
df

Create new labelstudio tasks to correct bad labels

(For now we assume the project is already created)

In [None]:
from labelstudio import LanceLabelStudioClient as Client

ls = Client.create()
p = ls.get_project("imagenet")

task_ids = p.add_tasks(df, image_col="external_image", pk_col="_pk")

In [None]:
task_ids

Let's pop over to the LabelStudio UI and label these

http://localhost:8080/projects/3/data?tab=11

Once these they're done, we export them from LabelStudio and merge these into the original dataset

In [None]:
import pyarrow as pa
label_df = p.get_annotations("new_label")


schema = pa.schema([pa.field('id', ds.schema.field("_pk").type),
                    pa.field('new_label', pa.string())])
tbl = pa.Table.from_pandas(label_df, schema)

ds.merge(tbl, left_on='_pk', right_on='id')

lance.dataset(uri).schema.names

BUG: Lance dataset merge doesn't have proper NA handling

In [None]:
df = ds.to_table(columns=["new_label", "external_image", "class", "_pk"]).to_pandas()
df[df.new_label != '']

Integration notes:

1. LabelStudio has both "pre-annotations" and "model generated labels". Not sure what's the difference
2. Current integration assumes single classification problem
3. Merging a string column in generates empty string rather than string NA for join key misses
4. We lose the dtype during JSON conversion so need some extra processing to match join key dtype
5. 