# ImageNet-1K Data Quality and Model Performance

[ImageNet 1K dataset](https://www.image-net.org/index.php) is an established image classification dataset.
Plenty of the off-shief classification models are trained on it.

In this notebook, we demo how much more we can know from such a public dataset and models.

First, let's load necessary packages and setup DuckDB extensions.

In [1]:
import lance
import duckdb
import torchvision
import torch
import pandas as pd
import pyarrow as pa

In [2]:
%load_ext sql
%sql duckdb:///:memory:

{}


In [3]:
uri = "imagenet.lance"

ds = lance.dataset(uri)

In [4]:
ds.schema

id: int32
image: extension<image[binary]<ImageBinaryType>>
label: int16
name: dictionary<values=string, indices=int16, ordered=0>
split: dictionary<values=string, indices=int8, ordered=0>

In [5]:
%%sql

SELECT split, count(split) FROM ds GROUP BY split

Took 0.007819414138793945


Unnamed: 0,split,count(split)
0,train,50000
1,validation,50000
2,test,50000


## Use two official pre-trained models ResNet and VisionTransform

We load two pretrained classic CNN and Transformer models to help us understand the dataset better.

* The ResNet model is based on the [Deep Residual Learning for Image Recognition paper](https://arxiv.org/abs/1512.03385)
* The VisionTransformer model is based on the [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale paper](https://arxiv.org/abs/2010.11929).

Models are moved to the accelerators if available. 
Not only we support CUDA as backend, we support [MPS backend on macOS](https://pytorch.org/docs/stable/notes/mps.html) as well.

In [6]:
from torchvision.models import resnet50, vit_b_16
import torch

# Support CUDA (Linux) or MPS (Mac) backends.
device = torch.device(
    "cuda" if torch.cuda.is_available() else (
        "mps" if torch.backends.mps.is_available() else "cpu")
)

resnet = resnet50(weights="DEFAULT").to(device)
vit = vit_b_16(weights="DEFAULT").to(device)

## Run the inference of these two models

Lance provides native [PyTorch Dataset](https://eto-ai.github.io/lance/api/python/lance.pytorch.html#lance.pytorch.data.Dataset),
which works with PyTorch DataLoader.

We can write a plain PyTorch evaluation loop and persist the predictions back to the dataset for future analysis.



In [7]:
# TODO: make easy conversion between lance.Dataset and lance.pytorch.Dataset
from lance.pytorch import Dataset
from torch.utils.data import DataLoader

def run_inference(uri: str, model, transform, col_name: str) -> pa.Table:
    dataset = Dataset(
        uri, 
        columns=["id", "image"],
        mode="batch",
        batch_size=128
    )
    data_loader = DataLoader(dataset, batch_size=1, num_workers=4)
    results = []
    with torch.no_grad():
        model.eval()
        for batch in dataset:
            imgs = [transform(img).to(device) for img in batch[1]]
            prediction = resnet(torch.stack(imgs)).squeeze(0).softmax(0)
            topk = torch.topk(prediction, 2)
            for pk, scores, indices in zip(
                batch[0], topk.values.tolist(), topk.indices.tolist()
            ):
                results.append({
                    "id": pk.item(),
                    col_name: {
                        "label": indices[0], 
                        "score": scores[0], 
                        "second_label": indices[1],  # Secondary guess
                        "second_score": scores[1],  # Confidence of the secondary guess.
                    }
                })
    df = pd.DataFrame(data=results)
    df = df.astype({"id": "int32"})
    return pa.Table.from_pandas(df)

resnet_table = run_inference(
    uri, resnet, torchvision.models.ResNet50_Weights.DEFAULT.transforms(), "resnet"
)
vit_table = run_inference(
    uri, vit, torchvision.models.ViT_L_16_Weights.DEFAULT.transforms(), "vit"
)

### Adding the inference results back to the dataset via appending columns.

Because Lance supports [Schema Evolution](https://en.wikipedia.org/wiki/Schema_evolution), 
it is quite easy to add new columns from the model inference back to the dataset.

In [8]:
ds = ds.merge(resnet_table, left_on="id", right_on="id")
ds = ds.merge(vit_table, left_on="id", right_on="id")
ds.schema

id: int32
image: extension<image[binary]<ImageBinaryType>>
label: int16
name: dictionary<values=string, indices=int16, ordered=0>
split: dictionary<values=string, indices=int8, ordered=0>
resnet: struct<label: int64, score: double, second_label: int64, second_score: double>
  child 0, label: int64
  child 1, score: double
  child 2, second_label: int64
  child 3, second_score: double
vit: struct<label: int64, score: double, second_label: int64, second_score: double>
  child 0, label: int64
  child 1, score: double
  child 2, second_label: int64
  child 3, second_score: double

Two columns `resnet` and `vit` are added using a LEFT JOIN algorithm on the "id" column, each of which contains the inference output from the model respectively.

Actually, by doing so, we creates 2 extra versions of Lance dataset. Underneath, Lance only writes the new columns to disk. It will not make extra copy of the existing columns. 

In [9]:
# See multiple versions of the dataset

ds.versions()

[{'version': 1,
  'timestamp': datetime.datetime(2022, 12, 13, 0, 11, 1, tzinfo=datetime.timezone.utc)},
 {'version': 2,
  'timestamp': datetime.datetime(2022, 12, 13, 1, 4, 25, tzinfo=datetime.timezone.utc)},
 {'version': 3,
  'timestamp': datetime.datetime(2022, 12, 13, 1, 4, 25, tzinfo=datetime.timezone.utc)}]

## Model Performance

Using lance and SQL, computing basic ML metrics such as ***precision*** is straightfoward and fast.

In [10]:
%%sql

SELECT 
  SUM(CAST(resnet.label == label AS FLOAT)) / COUNT(label) as resnet_precision,
  SUM(CAST(vit.label == label AS FLOAT)) / COUNT(label) as vit_precision
FROM ds 
WHERE split = 'validation'

Took 0.03298759460449219


Unnamed: 0,resnet_precision,vit_precision
0,0.7833,0.78098


Additionally, it is trivial to slice into each label class to detailed model performance in each class.

For example, we can see that both models consistently perform badly in `maillot`, `screen` and `sunglass` categories.

In [19]:
%%sql 

SELECT
  DISTINCT(name) as class,
  SUM(CAST(resnet.label == label AS FLOAT)) / COUNT(label) as resnet_precision,
  SUM(CAST(vit.label == label AS FLOAT)) / COUNT(label) as vit_precision
FROM ds 
WHERE split = 'validation'
GROUP BY name
ORDER BY resnet_precision ASC
LIMIT 10

Took 0.06665277481079102


Unnamed: 0,class,resnet_precision,vit_precision
0,maillot,0.12,0.1
1,"screen, CRT screen",0.2,0.18
2,"sunglasses, dark glasses, shades",0.2,0.18
3,"notebook, notebook computer",0.24,0.24
4,tiger cat,0.24,0.2
5,"projectile, missile",0.24,0.2
6,"night snake, Hypsiglena torquata",0.26,0.24
7,"green lizard, Lacerta viridis",0.26,0.28
8,"letter opener, paper knife, paperknife",0.26,0.24
9,"laptop, laptop computer",0.26,0.26


## Automatically Discover (Potential) Mislabels

To find potential mislables, we first need to automately establish a baseline of what consider as correct labels.
We establish the baseline via the agreements of the two pre-trained models inference results. That is,
if the two models strongly agree with each other (i.e., same label and confience score is high), but the predict label is not what ground truth describes.

Such logic can be easily expressed via SQL:


In [20]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT ds.name AS ground_truth,
  label_names.name as predict,
  resnet.score as score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
ORDER BY resnet.score DESC
LIMIT 20

Took 0.039568424224853516


Unnamed: 0,ground_truth,predict,score
0,"cucumber, cuke","zucchini, courgette",0.995327
1,"cowboy hat, ten-gallon hat",cowboy boot,0.994327
2,beaker,ping-pong ball,0.993382
3,"cellular telephone, cellular phone, cellphone,...","mouse, computer mouse",0.992164
4,holster,bulletproof vest,0.992044
5,"harmonica, mouth organ, harp, mouth harp","ocarina, sweet potato",0.990515
6,beaver,mongoose,0.987269
7,"vacuum, vacuum cleaner",broom,0.985042
8,wok,"hot pot, hotpot",0.983968
9,"tub, vat","bathtub, bathing tub, bath, tub",0.983811


While `zucchini` can look similar to `cucumber`, it defintely worths to investigate why the models think `cowboy hat` as `cowboy boot`, and `beaker` as `ping-pong ball`.

The reverse order of the above query (`ORDER BY score ASC`) is also very informative, 
as it shows where the weak agreement cross different models.

In [22]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT ds.name AS ground_truth,
  label_names.name as prediction,
  resnet.score as resnet_score,
  vit.score as vit_score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
ORDER BY resnet.score ASC
LIMIT 20

Took 0.04638862609863281


Unnamed: 0,ground_truth,prediction,resnet_score,vit_score
0,"carousel, carrousel, merry-go-round, roundabou...",swing,0.034843,0.04845
1,space heater,"loudspeaker, speaker, speaker unit, loudspeake...",0.042546,0.051058
2,plastic bag,"diaper, nappy, napkin",0.046489,0.038929
3,"grand piano, grand",scabbard,0.057336,0.06022
4,"lighter, light, igniter, ignitor",pencil sharpener,0.059951,0.047769
5,padlock,combination lock,0.066076,0.101465
6,"submarine, pigboat, sub, U-boat","pirate, pirate ship",0.066364,0.071412
7,"loggerhead, loggerhead turtle, Caretta caretta","leatherback turtle, leatherback, leathery turt...",0.073953,0.077844
8,bath towel,"doormat, welcome mat",0.086519,0.069996
9,"oboe, hautboy, hautbois",bassoon,0.092604,0.105123


Unsurprisingly, the query results collect some data samples that both models consistantly perform weakly. 

We can dig into the distribution to see which class behavor the worst. 

In [14]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT
  DISTINCT label_names.name as name,
  COUNT(label_names.name) as cnt,
  AVG(resnet.score) as avg_score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
  AND resnet.score < 0.35
GROUP BY 1
ORDER BY 3

Took 0.03439927101135254


Unnamed: 0,name,cnt,avg_score
0,"loudspeaker, speaker, speaker unit, loudspeake...",1,0.042546
1,combination lock,1,0.066076
2,"pirate, pirate ship",1,0.066364
3,"leatherback turtle, leatherback, leathery turt...",1,0.073953
4,"doormat, welcome mat",1,0.086519
...,...,...,...
483,canoe,1,0.346164
484,"zucchini, courgette",1,0.347184
485,"chambered nautilus, pearly nautilus, nautilus",1,0.348751
486,"tailed frog, bell toad, ribbed toad, tailed to...",1,0.348798


## Active Learning in Lance

With Lance and DuckDB, it is easy to build active learning loop as well.

One typical approach of Active Learning is finding `Lowest Margin of Confidence`. 

This query finds the examples where a model (*ResNet* in this case) is less confident between the top two candidates.

In [23]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT 
    ds.name as gt,
    n1.name as best_guess,
    n2.name as second_guess,
    resnet.score - resnet.second_score AS margin_of_confidence
FROM ds, label_names as n1, label_names as n2
WHERE 
    split != 'test'
    AND n1.label = resnet.label
    AND n2.label = resnet.second_label
ORDER BY margin_of_confidence
LIMIT 20

Took 0.08943009376525879


Unnamed: 0,gt,best_guess,second_guess,margin_of_confidence
0,sliding door,sliding door,tripod,2.384186e-07
1,"promontory, headland, head, foreland",pier,steel arch bridge,8.285046e-06
2,tiger cat,tiger cat,"tiger, Panthera tigris",1.80006e-05
3,toilet seat,rain barrel,yurt,1.966953e-05
4,Yorkshire terrier,Norwich terrier,"Pembroke, Pembroke Welsh corgi",2.676249e-05
5,dumbbell,dumbbell,barbell,2.938509e-05
6,"projectile, missile","projectile, missile",military uniform,3.413856e-05
7,water jug,"pitcher, ewer",vase,4.145503e-05
8,"garden spider, Aranea diademata","wolf spider, hunting spider","garden spider, Aranea diademata",5.722046e-05
9,"silky terrier, Sydney silky",Yorkshire terrier,"silky terrier, Sydney silky",7.516146e-05


The results between best guess and second guess are very interesting in this table. For example, ResNet thinks a `toiler seat` is eitehr a `rain barrel` or `yurt`, which could lead to a few distinct directions to improve:

1. Adding more distinct and high quality training examples in each of these categories. 
2. Improve feature extraction and the feature residual in the model architecture design.
3. Use multi-models to improve the accuracy in ambiguous case.

## A Data-Driven ML development cycle does not stop here. 

With Lance and DuckDB, and about 10 lines of SQL, it takes less than `0.1s` to understandt dataset. It offers flexibilty to 