# ImageNet-1K Data Quality and Model Performance

[ImageNet 1K dataset](https://www.image-net.org/index.php) is an established image classification dataset.
Plenty of the off-shief classification models are trained on it.

In this notebook, we demo how much more we can know from such a public dataset and models.

First, let's load necessary packages and setup DuckDB extensions.

In [1]:
import lance
import duckdb
import torchvision
import torch
import pandas as pd
import pyarrow as pa

In [2]:
%load_ext sql
%sql duckdb:///:memory:

{}


In [3]:
uri = "imagenet.lance"

ds = lance.dataset(uri)

In [4]:
ds.schema

id: int32
image: extension<image[binary]<ImageBinaryType>>
label: int16
name: dictionary<values=string, indices=int16, ordered=0>
split: dictionary<values=string, indices=int8, ordered=0>

In [5]:
%%sql

SELECT split, count(split) FROM ds GROUP BY split

Took 0.005626678466796875


Unnamed: 0,split,count(split)
0,train,10000
1,test,10000
2,validation,10000


## Use two official pre-trained models ResNet and VisionTransform

We load two pretrained classic CNN and Transformer models to help us understand the dataset better.

* The ResNet model is based on the [Deep Residual Learning for Image Recognition paper](https://arxiv.org/abs/1512.03385)
* The VisionTransformer model is based on the [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale paper](https://arxiv.org/abs/2010.11929).

Models are moved to the accelerators if available. 
Not only we support CUDA as backend, we support [MPS backend on macOS](https://pytorch.org/docs/stable/notes/mps.html) as well.

In [None]:
from torchvision.models import resnet50, vit_b_16
import torch

# Support CUDA (Linux) or MPS (Mac) backends.
device = torch.device(
    "cuda" if torch.cuda.is_available() else (
        "mps" if torch.backends.mps.is_available() else "cpu")
)

resnet = resnet50(weights="DEFAULT").to(device)
vit = vit_b_16(weights="DEFAULT").to(device)

## Run the inference of these two models

And persist the predictions back to the dataset for future analysis.


In [None]:
# TODO: make easy conversion between lance.Dataset and lance.pytorch.Dataset
from lance.pytorch import Dataset

def run_inference(uri: str, model, transform, col_name: str) -> pa.Table:
    dataset = Dataset(
        uri, 
        columns=["id", "image"],
        mode="batch",
        batch_size=128
    )
    results = []
    with torch.no_grad():
        model.eval()
        for batch in dataset:
            imgs = [transform(img).to(device) for img in batch[1]]
            prediction = resnet(torch.stack(imgs)).squeeze(0).softmax(0)
            topk = torch.topk(prediction, 2)
            for pk, scores, indices in zip(
                batch[0], topk.values.tolist(), topk.indices.tolist()
            ):
                results.append({
                    "id": pk.item(),
                    col_name: {
                        "label": indices[0], 
                        "score": scores[0], 
                        "second_label": indices[1],  # Secondary guess
                        "second_score": scores[1],  # Confidence of the secondary guess.
                    }
                })
    df = pd.DataFrame(data=results)
    df = df.astype({"id": "int32"})
    return pa.Table.from_pandas(df)

resnet_table = run_inference(
    uri, resnet, torchvision.models.ResNet50_Weights.DEFAULT.transforms(), "resnet"
)
vit_table = run_inference(
    uri, vit, torchvision.models.ViT_L_16_Weights.DEFAULT.transforms(), "vit"
)

### Adding the inference results back to the dataset via appending columns.

Because Lance supports [Schema Evolution](https://en.wikipedia.org/wiki/Schema_evolution), 
it is quite easy and fast to add new columns each time for one model.

In [None]:
ds = ds.merge(resnet_table, left_on="id", right_on="id")
ds = ds.merge(vit_table, left_on="id", right_on="id")
ds.schema

As a result, two columns `resnet` and `vit` are added using a LEFT JOIN algorithm on the "id" column, each of which contains the inference output from the model respectively.

Actually, by doing so, we creates 2 extra versions of Lance dataset. Underneath, Lance only writes the new columns to disk. It will not make extra copy of the existing columns. 

In [None]:
# See multiple versions of the dataset

ds.versions()

## Model Performance

Using lance and SQL, computing basic ML metrics such as precision is straightfoward and fast.

In [None]:
%%sql

SELECT 
  SUM(CAST(resnet.label == label AS FLOAT)) / COUNT(label) as resnet_precision,
  SUM(CAST(vit.label == label AS FLOAT)) / COUNT(label) as vit_precision
FROM ds 
WHERE split = 'validation'

Using DuckDB / SQL, it is trivial to slice into each label class to see model performance in each category.

In [None]:
%%sql 

SELECT
  DISTINCT(name),
  SUM(CAST(resnet.label == label AS FLOAT)) / COUNT(label) as resnet_precision,
  SUM(CAST(vit.label == label AS FLOAT)) / COUNT(label) as vit_precision
FROM ds 
WHERE split = 'validation'
GROUP BY name
ORDER BY resnet_precision ASC
LIMIT 10

## Find Potential Mislabels

If the two models strongly agree with each other (i.e., same label and confience score is high), however, the predict label is not what ground truth describes.


In [None]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT ds.name AS gt, ds.label as gt_label,
  resnet.label as resnet_label,
  vit.label as vit_label,
  label_names.name as predict_name,
  resnet.score as predict_score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
ORDER BY resnet.score DESC
LIMIT 20

The reverse order of the above query (`ORDER BY score ASC`) is also very informative, 
as it shows where the weak agreement cross different models.

In [None]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT ds.name AS gt, ds.label as gt_label,
  resnet.label as resnet_label,
  vit.label as vit_label,
  label_names.name as predict_name,
  resnet.score as resnet_score,
  vit.score as vit_score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
ORDER BY resnet.score ASC
LIMIT 20

We can dig into the distribution to see which class behavor the worst. 

In [None]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT
  DISTINCT label_names.name as name,
  COUNT(label_names.name) as cnt,
  AVG(resnet.score) as avg_score
FROM ds, label_names 
WHERE
  split != 'test'
  AND ds.label !=  resnet.label 
  AND resnet.label == vit.label
  AND resnet.label = label_names.label
  AND resnet.score < 0.35
GROUP BY 1
ORDER BY 3

## Active Learning in Lance

With Lance and DuckDB, it is easy to build active learning loop as well.

One typical approach of Active Learning is finding `Lowest Margin of Confidence`. 

This query finds the examples where a model (*ResNet* in this case) is less confident between the top two candidates.

In [None]:
%%sql

WITH label_names AS (SELECT DISTINCT label, name FROM ds)

SELECT 
    id, 
    ds.label as gt_label, 
    ds.name as gt,
    n1.name as best_guess,
    n2.name as second_guess,
    resnet.score - resnet.second_score AS margin_of_confidence
FROM ds, label_names as n1, label_names as n2
WHERE 
    split != 'test'
    AND n1.label = resnet.label
    AND n2.label = resnet.second_label
ORDER BY margin_of_confidence
LIMIT 20