# Dataset Analysis

Before training our models, we need to understand the quality of our dataset.

## Prerequisite

```python
python -m venv ~/.venv/lance
source ~/.venv/lance/bin/activate
pip install pylance duckdb
```

## Coco Dataset

In this example, we analyze the [Coco dataset](https://cocodataset.org/#home), an object detection dataset.

In [5]:
import lance
import duckdb
import pyarrow as pa

dataset = lance.dataset(
    "s3://eto-public/datasets/coco/coco.lance",
)
dataset.schema


license: int64
file_name: string
coco_url: extension<image[uri]<ImageUriType>>
height: int64
width: int64
date_captured: timestamp[ns]
flickr_url: extension<image[uri]<ImageUriType>>
image_id: int64
split: dictionary<values=string, indices=int8, ordered=0>
image_uri: extension<image[uri]<ImageUriType>>
annotations: struct<segmentation: list<item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>>, area: list<item: double>, iscrowd: list<item: bool>, bbox: list<item: fixed_size_list<item: float>[4]>, category_id: list<item: int16>, id: list<item: int64>, supercategory: list<item: string>, name: list<item: string>>
  child 0, segmentation: list<item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>>
      child 0, item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>
          child 0, counts: list<item: int32>
              child 0, item: 

### Understand Label Distributions

In [None]:
# Label distribution in training set

duckdb.query("""
SELECT count(1) as cnt, name 
FROM (
    SELECT UNNEST(annotations.name) AS name FROM dataset
    WHERE split = 'train'
) GROUP BY 2
""").to_df()

However, currently `DuckDB` does not support project nested field pushdown, i.e., only reads `annotation.name` column.
We can manually use lance / PyArrow scanner to selectively read `annotations.name` column.

In [None]:
# scan = dataset.scanner(columns=["annotations.name", "split"])

# duckdb.query("""
# SELECT count(1) as cnt, name 
# FROM (
#     SELECT UNNEST(annotations.name) AS name FROM scan
#     WHERE split = 'train'
# ) GROUP BY 2
# """).to_df()

### Calculate Label Distribution among splits



In [6]:
duckdb.query("""
SELECT
    count(1) as cnt, class, split
FROM (SELECT UNNEST(annotations.name) AS class, split FROM dataset)
GROUP BY 3, 2 ORDER BY class, split
""").to_df()

Unnamed: 0,cnt,class,split
0,5135,airplane,train
1,143,airplane,val
2,5851,apple,train
3,239,apple,val
4,8720,backpack,train
...,...,...,...
155,277,vase,val
156,7913,wine glass,train
157,343,wine glass,val
158,5303,zebra,train
