# Dataset Analysis

Before training our models, we need to understand the quality of our dataset.

## Prerequisite

```python
python -m venv ~/.venv/lance
source ~/.venv/lance/bin/activate
pip install pylance duckdb
```

## Coco Dataset

In this example, we analyze the [Coco dataset](https://cocodataset.org/#home), an object detection dataset.

In [1]:
import lance
import duckdb
import pyarrow as pa

dataset = lance.dataset(
    "s3://eto-public/datasets/coco/coco.lance",
)
dataset.schema


license: int64
file_name: string
coco_url: extension<image[uri]<ImageUriType>>
height: int64
width: int64
date_captured: timestamp[ns]
flickr_url: extension<image[uri]<ImageUriType>>
image_id: int64
split: dictionary<values=string, indices=int8, ordered=0>
image_uri: extension<image[uri]<ImageUriType>>
annotations: struct<segmentation: list<item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>>, area: list<item: double>, iscrowd: list<item: bool>, bbox: list<item: fixed_size_list<item: float>[4]>, category_id: list<item: int16>, id: list<item: int64>, supercategory: list<item: string>, name: list<item: string>>
  child 0, segmentation: list<item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>>
      child 0, item: struct<counts: list<item: int32>, polygon: list<item: list<item: float>>, size: list<item: int32>>
          child 0, counts: list<item: int32>
              child 0, item: 

### Understand Label Distributions

In [5]:
%%time
# Label distribution in training set

duckdb.query("""
SELECT count(1) as cnt, name 
FROM (
    SELECT UNNEST(annotations.name) AS name FROM dataset
    WHERE split = 'train'
) GROUP BY 2
""").to_df()

CPU times: user 6.54 s, sys: 4.41 s, total: 11 s
Wall time: 4min 29s


Unnamed: 0,cnt,name
0,5508,dog
1,8652,potted plant
2,5805,tv
3,10806,bird
4,2918,hot dog
...,...,...
75,1481,scissors
76,1983,stop sign
77,2262,mouse
78,198,hair drier


However, currently `DuckDB` does not support project nested field pushdown, i.e., only reading `annotation.name` column.
We can manually use `Lance`'s PyArrow Scanner integration to selectively read `annotations.name` column.

In [6]:
%%time
scan = dataset.scanner(columns=["annotations.name", "split"])

duckdb.query("""
SELECT count(1) as cnt, name 
FROM (
    SELECT UNNEST(annotations.name) AS name FROM scan
    WHERE split = 'train'
) GROUP BY 2
""").to_df()

CPU times: user 1.18 s, sys: 753 ms, total: 1.93 s
Wall time: 50.6 s


Unnamed: 0,cnt,name
0,7113,bicycle
1,43867,car
2,15714,dining table
3,262465,person
4,8725,motorcycle
...,...,...
75,4373,sandwich
76,225,toaster
77,2262,mouse
78,6126,surfboard


### Calculate Label Distribution among splits



In [4]:
duckdb.query("""
SELECT
    count(1) as cnt, class, split
FROM (SELECT UNNEST(annotations.name) AS class, split FROM scan)
GROUP BY 3, 2 ORDER BY class, split
""").to_df()

InvalidInputException: Invalid Input Error: arrow_scan: get_next failed(): Invalid: OneShotFragment was already scanned