# Dataset Analysis

Before training our models, we need to understand the quality of our dataset.

## Prerequisite

```python
python -m venv ~/.venv/lance
source ~/.venv/lance/bin/activate
pip install pylance duckdb
```

## Coco Dataset

In this example, we analyze the [Coco dataset](https://cocodataset.org/#home), an object detection dataset.

In [1]:
import lance
import duckdb
import pyarrow as pa

dataset = lance.dataset(
    "s3://eto-public/datasets/coco/coco.lance",
)
dataset.schema


FileNotFoundError: eto-public/datasets/coco/coco.lance

### Understand Label Distributions

In [None]:
# Label distribution in training set

duckdb.query("""
SELECT count(1) as cnt, name 
FROM (
    SELECT UNNEST(annotations.name) AS name FROM dataset
    WHERE split = 'train'
) GROUP BY 2
""").to_df()

However, currently `DuckDB` does not support project nested field pushdown, i.e., only reads `annotation.name` column.
We can manually use lance / PyArrow scanner to selectively read `annotations.name` column.

In [None]:
# scan = dataset.scanner(columns=["annotations.name", "split"])

# duckdb.query("""
# SELECT count(1) as cnt, name 
# FROM (
#     SELECT UNNEST(annotations.name) AS name FROM scan
#     WHERE split = 'train'
# ) GROUP BY 2
# """).to_df()

### Calculate Label Distribution among splits



In [None]:
duckdb.query("""
SELECT
    count(1) as cnt, class, split
FROM (SELECT UNNEST(annotations.name) AS class, split FROM dataset)
GROUP BY 3, 2 ORDER BY class, split
""").to_df()