<img src="static/images/datachain-logo.png" alt="Dataset" style="width: 200px;"/>

# 🎦 Wrangling Video Datasets with Datachain

Welcome to the `Wrangling Video Datasets with Datachain` tutorial, in which we dive into managing and optimizing video datasets with DataChain! Enhance your skills in handling complex image data, from filtering to minimizing redundancy.

📋 Topics covered:
1. **Creating and Versioning Datasets** for `kinetics-700-2020` video dataset
    - Manage changes and maintain historical versions of your datasets.
3. **Adding Annotations (Signals)**
   - Enrich your data with meaningful attributes.
5. **Filtering & Sorting** - Refine your datasets to get exactly what you need.
6. **Updating existing (old) datasets** -
    - Add and update data files 
    - Update annotations 
    - Merge annotations (in different formats)
    - Remove duplicates
8. **Exploring and Visualizing Datasets** 
    - multi-modal annotations 
    - data
    - predictions

# 🆕 Creating and Versioning Datasets

In [45]:
%load_ext autoreload
%autoreload 2

import os
import numpy as np
import pandas as pd

from datachain import DataChain, C

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Create a DataChain from a S3 bucket

In [46]:
# # Create a DataChain from previously save dataset

# ds = (
#     DataChain.from_dataset("fashion-product-images")
# )

# ds.show(3)

## Create a DataChain from a local directory of images

**(OPTIONAL) You may skip this and work with data in our public dataset.**

- Download the `kinetics-700-2020` dataset 
- Unzip data into `examples/kinetics_actions_video/data` directory 


In [96]:
# Create a DataChain

DATA_PATH = "data/validation/"

video_dc = (
    DataChain.from_storage(DATA_PATH, type='image')
    .filter(C("file.path").glob("*.mp4"))
)
video_dc.show(3)

Preparing: 1001 rows [00:00, 111481.33 rows/s]
Processed: 1001 rows [00:00, 16780.37 rows/s]
Cleanup: 2 tables [00:00, 2178.29 tables/s]


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file
Unnamed: 0_level_1,source,path,size,version,etag,is_latest,last_modified,location,vtype
0,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,
1,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,987323,,0x1.9a35722f1c7aep+30,1,2024-07-09 15:51:39.777812+00:00,,
2,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,362415,,0x1.9a35759757980p+30,1,2024-07-09 15:55:17.835541+00:00,,



[Limited by 3 rows]


## Extract 'video_id' from file path

In [97]:
video_dc = video_dc.map(video_id=lambda file: file.name.split("_")[0])
video_dc.show(3)

Preparing: 1001 rows [00:00, 108041.64 rows/s]
Processed: 1001 rows [00:00, 16435.95 rows/s]
Preparing: 1000 rows [00:00, 140885.56 rows/s]
Processed: 1000 rows [00:00, 19976.78 rows/s]
Cleanup: 6 tables [00:00, 5405.03 tables/s]


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,video_id
Unnamed: 0_level_1,source,path,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 10_level_1
0,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI
1,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,987323,,0x1.9a35722f1c7aep+30,1,2024-07-09 15:51:39.777812+00:00,,,ejvWztFFVh4
2,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,362415,,0x1.9a35759757980p+30,1,2024-07-09 15:55:17.835541+00:00,,,SJ8n2xiz3XQ



[Limited by 3 rows]


# Adding Annotations

**DATA FORMAT:**

- Each row of a CSV file contains an annotation for one person performing an
action in an interval, where that annotation is associated with the middle
keyframe. Different persons and multiple action labels are described in separate
rows.

- The format of a row is the following: video_id, middle_frame_timestamp,
person_box, action_id, person_id

- `video_id`: YouTube identifier
keyframe_timestamp: in seconds from the start of the YouTube video
- `person_box`: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect
  to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0)
  corresponds to the bottom right.
- `action_id`: identifier of an action class, see ava_action_list_v2.2.pbtxt
- `person_id`: a unique integer allowing this box to be linked to other boxes
  depicting the same person in adjacent frames of this video. The person id is
  optional--included in the AVA data but not in the Kinetics data.

In the training and validation files, rows containing only a
video_id,keyframe_timestamp (i.e. no box or action label) are included to
indicate an empty frame where no action was identified by labelers.

In the test files, boxes and action labels are omitted entirely, rather the
video_id,keyframe_timestamp pairs indicate the video frames upon which a
submission will be tested.

The full action label vocabulary is provided in
ava_action_list_v2.2_for_activitynet.pbtxt.

For the ActivityNet challenge, and in many papers, results are reported on only
a subset of 60 actions listed in ava_action_list_v2.2_for_activitynet.pbtxt.

In [49]:
# AVA annotations
action_list_path = "data/ava_kinetics_v1_0/ava_action_list_v2.2.pbtxt"
activitynet_labels_path = "data/ava_kinetics_v1_0/ava_action_list_v2.2_for_activitynet.pbtxt"
bbox_path =  "data/ava_kinetics_v1_0/kinetics_val_v1.0.csv"

# Deepmind annotations
dm_labels_path =  "data/deepmind/kinetics700_2020/validate.csv"

# List of labels 
kinetics700_labels_path =  "data/kinetics_700_labels.csv"

## Preprocess and save annotations to CSV

- prepare annotations in Pandas DataFrame
- create DataChain with `.from_pandas` method

In [50]:
# Read labels  

k700_df = pd.read_csv(kinetics700_labels_path)

print(k700_df.shape)
k700_df.head(3)

(700, 2)


Unnamed: 0,id,name
0,0,abseiling
1,1,acting in play
2,2,adjusting glasses


In [51]:
dm_labels_df = pd.read_csv(dm_labels_path)

print(dm_labels_df.shape)
dm_labels_df.head(3)

(33314, 5)


Unnamed: 0,label,youtube_id,time_start,time_end,split
0,testifying,---QUuC4vJs,84,94,validate
1,washing feet,--GkrdYZ9Tc,0,10,validate
2,air drumming,--nQbRBEz2s,104,114,validate


In [52]:
k700_annotated_val = (
    dm_labels_df
    .merge(k700_df, how="left", left_on="label", right_on="name")
    .drop("name", axis=1)
    .rename(columns={"id": "class"})
    .reindex(columns=[ 'youtube_id', 'time_start', 'time_end', 'split', 'class', 'label',])
)
print(k700_annotated_val.shape)
k700_annotated_val.head(3)

(33314, 6)


Unnamed: 0,youtube_id,time_start,time_end,split,class,label
0,---QUuC4vJs,84,94,validate,616,testifying
1,--GkrdYZ9Tc,0,10,validate,673,washing feet
2,--nQbRBEz2s,104,114,validate,3,air drumming


In [53]:
k700_annotated_val.to_csv("data/example_k700_annotations_val.csv", index=False)

## Create DataChain for `labels` using `.from_pandas()`

In [54]:
# Create DataChain 

relative_path = "data/example_k700_annotations_val.csv"
absolute_path = os.path.abspath(relative_path)

k700_df = pd.read_csv(absolute_path)
k700_dc = DataChain.from_pandas(k700_df)
k700_dc.show(3)

Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 72863.77 rows/s][A
Generated: 20001 rows [00:00, 73251.61 rows/s][A
Processed: 1 rows [00:00,  2.18 rows/s]rows/s][A
Generated: 33314 rows [00:00, 72537.64 rows/s]
Cleanup: 1 tables [00:00, 812.22 tables/s]


Unnamed: 0,youtube_id,time_start,time_end,split,class,label
0,---QUuC4vJs,84,94,validate,616,testifying
1,--GkrdYZ9Tc,0,10,validate,673,washing feet
2,--nQbRBEz2s,104,114,validate,3,air drumming



[Limited by 3 rows]


In [107]:
k700_dc.count()

Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 65721.65 rows/s][A
Generated: 20001 rows [00:00, 68155.86 rows/s][A
Processed: 1 rows [00:00,  2.05 rows/s]rows/s][A
Generated: 33314 rows [00:00, 68437.33 rows/s]
Cleanup: 6132 tables [00:00, 43928.72 tables/s]


33314

## Create DataChain for `bbox` annotations using `.from_csv()`

In [109]:
bbox_df = pd.read_csv(bbox_path, header=None)

# If your CSV does not contain headers, you will need to manually assign them
bbox_df.columns = [
    "video_id", "person_box_x1", "person_box_y1", 
    "person_box_x2", "person_box_y2", "xy_ratio", "action_id"
]
print(f'Total entries: {bbox_df.shape}')
print(f'Unique video_ids: {bbox_df.video_id.nunique()}')


# DataChain 
bbox_dc = DataChain.from_pandas(bbox_df)
bbox_dc.show(3)

Total entries: (75555, 7)
Unique video_ids: 32028


Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 69109.69 rows/s][A
Generated: 20001 rows [00:00, 69924.18 rows/s][A
Generated: 30001 rows [00:00, 70562.46 rows/s][A
Generated: 40001 rows [00:00, 69921.28 rows/s][A
Generated: 50001 rows [00:00, 69481.79 rows/s][A
Generated: 60001 rows [00:00, 69042.64 rows/s][A
Processed: 1 rows [00:01,  1.09s/ rows]rows/s][A
Generated: 75555 rows [00:01, 69220.87 rows/s]
Cleanup: 1 tables [00:00, 469.48 tables/s]


Unnamed: 0,video_id,person_box_x1,person_box_y1,person_box_x2,person_box_y2,xy_ratio,action_id
0,00_77WAvZ_o,17.0,0.308489,0.465223,0.530244,0.991838,12.0
1,00_77WAvZ_o,17.0,0.308489,0.465223,0.530244,0.991838,17.0
2,00g-inkLork,5.016949,,,,,



[Limited by 3 rows]


**Note:**
- every video may have multiple objects detected, so `bbox_dc` may contain multiple entries per `video_id`

## Merge annotations

In [57]:
# Create Kinetics annotations chain

kinetics_dc = k700_dc.merge(bbox_dc,  on="youtube_id", right_on="video_id")
kinetics_dc.show(3)

Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 69708.51 rows/s][A
Generated: 20001 rows [00:00, 62096.22 rows/s][A
Processed: 1 rows [00:00,  1.96 rows/s]rows/s][A
Generated: 33314 rows [00:00, 65445.14 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 68672.07 rows/s][A
Generated: 20001 rows [00:00, 67659.67 rows/s][A
Generated: 30001 rows [00:00, 67589.42 rows/s][A
Generated: 40001 rows [00:00, 66106.02 rows/s][A
Generated: 50001 rows [00:00, 66183.19 rows/s][A
Generated: 60001 rows [00:00, 65718.51 rows/s][A
Processed: 1 rows [00:01,  1.14s/ rows]rows/s][A
Generated: 75555 rows [00:01, 66108.60 rows/s]
Cleanup: 6 tables [00:00, 1754.45 tables/s]


Unnamed: 0,youtube_id,time_start,time_end,split,class,label,video_id,person_box_x1,person_box_y1,person_box_x2,person_box_y2,xy_ratio,action_id
0,---QUuC4vJs,84,94,validate,616,testifying,---QUuC4vJs,85.017267,0.0,0.0,0.210504,0.743298,11.0
1,---QUuC4vJs,84,94,validate,616,testifying,---QUuC4vJs,85.017267,0.098004,0.057044,0.886065,0.822089,11.0
2,--GkrdYZ9Tc,0,10,validate,673,washing feet,--GkrdYZ9Tc,9.0,0.087431,0.000478,0.624829,0.986173,11.0



[Limited by 3 rows]


In [58]:
kinetics_dc.count()

Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 69789.58 rows/s][A
Generated: 20001 rows [00:00, 72053.22 rows/s][A
Processed: 1 rows [00:00,  2.17 rows/s]rows/s][A
Generated: 33314 rows [00:00, 72358.80 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 67009.32 rows/s][A
Generated: 20001 rows [00:00, 66928.28 rows/s][A
Generated: 30001 rows [00:00, 67233.19 rows/s][A
Generated: 40001 rows [00:00, 65397.40 rows/s][A
Generated: 50001 rows [00:00, 65880.47 rows/s][A
Generated: 60001 rows [00:00, 65614.15 rows/s][A
Processed: 1 rows [00:01,  1.16s/ rows]rows/s][A
Generated: 75555 rows [00:01, 65425.09 rows/s]
Cleanup: 17 tables [00:00, 5214.51 tables/s]


76044

## Add Annotations to Video Dataset

In [112]:
# Add annotations to Dataset

dc = (
    video_dc
    .merge(kinetics_dc, on="video_id", right_on="video_id", inner=True)
    .save("kinetics-700-val")
)

print(dc.count())
dc.show(3)

Preparing: 1001 rows [00:00, 113802.03 rows/s]
Processed: 1001 rows [00:00, 17388.47 rows/s]
Preparing: 1000 rows [00:00, 138718.88 rows/s]
Processed: 1000 rows [00:00, 21558.34 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 68589.20 rows/s][A
Generated: 20001 rows [00:00, 68421.04 rows/s][A
Processed: 1 rows [00:00,  2.03 rows/s]rows/s][A
Generated: 33314 rows [00:00, 67798.17 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 70345.74 rows/s][A
Generated: 20001 rows [00:00, 69151.23 rows/s][A
Generated: 30001 rows [00:00, 67683.77 rows/s][A
Generated: 40001 rows [00:00, 67079.52 rows/s][A
Generated: 50001 rows [00:00, 66383.87 rows/s][A
Generated: 60001 rows [00:00, 66321.19 rows/s][A
Processed: 1 rows [00:01,  1.14s/ rows]rows/s][A
Generated: 75555 rows [00:01, 66555.27 rows/s]
Saving: 2040 rows [00:00, 17769.52 rows/s]
Cleanup: 73880 tables [00:0

2040





Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,video_id,youtube_id,time_start,time_end,split,class,label,right_video_id,person_box_x1,person_box_y1,person_box_x2,person_box_y2,xy_ratio,action_id
Unnamed: 0_level_1,source,path,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
0,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,12.0
1,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,59.0
2,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,80.0



[Limited by 3 rows]


In [101]:
dc.distinct('file.path').count()

Preparing: 1001 rows [00:00, 82070.85 rows/s]
Processed: 1001 rows [00:00, 18704.72 rows/s]
Preparing: 1000 rows [00:00, 150560.13 rows/s]
Processed: 1000 rows [00:00, 21031.88 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 72587.76 rows/s][A
Generated: 20001 rows [00:00, 72373.68 rows/s][A
Processed: 1 rows [00:00,  2.17 rows/s]rows/s][A
Generated: 33314 rows [00:00, 72237.67 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 70471.13 rows/s][A
Generated: 20001 rows [00:00, 67558.50 rows/s][A
Generated: 30001 rows [00:00, 69203.51 rows/s][A
Generated: 40001 rows [00:00, 68300.63 rows/s][A
Generated: 50001 rows [00:00, 67996.60 rows/s][A
Generated: 60001 rows [00:00, 67364.05 rows/s][A
Processed: 1 rows [00:01,  1.12s/ rows]rows/s][A
Generated: 75555 rows [00:01, 67661.71 rows/s]
Cleanup: 12328 tables [00:00, 147789.42 tables/s]


1000

## Saving Annotated Datasets

DataChain supports versioning of datasets. You can save a dataset as a new version and load specific versions:

You can load a specific version by specifying the `version` parameter.

In [60]:
# Save a dataset (version)

(
    video_dc
    .save("kinetics-700-val")
)


Preparing: 1001 rows [00:00, 138912.73 rows/s]
Processed: 1001 rows [00:00, 14977.89 rows/s]
Preparing: 1000 rows [00:00, 145051.32 rows/s]
Processed: 1000 rows [00:00, 20325.67 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 67064.50 rows/s][A
Generated: 20001 rows [00:00, 68155.60 rows/s][A
Processed: 1 rows [00:00,  2.00 rows/s]rows/s][A
Generated: 33314 rows [00:00, 66794.32 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 67367.58 rows/s][A
Generated: 20001 rows [00:00, 67048.50 rows/s][A
Generated: 30001 rows [00:00, 66979.01 rows/s][A
Generated: 40001 rows [00:00, 65077.76 rows/s][A
Generated: 50001 rows [00:00, 65397.51 rows/s][A
Generated: 60001 rows [00:00, 64617.59 rows/s][A
Processed: 1 rows [00:01,  1.17s/ rows]rows/s][A
Generated: 75555 rows [00:01, 64831.64 rows/s]
Saving: 2226 rows [00:00, 10473.17 rows/s]
Cleanup: 215 tables [00:00,

<datachain.lib.dc.DataChain at 0x1308b5fa0>

# Explore Dataset

In [117]:
dc = DataChain.from_dataset('kinetics-700-val')
print(dc.count())
dc.show(10)

2040


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,video_id,youtube_id,time_start,time_end,split,class,label,right_video_id,person_box_x1,person_box_y1,person_box_x2,person_box_y2,xy_ratio,action_id
Unnamed: 0_level_1,source,path,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
0,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,12.0
1,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,59.0
2,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,80.0
3,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,14.0
4,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,17.0
5,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,64.0
6,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,D9XYemqKjeI,17,27,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,80.0
7,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,987323,,0x1.9a35722f1c7aep+30,1,2024-07-09 15:51:39.777812+00:00,,,ejvWztFFVh4,ejvWztFFVh4,24,34,validate,602,sword fighting,ejvWztFFVh4,27.024,0.07497,0.433476,0.162426,0.720531,12.0
8,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,987323,,0x1.9a35722f1c7aep+30,1,2024-07-09 15:51:39.777812+00:00,,,ejvWztFFVh4,ejvWztFFVh4,24,34,validate,602,sword fighting,ejvWztFFVh4,27.024,0.07497,0.433476,0.162426,0.720531,17.0
9,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,987323,,0x1.9a35722f1c7aep+30,1,2024-07-09 15:51:39.777812+00:00,,,ejvWztFFVh4,ejvWztFFVh4,24,34,validate,602,sword fighting,ejvWztFFVh4,27.024,0.07497,0.433476,0.162426,0.720531,73.0



[Limited by 10 rows]


## Preview a file example

In [114]:
sample_results = list(dc.distinct('file.path').limit(10).collect())
example = sample_results[1]
example

(ImageFile(source='file:///', path='Users/mikhailrozhkov/dev/datachain/datachain-examples/computer_vision/kinetics_actions_video/data/validation/acting in play/Fjnz-MvwyqU_000379_000389.mp4', size=471363, version='', etag='0x1.9a356f49c789ep+30', is_latest=True, last_modified=datetime.datetime(2024, 7, 9, 15, 48, 34, 444862, tzinfo=datetime.timezone.utc), location=None, vtype=''),
 'Fjnz-MvwyqU',
 'Fjnz-MvwyqU',
 379,
 389,
 'validate',
 1,
 'acting in play',
 'Fjnz-MvwyqU',
 380.0,
 0.271116,
 0.178639,
 0.355623,
 0.776753,
 12.0)

In [115]:
from IPython.display import Video

video_path = example[0].get_path()
Video(video_path, width=640, height=360, embed=True)

## Preview a schema

In [116]:
# Signals schema

dc.schema

{'file': datachain.lib.file.ImageFile,
 'video_id': str,
 'youtube_id': str,
 'time_start': int,
 'time_end': int,
 'split': str,
 'class': int,
 'label': str,
 'right_video_id': str,
 'person_box_x1': float,
 'person_box_y1': float,
 'person_box_x2': float,
 'person_box_y2': float,
 'xy_ratio': float,
 'action_id': float}

In [119]:
# File signals schema

dc.schema['file'].schema()

{'description': '`DataModel` for reading image files.',
 'properties': {'source': {'default': '', 'title': 'Source', 'type': 'string'},
  'path': {'title': 'Path', 'type': 'string'},
  'size': {'default': 0, 'title': 'Size', 'type': 'integer'},
  'version': {'default': '', 'title': 'Version', 'type': 'string'},
  'etag': {'default': '', 'title': 'Etag', 'type': 'string'},
  'is_latest': {'default': True, 'title': 'Is Latest', 'type': 'boolean'},
  'last_modified': {'default': '1970-01-01T00:00:00Z',
   'format': 'date-time',
   'title': 'Last Modified',
   'type': 'string'},
  'location': {'anyOf': [{'type': 'object'},
    {'items': {'type': 'object'}, 'type': 'array'},
    {'type': 'null'}],
   'default': None,
   'title': 'Location'},
  'vtype': {'default': '', 'title': 'Vtype', 'type': 'string'}},
 'required': ['path'],
 'title': 'ImageFile',
 'type': 'object'}

In [120]:
# Explore with Pandas

df = dc.to_pandas(flatten=True)
df.head()

Unnamed: 0,file.source,file.path,file.size,file.version,file.etag,file.is_latest,file.last_modified,file.location,file.vtype,video_id,...,split,class,label,right_video_id,person_box_x1,person_box_y1,person_box_x2,person_box_y2,xy_ratio,action_id
0,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,...,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,12.0
1,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,...,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,59.0
2,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,...,validate,602,sword fighting,D9XYemqKjeI,22.0,0.246688,0.209394,0.367892,0.456894,80.0
3,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,...,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,14.0
4,file:///,Users/mikhailrozhkov/dev/datachain/datachain-e...,899427,,0x1.9a3575893c73bp+30,1,2024-07-09 15:55:14.309035+00:00,,,D9XYemqKjeI,...,validate,602,sword fighting,D9XYemqKjeI,22.0,0.307272,0.256443,0.506152,0.988571,17.0


In [121]:
df['label'].value_counts()

label
celebrating              45
cheerleading             33
capoeira                 31
dancing gangnam style    27
tossing coin             27
                         ..
digging                   1
blasting sand             1
crying                    1
making cheese             1
blowing bubble gum        1
Name: count, Length: 493, dtype: int64