# 🚀 Getting Started with DataChain

<img src="static/images/datachain-overview.png" alt="DataChain Overview" style="width: 600px;"/>

DataChain is a powerful tool for managing datasets and ML workflows. This tutorial explores how **[Datachain](https://github.com/iterative/datachain)** helps Computer Vision projects:
- 🗂️ **Manage and version datasets and annotations** effectively.
- 🔍 **Handle large-scale operations**, applying complex filters and transformations to millions of data entries.
- ⏰ **Save valuable time and resources** by avoiding redundant computations for previously processed samples.
- 🌊 **Directly stream curated data into PyTorch**, eliminating the need for intermediate resharing.

## 📋 Agenda

- 🖼️ Create a `fashion-product-images` dataset from an image directory
- 📂 Load the dataset
- 🔍 Explore filtering techniques
  
## 🛠 Prerequisites

Before you begin, ensure you have:
- **[Datachain](https://github.com/iterative/datachain)** is installed in your environment (follow the instructions in `examples/fashion-product-images/README.md`)
- The necessary dependencies installed, including PyTorch and the required libraries (see `requirements.txt`).

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd

# Import datachain
from datachain import C, DataChain


# 🆕 Create a DataChain

There are multiple ways of creating of DataChain and persisting it as a dataset. First, import the necessary modules and load your dataset using a `DataChain` class. 

- From cloud storages (AWS S3, GCP, Azure...) or local directory
- From previously saved dataset version
- From values 

Here are a few examples: 

```python
# from cloud storages as S3, gs or Azure: 
DataChain.from_storage("s3://my-bucket/my-dir/")

# from previously saved dataset: 
DataChain.from_dataset("name", version=1)

# from values: 
DataChain.from_values(fib=[1, 2, 3, 5, 8])
```

Data in DataChain is presented as Python classes with an arbitrary set of fields,
including nested classes. The data classes have to inherit from `DataModel` class based on `Pydantic`

<img src="static/images/dataset-1.png" alt="Dataset" style="width: 600px;"/>

**Note:** The DataChain represents file samples as pointers to their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples

## Create a DataChain from a GCP bucket

In [3]:
# Create a DataChain

dc = (
    DataChain.from_storage(
        "gs://datachain-demo/fashion-product-images", type="image", anon=True
    )
    .filter(C("file.name").glob("*.jpg"))
    .save()
)


Processed: 44441 rows [00:02, 21752.28 rows/s]


## Create a DataChain from a local directory of images

**(OPTIONAL) You may skip this and work with data in our public dataset.**

You may create a DataChain from a directory if images are stored locally. Download data from Kaggle to follow the example. 

**Manually**
- Download the Fashion Product Images (Small) dataset from kaggle.com: [Fashion Product Images (Small) dataset from Kaggle.com](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small/data) dataset contributed by Param Aggarwal.
- Unzip data into the (`data`) directory in `examples/fashion-product-images`

**Using a script below:**
1. Obtain your Kaggle credentials file (`kaggle.json`) and save it to the (`~/.kaggle`) directory so that it's available at (`~/.kaggle/kaggle.json`).
2. Download the desired dataset from Kaggle.
3. Unzip the downloaded data into the (`data`) directory.

In [3]:
# Prepare credentials
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json

## Download data
# !pip install -q kaggle
# !kaggle datasets download -d paramaggarwal/fashion-product-images-small

## Unzip files
# unzip fashion-product-images-small.zip "images/*" -d data2
# unzip fashion-product-images-small.zip "styles.csv" -d data2

## (optional) Remove unnecessary redundant directory in the source data
# ![ -d "data/myntradataset" ] && rm -r "data/myntradataset"

In [4]:
# Create a DataChain

# DATA_PATH = "data/images"

# dc = (
#     DataChain.from_storage(DATA_PATH, type="image")
#     .filter(C("file.name").glob("*.jpg"))
# )

## Preview DataChain content

In [4]:
# Preview with `.show()`

dc.show(3)

Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,



[Limited by 3 rows]


In [5]:
# Preview with Pandas

df = dc.to_pandas()

print(df.shape)
df.head()

(44439, 10)


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
3,gs://datachain-demo,fashion-product-images/images,10003.jpg,11564,1719489683599343,CO/fpJ7e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
4,gs://datachain-demo,fashion-product-images/images,10004.jpg,20647,1719489733765952,CMDWmrbe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,


# 🏷️ Add Metadata

In DataChain, you can add annotations and attributes to files.  In the following steps, you'll add metadata from a CSV file. Here's how you can do it:
1. Load/prepare annotations
2. Define a mapping function or UDF
3. Apply the function to generate new columns
4. Save an annotated dataset

<img src="static/images/dataset-2.png" alt="Dataset" style="width: 600px;"/>

## Load metadata from CSV in GCP

- With Datachain, you can create a chain from a single CSV, JSON, or Parquet file or parse multiple files at once
The example below shows how to parse a single CSV file of metadata using `from_csv()` method.

In [6]:
# Load metadata from CSV
dc_meta = (
    DataChain.from_csv("gs://datachain-demo/fashion-product-images/styles_clean.csv")
    .select_except("source") # remove technical columns
    .save()
)

dc_meta.show(3)

Processed: 1 rows [00:00, 1773.49 rows/s]
Processed: 1 rows [00:00, 1018.28 rows/s]
Parsed by pyarrow: 44446 rows [00:04, 10919.62 rows/s]
Processed: 1 rows [00:00,  1.02 rows/s]
Generated: 44446 rows [00:00, 45382.93 rows/s]


Unnamed: 0,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt,15970.jpg
1,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans,39386.jpg
2,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch,59263.jpg



[Limited by 3 rows]


## Load metadata from a local CSV file

**(OPTIONAL) You may skip this and work with data in our public dataset.**

- In this example, you load the metadata from a CSV file and prepare a annotations 
- Use an image `filename` to map each image file to its corresponding metadata

In [8]:
# # Load Annotations from 'data/styles.csv'

# ANNOTATIONS_PATH = "data/styles.csv"

# annotations = pd.read_csv(
#     ANNOTATIONS_PATH,
#     usecols=["id", "gender", "mastercategory", "subcategory", "articletype", "basecolour", "season", "year", "usage", "productdisplayname"],
# )

# annotations.head(3)

In [9]:
# # Preprocess columns
# annotations["basecolour"] = annotations["basecolour"].fillna('')
# annotations["season"] = annotations["season"].fillna('')
# annotations["usage"] = annotations["usage"].fillna('')
# annotations["productdisplayname"] = annotations["productdisplayname"].fillna('')

# # Add 'filename' column for each image
# annotations["filename"] = annotations["id"].apply(lambda s: str(s) + ".jpg")
# annotations = annotations.drop("id", axis=1)

In [10]:
# ### Create a metadata Datachain allows to generate a chain from Pandas DataFrame

# ds_meta = DataChain.from_pandas(annotations)
# ds_meta.show(3)

## Merge the original image and metadata datachains

- The `merge` method merges two chains based on the specified criteria
- Parameters:
  - `right_ds`: Chain to join with.
  - `on`: Predicate or list of Predicates to join on. If both chains have the same predicates then this predicate is enough for the join. Otherwise, `right_on` parameter has to specify the predicates for the other chain.
  - `right_on`: Optional predicate or list of Predicates for the `right_ds` to join.
  - `inner`: Whether to run inner join or outer join. Default is False.
  - `rname`: name prefix for conflicting signal names. Default: "{name}_right"

In [7]:
dc_annotated = dc.merge(dc_meta, on="file.name", right_on="filename")
dc_annotated.show(3)

Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,White,Summer,2011.0,Casual,Palm Tree Girls Sp Jace Sko White Skirts,10000.jpg
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Girls Sp Jema Skt Blue Skirts,10001.jpg
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Sp Jema Skt Blue Skirts,10002.jpg



[Limited by 3 rows]


# 💾 Save Dataset

Saving datasets in DataChain allows you to:

- Persist the dataset and its metadata for future use
- Version the dataset to track changes over time
- Share the dataset with others in your team or organization
- Easily load the dataset in other DataChain workflows or notebooks

By saving the annotated dataset, you ensure the metadata is stored alongside the image data, making it convenient to access and use the enriched dataset in your DataChain projects.

To save the annotated dataset in DataChain, you can use the `.save()` method on the ds_annotated dataset object. 

<img src="static/images/dataset-3.png" alt="Dataset" style="width: 600px;"/>

In [8]:
dc_annotated.save("fashion-product-images")

<datachain.lib.dc.DataChain at 0x14fb88380>

This line of code saves the `ds_annotated` dataset as a new dataset named "fashion-product-images" in DataChain.

The `.save()` method takes the name of the dataset as a parameter and creates a new dataset with that name in DataChain. The saved dataset will include all the data and metadata from the original dataset, as well as the newly added metadata signals from the `ImageMetadata` UDF.

After executing this code, you will have a new dataset named "fashion-product-images" in your DataChain workspace, which contains the annotated image data. You can later load this dataset using `DataChain.from_dataset("fashion-product-images")` to access the annotated data in your DataChain workflows.

# 🔍 Explore Data

The dataset contains metadata about the images. We can view this metadata in two ways: 
- using `.show()` 
- using the `.to_pandas()` method to review as a Pandas DataFrame 

In [9]:
# Load Image Catalog

dc = DataChain.from_dataset(name="fashion-product-images")

This line creates a DataChain object named `ds` that refers to previously saved dataset named `fashion-product-images`.

## Use DataChain API `show()`

In [10]:
dc.show(3)

Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,White,Summer,2011.0,Casual,Palm Tree Girls Sp Jace Sko White Skirts,10000.jpg
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Girls Sp Jema Skt Blue Skirts,10001.jpg
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Sp Jema Skt Blue Skirts,10002.jpg



[Limited by 3 rows]


## Convert to Pandas DataFrame

This line converts the DataChain dataset (`ds`) into a pandas DataFrame (`df`), making it easier to explore the data using familiar pandas functionalities.
- For example, review of the distribution of values in these columns

In [11]:
df = dc.to_pandas()

print(df.shape)
df.head(3)

(44439, 20)


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,White,Summer,2011.0,Casual,Palm Tree Girls Sp Jace Sko White Skirts,10000.jpg
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Girls Sp Jema Skt Blue Skirts,10001.jpg
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Sp Jema Skt Blue Skirts,10002.jpg


In [16]:
print(df.mastercategory.value_counts())
print(df.subcategory.value_counts())

mastercategory
Apparel           21395
Accessories       11288
Footwear           9221
Personal Care      2404
Free Items          105
Sporting Goods       25
Home                  1
Name: count, dtype: int64
subcategory
Topwear                     15401
Shoes                        7344
Bags                         3055
Bottomwear                   2693
Watches                      2542
Innerwear                    1808
Jewellery                    1080
Eyewear                      1073
Fragrance                    1012
Sandal                        962
Wallets                       933
Flip Flops                    915
Belts                         811
Socks                         698
Lips                          527
Dress                         478
Loungewear and Nightwear      470
Saree                         427
Nails                         329
Makeup                        307
Headwear                      293
Ties                          258
Accessories                   1

This code snippet demonstrates how to leverage DataChain to load and get a basic understanding of your dataset using `pandas`.

**Note**: DataChain offers functionalities beyond pandas conversion. Explore the documentation for more advanced data manipulation techniques!

# 🕵️‍♀️ Filtering Data

DataChain allows you to filter the dataset based on specific conditions.
- `.filter()` method applies querying expressions to columns  
- use a `C` object to refer to the dataset column by names like `C("NAME")` (e.g. `C("mastercategory")`)

## Show only images with `Apparel` category

In [12]:
(
    dc.filter(C("mastercategory") == "Apparel").show(3)
)


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,White,Summer,2011.0,Casual,Palm Tree Girls Sp Jace Sko White Skirts,10000.jpg
1,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Girls Sp Jema Skt Blue Skirts,10001.jpg
2,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Bottomwear,Skirts,Blue,Summer,2011.0,Casual,Palm Tree Kids Sp Jema Skt Blue Skirts,10002.jpg



[Limited by 3 rows]


## Show only `Topwear` products

In [13]:
(
    dc.filter(C("subcategory") == "Topwear").show(3)
)

Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,gs://datachain-demo,fashion-product-images/images,10003.jpg,11564,1719489683599343,CO/fpJ7e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Women,Apparel,Topwear,Tshirts,White,Fall,2011.0,Sports,Nike Women As Nike Eleme White T-Shirt,10003.jpg
1,gs://datachain-demo,fashion-product-images/images,10005.jpg,16677,1719489744441124,CKSeprve+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Men,Apparel,Topwear,Tshirts,Blue,Fall,2011.0,Sports,Nike Men As Ss Trainin Blue T-Shirts,10005.jpg
2,gs://datachain-demo,fashion-product-images/images,10006.jpg,2146,1719489713329967,CK+uu6ze+4YDEAE=,1,1970-01-01 00:00:00+00:00,,,Men,Apparel,Topwear,Tshirts,Black,Fall,2011.0,Sports,Nike Men AS T90 Black Tshirts,10006.jpg



[Limited by 3 rows]


## Chain multiple filters together

Show only 'Topwear' apparel products for a 'Summer' season

In [14]:
(
    dc
    .filter(C("mastercategory") == "Apparel")
    .filter(C("subcategory") == "Topwear")
    .filter(C("season") == "Summer")
    .to_pandas().shape
    # .show(3)
 )

(8830, 20)

You may use one line filter with multiple expressions joined with logical operators like `&` (AND) and  `|` (OR)

In [19]:
(
    dc
    .filter((C("mastercategory") == "Apparel") & (C("subcategory") == "Topwear") & (C("season") == "Summer"))
    .to_pandas().shape
    # .show(3)
 )

(8830, 20)

## Save Dataset

Let's save "fashion-topwear" to make it version and reusable

In [21]:
(
    DataChain(name="fashion-product-images")
    .filter((C("mastercategory") == "Apparel") & (C("subcategory") == "Topwear"))
    .save("fashion-topwear")
    .to_pandas()
    .shape
)


(15401, 20)

# ☁️ Run in Studio (SaaS)

<a href="https://datachain.ai/">
    <img src="static/images/studio.png" alt="DataChain Studio SaaS" style="width: 600px;"/>
</a>

To run these examples in Studio, follow the quide

1. Open Studio / YOUR_TEAM / `datasets` workspace
2. Create a new Python Script
3. Copy/past a script from `scripts/1-quick-start.py`
4. Click the Run button


# 🎉 Summary 

👏 **Congratulations on completing this tutorial! You're a DataChain superstar! 🌟** You've taken the first steps in harnessing the power of DataChain for your computer vision projects. In this tutorial, we covered:
- Creating the `fashion-product-images` dataset from existing images
- Filtering the dataset based on specific conditions
- Essential DataChain methods:
    - `.show()` for displaying dataset samples
    - `.to_pandas()` for converting datasets to Pandas DataFrames
    - `.filter()` for applying custom filters to datasets
    - `.merge()` for attaching metadata to images

But this is just the beginning! DataChain offers many features for streamlining your ML workflows, including data transformations, versioning, and much more. 🚀

## What's Next?

Excited to learn more? Check out the next parts of our tutorial series:
- 📂 Saving and Versioning Datasets 
- 🧩 Splitting Datasets for Training, Validation, and Testing
- 🎨 Generating and Managing Embeddings
- 🔍 Performing Similarity Search
- 🧹 Finding and Removing Redundant Images
- 🧠 Training Models
- 🔮 Running Inference and Saving Predictions
- 📊 Analyzing Predictions

By mastering these techniques, you'll be well on your way to building powerful and efficient computer vision pipelines with DataChain.

## 🤝 Get Involved

We'd love to have you join our growing community of DataChain users and contributors! Here's how you can get involved:

- ⭐ Give us a star on [GitHub](https://github.com/iterative/datachain) to show your support
- 🌐 Visit the [datachain.ai website](https://datachain.ai/) to learn more about our products and services
- 📞 Contact us to discuss on scaling 🚀 DataChain for your project!
- 🙌 Follow us on [LinkedIn](https://www.linkedin.com/company/dvc-ai/) and [Twitter](https://x.com/DVCorg) for the latest updates and insights

Thanks for choosing DataChain, and happy coding! 😄