# 🚀 Getting Started with DataChain

<img src="static/images/datachain-overview.png" alt="DataChain Overview" style="width: 600px;"/>

DataChain is a powerful tool for managing datasets and ML workflows. This tutorial explores how **DataChain** helps Computer Vision projects:
- 🗂️ **Manage and version datasets and annotations** effectively.
- 🔍 **Handle large-scale operations**, applying complex filters and transformations to millions of data entries.
- ⏰ **Save valuable time and resources** by avoiding redundant computations for previously processed samples.
- 🌊 **Directly stream curated data into PyTorch**, eliminating the need for intermediate resharing.

## 📋 Agenda

- 🖼️ Create a `fashion-product-images` dataset from an image directory
- 📂 Load the dataset
- 🔍 Explore filtering techniques
  
## 🛠 Prerequisites

Before you begin, ensure you have:
- ⚙️ DataChain installed in your environment (follow the instructions in `examples/fashion-product-images/README.md`)

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

# Import datachain 
from datachain.lib.dc import DataChain, C

import pandas as pd

# 🆕 Create a DataChain

There are multiple ways of creating of DataChain and persisting it as a dataset. First, import the necessary modules and load your dataset using a `DataChain` class. 

- From cloud storages (AWS S3, GCP, Azure...) or local directory
- From previously saved dataset version
- From values 

Here are a few examples: 

```python
# from cloud storages as S3, gs or Azure: 
DataChain.from_storage("s3://my-bucket/my-dir/")

# from previously saved dataset: 
DataChain.from_dataset("name", version=1)

# from values: 
DataChain.from_features(fib=[1, 2, 3, 5, 8])
```

Data in DataChain is presented as Python classes with an arbitrary set of fields,
including nested classes. The data classes have to inherit from `Feature` class based on `Pydantic`

<img src="static/images/dataset-1.png" alt="Dataset" style="width: 600px;"/>

**Note:** The DataChain represents file samples as pointers to their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples

## Create a DataChain from a GCP bucket

In [2]:
# Create a DataChain

ds = (
    DataChain.from_storage("gs://datachain-demo/fashion-product-images", type="image")
    .filter(C("file.name").glob("*.jpg"))
    .save()
)
ds.show(3)

Processed: 44441 rows [00:02, 19862.67 rows/s]


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,file.source,file.parent,file.name,file.size,file.version,file.etag,file.is_latest,file.last_modified,file.location,file.vtype
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,


[limited by 3 objects]


## Create a DataChain from a local directory of images

**(OPTIONAL) You may skip this and work with data in our public dataset.**

You may create a DataChain from a directory if images are stored locally. Download data from Kaggle to follow the example. 

**Manually**
- Download the Fashion Product Images (Small) dataset from kaggle.com: [Fashion Product Images (Small) dataset from Kaggle.com](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small/data) dataset contributed by Param Aggarwal.
- Unzip data into the (`data`) directory in `examples/fashion-product-images`

**Using a script below:**
1. Obtain your Kaggle credentials file (`kaggle.json`) and save it to the (`~/.kaggle`) directory so that it's available at (`~/.kaggle/kaggle.json`).
2. Download the desired dataset from Kaggle.
3. Unzip the downloaded data into the (`data`) directory.

In [3]:
## Prepare credentials 
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json

## Download data 
# !pip install -q kaggle
# !kaggle datasets download -d paramaggarwal/fashion-product-images-small

## Unzip files 
# unzip fashion-product-images-small.zip "images/*" -d data2
# unzip fashion-product-images-small.zip "styles.csv" -d data2

## (optional) Remove unnecessary redundant directory in the source data 
# ![ -d "data/myntradataset" ] && rm -r "data/myntradataset" 

In [4]:
# Create a DataChain

# DATA_PATH = "data/images"

# ds = (
#     DataChain.from_storage(DATA_PATH, type="image")
#     .filter(C("file.name").glob("*.jpg"))
# )

## Preview DataChain content

In [5]:
# Preview with `.show()`

ds.show(3)

Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,file.source,file.parent,file.name,file.size,file.version,file.etag,file.is_latest,file.last_modified,file.location,file.vtype
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,


[limited by 3 objects]


In [6]:
# Preview with Pandas

df = ds.to_pandas()

print(df.shape)
df.head()

(44439, 25)


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,file.source,file.parent,file.name,file.size,file.version,file.etag,file.is_latest,file.last_modified,file.location,file.vtype
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,gs://datachain-demo,fashion-product-images/images,10000.jpg,1030,1719489653370876,CPzf74/e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,gs://datachain-demo,fashion-product-images/images,10001.jpg,1210,1719489640006438,CKaGwIne+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,gs://datachain-demo,fashion-product-images/images,10002.jpg,807,1719489670015780,CKTW55fe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
3,4,3810480416941945053,,0,fashion-product-images/images,10003.jpg,CO/fpJ7e+4YDEAE=,1719489683599343,1,2024-06-27 12:01:23.651000+00:00,...,gs://datachain-demo,fashion-product-images/images,10003.jpg,11564,1719489683599343,CO/fpJ7e+4YDEAE=,1,1970-01-01 00:00:00+00:00,,
4,5,674154948235391147,,0,fashion-product-images/images,10004.jpg,CMDWmrbe+4YDEAE=,1719489733765952,1,2024-06-27 12:02:13.818000+00:00,...,gs://datachain-demo,fashion-product-images/images,10004.jpg,20647,1719489733765952,CMDWmrbe+4YDEAE=,1,1970-01-01 00:00:00+00:00,,


# 🏷️ Add Metadata

In DataChain, you can add annotations and attributes to files.  In the following steps, you'll add metadata from a CSV file. Here's how you can do it:
1. Load/prepare annotations
2. Define a mapping function or UDF
3. Apply the function to generate new columns
4. Save an annotated dataset

<img src="static/images/dataset-2.png" alt="Dataset" style="width: 600px;"/>

## Load metadata from CSV in GCP

- With Datachain, you can create a chain from a single CSV, JSON, or Parquet file or parse multiple files at once
The example below shows how to parse a single CSV file of metadata using `parse_csv()` method

In [7]:
# Load metadata from CSV 
ds_meta = (
    DataChain.from_storage("gs://datachain-demo/fashion-product-images/styles_clean.csv")
    .parse_csv()
    .save()
)

ds_meta.show(3)

Processed: 1 rows [00:00, 1488.40 rows/s]
Processed: 1 rows [00:00, 582.06 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:01, 9923.37 rows/s][A
Generated: 20001 rows [00:01, 10477.99 rows/s][A
Generated: 30001 rows [00:02, 10544.49 rows/s][A
Processed: 1 rows [00:04,  4.21s/ rows]rows/s][A
Generated: 44446 rows [00:04, 10564.21 rows/s]


Unnamed: 0,id,random,source.file.source,source.file.parent,source.file.name,source.file.size,source.file.version,source.file.etag,source.file.is_latest,source.file.last_modified,...,c0,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname
0,1,-1558329545615551488,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,12904,Men,Apparel,Topwear,Tshirts,Blue,Summer,2011.0,Sports,Nike Sahara Team India Fanwear Round Neck Jersey
1,2,-8342757993656255458,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,12627,Men,Apparel,Topwear,Tshirts,Blue,Winter,2015.0,Sports,Nike Men Blue T20 Indian Cricket Jersey
2,3,-1778859411961393851,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,16357,Men,Apparel,Topwear,Tshirts,Blue,Summer,2013.0,Sports,Nike Mean Team India Cricket Jersey


[limited by 3 objects]


In [8]:
# Add a "filename" column to map each image file to its corresponding metadata

ds_meta = ds_meta.map(filename=lambda c0: str(c0) + '.jpg', output=str)
ds_meta.show(3)

Processed: 44446 rows [00:01, 24529.86 rows/s]


Unnamed: 0,id,random,source.file.source,source.file.parent,source.file.name,source.file.size,source.file.version,source.file.etag,source.file.is_latest,source.file.last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,1,-1558329545615551488,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,Men,Apparel,Topwear,Tshirts,Blue,Summer,2011.0,Sports,Nike Sahara Team India Fanwear Round Neck Jersey,12904.jpg
1,2,-8342757993656255458,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,Men,Apparel,Topwear,Tshirts,Blue,Winter,2015.0,Sports,Nike Men Blue T20 Indian Cricket Jersey,12627.jpg
2,3,-1778859411961393851,gs://datachain-demo,fashion-product-images,styles_clean.csv,4675018,1719830629903847,COfbk67UhYcDEAE=,1,1970-01-01 00:00:00+00:00,...,Men,Apparel,Topwear,Tshirts,Blue,Summer,2013.0,Sports,Nike Mean Team India Cricket Jersey,16357.jpg


[limited by 3 objects]


## Load metadata from a local CSV file

**(OPTIONAL) You may skip this and work with data in our public dataset.**

- In this example, you load the metadata from a CSV file and prepare a annotations 
- Use an image `filename` to map each image file to its corresponding metadata

In [9]:
# # Load Annotations from 'data/styles.csv'

# ANNOTATIONS_PATH = "data/styles.csv"

# annotations = pd.read_csv(
#     ANNOTATIONS_PATH,
#     usecols=["id", "gender", "masterCategory", "subCategory", "articleType", "baseColour", "season", "year", "usage", "productDisplayName"],
# )

# annotations.head(3)

In [10]:
# # Preprocess columns
# annotations["baseColour"] = annotations["baseColour"].fillna('')
# annotations["season"] = annotations["season"].fillna('')
# annotations["usage"] = annotations["usage"].fillna('')
# annotations["productDisplayName"] = annotations["productDisplayName"].fillna('')

# # Add 'filename' column for each image
# annotations["filename"] = annotations["id"].apply(lambda s: str(s) + ".jpg")
# annotations = annotations.drop("id", axis=1)

In [11]:
# ### Create a metadata Datachain allows to generate a chain from Pandas DataFrame 

# ds_meta = DataChain.from_pandas(annotations)
# ds_meta.show(3)

## Merge the original image and metadata datachains

- The `merge` method merges two chains based on the specified criteria
- Parameters:
  - `right_ds`: Chain to join with.
  - `on`: Predicate or list of Predicates to join on. If both chains have the same predicates then this predicate is enough for the join. Otherwise, `right_on` parameter has to specify the predicates for the other chain.
  - `right_on`: Optional predicate or list of Predicates for the `right_ds` to join.
  - `inner`: Whether to run inner join or outer join. Default is False.
  - `rname`: name prefix for conflicting signal names. Default: "{name}_right"

In [12]:
ds_annotated = ds.merge(ds_meta, on="name", right_on="filename")
ds_annotated.show(3)

Processed: 44446 rows [00:01, 25949.11 rows/s]


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,Unisex,Footwear,Flip Flops,Flip Flops,Navy Blue,Winter,2012.0,Casual,Disney Unisex Kids Basic Navy Blue Flip Flops,10000.jpg
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,Men,Footwear,Shoes,Casual Shoes,Black,Summer,2013.0,Casual,Clarks Men Black Leather Loafers,10001.jpg
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,Unisex,Accessories,Socks,Socks,White,Summer,2012.0,Casual,ADIDAS Unisex White Pack of 3 Socks,10002.jpg


[limited by 3 objects]


# 💾 Save Dataset

Saving datasets in DataChain allows you to:

- Persist the dataset and its metadata for future use
- Version the dataset to track changes over time
- Share the dataset with others in your team or organization
- Easily load the dataset in other DataChain workflows or notebooks

By saving the annotated dataset, you ensure the metadata is stored alongside the image data, making it convenient to access and use the enriched dataset in your DataChain projects.

To save the annotated dataset in DataChain, you can use the `.save()` method on the ds_annotated dataset object. 

<img src="static/images/dataset-3.png" alt="Dataset" style="width: 600px;"/>

In [13]:
ds_annotated.save("fashion-product-images")

Processed: 44446 rows [00:01, 25216.53 rows/s]


<datachain.lib.dc.DataChain at 0x16985d390>

This line of code saves the `ds_annotated` dataset as a new dataset named "fashion-product-images" in DataChain.

The `.save()` method takes the name of the dataset as a parameter and creates a new dataset with that name in DataChain. The saved dataset will include all the data and metadata from the original dataset, as well as the newly added metadata signals from the `ImageMetadata` UDF.

After executing this code, you will have a new dataset named "fashion-product-images" in your DataChain workspace, which contains the annotated image data. You can later load this dataset using `DataChain.from_dataset("fashion-product-images")` to access the annotated data in your DataChain workflows.

# 🔍 Explore Data

The dataset contains metadata about the images. We can view this metadata in two ways: 
- using method `.show()` 
- using the `.to_pandas()` method to review as a Pandas DataFrame 

In [14]:
# Load Image Catalog

ds = DataChain.from_dataset(name="fashion-product-images")

This line creates a DataChain object named `ds` that refers to previously saved dataset named `fashion-product-images`.

## Use DataChain API `show()`

In [15]:
ds.show(3)

Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,Unisex,Footwear,Flip Flops,Flip Flops,Navy Blue,Winter,2012.0,Casual,Disney Unisex Kids Basic Navy Blue Flip Flops,10000.jpg
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,Men,Footwear,Shoes,Casual Shoes,Black,Summer,2013.0,Casual,Clarks Men Black Leather Loafers,10001.jpg
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,Unisex,Accessories,Socks,Socks,White,Summer,2012.0,Casual,ADIDAS Unisex White Pack of 3 Socks,10002.jpg


[limited by 3 objects]


## Convert to Pandas DataFrame

This line converts the DataChain dataset (`ds`) into a pandas DataFrame (`df`), making it easier to explore the data using familiar pandas functionalities.
- For example, review of the distribution of values in these columns

In [16]:
df = ds.to_pandas()

print(df.shape)
print(df.columns)
df.head(3)

(44439, 49)
Index(['id', 'random', 'vtype', 'dir_type', 'parent', 'name', 'etag',
       'version', 'is_latest', 'last_modified', 'size', 'owner_name',
       'owner_id', 'location', 'source', 'file.source', 'file.parent',
       'file.name', 'file.size', 'file.version', 'file.etag', 'file.is_latest',
       'file.last_modified', 'file.location', 'file.vtype', 'right_id',
       'right_random', 'source.file.source', 'source.file.parent',
       'source.file.name', 'source.file.size', 'source.file.version',
       'source.file.etag', 'source.file.is_latest',
       'source.file.last_modified', 'source.file.location',
       'source.file.vtype', 'source.index', 'c0', 'gender', 'mastercategory',
       'subcategory', 'articletype', 'basecolour', 'season', 'year', 'usage',
       'productdisplayname', 'filename'],
      dtype='object')


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,1,1555367085137818095,,0,fashion-product-images/images,10000.jpg,CPzf74/e+4YDEAE=,1719489653370876,1,2024-06-27 12:00:53.421000+00:00,...,Unisex,Footwear,Flip Flops,Flip Flops,Navy Blue,Winter,2012.0,Casual,Disney Unisex Kids Basic Navy Blue Flip Flops,10000.jpg
1,2,2127561607577938352,,0,fashion-product-images/images,10001.jpg,CKaGwIne+4YDEAE=,1719489640006438,1,2024-06-27 12:00:40.056000+00:00,...,Men,Footwear,Shoes,Casual Shoes,Black,Summer,2013.0,Casual,Clarks Men Black Leather Loafers,10001.jpg
2,3,5441151968502213098,,0,fashion-product-images/images,10002.jpg,CKTW55fe+4YDEAE=,1719489670015780,1,2024-06-27 12:01:10.067000+00:00,...,Unisex,Accessories,Socks,Socks,White,Summer,2012.0,Casual,ADIDAS Unisex White Pack of 3 Socks,10002.jpg


In [17]:
print(df.mastercategory.value_counts())
print(df.subcategory.value_counts())

mastercategory
Apparel           16026
Accessories        8374
Footwear           6907
Personal Care      1765
Free Items           80
Sporting Goods       21
Name: count, dtype: int64
subcategory
Topwear                     11538
Shoes                        5530
Bags                         2260
Bottomwear                   2008
Watches                      1883
Innerwear                    1364
Eyewear                       820
Jewellery                     792
Fragrance                     739
Sandal                        717
Wallets                       686
Flip Flops                    660
Belts                         602
Socks                         518
Lips                          382
Dress                         355
Loungewear and Nightwear      340
Saree                         328
Nails                         241
Makeup                        232
Headwear                      222
Ties                          191
Accessories                   100
Scarves              

This code snippet demonstrates how to leverage DataChain to load and get a basic understanding of your dataset using `pandas`.

**Note**: DataChain offers functionalities beyond pandas conversion. Explore the documentation for more advanced data manipulation techniques!

# 🕵️‍♀️ Filtering Data

DataChain allows you to filter the dataset based on specific conditions.
- `.filter()` method applies querying expressions to columns  
- use a `C` object to refer to the dataset column by names like `C("NAME")` (e.g. `C("mastercategory")`)

## Show only images with `Apparel` category

In [18]:
(
    ds.filter(C("mastercategory") == "Apparel").show(3)
)


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,4,3810480416941945053,,0,fashion-product-images/images,10003.jpg,CO/fpJ7e+4YDEAE=,1719489683599343,1,2024-06-27 12:01:23.651000+00:00,...,Men,Apparel,Bottomwear,Trousers,Khaki,Spring,2013.0,Smart Casual,Allen Solly Men Khaki Chino Trousers,10003.jpg
1,5,674154948235391147,,0,fashion-product-images/images,10004.jpg,CMDWmrbe+4YDEAE=,1719489733765952,1,2024-06-27 12:02:13.818000+00:00,...,Women,Apparel,Topwear,Kurtas,Multi,Summer,2012.0,Ethnic,Diva Women Multi Coloured Kurta,10004.jpg
2,6,3643261648999796763,,0,fashion-product-images/images,10005.jpg,CKSeprve+4YDEAE=,1719489744441124,1,2024-06-27 12:02:24.500000+00:00,...,Boys,Apparel,Topwear,Tshirts,Grey,Fall,2011.0,Casual,Chhota Bheem Kids Boys Lets Rock Grey T-shirt,10005.jpg


[limited by 3 objects]


## Show only `Topwear` products

In [19]:
(
    ds.filter(C("subcategory") == "Topwear").show(3)
)

Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,5,674154948235391147,,0,fashion-product-images/images,10004.jpg,CMDWmrbe+4YDEAE=,1719489733765952,1,2024-06-27 12:02:13.818000+00:00,...,Women,Apparel,Topwear,Kurtas,Multi,Summer,2012.0,Ethnic,Diva Women Multi Coloured Kurta,10004.jpg
1,6,3643261648999796763,,0,fashion-product-images/images,10005.jpg,CKSeprve+4YDEAE=,1719489744441124,1,2024-06-27 12:02:24.500000+00:00,...,Boys,Apparel,Topwear,Tshirts,Grey,Fall,2011.0,Casual,Chhota Bheem Kids Boys Lets Rock Grey T-shirt,10005.jpg
2,8,4808167984788117,,0,fashion-product-images/images,10007.jpg,CMK/06be+4YDEAE=,1719489701142466,1,2024-06-27 12:01:41.201000+00:00,...,Men,Apparel,Topwear,Tshirts,Black,Summer,2011.0,Sports,ADIDAS Men's Pune Warriors Graphic Black T-shirt,10007.jpg


[limited by 3 objects]


## Chain multiple filters together

Show only 'Topwear' apparel products for a 'Summer' season

In [20]:
(
    ds
    .filter(C("mastercategory") == "Apparel")
    .filter(C("subcategory") == "Topwear")
    .filter(C("season") == "Summer")
    .to_pandas().shape
    # .show(3)
 )

(6612, 49)

You may use one line filter with multiple expressions joined with logical operators like `&` (AND) and  `|` (OR)

In [21]:
(
    ds
    .filter((C("mastercategory") == "Apparel") & (C("subcategory") == "Topwear") & (C("season") == "Summer"))
    .to_pandas().shape
    # .show(3)
 )

(6612, 49)

## Save Dataset

Let's save "fashion-topwear" to make it version and reusable

In [22]:
(
    DataChain(name="fashion-product-images")
    .filter(C("mastercategory") == "Apparel")
    .filter(C("subcategory") == "Topwear")
    .save("fashion-topwear")
    .to_pandas().shape
)

(11538, 49)

# ☁️ Run in Studio (SaaS)

<a href="https://dvc.ai/">
    <img src="static/images/studio.png" alt="DataChain Studio SaaS" style="width: 600px;"/>
</a>

To run these examples in Studio, follow the quide

1. Open Studio / YOUR_TEAM / `datasets` workspace
2. Create a new Python Script
3. Copy/past a script from `scripts/1-quick-start.py`
4. Click the Run button


# 🎉 Summary 

👏 **Congratulations on completing this tutorial! You're a DataChain superstar! 🌟** You've taken the first steps in harnessing the power of DataChain for your computer vision projects. In this tutorial, we covered:
- Creating the `fashion-product-images` dataset from existing images
- Filtering the dataset based on specific conditions
- Essential DataChain methods:
    - `.show()` for displaying dataset samples
    - `.to_pandas()` for converting datasets to Pandas DataFrames
    - `.filter()` for applying custom filters to datasets
    - `.gen()` for generated metadata
    - `.merge()` for attaching metadata to images

But this is just the beginning! DataChain offers many features for streamlining your ML workflows, including data transformations, versioning, and much more. 🚀

## What's Next?

Excited to learn more? Check out the next parts of our tutorial series:
- 📂 Saving and Versioning Datasets 
- 🧩 Splitting Datasets for Training, Validation, and Testing
- 🎨 Generating and Managing Embeddings
- 🔍 Performing Similarity Search
- 🧹 Finding and Removing Redundant Images
- 🧠 Training Models
- 🔮 Running Inference and Saving Predictions
- 📊 Analyzing Predictions

By mastering these techniques, you'll be well on your way to building powerful and efficient computer vision pipelines with DataChain.

## 🤝 Get Involved

We'd love to have you join our growing community of DataChain users and contributors! Here's how you can get involved:

- ⭐ Give us a star on [GitHub](https://github.com/iterative/dvcx) to show your support
- 🌐 Visit the [dvc.ai website](https://dvc.ai/) to learn more about our products and services
- 📞 Contact us to discuss on scaling 🚀 DataChain for your project!
- 🙌 Follow us on [LinkedIn](https://www.linkedin.com/company/dvc-ai/) and [Twitter](https://x.com/DVCorg) for the latest updates and insights

Thanks for choosing DataChain, and happy coding! 😄