# datachain

- https://github.com/iterative/datachain



```bash
pip install datachain
```



## quick start

In [3]:
from datachain import Column, DataChain

meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta", anon=True)
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)

images_id = images.map(id=lambda file: file.path.split('.')[-2])
annotated = images_id.merge(meta, on="id", right_on="meta.id")

likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
                               & (Column("meta.inference.class_") == "cat"))
likely_cats.to_storage("high-confidence-cats/", signal="file")

Preparing: 0 rows [00:00, ? rows/s]

Download: 0.00B [00:00, ?B/s]

Processed: 0 rows [00:00, ? rows/s]

Download: 0.00B [00:00, ?B/s]

Processed: 0 rows [00:00, ? rows/s]

Generated: 0 rows [00:00, ? rows/s]

Cleanup:   0%|          | 0/3 [00:00<?, ? tables/s]

Exporting files to high-confidence-cats/:   0%|          | 0.00/20.0 [00:00<?, ? files/s]

Preparing: 0 rows [00:00, ? rows/s]

Download: 0.00B [00:00, ?B/s]

Processed: 0 rows [00:00, ? rows/s]

Download: 0.00B [00:00, ?B/s]

Processed: 0 rows [00:00, ? rows/s]

Generated: 0 rows [00:00, ? rows/s]

Exporting files to high-confidence-cats/:  75%|███████▌  | 15.0/20.0 [00:45<00:06, 1.30s/ files]

Cleanup:   0%|          | 0/6 [00:00<?, ? tables/s]

                                                                                                

## data curation

In [4]:
from transformers import pipeline
from datachain import DataChain, Column

classifier = pipeline("sentiment-analysis", device="cpu",
                model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

def is_positive_dialogue_ending(file) -> bool:
    dialogue_ending = file.read()[-512:]
    return classifier(dialogue_ending)[0]["label"] == "POSITIVE"

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
                          object_name="file", type="text", anon=True)
   .settings(parallel=8, cache=True)
   .map(is_positive=is_positive_dialogue_ending)
   .save("file_response")
)

positive_chain = chain.filter(Column("is_positive") == True)
positive_chain.to_storage("./output")

print(f"{positive_chain.count()} files were exported")

2025-03-17 17:42:52.215019: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-03-17 17:42:52.215053: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Processed: 0 rows [00:00, ? rows/s]

Listing gs://datachain-demo: 0 objects [00:00, ? objects/s]

Generated: 0 rows [00:00, ? rows/s]

Cleanup:   0%|          | 0/1 [00:00<?, ? tables/s]

Preparing: 0 rows [00:00, ? rows/s]

2025-03-17 17:43:03.292265: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-03-17 17:43:03.292282: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-03-17 17:43:03.923287: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-03-17 17:43:03.923304: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-03-17 17:43:04.548241: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or dire

Cleanup:   0%|          | 0/2 [00:00<?, ? tables/s]

                                                                            

13 files were exported


