# Gain Operator Insight


Operators provided by Data-Juicer serve as the backbone for a variety of data operations, including modification, cleaning, filtering, and deduplication. 


In the following sections, we will run several Operators to gain a deeper understanding of Operators,and inspect the result.


In [None]:
# Install data-juicer package if you are NOT in the Playground
# !pip3 install py-data-juicer

# Or use newest code of data-juicer
# !pip install git+https://github.com/modelscope/data-juicer

### Mapper Operators

A mapper operator is primarily used for editing, modifying, or enhancing functionality.


We use `CleanIpMapper` to clean up IP address in text and we showcase two approaches.

In [None]:
from data_juicer.ops.mapper.clean_ip_mapper import CleanIpMapper
op = CleanIpMapper()

-  Invoke op's process directly

In [None]:
sample = [{'text': 'test of ip 11.22.33.44'}]
out_sample = op.process(sample)

print(out_sample)

-  Invoke op's process with datasets

In [None]:
from datasets import Dataset

samples = [{'text': 'test of ip 11.22.33.44'}]
ds = Dataset.from_list(samples)
out_ds = ds.map(op.process)

for sample in out_ds:
    print(sample)

### Filter Operators

A filter operator is mainly used to filter out low-quality samples.

Generally, a filter operator involves two steps:
- Compute Statistical value
- Compare  statistical value and threshold

We show you how to use WordNumFilter to filter out samples whose number of words is not within the range of [3,10], that means, to discard samples with less than 3 or more than 10 words.


In [None]:
from datasets import Dataset

samples = [
    {'text': 'Data Juicer'}, 
    {'text': 'Welcome to Data Juicer Playground'}
]
ds = Dataset.from_list(samples)

# add a new column to the dataset to store the statistical values of the filter operator.
from data_juicer.utils.constant import Fields
ds = ds.add_column(name=Fields.stats, column=[{}] * ds.num_rows)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.filter.word_num_filter import WordNumFilter
op = WordNumFilter(
    min_num=3, 
    max_num=10
)

In [None]:
ds = ds.map(op.compute_stats)

for sample in ds:
    print(sample)

out_ds = ds.filter(op.process)

print(out_ds)
print(f'Number of samples of output dataset : {len(out_ds)}')
for sample in out_ds:
    print(sample)

### Deduplicator Operators

A deduplicator operator is mainly used to detects and removes duplicate samples. 

Generally, a deduplicator operator involves two steps:
- Compute hash values
- Delete duplicate samples

Here is a case-insensitive demo to deduplicate samples using exact matching (md5 hash)

In [None]:
from datasets import Dataset

samples = [
    {'text': 'welcome to data juicer playground'}, 
    {'text': 'Welcome to Data Juicer Playground'}
]
ds = Dataset.from_list(samples)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.deduplicator.document_deduplicator import DocumentDeduplicator
op = DocumentDeduplicator(lowercase=True)

In [None]:
ds = ds.map(op.compute_hash)
out_ds, dup_pairs= op.process(ds, show_num=1)

print(out_ds)
print(f'Number of samples of output dataset : {len(out_ds)}')
for sample in out_ds:
    print(sample)

Print deduplicate samples

In [None]:
for key, dup_pair in dup_pairs.items():
    print(f'Deduplicate hash value : {key}')
    for sample in dup_pair:
        print(sample)

### Selector Operators

A selector operator is mainly used to selects top samples based on ranking. It is primarily used to perform statistical analysis on a specific field of a dataset. For instance, it can select the top k samples with the highest frequency, or to choose a portion of samples with the highest proportion.


First we construct a dataset with 5 samples, and use selector operator to select the top 2 samples based on `meta.count`.


In [None]:
from datasets import Dataset

samples = [{
            'text': 'Today is Sun', 
            'meta': {'count': 5 }
        }, {
            'text': 'a v s e c s f e f g a a a  ',
            'meta': {'count': 23 }
        }, {
            'text': '，。、„”“«»１」「《》´∶：？！',
            'meta': { 'count': 48 }
        }, {
            'text': '他的英文名字叫Harry Potter',
            'meta': { 'count': 78 }
        }, {
            'text': '这是一个测试',
            'meta': { 'count': 3 }
        }]
ds = Dataset.from_list(samples)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.selector.topk_specified_field_selector import TopkSpecifiedFieldSelector

op = TopkSpecifiedFieldSelector(field_key='meta.count', topk=2)

In [None]:
out_ds = op.process(ds)

print(out_ds)
print(f'Number of samples of output dataset : {len(out_ds)}')
for sample in out_ds:
    print(sample)

### Multimodel Demos

Data-Juicer now supplies a lot of operators to process **multimodal** data.

**Note**


The input dataset must adhere to the Data-Juicer format, characterized by a text-centric, multi-chunk structure interspersed with special tokens. 

Additionally, we offer a suite of multimodal tools designed to facilitate conversion between other formats and the Data-Juicer format.


Here we use test data from Data-Juicer to to show you how to construct your dataset and how to use process multimodel data. 

In [None]:
import os
data_juicer_path = './data-juicer'  # change to your data-juicer directory
data_path = os.path.join(data_juicer_path, 'tests/ops/data')

- #### Image-Text Dataset

Generally speaking, most multimodal datasets consist of at least two modalities.

In [None]:
import os
from datasets import Dataset
from data_juicer.utils.mm_utils import SpecialTokens

cat_path = os.path.join(data_path, 'cat.jpg')

samples = [{
            'text': f'{SpecialTokens.image}a photo of a cat',  # 0.2457006871700287
            'images': [cat_path]
        }, {
            'text': f'{SpecialTokens.image}a photo of a dog',  # 0.19304907321929932
            'images': [cat_path]
        }]
ds = Dataset.from_list(samples)

# add a new column to the dataset to store the statistical values of the filter operator.
from data_juicer.utils.constant import Fields
ds = ds.add_column(name=Fields.stats, column=[{}] * ds.num_rows)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

During the initialization of `ImageTextSimilarityFilter` operator, we need to load the `openai/clip-vit-base-patch32` model, and we must wait and ensure it is successfully loaded.

In [None]:
from data_juicer.ops.filter.image_text_similarity_filter import ImageTextSimilarityFilter
op = ImageTextSimilarityFilter(
    hf_clip = 'openai/clip-vit-base-patch32',
    min_score=0.2,
    max_score=0.9
)

In [None]:
ds = ds.map(op.compute_stats)

for sample in ds:
    print(sample)

out_ds = ds.filter(op.process)

print(out_ds)
print(f'Number of samples of output dataset : {len(out_ds)}')
for sample in out_ds:
    print(sample)

- #### Image-Only Dataset

However, sometimes we may want to process multidomain datasets based on a single modality or work with single-modal datasets.

In [None]:
import os
from datasets import Dataset

img1_path = os.path.join(data_path, 'img1.png') # 336*336
img2_path = os.path.join(data_path, 'img2.jpg') # 640*480
img3_path = os.path.join(data_path, 'img3.jpg') # 342*500

samples = [{
            'images': [img1_path]
        }, {
            'images': [img2_path]
        }, {
            'images': [img3_path]
        }]
ds = Dataset.from_list(samples)

# add a new column to the dataset to store the statistical values of the filter operator.
from data_juicer.utils.constant import Fields
ds = ds.add_column(name=Fields.stats, column=[{}] * ds.num_rows)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.filter.image_shape_filter import ImageShapeFilter
op = ImageShapeFilter(
    min_width=400, 
    min_height=400
)

In [None]:
ds = ds.map(op.compute_stats)

for sample in ds:
    print(sample)

out_ds = ds.filter(op.process)

print(out_ds)
print(f'Number of samples of output dataset : {len(out_ds)}')
for sample in out_ds:
    print(sample)

Please refer to [[*DJ-SORA*]](https://github.com/modelscope/data-juicer/blob/main/docs/DJ_SORA.md)  and [[*Multimodal Converting*]](https://github.com/modelscope/data-juicer/blob/main/tools/multimodal/README.md)for more details.