# Gain Operator Insight


Operators provided by Data-Juicer serve as the backbone for a variety of data operations, including modification, cleaning, filtering, and deduplication. 


In the following sections, we will run several Operators to gain a deeper understanding of Operators,and inspect the result.


In [None]:
# Install data-juicer package if you are NOT in the Playground
# !pip3 install py-data-juicer

# Or use newest code of data-juicer
# !pip install git+https://github.com/modelscope/data-juicer

### Mapper Operators

A mapper operator is primarily used for editing, modifying, or enhancing functionality.


We use `CleanIpMapper` to clean up IP address in text and we showcase two approaches.

In [None]:
from data_juicer.ops.mapper.clean_ip_mapper import CleanIpMapper
op = CleanIpMapper()

-  Invoke op's process directly

In [None]:
sample = [{'text': 'test of ip 234.128.124.123'}]
out_sample = op.process(sample)
print(out_sample['text'])

-  Invoke op's process with datasets

In [None]:
from datasets import Dataset

samples = [{'text': 'test of ip 234.128.124.123'}]
ds = Dataset.from_list(samples)
ds = ds.map(op.process)
print(ds[0])

### Filter Operators

A filter operator is mainly used to filter out low-quality samples. Generally, a filter operator involves two steps:
- Compute Statistical value
- Compare  statistical value and threshold

We show you how to use WordNumFilter to filter out samples whose number of words is not within the range of [3,10], that means, to discard samples with less than 3 or more than 10 words.


In [None]:
from datasets import Dataset

samples = [
    {'text': 'Data Juicer'}, 
    {'text': 'Welcome to Data Juicer Playground'}
]
ds = Dataset.from_list(samples)

# add a new column to the dataset to store the statistical values of the filter operator.
from data_juicer.utils.constant import Fields
ds = ds.add_column(name=Fields.stats, column=[{}] * ds.num_rows)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.filter.word_num_filter import WordNumFilter
op = WordNumFilter(
    min_num=3, 
    max_num=10
)

In [None]:
ds = ds.map(op.compute_stats)
ds = ds.filter(op.process)

print(ds)
print(f'Number of samples of output dataset : {len(ds)}')

### Deduplicator Operators

A Deduplicator operator is mainly used to detects and removes duplicate samples. 

Generally, a deduplicator operator involves two steps:
- Compute hash values
- Delete duplicate samples

Here is a case-insensitive demo to deduplicate samples using exact matching (md5 hash)

In [None]:
from datasets import Dataset

samples = [
    {'text': 'welcome to data juicer playground'}, 
    {'text': 'Welcome to Data Juicer Playground'}
]
ds = Dataset.from_list(samples)

print(ds)
print(f'Number of samples of input dataset : {len(ds)}')

In [None]:
from data_juicer.ops.deduplicator.document_deduplicator import DocumentDeduplicator
op = DocumentDeduplicator(
    lowercase = True
)

In [None]:
ds = ds.map(op.compute_hash)
ds, dup_pairs= op.process(ds, show_num=1)

print(ds)
print(f'Number of samples of output dataset : {len(ds)}')

Print deduplicate samples

In [None]:
for key, dup_pair in dup_pairs.items():
    print(f'Deduplicate hash value : {key}')
    for sample in dup_pair:
        print(sample)

More Operators demos are coming.