# Gain Operator Insight


Operators provided by Data-Juicer serve as the backbone for a variety of data operations, including modification, cleaning, filtering, and deduplication. 


In the following sections, we will run several Operators to gain a deeper understanding of Operators,and inspect the result.


In [None]:
# Install data-juicer package if you are NOT in the Playground
!pip3 install py-data-juicer

# Or use newest code of data-juicer
!pip install git+https://github.com/modelscope/data-juicer

### Mapper Operators

A Mapper operator is primarily used for editing, modifying, or enhancing functionality.


We use `CleanIpMapper` to clean up IP address in text and we showcase two approaches.

In [None]:
# text
samples = [{'text': 'test of ip 234.128.124.123'}]

In [None]:
from data_juicer.ops.mapper.clean_ip_mapper import CleanIpMapper
op = CleanIpMapper()

-  Invoke op's process directly

In [None]:
out_sample = op.process(samples[0])
print(out_sample['text'])


-  Invoke op's process with datasets

In [None]:
from datasets import Dataset

ds = Dataset.from_list(samples)
ds = ds.map(op.process)
print(ds[0])

### Filter Operators

a Filter operator is mainly used to filter out low-quality samples

We use `WordNumFilter` to filter samples with number of words out of specific range.

In [None]:
samples = [
    {'text': 'Data Juicer'}, 
    {'text': 'Welcome to Data Juicer Playground'}
]

In [None]:
from data_juicer.ops.filter.word_num_filter import WordNumFilter
op = WordNumFilter(
    min_num=3, 
    max_num=10
)

In [None]:
from datasets import Dataset
ds = Dataset.from_list(samples)
print(ds)

We add a new column to the dataset to store the statistical values of the filter operator.


In [None]:
from data_juicer.utils.constant import Fields
ds = ds.add_column(name=Fields.stats, column=[{}] * ds.num_rows)
print(ds)

In [None]:
ds = ds.map(op.compute_stats)
ds = ds.filter(op.process)
print(ds[0])

More Operators demos are coming.