# Analyze Dataset

By using the Analyzer, we can obtain statistical information about the dataset and use these statistics to set parameters.

**Note:** Analyzer only computes stats of Filter operators.


In [None]:
# Install data-juicer package if you are NOT in the Playground
# !pip3 install py-data-juicer

# Or use newest code of data-juicer
# !pip install git+https://github.com/modelscope/data-juicer

Run `analyze_data.py` tool or `dj-analyze` command line tool with your config as the argument to analyze your dataset.

```shell
# only for installation from source
python tools/analyze_data.py --config your_recipe.yaml

# use command line tool
dj-analyze --config your_recipe.yaml
```

Here, we will show you how to analyze your dataset.

**Note:** The preparation of a temporary dataset is solely for demonstration purposes and is not part of the real analysis process.

Prepare a temporary directory to store input and output data.

In [None]:
import os
import tempfile

temp_dir = tempfile.mkdtemp()
input_file =  os.path.join(temp_dir,'input.jsonl')
output_dir = os.path.join(temp_dir,'processed')
os.makedirs(output_dir, exist_ok=True)
output_file =  os.path.join(output_dir,'output.jsonl')
data_recipe =  os.path.join(temp_dir,'recipe.yaml')

Prepare a temporary data recipe.

In [None]:
recipe_str = f"""
project_name: 'test_demo'
dataset_path: {input_file}  # path to your dataset directory or file
np: 1  # number of subprocess to process your dataset

export_path: {output_file}
save_stats_in_one_file: true 

# process schedule
# a list of several process operators with their arguments
process:
  - text_length_filter:                                     # filter text with length out of specific range
      min_len: 10                                             # the min length of filter range
      max_len: 10000                                          # the max length of filter range
"""

with open(data_recipe, 'w') as f:
    f.write(recipe_str)

In [None]:
# load recipe
from data_juicer.config import init_configs
cfg = init_configs(args=f'--config {data_recipe}'.split())
print(cfg.dataset_path)
print(cfg.process)

Prepare a temporary input dataset.

In [None]:
 
samples_str = """
{"text": "Today is Sunday and it's a happy day!"}
{"text": "Do you need a cup of coffee?"}
{"text": "你好，请问你是谁"}
{"text": "Sur la plateforme MT4, plusieurs manières d'accéder à ces fonctionnalités sont conçues simultanément."}
{"text": "欢迎来到阿里巴巴！"}
{"text": "This paper proposed a novel method on LLM pretraining."}
"""

with open(input_file, 'w') as f:
    f.write(samples_str)

Now that the preparation work is complete, let's begin the analyze the dataset.

In [None]:
from data_juicer.core import Analyser
analyzer = Analyser(cfg)
analyzer.run()

After the analysis is complete, we can view the visualized statistical information of the dataset.

In [None]:
import pandas as pd
overall_file = os.path.join(analyzer.analysis_path, 'overall.csv')
if os.path.exists(overall_file):
    analysis_res = pd.read_csv(overall_file)

analysis_res

Display the histogram of statistics of dataset

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

if os.path.exists(analyzer.analysis_path):
    for f_path in os.listdir(analyzer.analysis_path):
        if '.png' in f_path and 'all-stats' in f_path:
            all_stats = os.path.join(analyzer.analysis_path, f_path)
            break

img = mpimg.imread(all_stats) 
plt.imshow(img)
plt.show()

In [None]:
# Clean up temporary directory
import shutil
if os.path.exists(temp_dir):
    shutil.rmtree(temp_dir)