# SynEval: Synthetic Data Evaluation Framework

SynEval is a comprehensive evaluation framework for assessing the quality of synthetic data generated by Large Language Models (LLMs). The framework provides quantitative scoring across four key dimensions:

- **Fidelity**: Measures how well the synthetic data preserves the statistical properties and patterns of the original data
- **Utility**: Evaluates the usefulness of synthetic data for downstream tasks
- **Diversity**: Assesses the variety and uniqueness of the generated data
- **Privacy**: Analyzes the privacy protection level of the synthetic data

Find SynEval directory.

In [None]:
cd SynEval

In [4]:
ls

amazon_fashion_ner_analysis_fast.py  privacy_result.json
[0m[01;34mcache[0m/                               privacy_results.json
claude.csv                           [01;34mprivacy_visualizations[0m/
diversity.py                         [01;34m__pycache__[0m/
diversity_results.json               pyproject.toml
download_nltk_data.py                README_amazon_fashion_analysis.md
fidelity.py                          README.md
fidelity_results.json                real_10k.csv
flair_ner.py                         [01;34mreports[0m/
LICENSE                              requirements_flair_ner.txt
MANIFEST.in                          requirements.txt
metadata.json                        run.py
[01;34mplots[0m/                               setup.py
plotting.py                          [01;34msyneval[0m/
prepare_environment.py               utility.py
privacy.py                           utility_results.json


Install required packages.

In [5]:
!python prepare_environment.py

🚀 SynEval Environment Preparation
📦 Installing Python dependencies...
  Upgrading pip...
  Upgrading critical packages (numpy, scipy, gensim)...
Collecting numpy>=1.26.0
  Using cached numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
Using cached numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl (16.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.3.1 which is incompatible.
anonymeter 1.0.0 requires numpy<1.27,>=1.22, but you have numpy 2.3.1 which is incompatible.
scipy 1.13.1 requires numpy<2.3,>=1.22.4, but you have numpy 2.3.1 which is incompatible.
google-colab 1.0.0 requires pandas

To test fidelity evaluation, default synthetic data, original data, and metadata files are provided. The results are generated in the file named fidelity_results.json. Plots are generated inside ./plots directory.


In [10]:
!python run.py \
    --synthetic claude.csv \
    --original real_10k.csv \
    --metadata metadata.json \
    --dimensions fidelity \
    --utility-input text \
    --utility-output rating \
    --output fidelity_results.json \
    --plot

2025-07-10 01:24:59.960648: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752110700.180106    4279 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752110700.242745    4279 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:__main__:Loading synthetic data from claude.csv
INFO:__main__:Loading original data from real_10k.csv
INFO:__main__:Loading metadata from metadata.json
INFO:__main__:Running fidelity evaluation...
INFO:fidelity:GPU acceleration available - using CUDA
INFO:fidelity:Running SDV diagnostic evaluation...
Generating report ...

(1/2) Evaluating Data Validity: 100% 9/9 [00:00<00:00, 487.51it/s]
Data Validity Score: 75.0%

(2/2) Evalua

To test utility evaluation, default synthetic data, original data, and metadata files are provided. The default downstreaming task is to train a classifier to predict rating values based on review text. The results are generated in the file named utility_results.json. Plots are generated inside ./plots directory.

In [11]:
!python run.py \
    --synthetic claude.csv \
    --original real_10k.csv \
    --metadata metadata.json \
    --dimensions utility \
    --utility-input text \
    --utility-output rating \
    --output utility_results.json \
    --plot

2025-07-10 01:26:41.141053: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752110801.174889    4743 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752110801.185191    4743 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:__main__:Loading synthetic data from claude.csv
INFO:__main__:Loading original data from real_10k.csv
INFO:__main__:Loading metadata from metadata.json
INFO:__main__:Running utility evaluation...
INFO:utility:GPU acceleration available - using CUDA
INFO:utility:Using all synthetic data (300 samples) for training
INFO:utility:Training size: 300, Test size: 300
INFO:utility:All unique target values: [1.0, 2.0, 3.0, 4.0, 5.0]
INFO:

To test diversity evaluation, default synthetic data, original data, and metadata files are provided. The results are generated in the file named diversity_results.json. Plots are generated inside ./plots directory.

In [12]:
!python run.py \
    --synthetic claude.csv \
    --original real_10k.csv \
    --metadata metadata.json \
    --dimensions diversity \
    --utility-input text \
    --utility-output rating \
    --output diversity_results.json \
    --plot

2025-07-10 01:31:48.534964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752111108.554732    6081 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752111108.561019    6081 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:__main__:Loading synthetic data from claude.csv
INFO:__main__:Loading original data from real_10k.csv
INFO:__main__:Loading metadata from metadata.json
INFO:__main__:Running diversity evaluation...
INFO:diversity:Using device: cuda
INFO:diversity:GPU acceleration enabled
INFO:diversity:Evaluating text diversity for columns: ['title', 'text', 'asin', 'parent_asin', 'user_id']
Evaluating text columns:   0% 0/5 [00:00<?, ?it/s]INFO

To test privacy evaluation, default synthetic data, original data, and metadata files are provided. The results are generated in the file named privacy_results.json. Plots are generated inside ./plots directory.

In [13]:
!python run.py \
    --synthetic claude.csv \
    --original real_10k.csv \
    --metadata metadata.json \
    --dimensions privacy \
    --utility-input text \
    --utility-output rating \
    --output privacy_results.json \
    --plot

2025-07-10 01:33:28.247448: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752111208.267148    6524 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752111208.273170    6524 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:__main__:Loading synthetic data from claude.csv
INFO:__main__:Loading original data from real_10k.csv
INFO:__main__:Loading metadata from metadata.json
INFO:__main__:Running privacy evaluation...
INFO:privacy:Using CUDA for acceleration
INFO:privacy:Loading Flair NER model (this may take a few minutes)...
pytorch_model.bin: 100% 2.24G/2.24G [00:15<00:00, 140MB/s]
tokenizer_config.json: 100% 25.0/25.0 [00:00<00:00, 150kB/s]
confi

## Additional Tools

### Amazon Fashion Dataset - Named Entity Recognition Analysis

SynEval includes a specialized tool for analyzing large text datasets using Named Entity Recognition (NER). The `amazon_fashion_ner_analysis_fast.py` script performs comprehensive NER analysis on the Amazon Fashion dataset, processing 10k+ records efficiently.

#### Features

- **Large Dataset Processing**: Efficiently handles 10k+ records with batch processing
- **Entity Density Analysis**: Calculates and analyzes entity density for each text
- **Comprehensive Reporting**: Generates multiple detailed report files
- **Top 200 High Entity Texts**: Identifies and reports texts with the most entities
- **Visualizations**: Creates charts and graphs for better understanding
- **Caching**: Automatic caching for faster subsequent runs
- **Progress Tracking**: Real-time progress bars for long-running operations

In [6]:
!python amazon_fashion_ner_analysis_fast.py

2025-07-10 02:14:56.087656: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752113696.107454   17420 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752113696.113786   17420 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-07-10 02:14:59,606 - INFO - Initializing Fast Amazon Fashion NER Analyzer...
2025-07-10 02:14:59,660 - INFO - Loading Flair NER model (this may take a few minutes)...
Loading Flair NER model...
Using GPU: Tesla T4
Loading model to cuda:0...
2025-07-10 02:15:00,969 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC