Skip to content

oigres5/SAFL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic and Authentic Forensic Lab (SAFL)

This repository contains the datasets and materials associated with the article:

Sergio A. Falcón-López, et al.
Forensic Analysis of Manipulated Images and Videos
Submitted to Applied Sciences (MDPI), 2025.

Relation to the Article

The SAFL dataset was created specifically for the experiments described in the article.
It includes:

  • Authentic and synthetic (Deepfake) images and videos.
  • The exact subsets used for the evaluation of conventional forensic tools and modern Deepfake-detection models.
  • File structures and naming conventions referenced in the manuscript tables and figures.

This repository enables full reproducibility of the experiments and verification of the reported metrics.

Author and Ownership

The dataset and repository are authored by:

Sergio A. Falcón-López
ORCID: https://orcid.org/0009-0002-7106-6691
Email: sfalcon@scc.uned.es

GitHub username: oigres5

Repository Structure

The repository is organized as follows:

  • audios/: Contains both real and deepfake-generated audio samples.

    • real/: Original, non-manipulated audio files.
    • fake/: Deepfake-generated audio files.
  • images/: Contains real and deepfake-generated images.

    • real/: Original images used in the experiments.
    • fake/: Deepfake-generated images.
  • videos/: Directory with real and deepfake videos.

    • real/: Original source videos.
    • fake/: Deepfake-generated videos.
  • src/: Contains modified or adapted scripts based on external tools.

  • README.md: This documentation file.

Dataset Description

Images

-Real: 2102 samples

  • These samples were selected from CelebA dataset.

-Deepfake: 2095 (+300 Updated 06/2025)samples

  • 1009 samples generated from https://thispersondoesnotexist.com
  • 613 samples generate with FaceApp
  • 453 samples from generated videos with DeepFaceLab
  • 20 samples by Dall-E2
  • 300 samples by Dall-E3 (Updated 06/2025)

Videos

-Real: 212 samples

  • These samples were selected from Celeb-DF dataset.

-Deepfake: 204 samples

  • 100 generated with Avatarify
  • 104 generated with DeepFaceLab

Audios

-Real: 2000 samples in Spanish language extracted from LibriVox project audiobooks

-Deepfake: 2000 samples generated with Text-To-Speech (TTS) method

System Requirements

The src/ directory contains helper scripts that wrap or slightly adapt existing forensic and Deepfake-related tools from external projects.
The complete and up-to-date software requirements for each tool are documented in their original repositories, for example:

In our experiments, all scripts in src/ were executed under the following minimum environment:

Software

  • Operating system: Linux (Ubuntu 20.04+) or Windows 10/11
  • Python: 3.8 or higher
  • Git: 2.25 or higher
  • CUDA 11+ and NVIDIA drivers (only required for GPU-accelerated training/inference)

Hardware

  • CPU: Quad-core processor or better
  • RAM: at least 8 GB (16 GB recommended for video processing)
  • Disk space: at least 20 GB free for datasets and intermediate results
  • GPU (optional but recommended): NVIDIA GPU with ≥ 4 GB VRAM for DeepFaceLab-based workflows

Versioning and Commit Transparency

Past changes to the dataset have been reviewed and documented in this repository. All future commits will include:

  • Clear descriptions of the changes performed.
  • Exact counts of affected files.
  • Justification whenever aggregate statistics or percentages change.

Dataset Version History

  • 2025-06-07 Initial public release of the SAFL dataset (v1.0). (Git commit: 3f6f20088360acfdcfbdde5186d4c7f67c775f89 on branch main).

Data Protection and Consent

The “real” samples in SAFL are not recordings collected directly by the authors:

  • images/real/: contains a subset of face images taken from the CelebA dataset, released by its authors for non-commercial research.
  • videos/real/: contains a subset of face videos taken from the Celeb-DF dataset, released by its authors as a public benchmark for Deepfake forensics.
  • audios/real/: contains original audio files extracted from the LibriVox project (https://librivox.org/)

No additional personal identifiers (e.g., names, social media accounts, or textual labels of individuals) are included in SAFL beyond the anonymous file naming used for the experiments. The dataset does not contain images or recordings collected from private individuals specifically for this work.

Users of SAFL are responsible for ensuring that their use of the data complies with the original dataset licenses (e.g., CelebA, Celeb-DF) and with any applicable data protection regulation (such as GDPR when processing biometric data).

License

This dataset is released under the following license:

Creative Commons Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/

This allows reuse with proper citation.

How to Cite

If you use this dataset, please cite:

Sergio A. Falcón-López, Synthetic and Authentic Forensic Lab (SAFL) Dataset. https://github.com/oigres5/SAFL

About

Synthetic and Authentic Forensic Lab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors