# README
You another 2 options to get the data.

1. TO SAVE TIME... Get pre-processed data here [https://drive.google.com/drive/folders/1DxlJ-buZQNcc1LglxpmpbXkNXpjt4MfD?usp=sharing]
  - Place the data folders inside a 'Data' folder of the project directory.

2. OR run this notebook if you want a fresh copy of the data and process everything yourself.

# Choose Root Directory
Uncomment the directory of your choice
- Mount Google Drive's root_dir
- OR define your own root_dir



In [None]:
# ----- Google Colab directory and runtime -----
from google.colab import drive
drive.mount('/content/drive')
root_dir = '/content/dricve/MyDrive/'

# ----- Custom directory, using MAGIC or local runtime -----
#root_dir = './'


# Download Data

In [None]:
# Create main folders
import os
sif_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC'
h_dir = root_dir + 'Voice_Cloning_Detection/Data/H-Voice'
iS_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/SiF-DeepVC'
iH_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/H-Voice'

if not os.path.exists(sif_dir):
    os.makedirs(sif_dir)
if not os.path.exists(h_dir):
    os.makedirs(h_dir)
if not os.path.exists(iS_dir):
    os.makedirs(iS_dir)
if not os.path.exists(iH_dir):
    os.makedirs(iH_dir)

### SiF-DeepVC

Instructions:
- https://github.com/dstsmallbird/SiF-DeepVC_Dataset
- Run the code to download the Google Drive .zip file (7.8 GB) at the zip_dir shown below

Structure:

RQ1 (Part 1)
- "for-real-validation": original human recordings from FoR Validation dataset
- "zh-real-test": original human recordings from MagicData Test dataset

RQ2 (Part 2)
- "for-real-validation-denoised": slightly denoised "for-real-validation"
- "zh-real-test-denoised": slightly denoised "zh-real-test"
- "zh-real-test-silenced": silence-removed "zh-real-test"

RQ3 (Part 3)
- "for-bh-madefake-final-r4k": cloned fake voices by SiF-DeepVC for Farid et al.
- "for-deep4s-madefake-final-r4k": cloned fake voices by SiF-DeepVC for Deep4SNet
- "for-rawnet-madefake-final-r4k": cloned fake voices by SiF-DeepVC for RawNet2

In [None]:
import gdown
import zipfile

# Google Drive file URL
url = 'https://drive.google.com/uc?id=1aaoG_jkREVBaglmW2M7cJmXU088Ndl8G'

# Zip download directory
zip_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/SiF-DeepVC/DeepVC-Dataset.zip'

# Download the file
gdown.download(url, zip_dir, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1aaoG_jkREVBaglmW2M7cJmXU088Ndl8G
From (redirected): https://drive.google.com/uc?id=1aaoG_jkREVBaglmW2M7cJmXU088Ndl8G&confirm=t&uuid=d3683b74-18a3-43e9-9197-bf4531bb8265
To: /home/jupyter-shl224/Voice_Cloning_Detection/Data/Import/SiF-DeepVC/DeepVC-Dataset.zip
100%|██████████| 8.36G/8.36G [02:09<00:00, 64.3MB/s]


'./Voice_Cloning_Detection/Data/Import/SiF-DeepVC/DeepVC-Dataset.zip'

In [None]:
# Extraction target directory
zip_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/SiF-DeepVC/DeepVC-Dataset.zip'
target_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/All'

# # Create folder
import os
if not os.path.exists(target_dir):
    os.makedirs(target_dir)

# Unzip ALL files
!unzip "$zip_dir" -d "$target_dir"


Archive:  ./Voice_Cloning_Detection/Data/Import/SiF-DeepVC/DeepVC-Dataset.zip
  inflating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-deep4s-madefake-final-r4k/file11711.wav_16k.wav_norm.wav_mono.wav_silence.wav_02.wav.noisered.wav  
   creating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r4k/
  inflating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r4k/file10016.wav_16k.wav_norm.wav_mono.wav_silence.wav_00.wav.noisered.wav  
  inflating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r4k/file10016.wav_16k.wav_norm.wav_mono.wav_silence.wav_01.wav.noisered.wav  
  inflating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r4k/file10016.wav_16k.wav_norm.wav_mono.wav_silence.wav_02.wav.noisered.wav  
  inflating: ./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r

### H-Voice

Instructions:
- https://data.mendeley.com/datasets/k47yd3m28w/4
- Run the code to download the .zip file (87.8 MB) at the zip_dir shown below

Structure:
- Training_fake: 2088 histograms of fake voice recordings (2016 with Imitation and with 72 Deep Voice)
- Training_original: 2020 histograms of original voice recordings
- Validation_fake: 864 histograms of fake voice recordings (all with Imitation)
- Validation_original: 864 histograms of original voice recordings
- External_test1: 760 histograms (380 original + 380 fake with Imitation)
- External_test2: 76 histograms (4 original + 72 fake with Deep Voice)

Their Split:
- Train: 61.57%
- Validation: 25.90%
- Test: 12.53%

In [None]:
import requests

# URL of the zip file
url = "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/k47yd3m28w-4.zip"

# Path where you want to store the downloaded zip file
zip_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/H-Voice/H-Voice_Dataset.zip'

# Send HTTP request to the URL of the zip file
response = requests.get(url)

# Check if the request is successful
if response.status_code == 200:
    # Write the contents of the response to a file
    with open(zip_dir, 'wb') as file:
        file.write(response.content)
else:
    print(f"Failed to download the file. HTTP Status Code: {response.status_code}")


In [None]:
# Your .zip directory -------------------------------------------------------------------------------
zip_dir = root_dir + 'Voice_Cloning_Detection/Data/Import/H-Voice/H-Voice_Dataset.zip'
target_dir = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All'

# Create folder
import os
if not os.path.exists(target_dir):
    os.makedirs(target_dir)

# Unzip ALL files
!unzip "$zip_dir" -d "$target_dir"

# Unzip the files in the subfolders
import zipfile
for subdir, dirs, files in os.walk(target_dir):
    print("target: ", target_dir)
    for file in files:
        # Check if the file is a zip file
        if file.endswith('.zip'):
            print("zip")
            zip_path = os.path.join(subdir, file)
            # Unzip the file in its current directory
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall(subdir)
            # Delete the zip file
            #os.remove(zip_path)


# Define directories -------------------------------------------------------------------------------
dir_1f = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Training_Set/fake'
dir_1r = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Training_Set/real'
dir_2f = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Validation_Set/fake'
dir_2r = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Validation_Set/real'
dir_3f = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Test_Set/fake'
dir_3r = root_dir + 'Voice_Cloning_Detection/Data/H-Voice/Test_Set/real'

# Create directories if they don't exist
for directory in [dir_1f, dir_1r, dir_2f, dir_2r, dir_3f, dir_3r]:
    if not os.path.exists(directory):
        os.makedirs(directory)


# Function to move contents of a directory -------------------------------------------------------------------------------
import shutil
def move_contents(source_dir, target_dir):
    for item in os.listdir(source_dir):
        source_item = os.path.join(source_dir, item)
        if os.path.isfile(source_item):
            shutil.move(source_item, target_dir)
        elif os.path.isdir(source_item):
            # Recursively move contents of subdirectories
            move_contents(source_item, target_dir)

# Move contents of source directories to target directories
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/Training_fake/Training_fake', dir_1f)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/Training_original/Training_original', dir_1r)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/Validation_fake/Validation_fake', dir_2f)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/Validation_original/Validation_original', dir_2r)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/External_test1/External_test1/FAKE', dir_3f)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/External_test1/External_test1/ORIGINAL', dir_3r)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/External_test2/External_test2/Fake', dir_3f)
move_contents(root_dir + 'Voice_Cloning_Detection/Data/H-Voice/All/External_test2/External_test2/Original', dir_3r)


# Count files after move -------------------------------------------------------------------------------
def count_files(directory):
    file_count = 0
    for root, dirs, files in os.walk(directory):
        file_count += len(files)
    return file_count

counts = {
    'dir_1f': count_files(dir_1f),
    'dir_1r': count_files(dir_1r),
    'dir_2f': count_files(dir_2f),
    'dir_2r': count_files(dir_2r),
    'dir_3f': count_files(dir_3f),
    'dir_3r': count_files(dir_3r)
}
print("\nFiles Count:")
for directory, count in counts.items():
    print(f"{directory}: {count} files")

Archive:  ./Voice_Cloning_Detection/Data/Import/H-Voice/H-Voice_Dataset.zip
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/Training_original/
  inflating: ./Voice_Cloning_Detection/Data/H-Voice/All/Training_original/Training_original.zip  
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/External_test2/
  inflating: ./Voice_Cloning_Detection/Data/H-Voice/All/External_test2/External_test2.zip  
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/Training_fake/
  inflating: ./Voice_Cloning_Detection/Data/H-Voice/All/Training_fake/Training_fake.zip  
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/Validation_fake/
  inflating: ./Voice_Cloning_Detection/Data/H-Voice/All/Validation_fake/Validation_fake.zip  
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/External_test1/
  inflating: ./Voice_Cloning_Detection/Data/H-Voice/All/External_test1/External_test1.zip  
   creating: ./Voice_Cloning_Detection/Data/H-Voice/All/Validation_original/
  inflating: ./Voi

# Process SiF Data


## Sample data
* 4.5k real voices from FoR
* 4.5k fake voices from against Faird + RawNet
* 1.0k fake voices from against Deep4SNet (as testing later)


In [None]:
import os
import random

# Define the directories
dir_real_for = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation'
dir_fake_farid = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-bh-madefake-final-r4k'
dir_fake_rawnet = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-rawnet-madefake-final-r4k'
dir_fake_deep4s = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-deep4s-madefake-final-r4k'

# Get a list of all .wav file paths in the directories
files_real_for = [os.path.join(dir_real_for, f) for f in os.listdir(dir_real_for) if f.endswith('.wav')]
files_fake_farid = [os.path.join(dir_fake_farid, f) for f in os.listdir(dir_fake_farid) if f.endswith('.wav')]
files_fake_rawnet = [os.path.join(dir_fake_rawnet, f) for f in os.listdir(dir_fake_rawnet) if f.endswith('.wav')]
files_fake_deep4s = [os.path.join(dir_fake_deep4s, f) for f in os.listdir(dir_fake_deep4s) if f.endswith('.wav')]

# check files exist
print("All files:")
print("RQ1/for-real-validation ", len(files_real_for))
print("RQ3/for-bh-madefake-final-r4k ", len(files_fake_farid))
print("RQ3/for-rawnet-madefake-final-r4k ", len(files_fake_rawnet))
print("RQ3/for-deep4s-madefake-final-r4k ", len(files_fake_deep4s))

# Randomly sample from each file path
sample_real = random.sample(files_real_for, 4500)
sample_fake_1 = random.sample(files_fake_farid, 2995)
sample_fake_2 = random.sample(files_fake_rawnet, 1505)
sample_fake_deep4s = random.sample(files_fake_deep4s, 1000)

# check samples exist
print("\nSamples sizes:")
print("RQ1/for-real-validation ", len(sample_real))
print("RQ3/for-bh-madefake-final-r4k ", len(sample_fake_1))
print("RQ3/for-rawnet-madefake-final-r4k ", len(sample_fake_2))
print("RQ3/for-deep4s-madefake-final-r4k ", len(sample_fake_deep4s))


All files:
RQ1/for-real-validation  5400
RQ3/for-bh-madefake-final-r4k  8720
RQ3/for-rawnet-madefake-final-r4k  1505
RQ3/for-deep4s-madefake-final-r4k  9015

Samples sizes:
RQ1/for-real-validation  4500
RQ3/for-bh-madefake-final-r4k  2995
RQ3/for-rawnet-madefake-final-r4k  1505
RQ3/for-deep4s-madefake-final-r4k  1000


## Label Samples

In [None]:
# Create a dictionary to store file paths and labels
data = {}
data_deep4s = {}

# Assign labels
for file_path in sample_real:
    data[file_path] = 'real'
for file_path in sample_fake_1 + sample_fake_2:
    data[file_path] = 'fake'
for file_path in sample_fake_deep4s:
    data_deep4s[file_path] = 'fake'

# Check some files and their labels
print("Total number of files:", len(data))

print("\nLabels for the first 3 files:")
for file_path, label in list(data.items())[:3]:
    print(file_path, "->", label)

print("\nLabels for the last 3 files:")
for file_path, label in list(data.items())[8997:]:
    print(file_path, "->", label)

# Check some files and their labels in data_deep4s
print("\nTotal number of files in data_deep4s:", len(data_deep4s))

print("\nLabels for the first 3 files in data_deep4s:")
for file_path, label in list(data_deep4s.items())[:3]:
    print(file_path, "->", label)

print("\nLabels for the last 3 files in data_deep4s:")
for file_path, label in list(data_deep4s.items())[-3:]:
    print(file_path, "->", label)

Total number of files: 9000

Labels for the first 3 files:
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file13146.wav_16k.wav_norm.wav_mono.wav_silence.wav -> real
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file32757.wav_16k.wav_norm.wav_mono.wav_silence.wav -> real
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file4655.wav_16k.wav_norm.wav_mono.wav_silence.wav -> real

Labels for the last 3 files:
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-rawnet-madefake-final-r4k/file11116.wav_16k.wav_norm.wav_mono.wav_silence.wav_01.wav.noisered.wav -> fake
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-rawnet-madefake-final-r4k/file3076.wav_16k.wav_norm.wav_mono.wav_silence.wav_01.wav.noisered.wav -> fake
./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-rawnet-madefake-final-r4k/file18593.wav_16k.wav_norm.wav_mo

## Split data
* Train: 70%
* Validation: 15%
* Test: 15%

In [None]:
import random

# Shuffle the dictionary keys
keys = list(data.keys())
random.shuffle(keys)
deep4s_keys = list(data_deep4s.keys())
random.shuffle(deep4s_keys)

# Calculate the sizes of each set
total_size = len(keys)
train_size = int(total_size * 0.7)
val_size = int(total_size * 0.15)

# Divide the keys into training, validation, and testing sets
train_keys = keys[:train_size]
val_keys = keys[train_size:train_size + val_size]
test_keys = keys[train_size + val_size:]

# Retrieve the corresponding file paths and labels for each set
train_set = [(key, data[key]) for key in train_keys]
val_set = [(key, data[key]) for key in val_keys]
test_set = [(key, data[key]) for key in test_keys]
deep4s_set = [(key, data_deep4s[key]) for key in deep4s_keys]

# Print sizes of each set
print("Training set size:", len(train_set))
print("Validation set size:", len(val_set))
print("Testing set size:", len(test_set))
print("Deep4S set size:", len(deep4s_set))

Training set size: 6300
Validation set size: 1350
Testing set size: 1350
Deep4S set size: 1000


In [None]:
# Check 1 example from each
print("\n", train_set[0])
print("\n", val_set[0])
print("\n", test_set[0])
print("\n", deep4s_set[0])


 ('./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file7697.wav_16k.wav_norm.wav_mono.wav_silence.wav', 'real')

 ('./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file6892.wav_16k.wav_norm.wav_mono.wav_silence.wav', 'real')

 ('./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ1/for-real-validation/file9043.wav_16k.wav_norm.wav_mono.wav_silence.wav', 'real')

 ('./Voice_Cloning_Detection/Data/SiF-DeepVC/All/DeepVC-Dataset/RQ3/for-deep4s-madefake-final-r4k/file24320.wav_16k.wav_norm.wav_mono.wav_silence.wav_03.wav.noisered.wav', 'fake')


# Feature Extraction (ONLY RUN ONCE)
- no need to run if you have histograms stored in set folders
- extracts histograms from SiF-DeepVC, not H-Voice. H-Voice already comes with histograms.

## Histograms

### Regular Function
- no limitations
- imiate H-voice histograms

In [None]:
import numpy as np
import librosa.display
import matplotlib.pyplot as plt
def compute_histogram(file_path, dir, iter):
    # Load audio file
    audio, sr = librosa.load(file_path, sr=None)

    # Calculate histogram of audio
    hist, bins = np.histogram(audio, bins=256, range=(-1, 1)) # Ours: 2^8 | Original: 2^16 bins

    # Plot histogram
    plt.figure()
    plt.bar(bins[:-1], hist, width=(bins[1] - bins[0]), color='black')
    #plt.title('Histogram of Audio')
    #plt.xlabel('Amplitude')
    #plt.ylabel('Frequency')
    plt.savefig(os.path.join(dir, f'hist_{iter}.png'))
    plt.close()
    #plt.show()
    #print(hist.shape)
    #print(hist.dtype)
    return hist

In [None]:
# # Get visuals as close to H-Voice as possible
# # https://www.sciencedirect.com/science/article/pii/S2352340920302250

# def compute_histogram_test(file_path):
#     # Load audio file
#     audio, sr = librosa.load(file_path, sr=None)

#     # Calculate histogram of audio
#     hist, bins = np.histogram(audio, bins=256, range=(-1, 1)) # Ours: 2^8 | Original: 2^16 bins

#     # Plot histogram
#     plt.figure()
#     plt.bar(bins[:-1], hist, width=(bins[1] - bins[0]), color='black')
#     #plt.title('{}'.format(file_path))
#     #plt.xlabel('Amplitude')
#     #plt.ylabel('Frequency')
#     plt.show()
#     #plt.savefig(os.path.join(dir, f'hist_{iter}.png'))
#     plt.close()
#     #print(hist.shape)
#     #print(hist.dtype)

# # Test them
# fileInput = deep4s_set[0][0]
# print(fileInput)
# compute_histogram_test(fileInput)


### Filtered Function - Limit Histograms under 4 kHz (WIP)
- Create a more generalized model by training our model on both H-Voice and SiF-DeepVC data sets.
- Limit the histograms to below 4000 Hz. Since the SiF-DeepVC's handcrafted SiFs were designed at above 4k Hz, we want to test the model's capabilities when ignoring the SiFs.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import librosa.display

# Function to plot spectrogram
def plot_spectrogram(audio, sr, title):
    plt.figure(figsize=(10, 4))
    spectrogram = librosa.display.specshow(librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max), sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title(title)
    plt.tight_layout()
    #plt.show()
    plt.close()



In [None]:
import numpy as np
import scipy.signal
import scipy.io.wavfile

def filter(audio_data, cutoff_frequency, sr):
    # Define the filter
    nyquist_frequency = sr / 2
    cutoff_normalized = cutoff_frequency / nyquist_frequency
    b, a = scipy.signal.butter(4, cutoff_normalized, btype='low')

    # Apply the filter to each channel
    filtered_audio = np.apply_along_axis(lambda x: scipy.signal.filtfilt(b, a, x), axis=0, arr=audio_data)

    return filtered_audio

# Load the original audio file
#sampling_rate, audio_data = scipy.io.wavfile.read(file_path)


In [None]:
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
import os
def compute_histogram_filtered(file_path, dir, iter):
    # Load audio file
    audio, sr = librosa.load(file_path, sr=44100)
    # Plot spectrogram of original audio
    #plot_spectrogram(audio, 44100, title='Original Audio Spectrogram')


    cutoff_frequency = 4000
    filtered_audio = filter(audio, cutoff_frequency, sr=44100)
    # Plot spectrogram of filtered audio
    #plot_spectrogram(filtered_audio, 44100, title='Filtered Audio Spectrogram')
    # Calculate histogram of audio
    hist, bins = np.histogram(filtered_audio, bins=256, range=(-1, 1)) # 2^8 bins

    # Plot histogram
    plt.figure()
    plt.bar(bins[:-1], hist, width=(bins[1] - bins[0]), color='black')
    #plt.title('Histogram of Audio')
    #plt.xlabel('Amplitude')
    #plt.ylabel('Frequency')
    plt.savefig(os.path.join(dir, f'hist_{iter}.png'))
    #plt.show()
    plt.close()
    #print(hist.shape)
    #print(hist.dtype)
    return hist

# Test
compute_histogram_filtered(test_set[0][0], "Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set_Filtered", 2)

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     3,     1,     2,     0,     2,     6,     3,     3,
           3,     1,     0,     4,     5,     3,     8,     4,     5,
           2,     4,     3,     3,     3,     6,     4,     2,     2,
           2,     6,     7,     6,     6,     5,     5,     7,     4,
           8,     8,     5,     8,     6,    11,    10,    13,    11,
          21,    17,    30,    33,    30,    34,    30,    43,    49,
          44,    54,    71,    70,    75,    86,    99,   111,   125,
         117,   120,   128,   147,   166,   167,   182,   187,   193,
         255,   275,   270,   303,   334,   395,   420,   500,   487,
         567,   624,   649,   667,   801,   920,  1032,  1197,  1342,
        1481,  1631,  1766,  2036,  2298,  2495,  2813,  3256,  3581,
        3871,  4480,  5167,  5880,  6573,  7060,  8087,  9424, 12517,
       18862, 56367,

### Store Histograms - Regular

In [None]:
# Training Set
#   This must run in batches. Trying to create 4200 graphs will crash the program due to a lack of RAM.
#   Batches can be as large as 900.

save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

batch_size = 900
num_batches = len(train_set) // batch_size  # Calculate the number of batches

for batch_index in range(num_batches):
    start_index = batch_index * batch_size
    end_index = (batch_index + 1) * batch_size

    for i, (file_path, label) in enumerate(train_set[start_index:end_index]):
        label_dir = os.path.join(save_dir, label)
        if not os.path.isdir(label_dir):
            os.makedirs(label_dir)  # Create the directory if it doesn't exist

        hist = compute_histogram(file_path, label_dir, i + start_index)

# Process the remaining items (if any) after the last full batch
remaining_items = len(train_set) % batch_size
if remaining_items > 0:
    start_index = num_batches * batch_size
    for i, (file_path, label) in enumerate(train_set[start_index:]):
        label_dir = os.path.join(save_dir, label)
        if not os.path.isdir(label_dir):
            os.makedirs(label_dir)  # Create the directory if it doesn't exist

        hist = compute_histogram(file_path, label_dir, i + start_index)


In [None]:
# Validation Set
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in val_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram(file_path, label_dir, iter)
    iter+=1

In [None]:
# Test Set
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in test_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram(file_path, label_dir, iter)
    iter+=1

In [None]:
# Deep4S Test Set - SiF-DeepVC built to bypass Deep4SNet
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in deep4s_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram(file_path, label_dir, iter)
    iter+=1

In [None]:
# Count histograms in each directory
def count_png_files(directory):
    count = 0
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".png"):
                count += 1
    return count

directories = [
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set/'
]

for directory in directories:
    total_count = 0
    print("Directory:", directory)
    for sub_dir in os.listdir(directory):
        sub_dir_path = os.path.join(directory, sub_dir)
        if os.path.isdir(sub_dir_path):
            png_count = count_png_files(sub_dir_path)
            total_count += png_count
            print("   Subdirectory:", sub_dir, "| Histograms:", png_count)
    print("Total histograms: ", total_count)
    print()

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set/
   Subdirectory: real | Histograms: 3518
   Subdirectory: fake | Histograms: 3511
Total histograms:  7029

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set/
   Subdirectory: fake | Histograms: 671
   Subdirectory: real | Histograms: 679
Total histograms:  1350

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set/
   Subdirectory: fake | Histograms: 665
   Subdirectory: real | Histograms: 685
Total histograms:  1350

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set/
   Subdirectory: fake | Histograms: 1000
Total histograms:  1000



### Store Histograms - Filtered

In [None]:
# Training Set
#   This must run in batches. Trying to create 4200 graphs will crash the program due to a lack of RAM.
#   Batches can be as large as 900.

save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set_Filtered/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

batch_size = 900
num_batches = len(train_set) // batch_size  # Calculate the number of batches

for batch_index in range(num_batches):
    start_index = batch_index * batch_size
    end_index = (batch_index + 1) * batch_size

    for i, (file_path, label) in enumerate(train_set[start_index:end_index]):
        label_dir = os.path.join(save_dir, label)
        if not os.path.isdir(label_dir):
            os.makedirs(label_dir)  # Create the directory if it doesn't exist

        hist = compute_histogram_filtered(file_path, label_dir, i + start_index)

# Process the remaining items (if any) after the last full batch
remaining_items = len(train_set) % batch_size
if remaining_items > 0:
    start_index = num_batches * batch_size
    for i, (file_path, label) in enumerate(train_set[start_index:]):
        label_dir = os.path.join(save_dir, label)
        if not os.path.isdir(label_dir):
            os.makedirs(label_dir)  # Create the directory if it doesn't exist

        hist = compute_histogram_filtered(file_path, label_dir, i + start_index)


In [None]:
# Validation Set
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set_Filtered/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in val_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram_filtered(file_path, label_dir, iter)
    iter+=1

In [None]:
# Test Set
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set_Filtered/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in test_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram_filtered(file_path, label_dir, iter)
    iter+=1

In [None]:
# Deep4S Test Set - SiF-DeepVC built to bypass Deep4SNet
save_dir = root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set_Filtered/'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

iter = 0;

for file_path, label in deep4s_set:
    label_dir = os.path.join(save_dir, label)
    if not os.path.isdir(label_dir):
      os.makedirs(label_dir)  # Create the directory if it doesn't exist

    hist = compute_histogram_filtered(file_path, label_dir, iter)
    iter+=1

In [None]:
# Count histograms in each directory
def count_png_files(directory):
    count = 0
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".png"):
                count += 1
    return count

directories = [
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set_Filtered/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set_Filtered/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set_Filtered/',
    root_dir + 'Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set_Filtered/'
]

for directory in directories:
    total_count = 0
    print("Directory:", directory)
    for sub_dir in os.listdir(directory):
        sub_dir_path = os.path.join(directory, sub_dir)
        if os.path.isdir(sub_dir_path):
            png_count = count_png_files(sub_dir_path)
            total_count += png_count
            print("   Subdirectory:", sub_dir, "| Histograms:", png_count)
    print("Total histograms: ", total_count)
    print()

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Training_Set_Filtered/
   Subdirectory: .ipynb_checkpoints | Histograms: 0
   Subdirectory: real | Histograms: 3145
   Subdirectory: fake | Histograms: 3157
Total histograms:  6302

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Validation_Set_Filtered/
   Subdirectory: real | Histograms: 700
   Subdirectory: fake | Histograms: 651
Total histograms:  1351

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Test_Set_Filtered/
   Subdirectory: real | Histograms: 656
   Subdirectory: fake | Histograms: 694
Total histograms:  1350

Directory: ./Voice_Cloning_Detection/Data/SiF-DeepVC/Deep4SNet_Target_Test_Set_Filtered/
   Subdirectory: fake | Histograms: 1000
Total histograms:  1000

