------------------------------------------------------------------------------------------------------------------------
### <b>Table of Content</b>

0. Background

1. Import functions

2. Load data

3. Explore data<br>
    3-1. Examine and transform data formats<br>
    3-2. Plot histograms and box plots<br>
    3-3. Plot a correlation heatmap<br>
    3-4. Plot seasonal decomposition<br>
    3-5. Plot autocorrelations

3. build a classification model to access the performance of cloned audio's target speaker
    3-1. Convert .wav to machine-readable form
    3-2. train a model
    3-3. evaluate classification accuracy using test data

4. first, build a voice cloning system given a speaker’s spoken audio that clones the source speaker’s voice to the target speaker’s voice
    4-1. For the voice cloning system (VC), you can utilize the TIMIT dataset as it consists of aligned text-audio data with various speakers.
    4-2. Use Word Error Rate (WER) for automatic evaluation of the voice cloning (VC) system for the speech generation part
    https://medium.com/ibm-watson-speech-services/new-python-scripts-to-measure-word-error-rate-on-watson-speech-to-text-77ecaa513f60
    https://github.com/jitsi/jiwer
    &emsp; 4-2-1. speech to text
    &emsp; 4-2-2. measure WER using the original script and transcribed text
    4-3. also report speaker classification accuracy to assess the performance of the generated audio’s target speaker.

5. Next, build a machine learning system which detects if any spoken audio is a natural speech or synthetically generated by machine.
    5-1. For the fake audio detection system (FAD) you can utilize the CommonVoice dataset as it consists of thousands of naturally spoken audio which could be used as golden spoken audio by humans as positive examples and creating negative examples using the voice cloning system as automatic data/label generator.
    5-2. For the fake audio detection (FAD) system evaluate the performance of the models using F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.



4. Select and engineer features

5. Train models<br>
    5-1. Split data into train and test sets<br>
    5-2. SARIMAX model<br>
    &emsp; 5-2-1. Perform stepwise search<br>
    &emsp; 5-2-2. Train with best orders<br>
    5-3. Train and evaluate different models<br>
    &emsp; 5-3-1. Train three different models<br>
    &emsp; 5-3-2. Compare model performance

6. Evaluate prediction results<br>
    6-1. Create Bollinger Bands<br>
    6-2. Get trading dates with different intervals<br>
    6-3. Make training decisions and get capital returns<br>
    &emsp; 6-3-1. Based on SARIMAX predictions<br>
    &emsp; 6-3-2. Based on Bollinger Band

7. Build pipeline and process all stock data<br>
    7-1. Interpret results<br>
    &emsp; 7-1-1. Capital returns<br>
    &emsp; 7-1-2. Model performance

8. Conclusion
------------------------------------------------------------------------------------------------------------------------

### <b>0. Background</b>

This project is for a cyber security company providing products and services that can detect whether audio and video media is authentic or fake.

We are given two publically available corpora of speech data, which are the <a href="https://github.com/philipperemy/timit">TIMIT</a> and <a href="https://commonvoice.mozilla.org/en/datasets">CommonVoice</a> datasets. We will build two machine learning systems using these datasets as follows:

1. A voice cloning (VC) system that clones a given speaker's spoken audio to the target speaker's voice.
2. A fake audio detection (FAD) system that detects if any spoken audio is natural speech or synthetically generated by machines.

More details can be found in <a href="https://github.com/henryhyunwookim/K7h2vHrgG1Gl0S2r#readme">README</a>.

### <b>1. Import functions</b>

In [1]:
from TTS.api import TTS
# https://tts.readthedocs.io/en/latest/inference.html

import sys
from pathlib import Path
import pandas as pd

### <b>2. Load data</b>

In [2]:
# Define file path.
root_dir = Path(sys.path[0])

timit_dir = root_dir / 'data' / 'TIMIT' / 'archive'
common_voice_dir = root_dir / 'data' / 'CommonVoice' / \
    'cv-corpus-14.0-delta-2023-06-23-en.tar' / \
    'cv-corpus-14.0-delta-2023-06-23-en' / \
    'cv-corpus-14.0-delta-2023-06-23' / 'en' / 'clips'

In [3]:
train_csv = pd.read_csv(timit_dir / 'train_data.csv', index_col='index').dropna(how='all')
train_csv.index = train_csv.index.astype(int).astype(str)

test_csv = pd.read_csv(timit_dir / 'test_data.csv', index_col='index').dropna(how='any')
test_csv.index = test_csv.index.astype(int).astype(str)

train_csv.head()

Unnamed: 0_level_0,test_or_train,dialect_region,speaker_id,filename,path_from_data_dir,path_from_data_dir_windows,is_converted_audio,is_audio,is_word_file,is_phonetic_file,is_sentence_file
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,TRAIN,DR4,MMDM0,SI681.WAV.wav,TRAIN/DR4/MMDM0/SI681.WAV.wav,TRAIN\\DR4\\MMDM0\\SI681.WAV.wav,True,True,False,False,False
2,TRAIN,DR4,MMDM0,SI1311.PHN,TRAIN/DR4/MMDM0/SI1311.PHN,TRAIN\\DR4\\MMDM0\\SI1311.PHN,False,False,False,True,False
3,TRAIN,DR4,MMDM0,SI1311.WRD,TRAIN/DR4/MMDM0/SI1311.WRD,TRAIN\\DR4\\MMDM0\\SI1311.WRD,False,False,True,False,False
4,TRAIN,DR4,MMDM0,SX321.PHN,TRAIN/DR4/MMDM0/SX321.PHN,TRAIN\\DR4\\MMDM0\\SX321.PHN,False,False,False,True,False
5,TRAIN,DR4,MMDM0,SX321.WRD,TRAIN/DR4/MMDM0/SX321.WRD,TRAIN\\DR4\\MMDM0\\SX321.WRD,False,False,True,False,False


In [82]:
train_audio_path = train_csv[train_csv['is_converted_audio']==True]['path_from_data_dir']

source_audio_subpath = train_audio_path[0]
source_speaker_id = source_audio_subpath.split('/')[2]
source_audio_file = source_audio_subpath.split('/')[3]
source_file_id = source_audio_file.split('.')[0]
source_audio_path = timit_dir / 'data' / source_audio_subpath

source_text_subpath = train_csv[(train_csv['speaker_id']==source_speaker_id) &
          (train_csv['filename']==source_file_id+'.TXT')]['path_from_data_dir'][0]
source_text_path = timit_dir / 'data' / source_text_subpath
with open(source_text_path) as txt:
    source_text = ' '.join(txt.read().split()[2:])
print(f'Source path: {source_audio_path}')

target_audio_subpath = train_audio_path[-1]
target_speaker_id = target_audio_subpath.split('/')[2]
target_audio_file = target_audio_subpath.split('/')[3]
target_file_id = target_audio_file.split('.')[0]
target_audio_path = timit_dir / 'data' / target_audio_subpath
print(f'Target path: {target_audio_path}')

output_folder = root_dir / 'output' / f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}'
output_filename = f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}.wav'

Source path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR4\MMDM0\SI681.WAV.wav
Target path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR8\MRDM0\SA2.WAV.wav


3. clone audio file

In [84]:
# Copy original source and target files to the destination folder
# for an easier review of the output file.

from shutil import copy2
import os

if not os.path.exists(output_folder):
    os.mkdir(output_folder)

# Copy with file permission and dest can be a folder
if not os.path.exists(output_folder / source_audio_file):
    copy2(src=source_audio_path, dst=output_folder / source_audio_file)
if not os.path.exists(output_folder / target_audio_file):
    copy2(src=target_audio_path, dst=output_folder / target_audio_file)

In [85]:
# Example voice conversion converting speaker of the source_wav to the speaker of the target_wav
# Downloading model to C:\Users\Admin\AppData\Local\tts\voice_conversion_models--multilingual--vctk--freevc24   
multi_tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=False)
multi_tts.voice_conversion_to_file(
    source_wav=str(source_audio_path),
    target_wav=str(target_audio_path),
    file_path=output_folder / f'multi_{output_filename}'
)

 > voice_conversion_models/multilingual/vctk/freevc24 is already downloaded.
 > Using model: freevc
 > Loading pretrained speaker encoder model ...
Loaded the voice encoder model on cpu in 0.09 seconds.


WindowsPath('d:/OneDrive/GitHub/Apziva/K7h2vHrgG1Gl0S2r/output/MMDM0-SI681_to_MRDM0-SA2/multi_MMDM0-SI681_to_MRDM0-SA2.wav')

In [86]:
en_models = [model for model in TTS.list_models() if '/en/' in model]
# TTS with on the fly voice conversion
for model in en_models:
    try:
        en_tts = TTS(model)
        model_name = model.split('/')[-1]
        en_tts.tts_with_vc_to_file(
            source_text,
            speaker_wav=str(target_audio_path),
            file_path=output_folder / f'{model_name}_{output_filename}'
        )
    except Exception  as e:
        print(e)
        print(f'Failed to load {model_name}.')

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`

 > tts_models/en/ek1/tacotron2 is already downloaded.
 > vocoder_models/en/ek1/wavegrad is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-10
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.8
 | > preemphasis:0.99
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_pa

# evaluate results in D:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2

# Other VC models:

https://paperswithcode.com/task/voice-conversion
=>
https://github.com/bshall/VectorQuantizedCPC
https://github.com/bshall/knn-vc

neural network-based voice conversion model, such as CycleGAN, StarGAN-VC, or Tacotron 2

VITS

VALL-E

3-1. speech to text using the output file

1. https://github.com/facebookresearch/fairseq/blob/main/examples/mms/asr/tutorial/MMS_ASR_Inference_Colab.ipynb
* 4: Run Inference and transcribe your audio(s)

To improve the transcription quality, user can use language-model (LM) decoding by following this instruction <a href='https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr'>ASR LM decoding</a>

<br><br>
2. https://github.com/facebookresearch/fairseq/blob/main/examples/mms/lid/tutorial/MMS_LID_Inference_Colab.ipynb
* 4: Run Inference and transcribe your audio(s)

- pick up one or two files for EDA since typically audio files are very heavy. do not pick up more than 5 files at a time, which might not be even possible.

- Can try these to deal with the volume and high dimensionality of audio data:
1) Google Colab
2) Keras Audio data loading
https://keras.io/api/data_loading/audio/#audio_dataset_from_directory-function


- will need a library that converts audio files to machine-readable data, i.e. numbers.
- will need another library that trains a model (e.g. sequence/ deep learning model for time series audio data).