------------------------------------------------------------------------------------------------------------------------
### <b>Table of Content</b>

0. Background

1. Import functions

2. Load data

3. Explore data<br>
    3-1. Examine and transform data formats<br>
    3-2. Plot histograms and box plots<br>
    3-3. Plot a correlation heatmap<br>
    3-4. Plot seasonal decomposition<br>
    3-5. Plot autocorrelations

3. build a classification model to access the performance of cloned audio's target speaker
    3-1. Convert .wav to machine-readable form
    3-2. train a model
    3-3. evaluate classification accuracy using test data

4. first, build a voice cloning system given a speaker’s spoken audio that clones the source speaker’s voice to the target speaker’s voice<br>
    4-1. For the voice cloning system (VC), you can utilize the TIMIT dataset as it consists of aligned text-audio data with various speakers.<br>
    4-2. Use Word Error Rate (WER) for automatic evaluation of the voice cloning (VC) system for the speech generation part<br>
    &emsp; 4-2-1. speech to text<br>
    &emsp; 4-2-2. measure WER using the original script and transcribed text<br>
    4-3. also report speaker classification accuracy to assess the performance of the generated audio’s target speaker.

5. Next, build a machine learning system which detects if any spoken audio is a natural speech or synthetically generated by machine.<br>
    5-1. For the fake audio detection system (FAD) you can utilize the CommonVoice dataset as it consists of thousands of naturally spoken audio which could be used as golden spoken audio by humans as positive examples and creating negative examples using the voice cloning system as automatic data/label generator.<br>
    5-2. For the fake audio detection (FAD) system evaluate the performance of the models using F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.



4. Select and engineer features

5. Train models<br>
    5-1. Split data into train and test sets<br>
    5-2. SARIMAX model<br>
    &emsp; 5-2-1. Perform stepwise search<br>
    &emsp; 5-2-2. Train with best orders<br>
    5-3. Train and evaluate different models<br>
    &emsp; 5-3-1. Train three different models<br>
    &emsp; 5-3-2. Compare model performance

6. Evaluate prediction results<br>
    6-1. Create Bollinger Bands<br>
    6-2. Get trading dates with different intervals<br>
    6-3. Make training decisions and get capital returns<br>
    &emsp; 6-3-1. Based on SARIMAX predictions<br>
    &emsp; 6-3-2. Based on Bollinger Band

7. Build pipeline and process all stock data<br>
    7-1. Interpret results<br>
    &emsp; 7-1-1. Capital returns<br>
    &emsp; 7-1-2. Model performance

8. Conclusion
------------------------------------------------------------------------------------------------------------------------

### <b>0. Background</b>

This project is for a cyber security company providing products and services that can detect whether audio and video media is authentic or fake.

We are given two publically available corpora of speech data, which are the <a href="https://github.com/philipperemy/timit">TIMIT</a> and <a href="https://commonvoice.mozilla.org/en/datasets">CommonVoice</a> datasets. We will build two machine learning systems using these datasets as follows:

1. A voice cloning (VC) system that clones a given speaker's spoken audio to the target speaker's voice.
2. A fake audio detection (FAD) system that detects if any spoken audio is natural speech or synthetically generated by machines.

More details can be found in <a href="https://github.com/henryhyunwookim/K7h2vHrgG1Gl0S2r#readme">README</a>.

### <b>1. Import functions</b>

In [1]:
from TTS.api import TTS
# https://tts.readthedocs.io/en/latest/inference.html

import sys
from pathlib import Path
import pandas as pd

### <b>2. Load data</b>

In [2]:
# Define file path.
root_dir = Path(sys.path[0])

timit_dir = root_dir / 'data' / 'TIMIT' / 'archive'
common_voice_dir = root_dir / 'data' / 'CommonVoice' / \
    'cv-corpus-14.0-delta-2023-06-23-en.tar' / \
    'cv-corpus-14.0-delta-2023-06-23-en' / \
    'cv-corpus-14.0-delta-2023-06-23' / 'en' / 'clips'

In [3]:
train_csv = pd.read_csv(timit_dir / 'train_data.csv', index_col='index').dropna(how='all')
train_csv.index = train_csv.index.astype(int).astype(str)

test_csv = pd.read_csv(timit_dir / 'test_data.csv', index_col='index').dropna(how='any')
test_csv.index = test_csv.index.astype(int).astype(str)

train_csv.head()

Unnamed: 0_level_0,test_or_train,dialect_region,speaker_id,filename,path_from_data_dir,path_from_data_dir_windows,is_converted_audio,is_audio,is_word_file,is_phonetic_file,is_sentence_file
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,TRAIN,DR4,MMDM0,SI681.WAV.wav,TRAIN/DR4/MMDM0/SI681.WAV.wav,TRAIN\\DR4\\MMDM0\\SI681.WAV.wav,True,True,False,False,False
2,TRAIN,DR4,MMDM0,SI1311.PHN,TRAIN/DR4/MMDM0/SI1311.PHN,TRAIN\\DR4\\MMDM0\\SI1311.PHN,False,False,False,True,False
3,TRAIN,DR4,MMDM0,SI1311.WRD,TRAIN/DR4/MMDM0/SI1311.WRD,TRAIN\\DR4\\MMDM0\\SI1311.WRD,False,False,True,False,False
4,TRAIN,DR4,MMDM0,SX321.PHN,TRAIN/DR4/MMDM0/SX321.PHN,TRAIN\\DR4\\MMDM0\\SX321.PHN,False,False,False,True,False
5,TRAIN,DR4,MMDM0,SX321.WRD,TRAIN/DR4/MMDM0/SX321.WRD,TRAIN\\DR4\\MMDM0\\SX321.WRD,False,False,True,False,False


In [43]:
train_audio_path = train_csv[train_csv['is_converted_audio']==True]['path_from_data_dir']

source_audio_subpath = train_audio_path[0]
source_speaker_id = source_audio_subpath.split('/')[2]
source_audio_file = source_audio_subpath.split('/')[3]
source_file_id = source_audio_file.split('.')[0]
source_audio_path = timit_dir / 'data' / source_audio_subpath

import string
source_text_subpath = train_csv[(train_csv['speaker_id']==source_speaker_id) &
          (train_csv['filename']==source_file_id+'.TXT')]['path_from_data_dir'][0]
source_text_file = source_text_subpath.split('/')[3]
source_text_path = timit_dir / 'data' / source_text_subpath
with open(source_text_path) as txt:
    raw_source_text = txt.read().split()[2:]
source_text = ' '.join(raw_source_text) # Remove whitespace
source_text = source_text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
source_text = source_text.lower() # Convert to lowercase
print(f'Source path: {source_audio_path}')

target_audio_subpath = train_audio_path[-1]
target_speaker_id = target_audio_subpath.split('/')[2]
target_audio_file = target_audio_subpath.split('/')[3]
target_file_id = target_audio_file.split('.')[0]
target_audio_path = timit_dir / 'data' / target_audio_subpath
print(f'Target path: {target_audio_path}')

output_folder = root_dir / 'output' / f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}'
output_filename = f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}.wav'

Source path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR4\MMDM0\SI681.WAV.wav
Target path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR8\MRDM0\SA2.WAV.wav


3. clone audio file

In [44]:
# Copy original source and target files to the destination folder
# for an easier review of the output file.
from shutil import copy2
import os

if not os.path.exists(output_folder):
    os.mkdir(output_folder)

# Copy with file permission and dest can be a folder
if not os.path.exists(output_folder / source_audio_file):
    copy2(src=source_audio_path, dst=output_folder / source_audio_file)
if not os.path.exists(output_folder / target_audio_file):
    copy2(src=target_audio_path, dst=output_folder / target_audio_file)
if not os.path.exists(output_folder / source_text_file):
    copy2(src=source_text_path, dst=output_folder / source_text_file)

In [45]:
# Example voice conversion converting speaker of the source_wav to the speaker of the target_wav
# Downloading model to C:\Users\Admin\AppData\Local\tts\voice_conversion_models--multilingual--vctk--freevc24   
multi_output_path = output_folder / f'multi_{output_filename}'
if os.path.exists(multi_output_path):
    print(f'{multi_output_path} already exists.')
else:
    multi_tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=False)
    multi_tts.voice_conversion_to_file(
        source_wav=str(source_audio_path),
        target_wav=str(target_audio_path),
        file_path=multi_output_path
    )

d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\multi_MMDM0-SI681_to_MRDM0-SA2.wav already exists.


In [46]:
# TTS with on the fly voice conversion
en_models = [model for model in TTS.list_models() if '/en/' in model]
for model in en_models:
    try:
        model_name = model.split('/')[-1]
        output_file_path = output_folder / f'{model_name}_{output_filename}'
        if os.path.exists(output_file_path):
            print(f'{output_file_path} already exists.')
        else:
            en_tts = TTS(model)
            en_tts.tts_with_vc_to_file(
                source_text,
                speaker_wav=str(target_audio_path),
                file_path=output_file_path
            )
    except Exception  as e:
        print(e)
        print(f'Failed to load {model_name}.')

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`

d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DDC_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DDC_ph_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\glow-tts_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\speedy-speech_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DCA_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\

# evaluate results in D:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2

3-1. speech to text using the output file

In [47]:
# https://pdf.co/blog/transcribe-speech-recordings-to-text-python

import speech_recognition as sr
from os import path
from pydub import AudioSegment
from jiwer import process_words, process_characters, visualize_alignment

eval_dict = {}
for filename in os.listdir(output_folder):
    if output_filename in filename:
        wav_path = output_folder / filename
        model_name = filename.split('/')[-1].split('_')[0]
        # sound=AudioSegment.from_wav(wav_path)
        r=sr.Recognizer()
        with sr.AudioFile(str(wav_path)) as source:
            audio=r.record(source)
            transcription = r.recognize_google(audio, language='en-US')
            print(f'Transcription: {transcription}')

        eval_dict[model_name] = {}
        words_output = process_words(source_text, transcription) # source_text and transcription can be either a string or a list of strings
        eval_dict[model_name]['wer'] = words_output.wer
        eval_dict[model_name]['mer'] = words_output.mer
        eval_dict[model_name]['wil'] = words_output.wil
        eval_dict[model_name]['wip'] = words_output.wip

        chars_output = process_characters(source_text, transcription)
        eval_dict[model_name]['cer'] = chars_output.cer
        eval_dict[model_name]['Transcription'] = transcription
        
        print(f'Performance of model "{model_name}":')
        print(visualize_alignment(words_output))
        print(visualize_alignment(chars_output))
        print()

        # Saving the Transcript
        # transcription_path = output_folder / filename.replace('.wav', '.txt')
        # if os.path.exists(transcription_path):
        #     print(f'{transcription_path} already exists.')
        # else:
        #     with open(transcription_path, "w") as file:
        #         file.write(transcription)

Transcription: what's such an act of refuse to be used for
Performance of model "capacitron-t2-c150":
sentence 1
REF:  would such an act of ****** refusal be **** useful
HYP: what's such an act of refuse      to be used    for
          S                     I       S       I      S

number of sentences: 1
substitutions=3 deletions=0 insertions=2 hits=5

mer=50.00%
wil=68.75%
wip=31.25%
wer=62.50%

sentence 1
REF: w*ould such an act of refus**al be use**ful
HYP: what's such an act of refuse to be used for
      ISSSS                     IISS       II SS

number of sentences: 1
substitutions=8 deletions=0 insertions=5 hits=30

cer=34.21%


Transcription: what's such an act of refusal be useful
Performance of model "capacitron-t2-c50":
sentence 1
REF:  would such an act of refusal be useful
HYP: what's such an act of refusal be useful
          S                                 

number of sentences: 1
substitutions=1 deletions=0 insertions=0 hits=7

mer=12.50%
wil=23.44%
wip=76.56%
wer=

In [51]:
print(f'Source text: {source_text}')
eval_df = pd.DataFrame(eval_dict).T
eval_df.sort_values(list(eval_df.columns)[:-1], ascending=[True, True, True, False, True])

# The following measures are implemented:
# 1. Word Error Rate (WER), which is where this library got its name from. This
#   has long been (and arguably still is) the de facto standard for computing
#   ASR performance. - the lower the better (i.e. more accurate)
# 2. Match Error Rate (MER) - the lower the better (i.e. more accurate)
# 3. Word Information Lost (WIL) - the lower the better (i.e. more accurate)
# 4. Word Information Preserved (WIP) - the larget the better
# 5. Character Error Rate (CER) - the lower the better (i.e. more accurate)

Source text: would such an act of refusal be useful


Unnamed: 0,wer,mer,wil,wip,cer,Transcription
neural,0.0,0.0,0.0,1.0,0.0,would such an act of refusal be useful
overflow,0.0,0.0,0.0,1.0,0.0,would such an act of refusal be useful
glow-tts,0.125,0.125,0.234375,0.765625,0.105263,was such an act of refusal be useful
jenny,0.125,0.125,0.234375,0.765625,0.105263,was such an act of refusal be useful
tacotron2,0.125,0.125,0.234375,0.765625,0.105263,with such an act of refusal be useful
vits--neon,0.125,0.125,0.234375,0.765625,0.105263,put such an act of refusal be useful
capacitron-t2-c50,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
tacotron2-DDC,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
vits,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
fast,0.25,0.2,0.2,0.8,0.131579,I would such an act of refusal be useful in


# build an audio classification model to predict the speaker based on synthesized audio file

# Other VC models:

https://paperswithcode.com/task/voice-conversion
=>
https://github.com/bshall/VectorQuantizedCPC
https://github.com/bshall/knn-vc

neural network-based voice conversion model, such as CycleGAN, StarGAN-VC, or Tacotron 2

VITS

VALL-E

- pick up one or two files for EDA since typically audio files are very heavy. do not pick up more than 5 files at a time, which might not be even possible.

- Can try these to deal with the volume and high dimensionality of audio data:
1) Google Colab
2) Keras Audio data loading
https://keras.io/api/data_loading/audio/#audio_dataset_from_directory-function


- will need a library that converts audio files to machine-readable data, i.e. numbers.
- will need another library that trains a model (e.g. sequence/ deep learning model for time series audio data).