------------------------------------------------------------------------------------------------------------------------
### <b>Table of Content</b>

0. Background

1. Import functions

2. Load data

3. Explore data<br>
    3-1. Examine and transform data formats<br>
    3-2. Plot histograms and box plots<br>
    3-3. Plot a correlation heatmap<br>
    3-4. Plot seasonal decomposition<br>
    3-5. Plot autocorrelations

3. build a classification model to access the performance of cloned audio's target speaker
    3-1. Convert .wav to machine-readable form
    3-2. train a model
    3-3. evaluate classification accuracy using test data

4. first, build a voice cloning system given a speaker’s spoken audio that clones the source speaker’s voice to the target speaker’s voice<br>
    4-1. For the voice cloning system (VC), you can utilize the TIMIT dataset as it consists of aligned text-audio data with various speakers.<br>
    4-2. Use Word Error Rate (WER) for automatic evaluation of the voice cloning (VC) system for the speech generation part<br>
    &emsp; 4-2-1. speech to text<br>
    &emsp; 4-2-2. measure WER using the original script and transcribed text<br>
    4-3. also report speaker classification accuracy to assess the performance of the generated audio’s target speaker.

5. Next, build a machine learning system which detects if any spoken audio is a natural speech or synthetically generated by machine.<br>
    5-1. For the fake audio detection system (FAD) you can utilize the CommonVoice dataset as it consists of thousands of naturally spoken audio which could be used as golden spoken audio by humans as positive examples and creating negative examples using the voice cloning system as automatic data/label generator.<br>
    5-2. For the fake audio detection (FAD) system evaluate the performance of the models using F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.



4. Select and engineer features

5. Train models<br>
    5-1. Split data into train and test sets<br>
    5-2. SARIMAX model<br>
    &emsp; 5-2-1. Perform stepwise search<br>
    &emsp; 5-2-2. Train with best orders<br>
    5-3. Train and evaluate different models<br>
    &emsp; 5-3-1. Train three different models<br>
    &emsp; 5-3-2. Compare model performance

6. Evaluate prediction results<br>
    6-1. Create Bollinger Bands<br>
    6-2. Get trading dates with different intervals<br>
    6-3. Make training decisions and get capital returns<br>
    &emsp; 6-3-1. Based on SARIMAX predictions<br>
    &emsp; 6-3-2. Based on Bollinger Band

7. Build pipeline and process all stock data<br>
    7-1. Interpret results<br>
    &emsp; 7-1-1. Capital returns<br>
    &emsp; 7-1-2. Model performance

8. Conclusion
------------------------------------------------------------------------------------------------------------------------

### <b>0. Background</b>

This project is for a cyber security company providing products and services that can detect whether audio and video media is authentic or fake.

We are given two publically available corpora of speech data, which are the <a href="https://github.com/philipperemy/timit">TIMIT</a> and <a href="https://commonvoice.mozilla.org/en/datasets">CommonVoice</a> datasets. We will build two machine learning systems using these datasets as follows:

1. A voice cloning (VC) system that clones a given speaker's spoken audio to the target speaker's voice.
2. A fake audio detection (FAD) system that detects if any spoken audio is natural speech or synthetically generated by machines.

More details can be found in <a href="https://github.com/henryhyunwookim/K7h2vHrgG1Gl0S2r#readme">README</a>.

### <b>1. Import functions</b>

In [1]:
from TTS.api import TTS
# https://tts.readthedocs.io/en/latest/inference.html

import sys
from pathlib import Path
import pandas as pd

### <b>2. Load data</b>

In [2]:
# Define file path.
root_dir = Path(sys.path[0])

timit_dir = root_dir / 'data' / 'TIMIT' / 'archive'
common_voice_dir = root_dir / 'data' / 'CommonVoice' / \
    'cv-corpus-14.0-delta-2023-06-23-en.tar' / \
    'cv-corpus-14.0-delta-2023-06-23-en' / \
    'cv-corpus-14.0-delta-2023-06-23' / 'en' / 'clips'

In [73]:
train_csv = pd.read_csv(timit_dir / 'train_data.csv', index_col='index').dropna(how='all').reset_index(drop=True)
train_csv.index = train_csv.index.astype(int).astype(str)

test_csv = pd.read_csv(timit_dir / 'test_data.csv', index_col='index').dropna(how='any').reset_index(drop=True)
test_csv.index = test_csv.index.astype(int).astype(str)

train_csv.head()

Unnamed: 0,test_or_train,dialect_region,speaker_id,filename,path_from_data_dir,path_from_data_dir_windows,is_converted_audio,is_audio,is_word_file,is_phonetic_file,is_sentence_file
0,TRAIN,DR4,MMDM0,SI681.WAV.wav,TRAIN/DR4/MMDM0/SI681.WAV.wav,TRAIN\\DR4\\MMDM0\\SI681.WAV.wav,True,True,False,False,False
1,TRAIN,DR4,MMDM0,SI1311.PHN,TRAIN/DR4/MMDM0/SI1311.PHN,TRAIN\\DR4\\MMDM0\\SI1311.PHN,False,False,False,True,False
2,TRAIN,DR4,MMDM0,SI1311.WRD,TRAIN/DR4/MMDM0/SI1311.WRD,TRAIN\\DR4\\MMDM0\\SI1311.WRD,False,False,True,False,False
3,TRAIN,DR4,MMDM0,SX321.PHN,TRAIN/DR4/MMDM0/SX321.PHN,TRAIN\\DR4\\MMDM0\\SX321.PHN,False,False,False,True,False
4,TRAIN,DR4,MMDM0,SX321.WRD,TRAIN/DR4/MMDM0/SX321.WRD,TRAIN\\DR4\\MMDM0\\SX321.WRD,False,False,True,False,False


In [83]:
train_audio_path = train_csv[train_csv['is_converted_audio']==True]['path_from_data_dir']
test_audio_path = test_csv[test_csv['is_converted_audio']==True]['path_from_data_dir']

source_audio_subpath = train_audio_path[0]
source_speaker_id = source_audio_subpath.split('/')[2]
source_audio_file = source_audio_subpath.split('/')[3]
source_file_id = source_audio_file.split('.')[0]
source_audio_path = timit_dir / 'data' / source_audio_subpath

import string
source_text_subpath = train_csv[(train_csv['speaker_id']==source_speaker_id) &
          (train_csv['filename']==source_file_id+'.TXT')]['path_from_data_dir'][0]
source_text_file = source_text_subpath.split('/')[3]
source_text_path = timit_dir / 'data' / source_text_subpath
with open(source_text_path) as txt:
    raw_source_text = txt.read().split()[2:]
source_text = ' '.join(raw_source_text) # Remove whitespace
source_text = source_text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
source_text = source_text.lower() # Convert to lowercase
print(f'Source path: {source_audio_path}')

target_audio_subpath = train_audio_path[-1]
target_speaker_id = target_audio_subpath.split('/')[2]
target_audio_file = target_audio_subpath.split('/')[3]
target_file_id = target_audio_file.split('.')[0]
target_audio_path = timit_dir / 'data' / target_audio_subpath
print(f'Target path: {target_audio_path}')

output_folder = root_dir / 'output' / f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}'
output_filename = f'{source_speaker_id}-{source_file_id}_to_{target_speaker_id}-{target_file_id}.wav'

Source path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR4\MMDM0\SI681.WAV.wav
Target path: d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\data\TIMIT\archive\data\TRAIN\DR8\MRDM0\SA2.WAV.wav


3. clone audio file

In [5]:
# Copy original source and target files to the destination folder
# for an easier review of the output file.
from shutil import copy2
import os

if not os.path.exists(output_folder):
    os.mkdir(output_folder)

# Copy with file permission and dest can be a folder
if not os.path.exists(output_folder / source_audio_file):
    copy2(src=source_audio_path, dst=output_folder / source_audio_file)
if not os.path.exists(output_folder / target_audio_file):
    copy2(src=target_audio_path, dst=output_folder / target_audio_file)
if not os.path.exists(output_folder / source_text_file):
    copy2(src=source_text_path, dst=output_folder / source_text_file)

In [6]:
# Example voice conversion converting speaker of the source_wav to the speaker of the target_wav
# Downloading model to C:\Users\Admin\AppData\Local\tts\voice_conversion_models--multilingual--vctk--freevc24   
multi_output_path = output_folder / f'multi_{output_filename}'
if os.path.exists(multi_output_path):
    print(f'{multi_output_path} already exists.')
else:
    multi_tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=False)
    multi_tts.voice_conversion_to_file(
        source_wav=str(source_audio_path),
        target_wav=str(target_audio_path),
        file_path=multi_output_path
    )

d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\multi_MMDM0-SI681_to_MRDM0-SA2.wav already exists.


In [7]:
# TTS with on the fly voice conversion
en_models = [model for model in TTS.list_models() if '/en/' in model]
for model in en_models:
    try:
        model_name = model.split('/')[-1]
        output_file_path = output_folder / f'{model_name}_{output_filename}'
        if os.path.exists(output_file_path):
            print(f'{output_file_path} already exists.')
        else:
            en_tts = TTS(model)
            en_tts.tts_with_vc_to_file(
                source_text,
                speaker_wav=str(target_audio_path),
                file_path=output_file_path
            )
    except Exception  as e:
        print(e)
        print(f'Failed to load {model_name}.')

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`

d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DDC_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DDC_ph_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\glow-tts_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\speedy-speech_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\tacotron2-DCA_MMDM0-SI681_to_MRDM0-SA2.wav already exists.
d:\OneDrive\GitHub\

 [!] Config for tortoise cannot be found.
Failed to load tortoise-v2.
d:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2\jenny_MMDM0-SI681_to_MRDM0-SA2.wav already exists.


### <b>evaluate results in D:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2</b>

3-1. speech to text using the output file

In [8]:
# https://pdf.co/blog/transcribe-speech-recordings-to-text-python

import speech_recognition as sr
from os import path
from pydub import AudioSegment
from jiwer import process_words, process_characters, visualize_alignment

eval_dict = {}
for filename in os.listdir(output_folder):
    if output_filename in filename:
        wav_path = output_folder / filename
        model_name = filename.split('/')[-1].split('_')[0]
        # sound=AudioSegment.from_wav(wav_path)
        r=sr.Recognizer()
        with sr.AudioFile(str(wav_path)) as source:
            audio=r.record(source)
            transcription = r.recognize_google(audio, language='en-US')
            print(f'Transcription: {transcription}')

        eval_dict[model_name] = {}
        words_output = process_words(source_text, transcription) # source_text and transcription can be either a string or a list of strings
        eval_dict[model_name]['wer'] = words_output.wer
        eval_dict[model_name]['mer'] = words_output.mer
        eval_dict[model_name]['wil'] = words_output.wil
        eval_dict[model_name]['wip'] = words_output.wip

        chars_output = process_characters(source_text, transcription)
        eval_dict[model_name]['cer'] = chars_output.cer
        eval_dict[model_name]['Transcription'] = transcription
        
        print(f'Performance of model "{model_name}":')
        print(visualize_alignment(words_output))
        print(visualize_alignment(chars_output))
        print()

        # Saving the Transcript
        # transcription_path = output_folder / filename.replace('.wav', '.txt')
        # if os.path.exists(transcription_path):
        #     print(f'{transcription_path} already exists.')
        # else:
        #     with open(transcription_path, "w") as file:
        #         file.write(transcription)



Transcription: what's such an act of refuse to be used for
Performance of model "capacitron-t2-c150":
sentence 1
REF:  would such an act of ****** refusal be **** useful
HYP: what's such an act of refuse      to be used    for
          S                     I       S       I      S

number of sentences: 1
substitutions=3 deletions=0 insertions=2 hits=5

mer=50.00%
wil=68.75%
wip=31.25%
wer=62.50%

sentence 1
REF: w*ould such an act of refus**al be use**ful
HYP: what's such an act of refuse to be used for
      ISSSS                     IISS       II SS

number of sentences: 1
substitutions=8 deletions=0 insertions=5 hits=30

cer=34.21%


Transcription: what's such an act of refusal be useful
Performance of model "capacitron-t2-c50":
sentence 1
REF:  would such an act of refusal be useful
HYP: what's such an act of refusal be useful
          S                                 

number of sentences: 1
substitutions=1 deletions=0 insertions=0 hits=7

mer=12.50%
wil=23.44%
wip=76.56%
wer=

In [9]:
print(f'Source text: {source_text}')
eval_df = pd.DataFrame(eval_dict).T
eval_df.sort_values(list(eval_df.columns)[:-1], ascending=[True, True, True, False, True])

# The following measures are implemented:
# 1. Word Error Rate (WER), which is where this library got its name from. This
#   has long been (and arguably still is) the de facto standard for computing
#   ASR performance. - the lower the better (i.e. more accurate)
# 2. Match Error Rate (MER) - the lower the better (i.e. more accurate)
# 3. Word Information Lost (WIL) - the lower the better (i.e. more accurate)
# 4. Word Information Preserved (WIP) - the larget the better
# 5. Character Error Rate (CER) - the lower the better (i.e. more accurate)

Source text: would such an act of refusal be useful


Unnamed: 0,wer,mer,wil,wip,cer,Transcription
neural,0.0,0.0,0.0,1.0,0.0,would such an act of refusal be useful
overflow,0.0,0.0,0.0,1.0,0.0,would such an act of refusal be useful
glow-tts,0.125,0.125,0.234375,0.765625,0.105263,was such an act of refusal be useful
jenny,0.125,0.125,0.234375,0.765625,0.105263,was such an act of refusal be useful
tacotron2,0.125,0.125,0.234375,0.765625,0.105263,with such an act of refusal be useful
vits--neon,0.125,0.125,0.234375,0.765625,0.105263,put such an act of refusal be useful
capacitron-t2-c50,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
tacotron2-DDC,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
vits,0.125,0.125,0.234375,0.765625,0.131579,what's such an act of refusal be useful
fast,0.25,0.2,0.2,0.8,0.131579,I would such an act of refusal be useful in


# build an audio classification model to predict the speaker based on synthesized audio file

https://www.section.io/engineering-education/machine-learning-for-audio-classification/
https://www.analyticsvidhya.com/blog/2022/03/implementing-audio-classification-project-using-deep-learning/


https://towardsdatascience.com/audio-classification-with-deep-learning-in-python-cf752b22ba07


https://www.analyticsvidhya.com/blog/2022/03/implementing-audio-classification-project-using-deep-learning/
MFCCs – The MFCC summarizes the frequency distribution across the window size. So, it is possible to analyze both the frequency and time characteristics of the sound. This audio representation will allow us to identify features for classification. So, it will try to convert audio into some kind of features based on time and frequency characteristics that will help us to do classification.

In [None]:
import librosa
source_audio_array, sample_rate = librosa.load(source_audio_path, sr = None)

# Play the audio in Jupyter notebook
from IPython.display import Audio

Audio(data=source_audio_array, rate=sample_rate)

In [None]:
%matplotlib inline
from librosa.display import waveshow
plt.figure(figsize=(8, 6))
waveshow(source_audio_array, sr=sample_rate);

In [77]:
import librosa
import numpy as np
import pandas as pd
from tqdm import tqdm
    
# train_audio_path = train_csv[train_csv['is_converted_audio']==True]['path_from_data_dir']
# test_audio_path = test_csv[train_csv['is_converted_audio']==True]['path_from_data_dir']

def get_X_and_y(audio_path):
    X_list = []
    y_list = []
    for audio_subpath in tqdm(audio_path):
        audio_path = timit_dir / 'data' / audio_subpath
        y = audio_subpath.split('/')[2]
        y_list.append(y)

        audio_array, sample_rate = librosa.load(audio_path,
            sr = None) # Set sr to None to get original sampling rate. Otherwise the default is 22050.
        # print(f'Shape of audio array: {audio_array.shape}')

        # https://www.analyticsvidhya.com/blog/2022/03/implementing-audio-classification-project-using-deep-learning/
        mfccs_features = librosa.feature.mfcc(y=audio_array, sr=sample_rate) # n_mfcc=40
        # print(f'Shape of mfccs features: {mfccs_features.shape}')
        X = np.mean(mfccs_features.T,axis=0) # Normalize features into the same scale
        # print(f'Shape of scaled mfccs features: {X.shape}')
        X_list.append(X)

    df = pd.DataFrame(X_list)
    df['label'] = y_list

    X = df.drop(['label'], axis=1)
    y = df['label']

    return X, y, df

train_X, train_y, train_df = get_X_and_y(train_audio_path)

  0%|          | 4/4620 [00:00<02:22, 32.34it/s]

100%|██████████| 4620/4620 [01:42<00:00, 44.97it/s]


In [78]:
print(train_X.shape, train_y.shape)
train_df

(4620, 20) (4620,)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,label
0,-459.484833,56.761337,-2.669462,19.968649,-23.367184,-3.540117,-14.750872,-10.126809,-2.539186,-9.488992,...,-9.316575,-9.237600,1.366279,-2.545701,-12.262521,3.099277,-2.328727,-6.903233,-0.064869,MMDM0
1,-465.787048,64.562737,-9.489527,8.321322,-21.122004,-2.046137,-21.014496,-7.231303,-7.483816,-4.948157,...,-8.372976,-5.502108,1.284694,-2.473558,-7.687661,1.293272,0.884682,-8.997627,-1.206424,MMDM0
2,-460.641968,63.317535,-2.863496,10.129391,-20.092010,-2.287424,-16.457745,-8.084587,-6.815063,-3.239865,...,-9.894239,-2.476811,2.662166,-1.993033,-7.037549,0.101688,2.700857,-8.354132,-2.119529,MMDM0
3,-472.025757,67.413963,-5.153868,6.219963,-13.424616,-4.105060,-22.515928,-3.277426,-2.497244,-5.639574,...,-6.818600,-5.670763,-0.437742,0.500926,-8.669519,-1.226035,3.584755,-9.111854,-2.529173,MMDM0
4,-471.239319,69.058517,-12.291035,3.194077,-25.257214,-0.998650,-18.970245,-9.737679,-5.135023,-1.857679,...,-4.474608,-6.896949,1.058094,2.615407,-12.128066,0.365896,0.885185,-11.063824,-2.649611,MMDM0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4615,-423.451477,95.278877,-22.301649,3.708834,-18.656015,-12.609299,-12.033003,-10.555429,-2.433274,-6.521004,...,-1.722636,-8.361225,-2.762403,-3.371542,-7.972284,-2.850656,-1.245145,-5.436496,-6.501590,MRDM0
4616,-386.019409,87.441643,-20.666483,-1.609706,-24.377481,-11.676523,-17.594046,-17.587301,-3.453123,-6.572145,...,-2.804390,-12.402589,4.166420,-3.674955,-10.928297,1.093287,-4.062526,-6.624683,-5.267781,MRDM0
4617,-400.804199,98.299683,-32.088337,-1.725039,-20.465277,-11.593572,-22.799036,-14.846024,-1.932206,-13.435257,...,2.143220,-13.219035,-1.477908,-4.940297,-6.138267,-1.202820,-4.640791,-4.390095,-5.367548,MRDM0
4618,-416.451416,65.583344,-7.596760,6.540810,-20.395090,-9.472665,-15.429052,-12.100680,-4.556737,-5.017276,...,-2.022671,-11.940890,4.366385,-7.756617,-7.010954,1.492949,-5.822121,-4.933115,-3.898118,MRDM0


In [85]:
# test_X, test_y, test_df = get_X_and_y(test_audio_path)
# print(test_X.shape, test_y.shape)
# test_df

100%|██████████| 1680/1680 [01:15<00:00, 22.17it/s]


(1680, 20) (1680,)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,label
0,-416.969818,74.011398,-13.187699,7.313196,-16.026976,-8.255945,-15.144563,-12.874265,-12.429185,-11.921021,...,-2.728844,-7.919119,0.885161,-2.531504,-3.715901,2.588741,-3.963770,-4.285760,-3.759105,MGMM0
1,-403.078735,76.966255,-18.775089,-1.268906,-9.745815,-5.179509,-23.778231,-17.591146,-6.624335,-11.397276,...,-5.800530,-7.827424,4.408467,-3.831102,-5.387887,0.043134,-2.391682,-3.166301,-6.031714,MGMM0
2,-467.837006,75.185669,-21.702417,3.530928,-7.479931,-10.674512,-20.365223,-11.112667,-5.670715,-9.603108,...,-3.326303,-5.674239,-0.227867,0.024477,-5.983346,-0.001973,-4.833220,-2.994748,-3.716501,MGMM0
3,-418.118164,37.631710,-8.191947,14.002369,-6.737182,-5.914884,-20.806435,-8.180764,-2.208747,-10.896692,...,-3.398853,-5.102459,0.283969,-3.945397,-2.221809,-0.478784,-4.513309,-2.042158,-3.401044,MGMM0
4,-435.070526,57.279308,-29.724104,3.099550,-17.637997,-5.830471,-20.228270,-12.256104,-2.303483,-8.506678,...,0.906471,-7.174262,3.197781,0.523936,-9.252493,-1.749536,-1.823353,-4.496766,-3.326604,MGMM0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,-406.085480,92.109261,-23.479044,-7.624950,-22.940878,-9.834519,-13.621748,-21.213266,-4.604829,-21.231474,...,-9.365309,-12.795815,-2.434451,-5.877679,-7.407432,-3.499592,-2.272517,-0.690597,-9.970720,MPAM0
1676,-385.963654,78.325348,-29.334007,-0.930862,-22.941746,-9.558640,-26.149376,-23.143641,-0.564970,-22.136240,...,-11.822415,-5.651861,-2.994653,-8.926338,-1.621032,-6.353990,-5.315491,-3.241990,-4.528982,MPAM0
1677,-420.404083,56.258163,-10.633552,3.504625,-22.861475,-10.257877,-15.837805,-14.366729,-6.907274,-17.667339,...,-8.835690,-8.947762,-3.208503,-4.989741,-3.821439,-4.549932,-1.142911,-5.954093,-4.140995,MPAM0
1678,-388.224548,83.864449,-20.634623,-10.419191,-26.245213,-7.218798,-23.554562,-19.968264,-8.886330,-16.091333,...,-10.503175,-6.578616,-2.868214,-7.529954,-3.150034,-5.381018,-1.760893,-5.881048,-5.423452,MPAM0


In [93]:
# train_y.value_counts()

MMDM0    10
FPAC0    10
MKDB0    10
MGAW0    10
MCLK0    10
         ..
MEFG0    10
MRMS0    10
MJHI0    10
MRLJ0    10
MRDM0    10
Name: label, Length: 462, dtype: int64

In [94]:
# test_y.value_counts()

MGMM0    10
FGWR0    10
MDWK0    10
MLIH0    10
FHEW0    10
         ..
MTDT0    10
FSLB1    10
MRCZ0    10
MTMR0    10
MPAM0    10
Name: label, Length: 168, dtype: int64

In [None]:
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
y=to_categorical(labelencoder.fit_transform(y))

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Activation,Flatten
from tensorflow.keras.optimizers import Adam
from sklearn import metrics

model=Sequential()
###first layer
model.add(Dense(100,input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
###second layer
model.add(Dense(200))
model.add(Activation('relu'))
model.add(Dropout(0.5))
###third layer
model.add(Dense(100))
model.add(Activation('relu'))
model.add(Dropout(0.5))

###final layer
model.add(Dense(num_labels))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',metrics=['accuracy'],optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from datetime import datetime 

num_epochs = 200
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/audio_classification.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(X_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(X_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

# <b>predict labels using the cloned audio files in D:\OneDrive\GitHub\Apziva\K7h2vHrgG1Gl0S2r\output\MMDM0-SI681_to_MRDM0-SA2</b>

In [None]:
test_accuracy=model.evaluate(X_test,y_test,verbose=0)
print(test_accuracy[1])

In [None]:
filename="fold8/103076-3-0-0.wav"
audio, sample_rate = librosa.load(filename, res_type='kaiser_fast') 
mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
mfccs_scaled_features = np.mean(mfccs_features.T,axis=0)

print(mfccs_scaled_features)
mfccs_scaled_features=mfccs_scaled_features.reshape(1,-1)
print(mfccs_scaled_features)
print(mfccs_scaled_features.shape)
predicted_label=model.predict(mfccs_scaled_features)
print(predicted_label)
classes_x=np.argmax(predicted_label,axis=1)
prediction_class = labelencoder.inverse_transform(classes_x)
prediction_class

# Other VC models:

https://paperswithcode.com/task/voice-conversion
=>
https://github.com/bshall/VectorQuantizedCPC
https://github.com/bshall/knn-vc

neural network-based voice conversion model, such as CycleGAN, StarGAN-VC, or Tacotron 2

VITS

VALL-E

- pick up one or two files for EDA since typically audio files are very heavy. do not pick up more than 5 files at a time, which might not be even possible.

- Can try these to deal with the volume and high dimensionality of audio data:
1) Google Colab
2) Keras Audio data loading
https://keras.io/api/data_loading/audio/#audio_dataset_from_directory-function


- will need a library that converts audio files to machine-readable data, i.e. numbers.
- will need another library that trains a model (e.g. sequence/ deep learning model for time series audio data).