<a href="https://colab.research.google.com/github/jsjj10002/FackVoiceClassfication/blob/main/%EC%9D%8C%EC%84%B1_%ED%8C%90%EB%B3%84_%EA%B8%B0%EC%B4%88_%EB%AA%A8%EB%8D%B8_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 사용 데이터셋

[The Fake-or-Real (FoR) Dataset](https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset/data)

The Fake-or-Real (FoR) dataset is a collection of more than 195,000 utterances from real humans and computer generated speech. The dataset can be used to train classifiers to detect synthetic speech. The dataset is published in four versions: for-original, for-norm, for-2sec and for-rerec.

The first version, named for-original, contains the files as collected from the speech sources, without any modification (balanced version).

***The second version, called for-norm, contains the same files, but balanced in terms of gender and class and normalized in terms of sample rate, volume and number of channels.***

The third one, named for-2sec is based on the second one, but with the files truncated at 2 seconds.

The last version, named for-rerec, is a rerecorded version of the for-2second dataset, to simulate a scenario where an attacker sends an utterance through a voice channel (i.e. a phone call or a voice message).

사용한 버전: 정규화 버젼

# 케글 API 불러온 후 데이터셋 다운로드


In [None]:
!gdown https://drive.google.com/uc?id=1Mnp2CPxa6hDi7_hkwUrleP6pQqulrXz0

Downloading...
From: https://drive.google.com/uc?id=1Mnp2CPxa6hDi7_hkwUrleP6pQqulrXz0
To: /content/kaggle.json
  0% 0.00/67.0 [00:00<?, ?B/s]100% 67.0/67.0 [00:00<00:00, 233kB/s]


In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d mohammedabdeldayem/the-fake-or-real-dataset

Dataset URL: https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset
License(s): GNU Lesser General Public License 3.0
Downloading the-fake-or-real-dataset.zip to /content
100% 16.0G/16.0G [02:43<00:00, 197MB/s]
100% 16.0G/16.0G [02:43<00:00, 105MB/s]


In [None]:
%cd /content
!mkdir the-fake-or-real-dataset
!unzip -qq "/content/the-fake-or-real-dataset.zip" -d the-fake-or-real-dataset/

/content


# 1. 음성파일 경로 수집

In [None]:
import os
import pandas as pd

# 기본 경로 설정
base_path = "/content/the-fake-or-real-dataset/for-norm/for-norm"

# 카테고리와 타입 정의
categories = ['testing', 'training', 'validation']
types = ['fake', 'real']

# 모든 파일 경로 수집
data = []
for category in categories:
    for type_ in types:
        dir_path = os.path.join(base_path, category, type_)
        for filename in os.listdir(dir_path):
            if filename.endswith('.wav'):
                full_path = os.path.join(dir_path, filename)
                data.append({'path': full_path, 'label': type_, 'category': category})

# DataFrame 생성
path_df = pd.DataFrame(data)
print("Dataframe shape:", path_df.shape)
print(path_df.head())


Dataframe shape: (69300, 3)
                                                path label category
0  /content/the-fake-or-real-dataset/for-norm/for...  fake  testing
1  /content/the-fake-or-real-dataset/for-norm/for...  fake  testing
2  /content/the-fake-or-real-dataset/for-norm/for...  fake  testing
3  /content/the-fake-or-real-dataset/for-norm/for...  fake  testing
4  /content/the-fake-or-real-dataset/for-norm/for...  fake  testing


# 2.MFCC 특징 추출

In [None]:
import librosa
import numpy as np

def extract_mfcc(file_path):
    try:
        audio, sample_rate = librosa.load(file_path, sr=None)
        if audio.size == 0:
            return None
        mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=13)
        mfcc_mean = np.mean(mfcc, axis=1)
        return mfcc_mean
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None

# MFCC 특징을 병렬로 추출
from multiprocessing import Pool

def process_row(row):
    mfcc = extract_mfcc(row['path'])
    if mfcc is not None:
        return {'path': row['path'], 'mfcc': mfcc, 'label': row['label'], 'category': row['category']}
    else:
        return None

# 데이터프레임에서 병렬 처리를 위해 데이터를 리스트로 변환
data_list = path_df.to_dict('records')
# 병렬 처리 설정
with Pool(processes=4) as pool:
    results = pool.map(process_row, data_list)

# None을 제거하고 데이터프레임 생성
mfcc_data = [result for result in results if result is not None]
mfcc_df = pd.DataFrame(mfcc_data)

print("MFCC DataFrame shape:", mfcc_df.shape)
print(mfcc_df.head())




MFCC DataFrame shape: (69298, 4)
                                                path  \
0  /content/the-fake-or-real-dataset/for-norm/for...   
1  /content/the-fake-or-real-dataset/for-norm/for...   
2  /content/the-fake-or-real-dataset/for-norm/for...   
3  /content/the-fake-or-real-dataset/for-norm/for...   
4  /content/the-fake-or-real-dataset/for-norm/for...   

                                                mfcc label category  
0  [-121.118385, 104.168976, -17.292633, 7.035876...  fake  testing  
1  [-133.96411, 77.051575, -3.4393408, 11.727731,...  fake  testing  
2  [-153.86955, 123.38464, -12.190086, 23.369476,...  fake  testing  
3  [-126.404106, 96.755585, -7.130523, 5.172734, ...  fake  testing  
4  [-136.06691, 120.67819, -12.658545, 18.204365,...  fake  testing  


# 3.데이터셋 분할

In [None]:
train_df = mfcc_df[mfcc_df['category'] == 'training']
val_df = mfcc_df[mfcc_df['category'] == 'validation']
test_df = mfcc_df[mfcc_df['category'] == 'testing']

print("Training Data Shape:", train_df.shape)
print("Validation Data Shape:", val_df.shape)
print("Testing Data Shape:", test_df.shape)


Training Data Shape: (53866, 4)
Validation Data Shape: (10798, 4)
Testing Data Shape: (4634, 4)


# 4,데이터 준비

In [None]:
X_train = np.vstack(train_df['mfcc'].apply(lambda x: np.array(x)).values)
Y_train = train_df['label'].values

X_val = np.vstack(val_df['mfcc'].apply(lambda x: np.array(x)).values)
Y_val = val_df['label'].values

X_test = np.vstack(test_df['mfcc'].apply(lambda x: np.array(x)).values)
Y_test = test_df['label'].values

print("X_train shape:", X_train.shape)
print("Y_train shape:", Y_train.shape)
print("X_val shape:", X_val.shape)
print("Y_val shape:", Y_val.shape)
print("X_test shape:", X_test.shape)
print("Y_test shape:", Y_test.shape)


X_train shape: (53866, 13)
Y_train shape: (53866,)
X_val shape: (10798, 13)
Y_val shape: (10798,)
X_test shape: (4634, 13)
Y_test shape: (4634,)


## 필요 라이브러리

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam

# 5.데이터 정규화

In [None]:
# StandardScaler를 사용하여 특징 데이터의 스케일을 조정합니다.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# 6.레이블 인코딩

In [None]:
# 문자열 레이블을 숫자로 변환.
encoder = LabelEncoder()
Y_train = encoder.fit_transform(Y_train)
Y_val = encoder.transform(Y_val)
Y_test = encoder.transform(Y_test)

## 6-1 원-핫 인코딩

In [None]:
# 분류 작업을 위해 레이블을 원-핫 인코딩 형식으로 변환.
Y_train = to_categorical(Y_train)
Y_val = to_categorical(Y_val)
Y_test = to_categorical(Y_test)

# 7.모델 구축

In [None]:
# 간단한 신경망 모델을 구성. 입력 레이어, 두 개의 숨겨진 레이어, 그리고 출력 레이어로 구성.
model = Sequential([
    Dense(256, activation='relu', input_shape=(X_train.shape[1],)),  # 입력 레이어
    Dense(128, activation='relu'),                                   # 은닉층
    Dense(Y_train.shape[1], activation='softmax')                    # 출력 레이어, 클래스 수만큼 출력 노드를 가짐
])
# 모델 컴파일. 손실 함수로는 'categorical_crossentropy'를 사용하며, 최적화기로는 'adam'을 사용.
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# 8.모델 훈련

In [None]:
# 모델 훈련
# 모델을 훈련 데이터로 훈련하고, 검증 데이터로 각 에포크마다 성능을 평가합니다.
model.fit(X_train, Y_train, epochs=10, validation_data=(X_val, Y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79abc9ca2980>

# 9.모델평가

In [None]:
# 훈련된 모델을 테스트 데이터로 평가.
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

Test Accuracy: 50.50%
