# Final Report - Data Science
## Bachelor's Degree in Computer Science / PUCPR

**Prof. Jean Paul Barddal** / **Prof. Rayson Laroca**

`Guilherme Schwarz` - `guilherme.schwarz@pucpr.edu.br`

`Julia Cristina Moreira da Silva` - `s.moreira4@pucpr.edu.br`

`2025`

# Imports & Installs

In [1]:
%pip install numpy librosa tqdm pydub soundfile scikit-image tensorflow sklearn pandas

import os
import time
import librosa
import numpy as np
import pandas as pd
from tqdm import tqdm
from skimage.transform import resize
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split

Note: you may need to restart the kernel to use updated packages.


ImportError: Traceback (most recent call last):
  File "c:\Users\Chico\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 73, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: DLL load failed while importing _pywrap_tensorflow_internal: Uma rotina de inicialização da biblioteca de vínculo dinâmico (DLL) falhou.


Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors for some common causes and solutions.
If you need help, create an issue at https://github.com/tensorflow/tensorflow/issues and include the entire stack trace above this error message.

# Data & Data Treatment

## Paths

In [17]:
AUDIO_DIR = 'Data/flac/'
PROTOCOL_PATH = 'Data/ASVspoof2021Protocol.txt'
OUTPUT_PATH = 'Data/asvspoof_features.csv'
SAMPLE_OUTPUT_PATH = 'Data/sample_features.csv'

## Data Treatment Functions

In [18]:
import os
import librosa
import numpy as np
import matplotlib.pyplot as plt
from skimage.transform import resize

def extract_mfcc(y, sr, n_mfcc=40, hop_length=256):
    return librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)

def extract_logmel(y, sr, n_mels=128, hop_length=256):
    mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, hop_length=hop_length)
    return librosa.power_to_db(mel)

def extract_lfcc(y, sr, n_lfcc=40, hop_length=256):
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, hop_length=hop_length)
    power_db = librosa.power_to_db(S)
    return librosa.feature.mfcc(S=power_db, sr=sr, n_mfcc=n_lfcc)

def normalize_feature(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x) + 1e-6)

def extract_features(y, sr):
    mfcc = extract_mfcc(y, sr)
    logmel = extract_logmel(y, sr)
    lfcc = extract_lfcc(y, sr)
    
    # Average over time only here
    mfcc = np.mean(mfcc, axis=1)
    logmel = np.mean(logmel, axis=1)
    lfcc = np.mean(lfcc, axis=1)

    return np.concatenate([mfcc, logmel, lfcc])

def create_feature_image(y, sr, target_shape=(64, 64)):
    mfcc = extract_mfcc(y, sr)
    logmel = extract_logmel(y, sr)
    lfcc = extract_lfcc(y, sr)

    # Normalize without reducing dimensions
    mfcc = normalize_feature(mfcc)
    logmel = normalize_feature(logmel)
    lfcc = normalize_feature(lfcc)

    # Resize feature maps
    mfcc_resized = resize(mfcc, target_shape, mode='reflect', anti_aliasing=True)
    logmel_resized = resize(logmel, target_shape, mode='reflect', anti_aliasing=True)
    lfcc_resized = resize(lfcc, target_shape, mode='reflect', anti_aliasing=True)

    return np.stack([mfcc_resized, logmel_resized, lfcc_resized], axis=-1)

def plot_feature_image(image_input):
    _, axs = plt.subplots(1, 3, figsize=(12, 4))
    titles = ['MFCC', 'Log-Mel', 'LFCC']

    for i in range(3):
        axs[i].imshow(image_input[:, :, i], aspect='auto', origin='lower', cmap='magma')
        axs[i].set_title(titles[i])
        axs[i].axis('off')

    plt.tight_layout()
    plt.show()

def save_feature_images(image_input, file_name, output_dir='./Data/Images'):
    os.makedirs(output_dir, exist_ok=True)
    feature_names = ['mfcc', 'logmel', 'lfcc']

    for i, name in enumerate(feature_names):
        output_path = os.path.join(output_dir, name)
        os.makedirs(output_path, exist_ok=True)

        fig, ax = plt.subplots(figsize=(4, 4))
        ax.imshow(image_input[:, :, i], aspect='auto', origin='lower', cmap='magma')
        ax.axis('off')

        fig.patch.set_facecolor('black')
        image_path = os.path.join(output_path, f"{file_name}.png")
        plt.subplots_adjust(left=0, right=1, top=1, bottom=0)
        plt.savefig(image_path, dpi=100, bbox_inches='tight', pad_inches=0, facecolor='black')
        plt.close()

def process_audio_row(row, audio_dir):
    file_path = os.path.join(audio_dir, row['file_name'] + '.flac')
    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        return None

    try:
        y, sr = librosa.load(file_path, sr=None)
        features = extract_features(y, sr)
        feature_image = create_feature_image(y, sr)
        save_feature_images(feature_image, row['file_name'])

        return {
            'file_name': row['file_name'],
            'label': 1 if row['label'] == 'bonafide' else 0,
            **{f'feature_{i+1}': val for i, val in enumerate(features)}
        }
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None


## Load Protocol File

In [19]:
protocol = pd.read_csv(PROTOCOL_PATH, sep=' ', header=None)
protocol.columns = [
    "id", "file_name", "codec", "source_db", "system_id", "label",
    "trim_status", "set_type", "spoof_category",
    "track", "team", "subset", "group"
]

protocol.head()

Unnamed: 0,id,file_name,codec,source_db,system_id,label,trim_status,set_type,spoof_category,track,team,subset,group
0,LA_0023,DF_E_2000011,nocodec,asvspoof,A14,spoof,notrim,progress,traditional_vocoder,-,-,-,-
1,TEF2,DF_E_2000013,low_m4a,vcc2020,Task1-team20,spoof,notrim,eval,neural_vocoder_nonautoregressive,Task1,team20,FF,E
2,TGF1,DF_E_2000024,mp3m4a,vcc2020,Task2-team12,spoof,notrim,eval,traditional_vocoder,Task2,team12,FF,G
3,LA_0043,DF_E_2000026,mp3m4a,asvspoof,A09,spoof,notrim,eval,traditional_vocoder,-,-,-,-
4,LA_0021,DF_E_2000027,mp3m4a,asvspoof,A12,spoof,notrim,eval,neural_vocoder_autoregressive,-,-,-,-


## Process 100 files to create a sample and benchmark process (if not already done)

In [None]:
if not os.path.exists(SAMPLE_OUTPUT_PATH):
    features_list = []
    num_files = 3000
    processed = 0
    total_time = 0

    for _, row in tqdm(protocol.iterrows(), total=len(protocol), desc="Extracting"):
        start_time = time.time()
        result = process_audio_row(row, AUDIO_DIR)
        elapsed = time.time() - start_time
        if result is not None:
            features_list.append(result)
            total_time += elapsed
            processed += 1
        if processed >= num_files:
            break

    # Save features to CSV
    if features_list:
        average_time = total_time / processed
        features_df = pd.DataFrame(features_list)
        features_df.to_csv(SAMPLE_OUTPUT_PATH, index=False)
        print(f"\nProcessed {processed} files.")
        print(f"Total time: {total_time:.2f} seconds")
        print(f"Average time per file: {average_time:.4f} seconds")
        print(f"Expected time for whole feature extraction: {average_time * len(protocol) / 3600} hours")
        print(f"Sample features saved to '{SAMPLE_OUTPUT_PATH}'")
    else:
        print("No features extracted.")
else:
    print("Sample CSV already exists")

  y, sr = librosa.load(file_path, sr=None)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Extracting:   0%|          | 4/611829 [00:01<47:34:48,  3.57it/s]


Processed 5 files.
Total time: 1.12 seconds
Average time per file: 0.2232 seconds
Expected time for whole feature extraction: 37.92826604962349 hours
Sample features saved to 'Data/sample_features.csv'





## Create CSV from audio files (if not already created)

In [21]:
if not os.path.exists(OUTPUT_PATH):
    features_list = []
    for _, row in tqdm(protocol.iterrows(), total=len(protocol)):
        result = process_audio_row(row, AUDIO_DIR)
        if result is not None:
            features_list.append(result)

    if features_list:
        features_df = pd.DataFrame(features_list)
        features_df.to_csv(OUTPUT_PATH, index=False)
        print(f"Processed {len(features_list)} files. Features saved to '{OUTPUT_PATH}'")
    else:
        print("No features extracted.")
else:
    print("CSV features file already exists")

CSV features file already exists


## Load the definitive Data Frame

In [22]:
# features_df will already exist if the code previous code block generated a CSV file.
# This checks if the variable was ever created in this session.
if "features_df" not in locals():
    features_df = pd.read_csv(OUTPUT_PATH)

features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 210 entries, file_name to feature_208
dtypes: float32(208), int64(1), object(1)
memory usage: 4.3+ KB


## Head

In [23]:
features_df.head()

Unnamed: 0,file_name,label,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_199,feature_200,feature_201,feature_202,feature_203,feature_204,feature_205,feature_206,feature_207,feature_208
0,DF_E_2000011,0,-258.156494,68.461945,-4.070857,41.81522,-12.947649,0.65259,-32.048534,-10.525287,...,-10.158218,-5.305505,-5.499173,-7.173549,-7.56113,-4.460036,-9.567829,-6.674157,-7.094151,-5.831952
1,DF_E_2000013,0,-197.490753,74.101059,-20.719416,21.303391,3.568402,-23.902008,-21.944942,-15.401686,...,1.075542,2.747469,3.197092,4.042048,2.142642,0.286704,0.267961,-1.313032,-1.234944,-2.65025
2,DF_E_2000024,0,-245.888214,58.375427,3.938906,27.994375,-4.515254,-2.728295,-31.741747,-4.433678,...,-5.44619,2.675401,-3.27186,3.757226,-0.948672,4.274937,0.186334,1.338551,-2.338968,-0.135259
3,DF_E_2000026,0,-258.217987,64.79245,-26.174234,31.422873,-39.746807,-4.368946,-24.203897,-17.59404,...,-0.987019,0.448028,-3.692505,-3.039355,-2.499328,-2.017834,-2.915188,-4.04407,-3.990771,-3.360973
4,DF_E_2000027,0,-285.028107,91.180786,-16.007261,40.066257,-11.105558,-14.232034,-10.550929,-19.979368,...,-3.610543,-4.734384,-3.698732,-8.030618,-7.725413,-3.885382,-1.117573,-2.100805,-4.414304,-8.654475


## Split data into training, validation and test

In [24]:
# Split features and target
X = features_df.drop('label', axis=1)  # Replace with your actual target column name
y = features_df['label']

# TRAIN      : 70%
# VALIDATION : 15%
# TEST       : 15%
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42)

# Random Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Validation performance
rf_val_preds = rf_model.predict(X_val)
print("🎯 Random Forest - Validation Set")
print("Accuracy:", accuracy_score(y_val, rf_val_preds))
print("Confusion Matrix:\n", confusion_matrix(y_val, rf_val_preds))
print("Classification Report:\n", classification_report(y_val, rf_val_preds))

# SVM

In [None]:
svm_model = make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=True, random_state=42))
svm_model.fit(X_train, y_train)

# Validation performance
svm_val_preds = svm_model.predict(X_val)
print("\n🎯 SVM - Validation Set")
print("Accuracy:", accuracy_score(y_val, svm_val_preds))
print("Confusion Matrix:\n", confusion_matrix(y_val, svm_val_preds))
print("Classification Report:\n", classification_report(y_val, svm_val_preds))

# Shallow CNN on Logmel 

In [None]:
# Parameters
image_size = (64, 64)
batch_size = 32
image_dir = './Data/Images/logmel'

# Step 1: Image Generator
datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2
)

train_generator = datagen.flow_from_directory(
    image_dir,
    target_size=image_size,
    batch_size=batch_size,
    class_mode='binary',
    subset='training'
)

val_generator = datagen.flow_from_directory(
    image_dir,
    target_size=image_size,
    batch_size=batch_size,
    class_mode='binary',
    subset='validation'
)

# Step 2: Define a Shallow CNN
model = models.Sequential([
    layers.Conv2D(16, (3, 3), activation='relu', input_shape=(64, 64, 3)),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Step 3: Train the Model
history = model.fit(
    train_generator,
    validation_data=val_generator,
    epochs=10
)

# Statistical Description

In this section, you should report the key characteristics of the dataset, including but not limited to:
* Number of instances;
* Number of features;
* Number of classes;
* Class distribution.

In [25]:
# use as many code and text cells as needed

# Univariate data analysis

In this section, you should perform univariate data analysis on at least **20 variables**.

In the end, you should describe the main variables that are of your interest, and these should be accounted for in the next sections of the report.
The definition of each variable chosen should be clarified, so arbitrary selections are **not** accepted at this point.

For each variable plotted, make sure you determine the following:
1. The distribution of the data (Gaussian, binomial, exponential, etc.);
2. Skewness;
3. Kurtosis;
4. Mean, standard deviation, and what they stand for in the context of the dataset.

Ensure that each variable is **plotted correctly** based on its type. For instance, make sure scatterplots are not used for categorical data and so forth.

In [26]:
# place as many cells to plot the visualizations,
# as well as to describe the main findings

In [27]:
# if you realize you need to further clean your data here, there is no problem,
# yet, make sure you are describing the entire process and the rationale
# behind your choices here

# Multivariate data analysis

In this section, you should plot at least **5 multivariate visualizations**. The key here is to investigate underlying correlations and behaviors within the dataset.
Naturally, as visualizations are being created, we should end up with obvious results, yet, you should find at least **ONE** non-obvious behavior in the data.

Please follow these steps for creating your visualizations:
1. State an hypothesis. Explain why you have selected these specific variables and what you expect to discover through their relationship;
2. Determine what kind of visualization is the most suitable;
3. Report the findings and discuss whether they corroborate or not the aforementioned hypothesis.


### Hints

In this section, make sure you go beyond naive explorations. For example, consider applying techniques such as PCA, t-SNE, or even others that we haven't covered in the lectures. The goal is to cultivate a critical mindset toward data analysis and our work.

### Important

It is strictly prohibited to create multivariate visualizations using variables that were not included in the previous section (univariate data analysis).

In [28]:
# again, feel free to place as many cells to plot the visualizations,
# as well as describe to the main findings

# Final Plots (Effective Data Visualization)

In this section, you need to **enhance 3 multivariate visualizations** that were presented in the previous section of the report.
The goal is to enhance these visualizations so that they can be effectively presented to an audience unfamiliar with the dataset or with data analysis.
**Therefore, make sure that their size, colors, textures, and other visual elements are appropriate and convey the intended information to the audience.**

For your final plots, make sure you follow these steps:
1. Present the plot;
2. Provide a description of the visualization, highlighting the key findings that can be drawn from it.


**Hint**: take a look at the checklist based on Evergreen’s work to ensure your visualizations meet the best practices for clarity and impact.

In [29]:
# your code goes here

# Digest

In this section you should write down the main findings of this exploratory data analysis. Furthermore, you should provide a reflection about your own work and effort during the module, highlighting what you believe you have done well and what you should have done differently. This digest should have at least 2,500 characters (excluding spaces).

```
Add your text here.
```

# Machine Learning (**post checkpoint!**)

In this section, you must create at least **3 machine learning models** for the task at hand. Depending on the problem's nature, you must select from classification, regression, or clustering models.
It is also important that you:
* Select **an appropriate validation protocol**, providing a rationale for why it is appropriate for this specific task;
* Choose **a suitable set of evaluation metrics**, providing an explanation for each and describing how it contributes to evaluating the model's performance in the context of this specific task.

In [30]:
# use as many cells as needed

# Final Steps (Submission)


1. Save this report as a Jupyter Notebook (`.ipynb`);
2. Export a copy of the report as a PDF file (`.pdf`);
3. Copy the dataset;
4. Compress all the files (the Jupyter Notebook, PDF, and dataset) into a single ZIP archive (`<your_team_name>.zip`);
5. Upload the ZIP file to AVA.