# Final Report - Data Science
## Bachelor's Degree in Computer Science / PUCPR

**Prof. Jean Paul Barddal** / **Prof. Rayson Laroca**

`Guilherme Schwarz` - `guilherme.schwarz@pucpr.edu.br`

`Julia Cristina Moreira da Silva` - `s.moreira4@pucpr.edu.br`

`2025`

# Imports & Installs

In [38]:
%pip install librosa tqdm pydub soundfile

import os
import librosa
import numpy as np
import pandas as pd
import time
from tqdm import tqdm

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip




# Data & Data Treatment

## Paths

In [39]:
AUDIO_DIR = 'Data/flac/'
PROTOCOL_PATH = 'Data/ASVspoof2021Protocol.txt'
OUTPUT_PATH = 'Data/asvspoof_features.csv'
SAMPLE_OUTPUT_PATH = 'Data/sample_features.csv'

## Data Treatment Functions

In [40]:
# Function to extract MFCCs
def extract_mfcc(y, sr, n_mfcc=13):
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    return np.mean(mfcc, axis=1)

# Function to extract log-Mel spectrogram
def extract_logmel(y, sr, n_mels=128):
    mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
    logmel = librosa.power_to_db(mel)
    return np.mean(logmel, axis=1)

# Function to extract LFCCs
def extract_lfcc(y, sr, n_lfcc=20):
    # Compute the Short-Time Fourier Transform (STFT)
    stft = np.abs(librosa.stft(y))
    # Apply a linear filter bank
    linear_fb = np.linspace(0, sr / 2, n_lfcc + 2)
    lfcc = librosa.feature.mfcc(S=librosa.power_to_db(stft), sr=sr, n_mfcc=n_lfcc)
    return np.mean(lfcc, axis=1)

def extract_features(y, sr):
    mfcc = extract_mfcc(y, sr)
    logmel = extract_logmel(y, sr)
    lfcc = extract_lfcc(y, sr)
    return np.concatenate([mfcc, logmel, lfcc])

def process_audio_row(row, audio_dir):
    file_path = os.path.join(audio_dir, row['file_name'] + '.flac')
    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        return None

    try:
        y, sr = librosa.load(file_path, sr=None)  # load_audio_pydub(file_path)
        features = extract_features(y, sr)

        return {
            'file_name': row['file_name'],
            'label': 1 if row['label'] == 'bonafide' else 0,
            **{f'feature_{i+1}': val for i, val in enumerate(features)}
        }
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None

## Load Protocol File

In [41]:
protocol = pd.read_csv(PROTOCOL_PATH, sep=' ', header=None)
protocol.columns = [
    "id", "file_name", "codec", "source_db", "system_id", "label",
    "trim_status", "set_type", "spoof_category",
    "track", "team", "subset", "group"
]

protocol.head()

Unnamed: 0,id,file_name,codec,source_db,system_id,label,trim_status,set_type,spoof_category,track,team,subset,group
0,LA_0023,DF_E_2000011,nocodec,asvspoof,A14,spoof,notrim,progress,traditional_vocoder,-,-,-,-
1,TEF2,DF_E_2000013,low_m4a,vcc2020,Task1-team20,spoof,notrim,eval,neural_vocoder_nonautoregressive,Task1,team20,FF,E
2,TGF1,DF_E_2000024,mp3m4a,vcc2020,Task2-team12,spoof,notrim,eval,traditional_vocoder,Task2,team12,FF,G
3,LA_0043,DF_E_2000026,mp3m4a,asvspoof,A09,spoof,notrim,eval,traditional_vocoder,-,-,-,-
4,LA_0021,DF_E_2000027,mp3m4a,asvspoof,A12,spoof,notrim,eval,neural_vocoder_autoregressive,-,-,-,-


## Process 100 files to create a sample and benchmark process (if not already done)

In [42]:
if not os.path.exists(SAMPLE_OUTPUT_PATH):
    features_list = []
    num_files = 100
    processed = 0
    total_time = 0

    for _, row in tqdm(protocol.iterrows(), total=len(protocol), desc="Extracting"):
        start_time = time.time()
        result = process_audio_row(row, AUDIO_DIR)
        elapsed = time.time() - start_time
        if result is not None:
            features_list.append(result)
            total_time += elapsed
            processed += 1
        if processed >= num_files:
            break

    # Save features to CSV
    if features_list:
        average_time = total_time / processed
        features_df = pd.DataFrame(features_list)
        features_df.to_csv(SAMPLE_OUTPUT_PATH, index=False)
        print(f"\nProcessed {processed} files.")
        print(f"Total time: {total_time:.2f} seconds")
        print(f"Average time per file: {average_time:.4f} seconds")
        print(f"Expected time for whole feature extraction: {average_time * len(protocol) / 3600} hours")
        print(f"Sample features saved to {SAMPLE_OUTPUT_PATH}")
    else:
        print("No features extracted.")

  y, sr = librosa.load(file_path, sr=None)  # load_audio_pydub(file_path)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Extracting:   0%|          | 99/611829 [00:03<6:40:20, 25.47it/s]


Processed 100 files.
Total time: 3.85 seconds
Average time per file: 0.0385 seconds
Expected time for whole feature extraction: 6.540861457407475 hours
Sample features saved to Data/sample_features.csv





## Create CSV from audio files (if not already created)

In [43]:
if not os.path.exists(OUTPUT_PATH):
    features_list = []
    for _, row in tqdm(protocol.iterrows(), total=len(protocol)):
        result = process_audio_row(row, AUDIO_DIR)
        if result is not None:
            features_list.append(result)

    if features_list:
        features_df = pd.DataFrame(features_list)
        features_df.to_csv(OUTPUT_PATH, index=False)
        print(f"Processed {len(features_list)} files. Features saved to {OUTPUT_PATH}")
    else:
        print("No features extracted.")

  y, sr = librosa.load(file_path, sr=None)  # load_audio_pydub(file_path)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
100%|██████████| 611829/611829 [7:31:17<00:00, 22.60it/s]   


Processed 611829 files. Features saved to Data/asvspoof_features.csv


# Statistical Description

In this section, you should report the key characteristics of the dataset, including but not limited to:
* Number of instances;
* Number of features;
* Number of classes;
* Class distribution.

In [44]:
# use as many code and text cells as needed

# Univariate data analysis

In this section, you should perform univariate data analysis on at least **20 variables**.

In the end, you should describe the main variables that are of your interest, and these should be accounted for in the next sections of the report.
The definition of each variable chosen should be clarified, so arbitrary selections are **not** accepted at this point.

For each variable plotted, make sure you determine the following:
1. The distribution of the data (Gaussian, binomial, exponential, etc.);
2. Skewness;
3. Kurtosis;
4. Mean, standard deviation, and what they stand for in the context of the dataset.

Ensure that each variable is **plotted correctly** based on its type. For instance, make sure scatterplots are not used for categorical data and so forth.

In [45]:
# place as many cells to plot the visualizations,
# as well as to describe the main findings

In [46]:
# if you realize you need to further clean your data here, there is no problem,
# yet, make sure you are describing the entire process and the rationale
# behind your choices here

# Multivariate data analysis

In this section, you should plot at least **5 multivariate visualizations**. The key here is to investigate underlying correlations and behaviors within the dataset.
Naturally, as visualizations are being created, we should end up with obvious results, yet, you should find at least **ONE** non-obvious behavior in the data.

Please follow these steps for creating your visualizations:
1. State an hypothesis. Explain why you have selected these specific variables and what you expect to discover through their relationship;
2. Determine what kind of visualization is the most suitable;
3. Report the findings and discuss whether they corroborate or not the aforementioned hypothesis.


### Hints

In this section, make sure you go beyond naive explorations. For example, consider applying techniques such as PCA, t-SNE, or even others that we haven't covered in the lectures. The goal is to cultivate a critical mindset toward data analysis and our work.

### Important

It is strictly prohibited to create multivariate visualizations using variables that were not included in the previous section (univariate data analysis).

In [47]:
# again, feel free to place as many cells to plot the visualizations,
# as well as describe to the main findings

# Final Plots (Effective Data Visualization)

In this section, you need to **enhance 3 multivariate visualizations** that were presented in the previous section of the report.
The goal is to enhance these visualizations so that they can be effectively presented to an audience unfamiliar with the dataset or with data analysis.
**Therefore, make sure that their size, colors, textures, and other visual elements are appropriate and convey the intended information to the audience.**

For your final plots, make sure you follow these steps:
1. Present the plot;
2. Provide a description of the visualization, highlighting the key findings that can be drawn from it.


**Hint**: take a look at the checklist based on Evergreen’s work to ensure your visualizations meet the best practices for clarity and impact.

In [48]:
# your code goes here

# Digest

In this section you should write down the main findings of this exploratory data analysis. Furthermore, you should provide a reflection about your own work and effort during the module, highlighting what you believe you have done well and what you should have done differently. This digest should have at least 2,500 characters (excluding spaces).

```
Add your text here.
```

# Machine Learning (**post checkpoint!**)

In this section, you must create at least **3 machine learning models** for the task at hand. Depending on the problem's nature, you must select from classification, regression, or clustering models.
It is also important that you:
* Select **an appropriate validation protocol**, providing a rationale for why it is appropriate for this specific task;
* Choose **a suitable set of evaluation metrics**, providing an explanation for each and describing how it contributes to evaluating the model's performance in the context of this specific task.

In [49]:
# use as many cells as needed

# Final Steps (Submission)


1. Save this report as a Jupyter Notebook (`.ipynb`);
2. Export a copy of the report as a PDF file (`.pdf`);
3. Copy the dataset;
4. Compress all the files (the Jupyter Notebook, PDF, and dataset) into a single ZIP archive (`<your_team_name>.zip`);
5. Upload the ZIP file to AVA.