Read audio files from drive location. (/content/drive/MyDrive/AudioWAV/ for me, please change it according to your path)

Create emotion labels by reading filenames and encode them. This gets stored in y_encoded.

librosa.load reads files from our directory one by one and stores raw speech(not mel spectrograms) in audio_inputs.

Then, split audio_inputs into train-test-val

In [2]:
from google.colab import drive
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torchaudio
import torch
from transformers import pipeline
import librosa
import matplotlib.pyplot as plt
from IPython.display import Audio

# Mount Google Drive
drive.mount('/content/drive')

# Directory path to your dataset
dataset_path = '/content/drive/MyDrive/AudioWAV/'

# Create empty lists for emotions and file paths
file_emotion = []
file_paths = []

# Iterate through all files in the directory
for file_name in os.listdir(dataset_path):
    # Construct the full file path
    file_path_full = os.path.join(dataset_path, file_name)

    # Check if the file is an audio file (assuming WAV format)
    if file_name.lower().endswith('.wav') and os.path.isfile(file_path_full):
        # storing file paths
        file_paths.append(file_path_full)

        # storing file emotions
        part = file_name.split('_')
        if part[2] == 'SAD':
            file_emotion.append('sad')
        elif part[2] == 'ANG':
            file_emotion.append('angry')
        elif part[2] == 'DIS':
            file_emotion.append('disgust')
        elif part[2] == 'FEA':
            file_emotion.append('fear')
        elif part[2] == 'HAP':
            file_emotion.append('happy')
        elif part[2] == 'NEU':
            file_emotion.append('neutral')
        else:
            file_emotion.append('Unknown')

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(file_emotion)

# Convert audio files to mel spectrograms using librosa
#spectrograms = [librosa.feature.melspectrogram(y=torchaudio.load(file_path_full)[0].numpy().flatten(), sr=16000) for file_path_full in file_paths]
audio_inputs = [librosa.load(file_path_full, sr=16000) for file_path_full in file_paths]

# Data splitting
X_train, X_test, y_train, y_test = train_test_split(audio_inputs, y_encoded, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)


Mounted at /content/drive


See what an element in our train set looks like

In [3]:
X_train[0]

(array([-0.00027466,  0.00039673,  0.00161743, ...,  0.        ,
         0.        ,  0.        ], dtype=float32),
 16000)

Unpack the tuple to give a separate speech array

In [4]:
speech, rate = X_train[0]
print(speech)

[-0.00027466  0.00039673  0.00161743 ...  0.          0.
  0.        ]


Use pretrained model to check loss for one train element (stored in speech variable from above)
Ref:
https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTForAudioClassification

In [5]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification
import torch


feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt")


with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label


target_label = model.config.id2label[0]
inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
loss = model(**inputs).loss
round(loss.item(), 2)

preprocessor_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

0.32