https://github.com/CheyneyComputerScience/CREMA-D/tree/master/docs#crema-d-crowd-sourced-emotional-multimodal-actors-dataset

## Filename labeling conventions
The Actor id is a 4 digit number at the start of the file. Each subsequent identifier is separated by an underscore (_).

Actors spoke from a selection of 12 sentences (in parentheses is the three letter acronym used in the second part of the filename):

* It's eleven o'clock (IEO).
* That is exactly what happened (TIE).
* I'm on my way to the meeting (IOM).
* I wonder what this is about (IWW).
* The airplane is almost full (TAI).
* Maybe tomorrow it will be cold (MTI).
* I would like a new alarm clock (IWL)
* I think I have a doctor's appointment (ITH).
* Don't forget a jacket (DFA).
* I think I've seen this before (ITS).
* The surface is slick (TSI).
* We'll stop in a couple of minutes (WSI).

The sentences were presented using different emotion (in parentheses is the three letter code used in the third part of the filename):

* Anger (ANG)
* Disgust (DIS)
* Fear (FEA)
* Happy/Joy (HAP)
* Neutral (NEU)
* Sad (SAD)

and emotion level (in parentheses is the two letter code used in the fourth part of the filename):

* Low (LO)
* Medium (MD)
* High (HI)
* Unspecified (XX)

The suffix of the filename is based on the type of file, flv for flash video used for presentation of both the video only, and the audio-visual clips. mp3 is used for the audio files used for the audio-only presentation of the clips. wav is used for files used for computational audio processing.

In [1]:
import wandb
wandb.login(key="6d4e75f096bc80792b033516844d3480e36572a8")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [2]:
!pip install transformers datasets evaluate accelerate librosa
!pip install --upgrade gdown

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.1
Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown
Successfully installed gdown-4.7.1


In [3]:
!pip install datasets==2.14.6
!pip install pandas==1.5.3

Collecting datasets==2.14.6
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.1.0
    Uninstalling datasets-2.1.0:
      Successfully uninstalled datasets-2.1.0
Successfully installed datasets-2.14.6
Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.2
    Uninstalling pandas-2.0.2:
      Successfully uninstalled pandas-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are instal

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
from glob import glob

# from tqdm import tqdm
from tqdm.notebook import tqdm
import librosa
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    recall_score,
    precision_score,
    accuracy_score,
    ConfusionMatrixDisplay,
    f1_score
)
from scipy.stats import spearmanr
import torch
from datasets import load_dataset, load_metric
from transformers import (
    AutoFeatureExtractor,
    AutoModelForAudioClassification,
    TrainingArguments,
    Trainer
)
import matplotlib.pyplot as plt

SEED=3

import warnings
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input/crema-d/CREMA-D-master/AudioMP3'):
#     for filename in filenames:
#         print(filename)
save_path = "/kaggle/working"
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



## Prepare Data

In [5]:
data = []

for path in tqdm(glob("/kaggle/input/d/return0root/crema-d/CREMA-D/AudioWAV/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    actor_id, sentence, emotion, level = name.split('_')
    try:
        y,sr = librosa.load(path, sr=16000)
        data.append({
            "file": path,
            "actor_id": actor_id,
            "sentence": sentence,
            "label": emotion,
            "level": level
        })
    except Exception as e:
        raise(e)
df = pd.DataFrame(data)

  0%|          | 0/7442 [00:00<?, ?it/s]

In [6]:
df = pd.DataFrame(data)

In [7]:
df.head(2)

Unnamed: 0,file,actor_id,sentence,label,level
0,/kaggle/input/d/return0root/crema-d/CREMA-D/Au...,1028,TSI,DIS,XX
1,/kaggle/input/d/return0root/crema-d/CREMA-D/Au...,1075,IEO,HAP,LO


In [8]:
# SentenceFilenames.csv - list of movie files used in study
# finishedEmoResponses.csv - the first emotional response with timing.
# finishedResponses.csv - the final emotional Responses with emotion levels with repeated and practice responses removed, used to tabulate the votes

df_sentence = pd.read_csv('/kaggle/input/d/return0root/crema-d/CREMA-D/SentenceFilenames.csv')
df_first_resp = pd.read_csv('/kaggle/input/d/return0root/crema-d/CREMA-D/finishedEmoResponses.csv')
df_final_resp = pd.read_csv('/kaggle/input/d/return0root/crema-d/CREMA-D/finishedResponses.csv', low_memory=False)

In [9]:
df_first_resp['numTries'].value_counts()

0    256297
1      2348
2       140
3        56
4        31
5        23
6        17
7        12
8         6
9         2
Name: numTries, dtype: int64

In [10]:
df_final_resp['numTries'].value_counts()

0    217703
1      1870
2        68
3        19
4         8
7         6
5         6
6         5
8         3
Name: numTries, dtype: int64

In [11]:
train_df, dev_df = train_test_split(df, test_size=0.3, random_state=SEED,
                                    stratify=df["label"])
dev_df, test_df = train_test_split(dev_df, test_size=0.5, random_state=SEED,
                                   stratify=dev_df["label"])

train_df = train_df.reset_index(drop=True)
dev_df = dev_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# remove unused features in training models
# train_df.drop(['actor_id','sentence', 'level'], axis=1, inplace=True)
# dev_df.drop(['actor_id','sentence', 'level'], axis=1, inplace=True)
# test_df.drop(['actor_id','sentence', 'level'], axis=1, inplace=True)

train_df.to_csv(f"{save_path}/train.csv", encoding="utf-8", index=False)
dev_df.to_csv(f"{save_path}/dev.csv", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", encoding="utf-8", index=False)

print(train_df.shape)
print(dev_df.shape)
print(test_df.shape)

(5209, 5)
(1116, 5)
(1117, 5)


In [12]:
data_files = {
    "train": f"{save_path}/train.csv",
    "validation": f"{save_path}/dev.csv",
    "test": f"{save_path}/test.csv"
}

# train_dataset = train_df
# dev_dataset = dev_df
# test_dataset = test_df
# label_list = sorted(train_dataset['label'].unique())

dataset = load_dataset("csv", data_files=data_files)
train_dataset = dataset["train"]
dev_dataset = dataset["validation"]
test_dataset = dataset["test"]


print(dataset)

label_list = sorted(train_dataset.unique('label'))

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['file', 'actor_id', 'sentence', 'label', 'level'],
        num_rows: 5209
    })
    validation: Dataset({
        features: ['file', 'actor_id', 'sentence', 'label', 'level'],
        num_rows: 1116
    })
    test: Dataset({
        features: ['file', 'actor_id', 'sentence', 'label', 'level'],
        num_rows: 1117
    })
})


In [13]:
# Base = 90M parameters; Large = 300M parameters

model_name_or_path = "facebook/wav2vec2-large-960h-lv60" # “baseline” model; pre-trained on 960 hours of English
# model_name_or_path = "facebook/wav2vec2-base-el-voxpopuli-v2" # pre-trained on Greek speech, no fine-tuning
# model_name_or_path = "facebook/wav2vec2-large-el-voxpopuli-v2" # pre-trained on Greek speech, no fine-tuning
# model_name_or_path = "facebook/wav2vec2-xls-r-300m" # pre-trained on 0.5 million hours in multiple languages, no fine-tuning
# model_name_or_path = "lighteternal/wav2vec2-large-xlsr-53-greek" # pre-trained on 50000 hours in multiple languages, Greek ASR fine-tuning

# Feel free to look for and experiment with other models at HuggingFace Hub https://huggingface.co/

In [14]:
feature_extractor=AutoFeatureExtractor.from_pretrained(model_name_or_path)
model=AutoModelForAudioClassification.from_pretrained(model_name_or_path,
                                      num_labels=len(train_dataset.unique("label")),
                                      label2id={label: i for i, label in enumerate(label_list)},
                                      id2label={i: label for i, label in enumerate(label_list)}
                                      )
model.freeze_feature_encoder()

Downloading (…)rocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60 and are newly initialized: ['classifier.bias', 'projector.weight', 'classifier.weight', 'wav2vec2.masked_spec_embed', 'projector.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
def label_to_id(label, label_list):
    if len(label_list) > 0:
        return label_list.index(label) if label in label_list else -1
    return label
def prepare_example(example):
    example["audio"], example["sampling_rate"] = librosa.load(example["file"], sr=feature_extractor.sampling_rate)
    example["duration_in_seconds"] = len(example["audio"]) / feature_extractor.sampling_rate
    example["label"] = label_to_id(example["label"], label_list)
    return example
def preprocess_function(examples):
    audio_arrays = examples["audio"]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate
    )
    return inputs

In [16]:
# train_dataset = train_dataset.map(prepare_example, remove_columns=['file'])
# dev_dataset = dev_dataset.map(prepare_example, remove_columns=['file'])
# test_dataset = test_dataset.map(prepare_example, remove_columns=['file'])
# train_dataset = train_dataset.map(preprocess_function, batched=True, batch_size=1, remove_columns=['audio'])
# dev_dataset = dev_dataset.map(preprocess_function, batched=True, batch_size=1, remove_columns=['audio'])
# test_dataset = test_dataset.map(preprocess_function, batched=True, batch_size=1)

In [17]:
dataset = dataset.map(prepare_example, remove_columns=['file'])
dataset = dataset.map(preprocess_function, batched=True, batch_size=1)

Map:   0%|          | 0/5209 [00:00<?, ? examples/s]

Map:   0%|          | 0/1116 [00:00<?, ? examples/s]

Map:   0%|          | 0/1117 [00:00<?, ? examples/s]

Map:   0%|          | 0/5209 [00:00<?, ? examples/s]

Map:   0%|          | 0/1116 [00:00<?, ? examples/s]

Map:   0%|          | 0/1117 [00:00<?, ? examples/s]

In [18]:
# delete processed data
# !rm -rf /kaggle/working/data/preprocessed

In [19]:
dataset.save_to_disk(f"{save_path}/data/preprocessed/")

Saving the dataset (0/6 shards):   0%|          | 0/5209 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/1116 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/1117 [00:00<?, ? examples/s]

## Train

In [20]:
from datasets import load_from_disk

dataset = load_from_disk(f"{save_path}/data/preprocessed/")
train_dataset = dataset["train"]
dev_dataset = dataset["validation"]
test_dataset = dataset["test"]


print(dataset)

label_list = sorted(train_dataset.unique('label'))
label_list

DatasetDict({
    train: Dataset({
        features: ['actor_id', 'sentence', 'label', 'level', 'audio', 'sampling_rate', 'duration_in_seconds', 'input_values', 'attention_mask'],
        num_rows: 5209
    })
    validation: Dataset({
        features: ['actor_id', 'sentence', 'label', 'level', 'audio', 'sampling_rate', 'duration_in_seconds', 'input_values', 'attention_mask'],
        num_rows: 1116
    })
    test: Dataset({
        features: ['actor_id', 'sentence', 'label', 'level', 'audio', 'sampling_rate', 'duration_in_seconds', 'input_values', 'attention_mask'],
        num_rows: 1117
    })
})


[0, 1, 2, 3, 4, 5]

In [21]:
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())
torch.cuda.empty_cache()
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

0
0
0
0


In [22]:
# Batch size = per_device_train_batch_size * gradient_accumulation_steps
# Parameters to tune: learning rate, epochs, (batch size)
# More details on hyperparameter tuning in https://github.com/google-research/tuning_playbook

def compute_metrics(pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(pred.predictions, axis=1)
    accuracy = accuracy_score(pred.label_ids, predictions)
    precision = precision_score(pred.label_ids, predictions, average='macro')
    recall = recall_score(pred.label_ids, predictions, average='macro')
    f1 = f1_score(pred.label_ids, predictions, average='macro')
    return {"accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1": f1}


#learning_rates = [1e-3, 1e-4, 1e-5] # first round
learning_rates = [1.5e-4, 1e-4, 0.5e-4] # second round
num_epochs = 5

model_name_or_path = "facebook/wav2vec2-large-960h-lv60" # “baseline” model; pre-trained on 960 hours of English
# model_name_or_path = "facebook/wav2vec2-base-el-voxpopuli-v2" # pre-trained on Greek speech, no fine-tuning
# model_name_or_path = "facebook/wav2vec2-large-el-voxpopuli-v2" # pre-trained on Greek speech, no fine-tuning
# model_name_or_path = "facebook/wav2vec2-xls-r-300m" # pre-trained on 0.5 million hours in multiple languages, no fine-tuning
# model_name_or_path = "lighteternal/wav2vec2-large-xlsr-53-greek" # pre-trained on 50000 hours in multiple languages, Greek ASR fine-tuning

feature_extractor=AutoFeatureExtractor.from_pretrained(model_name_or_path)

for lr in learning_rates:
    torch.cuda.empty_cache()
  # 🐝 1️⃣ Start a new run to track this script
    with wandb.init(
        # Set the project where this run will be logged
        project="SER",
        entity="black-noodles",
        # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
        name=f"{model_name_or_path}_{lr}_{num_epochs}", 
        # Track hyperparameters and run metadata
        config={
        "learning_rate": lr,
        "architecture": model_name_or_path,
        "dataset": "CREMA-D",
        "epochs": num_epochs,
    }):
        # renew model
        model=AutoModelForAudioClassification.from_pretrained(model_name_or_path,
                                              num_labels=len(train_dataset.unique("label")),
                                              label2id={label: i for i, label in enumerate(label_list)},
                                              id2label={i: label for i, label in enumerate(label_list)}
                                              )
        model.freeze_feature_encoder()
        
        # start training
        training_args = TrainingArguments(
            output_dir=f"{save_path}/{model_name_or_path}-speech-emotion-recognition",
            per_device_train_batch_size=32, # require more GPU memory, this set can exploit 16GB memory
            gradient_accumulation_steps=4,
            per_device_eval_batch_size=32,
            num_train_epochs=num_epochs,
            warmup_ratio=0.1,
            learning_rate=lr,
            evaluation_strategy = "epoch",
            save_strategy = "epoch",
            save_total_limit=2,
            logging_steps=10,
            load_best_model_at_end=True,
            metric_for_best_model='accuracy',
            greater_is_better=True,
            push_to_hub=False,
            gradient_checkpointing=True,
            fp16=True,
            report_to=None
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            compute_metrics=compute_metrics,
            train_dataset=train_dataset,
            eval_dataset=dev_dataset,
            tokenizer=feature_extractor,
        )


        trainer.train()

        predictions = trainer.predict(test_dataset)

        wandb.log(compute_metrics(predictions))
      
  # Mark the run as finished
wandb.finish()



[34m[1mwandb[0m: Currently logged in as: [33mpriyanship[0m ([33mblack-noodles[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.0 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.9
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231128_131019-qmjryrs8[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfacebook/wav2vec2-large-960h-lv60_0.00015_5[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/black-noodles/SER[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/black-noodles/SER/runs/qmjryrs8[0m
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60 and are newly initialized: ['classifier.bias', 'projector.weight', 'classifier.weig

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,1.5946,1.567742,0.357527,0.314115,0.357938,0.24141
1,1.3083,1.59544,0.395161,0.448473,0.406423,0.313287
2,1.1658,1.171182,0.542115,0.550466,0.549752,0.511926
4,0.9828,1.065374,0.610215,0.636022,0.617093,0.594518
4,0.9086,1.032992,0.628136,0.645399,0.63466,0.615146


[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                       accuracy ▁
[34m[1mwandb[0m:                  eval/accuracy ▁▂▆██
[34m[1mwandb[0m:                        eval/f1 ▁▂▆██
[34m[1mwandb[0m:                      eval/loss ██▃▁▁
[34m[1mwandb[0m:                 eval/precision ▁▄▆██
[34m[1mwandb[0m:                    eval/recall ▁▂▆██
[34m[1mwandb[0m:                   eval/runtime ▅▂█▁▂
[34m[1mwandb[0m:        eval/samples_per_second ▄▇▁█▇
[34m[1mwandb[0m:          eval/steps_per_second ▃▆▁█▆
[34m[1mwandb[0m:                             f1 ▁
[34m[1mwandb[0m:                      precision ▁
[34m[1mwandb[0m:                         recall ▁
[34m[1mwandb[0m:                    train/epoch ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇████
[34m[1mwandb[0m:              train/global_step ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇█████
[34m[1mwandb[0m:            train/learnin

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,1.6373,1.686382,0.307348,0.1689,0.299476,0.195883
1,1.4764,1.534042,0.361111,0.251181,0.357926,0.285231
2,1.3333,1.287089,0.500896,0.533134,0.505882,0.45077
4,1.2092,1.199503,0.544803,0.588995,0.551987,0.505773
4,1.1349,1.239839,0.542115,0.573398,0.549102,0.511275


[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                       accuracy ▁
[34m[1mwandb[0m:                  eval/accuracy ▁▃▇██
[34m[1mwandb[0m:                        eval/f1 ▁▃▇██
[34m[1mwandb[0m:                      eval/loss █▆▂▁▂
[34m[1mwandb[0m:                 eval/precision ▁▂▇██
[34m[1mwandb[0m:                    eval/recall ▁▃▇██
[34m[1mwandb[0m:                   eval/runtime █▅▃▆▁
[34m[1mwandb[0m:        eval/samples_per_second ▁▄▇▃█
[34m[1mwandb[0m:          eval/steps_per_second ▁▅▆▃█
[34m[1mwandb[0m:                             f1 ▁
[34m[1mwandb[0m:                      precision ▁
[34m[1mwandb[0m:                         recall ▁
[34m[1mwandb[0m:                    train/epoch ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇████
[34m[1mwandb[0m:              train/global_step ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇█████
[34m[1mwandb[0m:            train/learnin

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,1.7169,1.707726,0.303763,0.157452,0.29594,0.183439
1,1.5511,1.57305,0.325269,0.234442,0.318342,0.23463
2,1.4569,1.455924,0.403226,0.40351,0.400823,0.318159
4,1.3847,1.398434,0.423835,0.442043,0.420198,0.348442
4,1.3637,1.40978,0.432796,0.479083,0.428216,0.366268


[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                       accuracy ▁
[34m[1mwandb[0m:                  eval/accuracy ▁▂▆██
[34m[1mwandb[0m:                        eval/f1 ▁▃▆▇█
[34m[1mwandb[0m:                      eval/loss █▅▂▁▁
[34m[1mwandb[0m:                 eval/precision ▁▃▆▇█
[34m[1mwandb[0m:                    eval/recall ▁▂▇██
[34m[1mwandb[0m:                   eval/runtime ▄█▁▃▂
[34m[1mwandb[0m:        eval/samples_per_second ▅▁█▆▇
[34m[1mwandb[0m:          eval/steps_per_second ▅▁███
[34m[1mwandb[0m:                             f1 ▁
[34m[1mwandb[0m:                      precision ▁
[34m[1mwandb[0m:                         recall ▁
[34m[1mwandb[0m:                    train/epoch ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇████
[34m[1mwandb[0m:              train/global_step ▁▁▂▂▂▂▃▃▄▄▄▄▅▅▅▅▆▆▇▇▇▇█████
[34m[1mwandb[0m:            train/learnin

In [23]:
def map_to_pred(batch):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_values = feature_extractor(batch["audio"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to(device)).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    batch["predictions"] = predicted_ids
    return batch

In [24]:
# label_names = [model.config.id2label[i] for i in range(model.config.num_labels)]
# result = test_dataset.map(map_to_pred)
# print(classification_report(result['label'], result['predictions'], target_names=label_names, digits=4))

# cm = confusion_matrix(result['label'], result['predictions'], normalize='true')
# disp = ConfusionMatrixDisplay(confusion_matrix=cm,
#                               display_labels=label_names)

# disp.plot(xticks_rotation = 'vertical')
# plt.title(f"Confusion Matrix")
# plt.show()