## Data analysis notebook


For this analysis, we only take the argmax combined model into account, since it has the best accuracy from the models developed in the notebook 'Bimodel_modular.ipynb', and compare its performance with the single model performances, as well as their performances between each other.

In [1]:
# Load the models, data, and perform the predictions

import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import ast
from joblib import dump, load

df_train = pd.read_csv('database/LaA_train.csv')
df_test = pd.read_csv('database/LaA_test.csv')
df_train.head()

embeddings_audio_train = df_train['audio_embedding'].apply(ast.literal_eval).apply(np.array)
x_audio_train = np.stack(embeddings_audio_train.values)
embeddings_audio_test = df_test['audio_embedding'].apply(ast.literal_eval).apply(np.array)
x_audio_test = np.stack(embeddings_audio_test.values)

embeddings_lyrics_train = df_train['lyrics_embedding'].apply(ast.literal_eval).apply(np.array)
x_lyrics_train = np.stack(embeddings_lyrics_train.values)
embeddings_lyrics_test = df_test['lyrics_embedding'].apply(ast.literal_eval).apply(np.array)
x_lyrics_test = np.stack(embeddings_lyrics_test.values)

y_train = df_train['label']
y_test = df_test['label']

In [2]:
use_pretrained_svms = True
if use_pretrained_svms: 
    name_svm_audio = 'models/SVM_audio.joblib'
    name_svm_lyrics = 'models/SVM_lyrics.joblib'
    svm_classifier_audio = load(name_svm_audio)
    svm_classifier_lyrics = load(name_svm_lyrics)
else:
    svm_classifier_lyrics = SVC(kernel='rbf', C=1, gamma='auto', probability=True)
    svm_classifier_lyrics.fit(x_lyrics_train, y_train)
    print("Lyrics SVM trained")

    svm_classifier_audio = SVC(kernel='rbf', C=1, gamma='auto', probability=True)
    svm_classifier_audio.fit(x_audio_train, y_train)
    print("Audio SVM trained")

In [3]:
# arg max model
y_pred_test_prob_lyrics = svm_classifier_lyrics.predict_proba(x_lyrics_test)
y_pred_test_prob_audio = svm_classifier_audio.predict_proba(x_audio_test)
y_pred_train_prob_lyrics = svm_classifier_lyrics.predict_proba(x_lyrics_train)
y_pred_train_prob_audio = svm_classifier_audio.predict_proba(x_audio_train)

y_pred_lyrics = np.argmax(y_pred_test_prob_lyrics, axis=1)
y_pred_audio = np.argmax(y_pred_test_prob_audio, axis=1)

y_pred_max_combined = np.argmax(y_pred_test_prob_lyrics + y_pred_test_prob_audio, axis=1)
acc_max_combined = accuracy_score(y_test, y_pred_max_combined)
print(f"Accuracy: {acc_max_combined:.2f}")

Accuracy: 0.45


### 1. Check on how many labels the two models differ (per emotion)

In [4]:
results = {}
emotion_map = {0: 'Angry', 1: 'Happy', 2: 'Relaxed', 3: 'Sad'}

total_correct = 0
total_possible = 0

for emotion in [0,1,2,3]:
    emotion_mask = y_test == emotion
    audio_correct = y_pred_audio == y_test
    lyrics_correct = y_pred_lyrics == y_test
    comb_correct = y_pred_max_combined == y_test
    
    both_correct = np.logical_and(audio_correct, lyrics_correct) & emotion_mask
    audio_only_correct = np.logical_and(audio_correct, np.logical_not(lyrics_correct)) & emotion_mask
    lyrics_only_correct = np.logical_and(lyrics_correct, np.logical_not(audio_correct)) & emotion_mask
    
    results[emotion] = {
        'both_agree_and_correct': np.mean(both_correct[emotion_mask]),
        'audio_correct_lyrics_wrong': np.mean(audio_only_correct[emotion_mask]),
        'lyrics_correct_audio_wrong': np.mean(lyrics_only_correct[emotion_mask]),
        'comb_correct': np.mean(comb_correct[emotion_mask]),
    }
    correct_predictions = np.logical_or(both_correct, np.logical_or(audio_only_correct, lyrics_only_correct))

    # Calculate the total number of correct predictions and the total number of possible correct predictions
    total_correct += np.sum(correct_predictions)
    total_possible += np.sum(emotion_mask)


for emotion, result in results.items():
    print(f"Emotion {emotion_map[emotion]}:")
    print(f"  Both Agree & Correct: (Lower bound for combination) {result['both_agree_and_correct']:.2f}")
    print(f"  Audio Correct, Lyrics Wrong: {result['audio_correct_lyrics_wrong']:.2f}")
    print(f"  Lyrics Correct, Audio Wrong: {result['lyrics_correct_audio_wrong']:.2f}")
    print(f"  Upper Bound Correct: {np.sum([result['both_agree_and_correct'], result['audio_correct_lyrics_wrong'], result['lyrics_correct_audio_wrong']]):.2f}")
    print(f"  Combined Correct: {result['comb_correct']:.2f}\n")

# Calculate and print the overall estimated upper bound accuracy
upper_bound_accuracy = total_correct / total_possible
print(f"Estimated Upper Bound Accuracy: {upper_bound_accuracy:.2f}")
print(f"Our accuracy: {acc_max_combined:.2f}")

Emotion Angry:
  Both Agree & Correct: (Lower bound for combination) 0.05
  Audio Correct, Lyrics Wrong: 0.14
  Lyrics Correct, Audio Wrong: 0.09
  Upper Bound Correct: 0.29
  Combined Correct: 0.14

Emotion Happy:
  Both Agree & Correct: (Lower bound for combination) 0.45
  Audio Correct, Lyrics Wrong: 0.20
  Lyrics Correct, Audio Wrong: 0.21
  Upper Bound Correct: 0.86
  Combined Correct: 0.72

Emotion Relaxed:
  Both Agree & Correct: (Lower bound for combination) 0.01
  Audio Correct, Lyrics Wrong: 0.10
  Lyrics Correct, Audio Wrong: 0.04
  Upper Bound Correct: 0.14
  Combined Correct: 0.04

Emotion Sad:
  Both Agree & Correct: (Lower bound for combination) 0.35
  Audio Correct, Lyrics Wrong: 0.18
  Lyrics Correct, Audio Wrong: 0.26
  Upper Bound Correct: 0.79
  Combined Correct: 0.63

Estimated Upper Bound Accuracy: 0.59
Our accuracy: 0.45


Interesting: We see the model works, because the combined model chooses in most of the cases the one that's right (e.g. combined correct in the worst case would be both agree and correct, in the best case (both agree and correct + one of both agrees), and we're closer to the second case)

### 2. Check how certain the models are about their decisions for each emotion, depending on if they are right or wrong

In [5]:
def analyze_model_confidence(y_test, y_pred_probs, y_pred, emotion_labels):
    correct_confidences = {emotion: [] for emotion in emotion_labels}
    incorrect_confidences = {emotion: [] for emotion in emotion_labels}
    
    # Iterate through predictions and true labels
    for i, (true_label, pred_label) in enumerate(zip(y_test, y_pred)):
        # Confidence for the predicted label
        confidence = y_pred_probs[i][pred_label]
        
        # Correct predictions
        if true_label == pred_label:
            correct_confidences[true_label].append(confidence)
        # Incorrect predictions
        else:
            incorrect_confidences[true_label].append(confidence)
    
    # Calculate average confidences
    average_confidences = {
        'correct': {emotion: np.mean(correct_confidences[emotion]) if correct_confidences[emotion] else 0 for emotion in emotion_labels},
        'incorrect': {emotion: np.mean(incorrect_confidences[emotion]) if incorrect_confidences[emotion] else 0 for emotion in emotion_labels}
    }
    
    return average_confidences

# Example usage
emotion_labels = np.unique(y_test)  # Assuming y_test is available
y_pred_audio = np.argmax(y_pred_test_prob_audio, axis=1)
y_pred_lyrics = np.argmax(y_pred_test_prob_lyrics, axis=1)

# Analyze confidence for audio model
audio_confidence_analysis = analyze_model_confidence(y_test, y_pred_test_prob_audio, y_pred_audio, emotion_labels)
# Analyze confidence for lyrics model
lyrics_confidence_analysis = analyze_model_confidence(y_test, y_pred_test_prob_lyrics, y_pred_lyrics, emotion_labels)

print("Audio Model Confidence Analysis:")
print(f"Average max confidence scores for being correct, audio model: {audio_confidence_analysis['correct']}")
print(f"Average max confidence scores for being wrong, audio model: {audio_confidence_analysis['incorrect']}")
print("\nLyrics Model Confidence Analysis:")
print(f"Average max confidence scores for being correct, lyrics model: {lyrics_confidence_analysis['correct']}")
print(f"Average max confidence scores for being wrong, lyrics model: {lyrics_confidence_analysis['incorrect']}")

Audio Model Confidence Analysis:
Average max confidence scores for being correct, audio model: {0: 0.41396013280337673, 1: 0.4492064958765077, 2: 0.39492113481037383, 3: 0.42812359094689195}
Average max confidence scores for being wrong, audio model: {0: 0.4136322585454098, 1: 0.3969532960695512, 2: 0.41689713716473575, 3: 0.4110479570157514}

Lyrics Model Confidence Analysis:
Average max confidence scores for being correct, lyrics model: {0: 0.40514261119215694, 1: 0.46574352118434575, 2: 0.37666522055591656, 3: 0.41128707896323763}
Average max confidence scores for being wrong, lyrics model: {0: 0.4127315567337244, 1: 0.39216714606247377, 2: 0.4188959055308722, 3: 0.4023000962658397}


### Other things to mention in the presentation

- We checked that the dataset is balanced
- We checked that the models are not horribly over fitting (hopefully)
- Explanation of audio features and how we acquired the data
- Explanation of lyrical features and how we acquired it 
- Explanation of the arousal/valence scale
- Short explanation SVM, mention scaling
- Analysis of the two independent models
- Introduction to ensemble learning, present different techniques
- Discussion, what went wrong, what could be done better (both for the simple models and the bimodal model)
- Perhaps comparison with Deezer paper

### Questions we should be able to answer

- How could you improve the accuracy of your model?

- Why is the model doing better on some emotions than on others?

- Is there any issue with fixing the test train split beforehand?