# **Hybrid Feature Representation: Audio + Lyrics**

## **Overview**
In this notebook, we create a **hybrid feature representation** for each song by combining:

- **Audio latent vectors (`z_audio`)** extracted from a trained ConvVAE on Mel-spectrograms.  
- **Lyrics embeddings (`z_lyrics`)** generated using a pre-trained Sentence-BERT model (`all-MiniLM-L6-v2`).  


In [None]:
# Import Libraries 
from pathlib import Path
import numpy as np
from sentence_transformers import SentenceTransformer

In [2]:
# Lyrics directory
lyrics_dir = Path("../data/lyrics")
lyrics_files = sorted(lyrics_dir.glob("*/*.txt"))  
lyrics_list = [f.read_text(encoding="utf-8") for f in lyrics_files]

print(f"Loaded {len(lyrics_files)} lyrics files")

Loaded 2568 lyrics files


In [3]:
# Load audio files from the specified directory
audio_dir = Path("../data/audio")
audio_files = sorted(audio_dir.glob("*/*.mp3"))

print(f"Loaded {len(audio_files)} audio files")

Loaded 3554 audio files


In [4]:
# Load the previously saved latent vectors
z_audio = np.load("../results/z_audio.npy")
print("z_audio loaded successfully!")
print("Shape:", z_audio.shape)

z_audio loaded successfully!
Shape: (3554, 32)


In [5]:
# Extract file identifiers
audio_ids = [f.stem for f in audio_files]
lyrics_ids = [f.stem for f in lyrics_files]

In [6]:
# Keep only songs with both audio and lyrics
common_ids = set(audio_ids) & set(lyrics_ids)
print(f"Number of matching audio+lyrics: {len(common_ids)}")

Number of matching audio+lyrics: 2083


In [7]:
# Filter audio and lyrics files
audio_files_filtered = [f for f in audio_files if f.stem in common_ids]
lyrics_files_filtered = [f for f in lyrics_files if f.stem in common_ids]

In [8]:
# Sort to align
audio_files_filtered = sorted(audio_files_filtered, key=lambda x: x.stem)
lyrics_files_filtered = sorted(lyrics_files_filtered, key=lambda x: x.stem)

print(len(audio_files_filtered), len(lyrics_files_filtered))

2083 2083


In [9]:
# Trim z_audio latent vectors
z_audio = np.load("../results/z_audio.npy")  
z_audio = z_audio[:len(audio_files_filtered)]
print("Trimmed z_audio shape:", z_audio.shape)

Trimmed z_audio shape: (2083, 32)


In [10]:
# Load lyrics and generate embeddings
lyrics_list = [f.read_text(encoding="utf-8") for f in lyrics_files_filtered]
model_lyrics = SentenceTransformer('all-MiniLM-L6-v2')
z_lyrics = model_lyrics.encode(lyrics_list)
print("Lyrics embeddings shape:", z_lyrics.shape)

Lyrics embeddings shape: (2083, 384)


In [11]:
# Combine audio + lyrics features
z_hybrid = np.concatenate([z_audio, z_lyrics], axis=1)
print("Hybrid feature shape:", z_hybrid.shape)

Hybrid feature shape: (2083, 416)


In [12]:
# Save latent vectors
np.save("../results/z_hybrid.npy", z_hybrid)
print("z_hybrid saved successfully!")

z_hybrid saved successfully!
