<a href="https://colab.research.google.com/github/minguezalba/MusiCNN-embeddings/blob/main/music_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music Similarity
---
Author: Alba Mínguez Sánchez

March 2021

---

In this notebook, we will evaluate music similarity based on previously extracted songs embeddings. Each song embedding in the dataset has a single ground-truth genre label associated. We will assume these genres as a proxy for evaluating similarity, that is, considering two songs are similar when they have the same genre label. 

Evaluation metrics such as average precision@10, MAP@10, and MAP@50 will be computed across the entire dataset



**Packages and dependencies**




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

from sklearn.metrics.pairwise import cosine_similarity

## 1. Loading embeddings dataset from npy files

This step requires to clone MusiCNN-embeddings repository to download the necessary files.

In [None]:
!git clone https://github.com/minguezalba/MusiCNN-embeddings.git

In [None]:
with open('MusiCNN-embeddings/emb_dataset/embeddings.npy', 'rb') as f:
    embeddings = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/labels.npy', 'rb') as f:
    labels = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/labels_decoded.npy', 'rb') as f:
    labels_decoded = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/track_ids.npy', 'rb') as f:
    track_ids = np.load(f)


Check data dimensions and types

In [None]:
print('embeddings: ', embeddings.shape, type(embeddings))
print('labels: ', labels.shape, type(labels))
print('labels_decoded: ', labels_decoded.shape, type(labels_decoded))
print('track_ids: ', track_ids.shape, type(track_ids))

# 2. Computing similarity matrix

In this step, we will compute similarity matrix between every song embedding with each other using cosine similarity distance.

In [None]:
similarities = cosine_similarity(embeddings)  # X as n-samples, n-features

print(similarities.shape)
print(f'Min value: {similarities.min()}')
print(f'Max value: {similarities.max()}')

# 3. Understanding similarity matrix

We will visually represent similarity matrix in order to understand it and get conclusions from it.

In [None]:
fig = px.imshow(similarities, color_continuous_scale='RdYlGn')
fig.show()

As we have songs sorted by genre in blocks of 100 songs, we can visually check darker-green (more similar) areas for 100 to 100 songs corresponding to the similarity between the 100 songs within the same genre.

# 4. Compute similar songs matrix

We want to evaluate how good is our system, which will *recommend* similar songs to a query song. We will use cosine similarity to sort the songs, and then we will evaluate whether our sorted recommended songs are truly relevant or not, checking their original genre (if it matches or not with the target song).

We will start sorting our similarities matrix and saving the corresponding sorted indexes of the songs.

In [None]:
sorted_indexes = np.argsort(similarities, axis=1)
sorted_indexes = np.fliplr(sorted_indexes)

We will compute what we call *similar songs matrix*, where each row would correspond to the query song, and the columns would be the recommended similar songs, sorted by similarity.

Each `ij` item in the matrix contains 1 if song i is similar (aka belongs to the same genre) to song j, or 0 if it is not.

Remember we will build this *similar songs matrix* from previous one, already sorted by similarity.

In [None]:
similar_songs = np.zeros_like((sorted_indexes))

for i in range(sorted_indexes.shape[0]):
  genre_i = labels[i]  # Label/genre of the queried song
  sorted_indexes_i = sorted_indexes[i, :]  # Sorted similar songs for queried song

  # Assign 1 if genres between songs match, 0 if not.
  similar_songs_i = np.array([1 if genre_i == labels[j] else 0 for j in sorted_indexes_i])
  similar_songs[i, :] = similar_songs_i

# 5. Define evaluation metrics: AP@N and MAP@N

We will manually implement AP@N and MAP@N evaluation metrics from the following formulas definitions, adapted to our task:

**Precision@k**

$$ P = \frac{\text{# similar items in the k recommended songs}}{\text{k recommended songs}}$$


**AveragePrecision@N**

$$ AP@N = \frac{1}{min(N, \text{# similar items in the fullspace})}\sum_{k=1}^{N}P(k)\cdot rel(k)$$

**MeanAveragePrecision@N**

$$ mAP@N = \frac{1}{\text{N songs in the dataset}}\sum_{i=1}^{N}AP@N_i$$


In [None]:
def precision_k(similar_items, k):
  """
  Compute precision at top-k (P@k) elements as the ratio between number of similar songs and number of songs evaluated in the top (k).
  
  Args:
      similar_items (1d nparray): Vector where 1 represents truly similar and 0 not similar.
      k (int): Number of top items to compute P.

  Returns:
      float: The precision (P) value at top k items.
  """
  similar_items_k = similar_items[:k]
  P = sum(similar_items_k) / k
  return P

In [None]:
def average_precision_N(similar_items, N):
  """
  Compute average precision (AP) at top-N elements.
  
  Args:
      similar_items (1d nparray): Vector where 1 represents truly similar and 0 not similar.
      N (int): Number of sorted items to evaluate AP.

  Returns:
      float: The AP value evaluating N items.
  """
  m = np.min([np.sum(similar_items), N])  # min(number of relevant/similar items (1s) in the full space of items or N)
  sum_vector = []
  
  for k in range(1, N+1):
    sum_vector.append(precision_k(similar_items, k) * similar_items[k-1])

  AP = (1/m)*sum(sum_vector)
  return AP 


# 6. Evaluate music similarity

We will apply the previous evaluation metrics for a set of N-top songs range. We will evaluate different levels: song, genre and whole dataset.

### 6.1. Average precision @ N for each song

In [None]:
N_range = [10,25,50,75,100]
AP_songs = []

for N in N_range:
  AP_by_N = []
  for i in range(similar_songs.shape[0]):
    AP_by_N.append(average_precision_N(similar_songs[i, :], N))
  AP_songs.append(AP_by_N)

In [None]:
df = pd.DataFrame(list(zip([id for id in track_ids], [y for y in labels_decoded], *AP_songs)), 
                  columns =['song_id', 'genre']+[f'@{N}' for N in N_range])
df.head()

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=3, subplot_titles=("AP@10", "AP@25", "AP@50", "AP@75", "AP@100"))

fig.add_trace(go.Histogram(x=AP_songs[0]), row=1, col=1)
fig.add_trace(go.Histogram(x=AP_songs[1]), row=1, col=2)
fig.add_trace(go.Histogram(x=AP_songs[2]), row=1, col=3)
fig.add_trace(go.Histogram(x=AP_songs[3]), row=2, col=1)
fig.add_trace(go.Histogram(x=AP_songs[4]), row=2, col=2)

fig.update_layout(height=500, 
                  width=1000, 
                  showlegend=False,
                  title_text="AP@N histograms")

fig.show()

We can observe the more recommendations we ask for (greater N), the less precise is our model when recommending truly similar songs.

### 6.2. Mean Average precision @ N for each genre

Now, let's compute mAP@N for each genre and for the whole dataset.



In [None]:
mAP_by_genre = df.groupby('genre').mean()
mAP_by_genre

In [None]:
mAP_dataset = df.mean()
mAP_dataset.to_frame().rename(columns={0: 'mAP'})

In [None]:
fig = go.Figure()
for genre in df.genre.unique():
  fig.add_trace(go.Scatter(y=mAP_by_genre.loc[genre], x=N_range, mode='lines+markers', name=genre))
fig.add_trace(go.Scatter(y=mAP_dataset, x=N_range, mode='lines+markers', name='mean', line=dict(width=4, dash='dash')))

fig.update_layout(title='MAP@N by genre', xaxis_title='N', yaxis_title='MAP@N')
fig.show()



From the figure above we can get several conclusions:
*  Rock is the worst identified genre when looking for similar songs.
*  The number of recommendations N does not affect to all genres in the same way. There are some genres such as classical, hip-hop, metal or disco, where a greater N does not affect much their precision. However, genres such as blues, country or pop, which are really affected by N. This means the system does not recognize so clearly the similar songs for those genred among the whole dataset, based on the embeddings representation and cosine similarity.