# **Implementation of MuLAN**

A joint embedding of music audio and natural language.

1.   Implemented and adapted from the paper: https://arxiv.org/abs/2208.12415
2.   We are using the MIT AST to analyze music spectrograms, pretrained on AudioSet.
3.   The pretrained MuLAN model is used for audio classifcation.
4.   For Evaluation, a collection of songs and audio along with its text descriptions are taken from youtube and fed in to the model.
5.   The model classifies the test prompts in order of similarity to the audio.
6.   This cosine similarity of the prompts are taken with key text captions, and the top k captions are taken as outputs.
7.   These outputs are then passed in as input to the SDXL model in order to generate 2D images.






# 1. Import and Definition

## Install Dependencies

In [None]:
!apt-get update
!apt-get install unzip
!pip install transformers librosa pandas gdown==4.6.3 torchmetrics # gdown is a tool for downloading files from Google Drive via a URL. provides metrics and evaluation utilities for PyTorch, a popular deep learning framework.
!pip install diffusers accelerate --upgrade # This package provides utilities for accelerating deep learning training and inference, often by utilizing hardware acceleration like GPUs.

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Ign:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Ign:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Ign:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Ign:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Ign:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Ign:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Ign:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/

## Downloading Dataset and Pre-trained Model

In [None]:
!gdown 1llN1-LpZQjyNUy77BWZ98YpQ9Q9nLVoi # Installing MuLAN model from https://drive.google.com/uc?id=1llN1-LpZQjyNUy77BWZ98YpQ9Q9nLVoi
!gdown 1RWqA-0k91j7w90NBciSteVaOnwHrGcY5 # Installing timre_model from https://drive.google.com/uc?id=1RWqA-0k91j7w90NBciSteVaOnwHrGcY5
!gdown 1D-qph7MtCexR7jE4wmmHzCOXokgv5TVF # Installing audio dataset 'temp.zip' from https://drive.google.com/uc?id=1D-qph7MtCexR7jE4wmmHzCOXokgv5TVF
!unzip temp.zip

Downloading...
From: https://drive.google.com/uc?id=1llN1-LpZQjyNUy77BWZ98YpQ9Q9nLVoi
To: /content/model_MuLan.pt
100% 846M/846M [00:12<00:00, 67.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1RWqA-0k91j7w90NBciSteVaOnwHrGcY5
To: /content/model_timbre.pt
100% 346M/346M [00:02<00:00, 131MB/s]
Downloading...
From: https://drive.google.com/uc?id=1D-qph7MtCexR7jE4wmmHzCOXokgv5TVF
To: /content/temp.zip
100% 2.14G/2.14G [00:28<00:00, 74.9MB/s]
Archive:  temp.zip
replace segment/2FQKfGCwjSE.wav? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Import Libraries

In [None]:
! pip install torchaudio


import torch
import torch.nn as nn
import torch.nn.functional as F
import torchmetrics

import pandas as pd
from torch.utils.data import Dataset
import librosa
from torch.utils.data import DataLoader
from tqdm import tqdm

from transformers import ASTModel, RobertaModel, AutoTokenizer, AutoFeatureExtractor



## Hyper-parameters


In [None]:
batch_size = 256 # IMPORTANT: Please reduce the batch size if you have CUDA error out of memory

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(1234)
print(device)


cuda


# 2. Class Definitions

All three classes (train_dataset, val_dataset, test_dataset) are designed similarly with the following structure:

Initialization (init method): When an instance of the class is created, it loads a CSV file containing metadata about the music files. The CSV file path is passed as an argument (csv_file). This CSV likely includes columns for YouTube video IDs (ytid) and associated captions (caption). Each class uses pd.read_csv with different parameters to load specific segments of the dataset: the training class loads the first 4500 rows, the validation class skips the first 4500 rows and loads the next 400 rows, and the testing class skips the first 4900 rows to load the remainder. Additionally, each class initializes a feature extractor model from a pretrained model specified by "MIT/ast-finetuned-audioset-10-10-0.4593", which is likely an Audio Spectrogram Transformer (AST) model fine-tuned on the AudioSet dataset for enhanced audio feature extraction.

Length (len method): This method returns the total number of items in the dataset (i.e., the number of rows loaded from the CSV file), allowing functions that iterate over the dataset to know when to stop.

Get Item (getitem method): This method retrieves a single sample from the dataset at the specified index (idx). It extracts the YouTube video ID and caption for the corresponding row, loads the audio file associated with the YouTube video ID from a specified directory (segment/), resamples the audio to 16,000 Hz using librosa.load, and then passes the waveform to the feature extractor. The feature extractor processes the waveform to produce a spectrogram or some form of processed audio features, which are returned alongside the textual caption. This method enables the dataset to be used with a DataLoader in PyTorch, facilitating batch processing during model training or evaluation.

## Dataset Class

In [None]:
class train_dataset(Dataset):
    def __init__(self, csv_file):
        self.music_file = pd.read_csv(csv_file, encoding='utf-8', nrows=4500)
        self.feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

    def __len__(self):
        return len(self.music_file)

    def __getitem__(self, idx):
        column = self.music_file.iloc[idx]
        ytid = column['ytid']
        text = column['caption']
        waveform, sample_rate = librosa.load('segment/' + ytid + '.wav', sr = 16000)
        spec = self.feature_extractor(waveform, sampling_rate=sample_rate, return_tensors="pt")
        return spec, text

class val_dataset(Dataset):
    def __init__(self, csv_file):
        self.music_file = pd.read_csv(csv_file, encoding='utf-8', skiprows=4500, nrows=400)
        self.feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

    def __len__(self):
        return len(self.music_file)

    def __getitem__(self, idx):
        column = self.music_file.iloc[idx]
        ytid = column[0]
        text = column[5]
        waveform, sample_rate = librosa.load('segment/' + ytid + '.wav', sr = 16000)
        spec = self.feature_extractor(waveform, sampling_rate=sample_rate, return_tensors="pt")
        return spec, text

class test_dataset(Dataset):
    def __init__(self, csv_file):
        self.music_file = pd.read_csv(csv_file, encoding='utf-8', skiprows=4900)
        self.feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

    def __len__(self):
        return len(self.music_file)

    def __getitem__(self, idx):
        column = self.music_file.iloc[idx]
        ytid = column[0]
        text = column[5]
        waveform, sample_rate = librosa.load('segment/' + ytid + '.wav', sr = 16000)
        spec = self.feature_extractor(waveform, sampling_rate=sample_rate, return_tensors="pt")
        return spec, text

**SPECIFICS OF EACH CLASS**

*train_dataset:* Intended for training the model. It loads the first 4500 rows from the CSV, targeting a larger portion of the data for model training to learn from as many examples as possible.

*val_dataset:* Used for validating the model's performance during or after training. It skips the first 4500 rows used for training and loads the next 400 rows, providing a separate dataset that the model has not seen during training to evaluate its generalization capability.

*test_dataset:* Used for the final evaluation of the model. It skips the rows used for training and validation, loading the rest of the dataset to test how well the model performs on completely unseen data.

Summary These dataset classes are essential components for training, validating, and testing a machine learning model in a structured manner, ensuring that each phase uses distinct data. The use of a feature extractor pre-trained on AudioSet allows for leveraging sophisticated audio representations, potentially improving model performance on tasks requiring

In [None]:
traindata = train_dataset('musiccaps.csv')
trainLoader = DataLoader(traindata, batch_size=batch_size, shuffle=True, drop_last=True)
print(traindata.__len__())

valdata = val_dataset('musiccaps.csv')
valLoader = DataLoader(valdata, batch_size=batch_size, shuffle=True, drop_last=True)
print(valdata.__len__())

testdata = test_dataset('musiccaps.csv')
testLoader = DataLoader(testdata, batch_size=batch_size, shuffle=False, drop_last=True)
print(testdata.__len__())

4500
400
409


Based on the dataset classes previosuly defined for train, validation and test data. Here, the torch utility's dataloader library loads train, val and test data respectively shuffling for each epoch and keeping other technicalities in mind.

## Network Class

This class is designed to process audio inputs. It's structured to work with features extracted from audio data, potentially spectrograms or other forms of pre-processed audio signals.

Components:

ASTModel: Utilizes a pre-trained Audio Spectrogram Transformer (AST) model, specifically "MIT/ast-finetuned-audioset-10-10-0.4593". AST models are designed for audio classification tasks and have been fine-tuned on the AudioSet dataset, which contains a wide variety of audio events. This model is likely used to capture rich, high-level representations of audio inputs.

audio_fc: A fully connected linear layer that maps the AST model's output (presumably 768 dimensions) to a 128-dimensional space. This reduction in dimensionality may serve to concentrate the audio features into a more compact representation suitable for comparison or fusion with other modalities.

Forward Pass:
The audio input is first processed by the AST model to obtain a pooled output, which summarizes the audio's features into a single vector.
This vector is then passed through the audio_fc linear layer and normalized to ensure that the output vectors lie on a unit hypersphere, which can be particularly useful for similarity comparisons or as part of a joint embedding space with other modalities.

In [None]:
class audio_model(nn.Module):
    def __init__(self):
        super(audio_model, self).__init__()
        self.model = ASTModel.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
        self.audio_fc = nn.Sequential(
            nn.Linear(768, 384),
            nn.ReLU(),
            nn.Linear(384, 128)
        )


    def forward(self, audio):
        audio_x = self.model(audio).pooler_output
        audio_output = self.audio_fc(audio_x)
        audio_output = F.normalize(audio_output, p = 2, dim = -1)
        return audio_output

class text_model(nn.Module):
    def __init__(self):
        super(text_model, self).__init__()
        self.model = RobertaModel.from_pretrained("roberta-base")
        self.text_fc = nn.Sequential(
            nn.Linear(768, 384),
            nn.ReLU(),
            nn.Linear(384, 128)
        )

    def forward(self, text, mask):
        text_x = self.model(text, mask).pooler_output
        text_output = self.text_fc(text_x)
        text_output = F.normalize(text_output, p = 2, dim = -1)
        return text_output

class mulan(nn.Module):
    def __init__(self):
        super(mulan, self).__init__()
        self.audio_model = audio_model()
        self.text_model = text_model()

    def forward(self, audio, text, mask):
        audio_x = self.audio_model(audio)
        text_x = self.text_model(text, mask)
        return audio_x, text_x


**text_model:**

 Analogous to audio_model, but designed for processing textual data. Components: RobertaModel: Leverages a pre-trained RoBERTa model ("roberta-base"), a powerful transformer-based model known for its effectiveness in natural language processing tasks. This model extracts high-level semantic features from text inputs. text_fc: Similar to audio_fc, this linear layer reduces the dimensionality of the text features from 768 to 128, aiming to produce a compact and informative textual representation. Forward Pass: The text input and its attention mask are passed to the RoBERTa model to obtain a pooled output, encapsulating the text's semantic content in a single vector. This vector undergoes a linear transformation and normalization, similar to the audio model, preparing it for further processing or embedding space integration.

**MuLAN Model**

Purpose: This class combines audio_model and text_model to work in tandem, facilitating joint processing of audio and text data within a unified model architecture.
Components:
audio_model and text_model Instances: Embeds instances of both the audio and text models as components of the mulan model, allowing for their simultaneous use.
Forward Pass:
Accepts audio data, textual data, and a text mask as inputs.
Processes the audio data through audio_model and the text data through text_model independently, producing corresponding embeddings.
Returns the embeddings from both models, which could then be used for tasks requiring joint audio-text representations, such as cross-modal retrieval, audio-visual alignment, or multimodal classification.
This architecture reflects a sophisticated approach to multimodal learning, where the goal is to learn representations that capture the complementary information present in both audio and textual data. By embedding these representations into a shared space, the model can potentially perform a wide range of tasks that require understanding the content and context from both modalities.

## Initializing Network

In [None]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
audio_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

mulan_model = mulan()
mulan_model = mulan_model.to(device)
statedict = torch.load('model_MuLan.pt')
for key in list(statedict.keys()):
    statedict[key.replace('module.', '')] = statedict.pop(key)
mulan_model.load_state_dict(statedict)
mulan_model.eval()
audio_model = mulan_model.audio_model
text_model = mulan_model.text_model



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 3. Evaluating an example from Music Captions Data

In [1]:

def evaluate_audio(id, testing_text):
  SONG_ID = id
  test_data = next(iter(testLoader))
  audio, text = test_data
  audio = audio['input_values'][SONG_ID].to(device) # use this to be fed into SDXL
  text = text[SONG_ID]

  test_text = testing_text
  tokenized_text = tokenizer(test_text, return_tensors="pt", padding=True)
  text_input_ids = tokenized_text['input_ids'].to(device)
  text_attention_mask = tokenized_text['attention_mask'].to(device)

  audio_model = mulan_model.audio_model
  text_model = mulan_model.text_model

  audio_embedding = audio_model(audio)
  text_embedding = text_model(text_input_ids, text_attention_mask)

  sim = torch.einsum('i d, j d -> i j', audio_embedding, text_embedding)
  # simply calculate the dot product as similarity
  print(sim)
  # [-0.2857,  0.8734,  0.3234]
  # The result indicates that the 2nd sentence has the highest similarity to the audio input.

  return text, audio_embedding, text_embedding

In [3]:
ID = 34
test_text = ['A high quality piano solo', 'A low quality music with vocal singing, guitar, and drum', 'A music of male solo and guitar accompany']

text, audio_e, text_e = evaluate_audio(ID, test_text)

print(text)
print(audio_e)


NameError: name 'testLoader' is not defined

An audio is taken from the test dataset. The corresponding text description is "*The low quality recording features a rock song that contains filtered female vocal singing over loud and harsh electric guitar chords, energetic drums and buzzy bass guitar. It sounds chaotic due to the bad mix and since the drums are panned in an unconventional way - making the stereo image unbalanced. It sounds harsh, but also energetic*".

As a result, we used three test sentences for classification: *'A high quality piano solo', 'A low quality music with vocal singing, guitar, and drum', 'A music of male solo and guitar accompany'*.

Ideally, the similariry between the standard answer and the three sentences should be 2 > 3 > 1. And thus we expect the model to predict the similarity following the similarity of the text description.

tensor([[-0.2829, -0.4969, -0.2414]], device='cuda:0', grad_fn=<ViewBackward0>)


tensor([[-0.0878, -0.0977,  0.0478, -0.1218, -0.0939,  0.2267, -0.0194, -0.1304,
          0.2009,  0.0443, -0.0122, -0.0113,  0.0684, -0.1238,  0.0213, -0.0134,
          0.1139,  0.0606, -0.0067,  0.0478, -0.1411,  0.0995,  0.1027, -0.0195,
          0.0619,  0.0172, -0.0235,  0.1548, -0.0523,  0.1236, -0.1262,  0.0914,
         -0.0650,  0.1112,  0.0467, -0.1693,  0.1363,  0.1036, -0.0057,  0.0472,
          0.0819, -0.0471,  0.0286,  0.0917, -0.0347, -0.0239,  0.0134,  0.1247,
         -0.0858,  0.0046,  0.0751,  0.0217,  0.0199,  0.0389, -0.0525, -0.0146,
         -0.0157, -0.0111, -0.1268, -0.0785, -0.0351,  0.0365, -0.0620, -0.0836,
         -0.0005, -0.1699,  0.0071,  0.0540,  0.0285,  0.0832, -0.1317,  0.0553,
         -0.0566,  0.0812,  0.0746,  0.0289, -0.0703, -0.0202, -0.0042,  0.0910,
          0.1415, -0.1476, -0.0143,  0.0369, -0.1501,  0.0715,  0.0777,  0.0474,
         -0.0256,  0.1452, -0.1847,  0.0464,  0.1203, -0.0784, -0.0987, -0.1293,
          0.0045,  0.0451, -

# 4. Calculating Cosine Matrix and Output Keys


### !! No need to run **file below** after uploading text_keys to google drive once

In [None]:
from google.colab import files

# Upload a file
uploaded = files.upload()


Saving text_keys.txt to text_keys.txt


In [None]:
# Specify the file name
file_name = 'text_keys.txt'  # Replace with the actual uploaded file name

with open(file_name, 'r') as file:
    content = file.readlines()

# Convert to Python List
tkeys = [line.strip() for line in content]
print(tkeys)
print(len(tkeys))

['exciting', 'vibrant', 'joyful', 'playful', 'buoyant', 'animated', 'uplifting', 'inspiring', 'passionate', 'energetic', 'groovy', 'aggressive', 'funky', 'epic', 'romantic', 'sentimental', 'melancholic', 'soulful', 'nostalgic', 'heartfelt', 'relaxing', 'melodic', 'pleasant', 'chill', 'dreamy', 'meditative', 'engaging', 'suspenseful', 'dramatic', 'distorted', 'eerie', 'rock genre', 'electronic genre', 'ambient genre', 'hip hop genre', 'pop song', 'film music genre', 'jazz genre', 'folk music genre', 'love song', 'classical genre', 'country genre', 'upbeat tempo', 'driving tempo', 'leisurely tempo', 'measured tempo', 'languid tempo', 'flexible tempo', 'rubato', 'loud', 'soft', 'lively', 'soothing', 'intense', 'sudden dynamic shifts', 'gradual build up', 'steady dynamics', 'big range of dynamics', 'smooth dynamic contours', 'steady rhythm', 'syncopated rhythm', 'offbeat rhythm', 'swinging rhythm', 'irregular rhythm', 'regular rhythm', 'pulsating rhythm', 'repeated theme', 'verse and choru

In [None]:
tokenized_tkeys = tokenizer(tkeys, return_tensors="pt", padding=True)
tkeys_input_ids = tokenized_tkeys['input_ids'].to(device)
tkeys_attention_mask = tokenized_tkeys['attention_mask'].to(device)


In [None]:
tkeys_embedding = text_model(tkeys_input_ids, tkeys_attention_mask)

In [None]:
cosine_matrix = torch.einsum('i d, j d -> i j', audio_embedding, tkeys_embedding) # simply calculate the dot product as similarity
print(cosine_matrix)

tensor([[ 0.1842,  0.6502, -0.3024,  0.1244, -0.0248, -0.0451,  0.0048, -0.1150,
         -0.2462,  0.1571, -0.1453,  0.4833,  0.6967, -0.1221, -0.3116, -0.0669,
         -0.3280, -0.2760, -0.2402, -0.3712, -0.2423, -0.2658, -0.1754, -0.2941,
         -0.2604, -0.1357, -0.0757, -0.1811,  0.0610, -0.0051, -0.3115, -0.0687,
          0.0769, -0.1950,  0.9505,  0.0572, -0.2787, -0.2318, -0.3539, -0.3205,
         -0.3097, -0.3250,  0.2932,  0.1228, -0.1469, -0.2403, -0.1662,  0.1539,
          0.0196, -0.1007, -0.2920, -0.3261, -0.1344, -0.2387, -0.0369, -0.0238,
         -0.1510, -0.0449,  0.1894,  0.2761,  0.3031,  0.7652,  0.7055,  0.2085,
          0.4293,  0.2069, -0.0849, -0.3158, -0.0409, -0.2272, -0.2774, -0.2071,
         -0.2093, -0.1954, -0.3086,  0.5207,  0.0570, -0.3248, -0.2800,  0.1158,
          0.0267, -0.2796, -0.3213, -0.3097, -0.1559, -0.2367, -0.2125, -0.2366,
         -0.2750, -0.2999, -0.0554, -0.2212, -0.2627, -0.2983, -0.1915, -0.2854,
         -0.1684, -0.1305, -

In [None]:
# Flatten the cosine matrix and get the indices of the top 10 values
flat_cosine_matrix = cosine_matrix.view(-1)
top_values, top_indices = torch.topk(flat_cosine_matrix, k=10)

# Initialize lists to store the top 7 values and their indices
top_values_list = []
top_indices_list = []

# Append top values and their indices to the respective lists
for i in range(len(top_values)):
    top_values_list.append(top_values[i].item())
    row_index = top_indices[i] // cosine_matrix.size(1)
    col_index = top_indices[i] % cosine_matrix.size(1)
    top_indices_list.append((row_index.item(), col_index.item()))

print("Top 10 values:", top_values_list)
print("Indices of the top 10 values:", top_indices_list)

Top 10 values: [0.9504696130752563, 0.765236496925354, 0.7055051922798157, 0.6966748237609863, 0.6502219438552856, 0.535324215888977, 0.520749568939209, 0.4832949638366699, 0.4293314814567566, 0.30305320024490356]
Indices of the top 10 values: [(0, 34), (0, 61), (0, 62), (0, 12), (0, 1), (0, 103), (0, 75), (0, 11), (0, 64), (0, 60)]


In [None]:
# Initialize a new list to store the extracted items
extracted_items = []

# Extract items from newlist based on the indices and add them to extracted_items
for index_pair in top_indices_list:
    row_index, col_index = index_pair
    index = row_index * len(tkeys[0]) + col_index
    extracted_items.append(tkeys[index])

output = ', '.join(extracted_items)
print(output)

hip hop genre, offbeat rhythm, swinging rhythm, funky, vibrant, percussive bass line, percussion, aggressive, regular rhythm, syncopated rhythm


In [None]:
cosine_similarity_tensor = torch.tensor(cosine_matrix, device='cuda:0')

# Convert the tensor to a NumPy array
cosine_similarity_array = cosine_similarity_tensor.cpu().detach().numpy()

# Analyze the distribution of scores
mean_score = cosine_similarity_array.mean()
std_dev = cosine_similarity_array.std()

# Set the threshold based on mean and standard deviation
threshold = mean_score + 2.3 * std_dev  # You can adjust the multiplier as needed

# Filter text descriptions based on the threshold
good_descriptions = [score for score in cosine_similarity_array[0] if score >= threshold]
inaccurate_descriptions = [score for score in cosine_similarity_array[0] if score < threshold]

print("Threshold:", threshold)
print("Good Descriptions:", good_descriptions)
print("Inaccurate Descriptions:", inaccurate_descriptions)

Threshold: 0.055050058662891366
Good Descriptions: [0.28316426, 0.12839445]
Inaccurate Descriptions: [-0.15461688, -0.33816493, -0.25643817, -0.36298656, -0.21067788, -0.22419176, -0.10566264, -0.23600757, -0.32691246, -0.22349346, -0.31646547, -0.42285937, -0.32757074, -0.1149856, -0.21383023, -0.17365645, -0.4613264, 0.008459777, -0.14053553, -0.26598924, -0.1375659, -0.1872864, -0.26141763, -0.20761073, -0.14513937, -0.24123429, -0.3839349, -0.11058312, -0.25161153, -0.3162837, -0.2432096, -0.23976976, -0.2025727, -0.38178623, -0.082676575, -0.25843358, -0.25570387, -0.49184602, -0.2539662, -0.33682257, -0.3540408, -0.32770595, -0.21802464, -0.3348496, -0.28769612, -0.26976502, -0.33389604, -0.5053242, -0.38588446, -0.20909336, -0.20754328, -0.24904914, -0.23116156, -0.11657621, -0.19682577, -0.14867371, -0.1683981, -0.25464207, -0.27564687, -0.23862515, -0.3594663, -0.4109816, -0.3046247, -0.32258853, -0.18550923, -0.24982587, -0.4474535, -0.21344309, -0.18944913, -0.2862116, -0.17

  cosine_similarity_tensor = torch.tensor(cosine_matrix, device='cuda:0')
