# Voice clustering

This notebook makes a partition of a bunch of audios, each one from a single speaker and identifies sets containing audios from the same speaker. Using the [wespeaker toolkit](https://github.com/wenet-e2e/wespeaker).  This is called voice clustering.

You can find the [latest version](https://github.com/ikks/Speaker-embedding-and-Clustering/blob/master/auto/notebook/voiceclustering.ipynb) of this notebook. And the [notebook in Kaggle](https://www.kaggle.com/code/igortamara/voice-clustering)

The output is a zip file that holds the clustering information with a csv and a sqlite database.

## 📜 Instructions

You will need the audios; in this case, we are using the [carlfm dataset](https://www.kaggle.com/datasets/carlfm01/120h-spanish-speech).  Feel free to copy this notebook and adapt with your data.

You will also need a csv file that holds the transcriptions.

## 👀 Under the hood

The process is:

* For each audio file create an embedding vector
* Use HDBSCAN for the clustering
* mark each audio with the id given by the clustering process

[The git repo](https://github.com/ikks/Speaker-embedding-and-Clustering) contains all the details, including storing in sqlite, embedding and using wespeaker to calculate the embeddings.

# 🧰 Software and Packages installation

Run the next cell

In [18]:
%%bash
apt install sqlite3
git clone https://github.com/ikks/Speaker-embedding-and-Clustering.git clustering
pip install git+https://github.com/wenet-e2e/wespeaker.git
pip install scikit-learn==1.7.0 sqlite-vec tqdm
mkdir -p /kaggle/working/output /kaggle/working/bin

echo -e "
cp /kaggle/working/db.sqlite /kaggle/working/output
sqlite3 -quote -header /kaggle/working/output/db.sqlite 'SELECT * FROM files' > /kaggle/working/output/clustering.csv
cd /kaggle/working
zip -q dbdata.zip output/*
mv dbdata.zip output/
ls -sh /kaggle/working/output
rm output/clustering.csv
rm output/db.sqlite
" > /kaggle/working/bin/updateversion.sh
chmod +x  /kaggle/working/bin/updateversion.sh

echo -e "
    We are ready for the next step
"


    We are ready for the next step



# ⚙️ Setup and task execution

Set `csv_file` with the path of the csv file that contains three columns, the first one is the name of the wav file, and the third is the transcription of the file.

Set`path_wavs` to the path of the directory that contains the wavs to be processed.

Run the next cell once the required variable are setup.

## 🪓 Stop and rerun the process
You can interrupt the task with  *Cancel Run* and continue were it left off running again the cell.

In [None]:
# Configure these two variables
csv_file = "/kaggle/input/120h-spanish-speech/asr-spanish-v1-carlfm01/files.csv"
path_wavs = "/kaggle/input/120h-spanish-speech/asr-spanish-v1-carlfm01/audios"

%cd /kaggle/working/clustering
from speaker_embedding import nb_cluster_task
db_file = "/kaggle/working/db.sqlite"
nb_cluster_task(db_file, csv_file, path_wavs)


/kaggle/working/clustering
using device: cpu
Initializing embedding model...
Embedding model initialized
Reviewing 112845 files
Loading previously calculated embeddings...
Embeddings to be calculated: 103327


  0%|          | 0/103327 [00:00<?, ?it/s]

# 💾 Storing information

When you run the next cell, you'll get /kaggle/working/output/dbdata.zip containing the sqlite and csv files.

In [21]:
%%bash
/kaggle/working/bin/updateversion.sh

echo -e "

In the right panel go to Output and download dbdata.zip  --->
"

total 44M
1.3M clustering.csv
 31M dbdata.zip
 13M db.sqlite


In the right panel go to Output and download dbdata.zip  --->



# Possible improvements

* Use diarization to use audio files containing multiple speakers. Consider using [nemo](https://github.com/nvidia/nemo/), [wespeaker](https://github.com/wenet-e2e/wespeaker) or pyannotate. If you find another, feel free to add a comment.
* Use xi-vectors to update the cluster information.
* Export to a dataset the results to download the data from Kaggle.  For now,[scp can be used](https://www.kaggle.com/code/igortamara/ssh-to-kaggle).
# References

* [Wespeaker](https://github.com/wenet-e2e/wespeaker) for identification, validation and speaker diarization.
* [Blogpost](https://medium.com/@sapkotabinit2002/speaker-identification-and-clustering-using-pyannote-dbscan-and-cosine-similarity-dfa08b5b2a24) explaining concepts and from which this work took lot of information and code.
* [Public Domain Spanish dataset](https://www.kaggle.com/datasets/carlfm01/120h-spanish-speech) from librivox 