<a href="https://colab.research.google.com/github/pranavgupta2603/musiclm-training/blob/main/musiccaps_explorer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MusicCaps Explorer

In this notebook, we see how you can use `yt-dlp` to download clips from the MusicCaps dataset from Google. The MusicCaps dataset contains music and their associated text captions. You could use a dataset such as this to train a nice text-to-audio generation model 😉!

This notebook is 100% inspired and based on https://github.com/nateraw/download-musiccaps-dataset, with some additional annotations, so please give a star to that repo. The notebook shows how to load the dataset, underlying clips, and explore them.

Let's get started! 🔥

## Introduction and setup

Let's kick things off by installing some dependencies and load the dataset. **Note that Kaggle comes with an old datasets version but we need a newer one, so you might need to restart the notebook after install to make sure it's using the last version.**

In [None]:
%%capture
! pip install -U datasets[audio]
! pip install yt-dlp

# For the interactive interface we'll need gradio
! pip install gradio

We'll use the Hugging Face `datasets` library to load the dataset version hosted over there in the [google/MusicCaps](https://huggingface.co/datasets/google/MusicCaps) repository. 

In [4]:
from datasets import load_dataset

ds = load_dataset('google/MusicCaps', split='train')
ds



Dataset({
    features: ['ytid', 'start_s', 'end_s', 'audioset_positive_labels', 'aspect_list', 'caption', 'author_id', 'is_balanced_subset', 'is_audioset_eval'],
    num_rows: 5521
})

We see that there are 5,521 music samples. Each sample contains information such as the audio caption and a YouTube ID, which can be surprising. Rather than exposing the audio files directly, this dataset contains the ID of YouTube videos (`ytid` field) and the `start_s` and `end_s`, which indicate the time range of the video of the sample. This makes it a bit harder to work compared to other datasets.

## Loading audio data

As our goal is just loading some data and exploring it, we'll limit ourselves to load only 32 samples. Feel free to change the `samples_to_load` variable in the next cell, but take into account that it might take a long time for the whole dataset.Kaggle notebooks have 4 cores, so we can use that for our advantage too. 

Let's go and download the data! 🚀

In [2]:
# JUST HELPER METHODS IN THIS CELL 

import subprocess
import os
from pathlib import Path

def download_clip(
    video_identifier,
    output_filename,
    start_time,
    end_time,
    tmp_dir='/tmp/musiccaps',
    num_attempts=5,
    url_base='https://www.youtube.com/watch?v='
):
    status = False

    command = f"""
        yt-dlp --quiet --no-warnings -x --audio-format wav -f bestaudio -o "{output_filename}" --download-sections "*{start_time}-{end_time}" {url_base}{video_identifier}
    """.strip()

    attempts = 0
    while True:
        try:
            output = subprocess.check_output(command, shell=True,
                                                stderr=subprocess.STDOUT)
        except subprocess.CalledProcessError as err:
            attempts += 1
            if attempts == num_attempts:
                return status, err.output
        else:
            break

    # Check if the video was successfully saved.
    status = os.path.exists(output_filename)
    return status, 'Downloaded'

def process(example):
    print("here")
    outfile_path = str(data_dir / f"{example['ytid']}.wav")
    status = True
    if not os.path.exists(outfile_path):
        status = False
        status, log = download_clip(
            example['ytid'],
            outfile_path,
            example['start_s'],
            example['end_s'],
        )

    example['audio'] = outfile_path
    example['download_status'] = status
    return example

In [5]:
from datasets import Audio

samples_to_load = 5521    # How many samples to load
cores = 4                 # How many processes to use for the loading
sampling_rate = 44100     # Sampling rate for the audio, keep in 44100
writer_batch_size = 1000  # How many examples to keep in memory per worker. Reduce if OOM.
data_dir = "./music_new_data" # Where to save the data

# Just select some samples 
#ds = ds.select(range(samples_to_load))

# Create directory where data will be saved
data_dir = Path(data_dir)
data_dir.mkdir(exist_ok=True, parents=True)

ds = ds.map(
        process,
        num_proc=cores,
        writer_batch_size=writer_batch_size,
        keep_in_memory=False
    ).cast_column('audio', Audio(sampling_rate=sampling_rate))



In [None]:
!zip music_data.zip ./music_new_data

  adding: music_new_data/ (stored 0%)


In [7]:
"""new_ds = []
try:

  ds1 = ds[:1800]
except:
  pass
ds2 = ds[1800:3600]
ds3 = ds[3600:]"""


FileNotFoundError: ignored

In [None]:
for i in range(ds1):
  try:
    new_ds.append(ds1[i])
  except:
    pass

In [None]:
for i in range(ds2):
  try:
    new_ds.append(ds2[i])
  except:
    pass

In [None]:
for i in range(ds3):
  try:
    new_ds.append(ds3[i])
  except:
    pass

Done! Let's look at the data of an example

In [1]:
ds[0]

NameError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cp -av "./drive/MyDrive/music_new_data" "./music_new_data"

'./drive/MyDrive/music_new_data' -> './music_new_data'
'./drive/MyDrive/music_new_data/CaCjiFUL6Fg.wav' -> './music_new_data/CaCjiFUL6Fg.wav'
'./drive/MyDrive/music_new_data/-0Gj8-vB1q4.wav' -> './music_new_data/-0Gj8-vB1q4.wav'
'./drive/MyDrive/music_new_data/hoPnrbKOEl8.wav' -> './music_new_data/hoPnrbKOEl8.wav'
'./drive/MyDrive/music_new_data/hpiFoinUgvY.wav' -> './music_new_data/hpiFoinUgvY.wav'
'./drive/MyDrive/music_new_data/Cchf2QH63bI.wav' -> './music_new_data/Cchf2QH63bI.wav'
'./drive/MyDrive/music_new_data/hrCf8rMBtA8.wav' -> './music_new_data/hrCf8rMBtA8.wav'
'./drive/MyDrive/music_new_data/-0vPFx-wRRI.wav' -> './music_new_data/-0vPFx-wRRI.wav'
'./drive/MyDrive/music_new_data/hqQvatf1RUY.wav' -> './music_new_data/hqQvatf1RUY.wav'
'./drive/MyDrive/music_new_data/Cd7JefC6-Zw.wav' -> './music_new_data/Cd7JefC6-Zw.wav'
'./drive/MyDrive/music_new_data/hqCJarP-nVI.wav' -> './music_new_data/hqCJarP-nVI.wav'
'./drive/MyDrive/music_new_data/-0SdAVK79lg.wav' -> './music_new_data/-0SdA

Interesting! Let's see what we have
* The `audio` key maps to a dictionary that contains both the audio (`.wav`) file and the `numpy` array of the data already loaded, as well as the sampling rate
* `is_audioset_eval` specifies if it's from the eval or train split
* The `caption` field has the description of the audio: "The low quality recording features a ballad song that contains sustained strings, mellow piano melody and soft female vocal singing over it. It sounds sad and soulful, like something you would hear at Sunday services."

## Interactive explorer

We can use [Gradio](https://gradio.app/), an open-source library to build ML demos, to build an interface in which the user selects the index of the sample and can then listen to the audio and read the caption. Gradio has a nice `Interface` class which has three key components
* `inputs`: specifies which are the input components. In this case, we'll want a slider that will represent the index.
* `outputs`: the output components. In this case, we want an audio and a textarea
* Any inference function that receives the `inputs` type and outputs the `outputs` types. 

Let's see it in action!

In [None]:
import gradio as gr

def get_example(idx):
    ex = ds[idx]
    return ex['audio']['path'], ex['caption']

gr.Interface(
    get_example,
    inputs=gr.Slider(0, len(ds) - 1, value=0, step=1),
    outputs=['audio', 'textarea'],
    allow_flagging="never",
    live=True
).launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://8298a4bb-b1ac-482d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




That's it! I hope you find this notebook useful! 

Hugs!🤗