# Data Cleaning

(For convience, the cleaned dataset has been included as dataset.zip)

The purpose of this notebook is to clean our original datasource (https://huggingface.co/datasets/amaai-lab/MusicBench). Our full dataset was over 24GB, which presented a massive problem when trying to transmit the data, store it (too large for GitHub, Drive, etc.), or work with it. Operations on the data each took over an hour. Because of this, we have created this notebook which augments the original dataset, reducing its size (we were going to reduce the size in our original data cleaning pipeline anyway, we are just doing it up front to reduce the storage size). Thus, this notebook should be run ONCE after downloading the data from its source. The original dataset must be in the following directory format:

\dataset\
&ensp;&ensp;&ensp;\data\
&ensp;&ensp;&ensp;\data_aug2\
&ensp;&ensp;&ensp;MusicBench_train.json

After running the following code, the original dataset will have been replaced by the dataset used in the project.

In [12]:
import pandas as pd
import numpy as np
import pathlib

THRESHOLD = 0.85

df = pd.read_json("dataset/MusicBench_train.json", lines=True)
df = df.drop_duplicates(subset=["location"])
# Remove irrelavent features
df = df[["location", "key", "keyprob"]]
df["keyprob"] = df["keyprob"].map(lambda x: x[0])
# Remove instances below the probability threshold
df_keep = df[df["keyprob"] > THRESHOLD]
df_trash = df[df["keyprob"] <= THRESHOLD]
# Delete all of the files associated with the removed instances
locations = df_trash["location"].to_numpy()
not_found_count = 0
for loc in locations:
    try:
        pathlib.Path("dataset/" + loc).unlink()
    except FileNotFoundError:
        print("Not Found: dataset/" + loc)
        not_found_count += 1
print(str(not_found_count) + " WAV files not found")
# Overwrite the old json metadata file
df_keep = df_keep[["location", "key"]]
df_keep.to_json("dataset/metadata.json", orient="records", lines=True)
pathlib.Path("dataset/MusicBench_train.json").unlink()

0 WAV files not found


In [14]:
# A double-check that all files which are referenced in the metadata exist
df2 = pd.read_json("dataset/metadata.json", lines=True)
not_found = 0
locations = df2["location"].to_numpy()
for loc in locations:
    if not pathlib.Path("dataset/" + loc).exists:
        not_found += 1
print(not_found)

0
