# Data Cleaning

The purpose of this notebook is to clean our original datasource (https://huggingface.co/datasets/amaai-lab/MusicBench) and extract features from the audio files in a way that builds a manageable dataset. Our full dataset was over 24GB, which presented a massive problem when trying to transmit the data, store it (too large for GitHub, Drive, etc.), or work with it. Operations on the data each took over an hour. Because of this, we have created this notebook which augments the original dataset, reducing its size (we were going to reduce the size in our original data cleaning pipeline anyway, we are just doing it up front to reduce the storage size).After removing a high volume of low confidence instances, the remaining instances go through a feature extraction process. First, this process builds a feature set which is the frequency profile of the given audio file (via a Fast Fourier Transformation). The result is a massive set with approximatly 80,000 features (freqency bins) for each instance. Our dataset is then approximatly 6GB, which is too large to run through machine learning algorithms on our computers. In order to reduce the size during this pre-processing phase, we are keeping the highest n (1000) variance features, assuming that these are the frequecy bins which best explain our data. The resulting smaller dataset is stored in train.json and test.json for efficient retrieval in main.ipynb. Thus, this notebook should be run ONCE after downloading the data from its source. The original dataset must be in the following directory format:

\dataset\
&ensp;&ensp;&ensp;\data\
&ensp;&ensp;&ensp;\data_aug2\
&ensp;&ensp;&ensp;MusicBench_train.json\
&ensp;&ensp;&ensp;MusicBench_test_A.json


After running the following code, the original dataset will have been replaced by the dataset used in the project, and data.json will be the dataset to be used for algorithms.

In [11]:
import pandas as pd
import numpy as np
import pathlib
from scipy.io import wavfile
from scipy.fft import rfft, rfftfreq
from sklearn.preprocessing import StandardScaler

In [12]:
def drop_low_confidence_instances(old_metadata_path, new_metadata_path, threshold):
    """
    For the given metadata json file, remove instances (record and file)
    that are below the given threshold to reduce dataset size
    """
    df = pd.read_json(old_metadata_path, lines=True)
    df = df.drop_duplicates(subset=["location"])
    # Remove irrelavent features
    df = df[["location", "key", "keyprob"]]
    df["keyprob"] = df["keyprob"].map(lambda x: x[0])
    # Remove instances below the probability threshold
    df_keep = df[df["keyprob"] > threshold]
    df_trash = df[df["keyprob"] <= threshold]
    # Delete all of the files associated with the removed instances
    locations = df_trash["location"].to_numpy()
    not_found_count = 0
    for loc in locations:
        try:
            pathlib.Path("dataset/" + loc).unlink()
        except FileNotFoundError:
            print("Not Found: dataset/" + loc)
            not_found_count += 1
    print(str(not_found_count) + " WAV files not found for " + old_metadata_path)
    # Overwrite the old json metadata file
    df_keep = df_keep[["location", "key"]]
    df_keep.to_json(new_metadata_path, orient="records", lines=True)
    pathlib.Path(old_metadata_path).unlink()

In [13]:
def fft_analysis(locations, longest):
    """
    Perform FFT analysis for a given set of audio files.
    Return as a 2D numpy array of the instances
    """
    # Calculate the features to populate a data array
    bundles = []
    bundle = None
    for loc in locations:
        path = "dataset/" + loc
        samplerate, data = wavfile.read(path)
        # Trailing 0s to achieve the same length as the longest instance
        data_length = len(data)
        new_data = np.concatenate((data, np.zeros(longest - data_length)))
        # Compute the fft and add the instances to bundles to compute
        fft = np.round(np.abs(rfft(new_data)), 4)
        instance = np.concatenate([fft])
        if bundle is None:
            bundle = [instance]
        elif len(bundle) < 150:
            bundle = np.append(bundle, [instance], axis=0)
        else:
            bundles.append(bundle)
            bundle = [instance]
    bundles.append(bundle)  # Last remaining bundle
    instances = np.concatenate(bundles, axis=0)

    return instances

In [14]:
def features_from_audio(train, test, n_audio_features) -> pd.DataFrame:
    """
    'Join' the existing dataframes with the audio files via
    mapping to frequency features. Perform feature selection to 
    reduce the size of the resulting dataset.
    """
    # Determine the size for all instances (largest number of samples)
    longest = 0
    rate = 0
    train_locations = train["location"].array.tolist()
    test_locations = test["location"].array.tolist()
    locations = train_locations + test_locations
    for loc in locations:
        path = "dataset/" + loc
        rate, data = wavfile.read(path)
        length = len(data)
        if length > longest:
            longest = length

    freq_bins = rfftfreq(longest, 1 / rate)

    train_instances = fft_analysis(train_locations, longest)
    test_instances = fft_analysis(test_locations, longest)

    # Apply StandardScalars to the training and test data
    train_scalar = StandardScaler()
    train_scalar.fit_transform(train_instances)
    test_scalar = StandardScaler()
    test_scalar.fit_transform(test_instances)
    # Select n audio features based on their variance in the training data
    highest_var_indices = np.argpartition(train_scalar.var_, -n_audio_features)[-n_audio_features:]
    train_instances = train_instances[:, highest_var_indices]
    test_instances = test_instances[:, highest_var_indices]
    highest_var_features_names = freq_bins[highest_var_indices]
    # Place in DataFrames
    train_df = pd.DataFrame(data=train_instances, columns=highest_var_features_names)
    train_df = train_df.sort_index(axis=1)
    test_df = pd.DataFrame(data=test_instances, columns=highest_var_features_names)
    test_df = test_df.sort_index(axis=1)

    # Add the target and return the dataframes
    train_df["target"] = train["key"]
    test_df["target"] = test["key"]
    return train_df, test_df

In [15]:
N_FEATURES = 1000

drop_low_confidence_instances(old_metadata_path="dataset/MusicBench_train.json",
                              new_metadata_path="dataset/metadata_train.json",
                              threshold=0.88)
drop_low_confidence_instances(old_metadata_path="dataset/MusicBench_test_A.json",
                              new_metadata_path="dataset/metadata_test.json",
                              threshold=0.80)

train_df = pd.read_json("dataset/metadata_train.json", lines=True)
test_df = pd.read_json("dataset/metadata_test.json", lines=True)

train_df, test_df = features_from_audio(train_df, test_df, N_FEATURES)

print(train_df.head())
print(train_df.shape)
train_df.to_json("train.json", orient="records", lines=True)
print(test_df.head())
print(test_df.shape)
test_df.to_json("test.json", orient="records", lines=True)


0 WAV files not found for dataset/MusicBench_train.json
0 WAV files not found for dataset/MusicBench_test_A.json
[[4.29530000e+04 5.25835362e+04 8.07997420e+04 ... 7.79136400e+02
  9.28917800e+02 1.07300000e+03]
 [2.10534000e+05 1.19505484e+05 2.01212752e+05 ... 1.59643620e+03
  1.34979440e+03 1.59400000e+03]
 [5.80058000e+05 2.10840986e+05 3.15993333e+05 ... 4.26577450e+03
  8.35656920e+03 1.01400000e+03]
 ...
 [5.42400000e+03 7.84261091e+04 3.96970382e+04 ... 4.15784700e+02
  4.37823000e+02 3.64000000e+02]
 [1.75521000e+05 6.71018655e+04 2.07954143e+05 ... 8.95457100e+02
  7.68177900e+02 6.89000000e+02]
 [8.66000000e+03 8.78701058e+04 4.80228784e+04 ... 3.17799600e+02
  5.16470500e+02 4.90000000e+02]]
       0.0      36.6      36.7      38.0      38.2  38.300000000000004  \
0   3.0074   81.2963   83.7169  119.6888  115.8881            413.4768   
1  13.6523  361.5547  119.7225  170.9554   84.0531            273.8909   
2   7.1155   80.2869  169.7490  149.4092  132.8215            198