# Dataset Preparation File

In order to replicate the test that was performed in my Master's report, this file contains the code necessary to fully filter, format, and export the AudioCaps dataset. This Jupyter Notebook can be used for any dataset in severed parquet files, as long as they start in a /data/ folder at the same directory level as this notebook.

Prior to running this file, ensure that the python package "pandas" is installed and AudioCaps parquet files are downloaded from https://huggingface.co/datasets/OpenSound/AudioCaps?clone=true and left inside the repository's data folder. The current prepareDataset.ipynb file should be inside the same folder as the /data/ folder.

In [30]:
import os
import pandas as pd
import shutil

First, we test to see that we can access our dataset and we know how many dog / cat files exist within the first file.

In [31]:
df = pd.read_parquet('data/test-00000-of-00041.parquet')

filtered_df = df[df["caption"].str.contains("Dog") | 
                 df["caption"].str.contains("dog") | 
                 df["caption"].str.contains("Cat") | 
                 df["caption"].str.contains("cat")]

print(df.shape)
print(filtered_df.shape)

(108, 7)
(5, 7)


If this returns sensible values (e.g. (108, 7) and (5, 7)), then the next block is executed to perform this filtration on all files in the data folder.

In [32]:
os.makedirs("filtered", exist_ok=True)

for filename in os.listdir("data"):
    df = pd.read_parquet(os.path.join("data", filename))
    filtered_df = df[df["caption"].str.contains("Dog") |
                     df["caption"].str.contains("dog") |
                     df["caption"].str.contains("Cat") |
                     df["caption"].str.contains("cat")]
    filtered_df.to_parquet(os.path.join("filtered", filename))

All of the filtered files are finally combined into one dataset, but the memory this consumes requires this to be performed only a few files at a time. The next block combines every file in the /filtered/ folder.

In [33]:
file_count = len(os.listdir("filtered"))

while(file_count > 5):
    file_list = os.listdir("filtered")  
    groups_count = int(file_count / 5)
    for i in range(groups_count):
        dfs = []
        for j in range(5):
            # filename = "filtered/{:05d}-of-{:05d}".format(i*5+j,int(groups_count))
            filename = os.path.join("filtered", file_list[i*5 + j])
            dfs.append(pd.read_parquet(filename))
            
        fiveRows = pd.concat(dfs).reset_index(drop=True)
        fiveRows.to_parquet(os.path.join("filtered", "combined-{:05d}-of-{:05d}.parquet".format(i, groups_count)))
        
        for j in range(5):
            os.remove(os.path.join("filtered", file_list[i*5 + j]))
        
    file_count = len(os.listdir("filtered"))

dfs = []
file_list = os.listdir("filtered")
for i in range(len(file_list)):
    filename = os.path.join("filtered", file_list[i])
    dfs.append(pd.read_parquet(filename))
    
finalCombination_df = pd.concat(dfs).reset_index(drop=True)
finalCombination_df.to_parquet("filtered/combined.parquet")

for i in range(len(file_list)):
    os.remove(os.path.join("filtered", file_list[i]))

Finally, the combined dataset is formatted into the correct columns, exported as a metadata.csv file and .wav files, and zipped for use in the interface. All three of these steps are performed in the following block.

In [None]:
EXPORT_FOLDER_NAME = "exported-dataset"
os.makedirs(os.path.join(EXPORT_FOLDER_NAME, "data"), exist_ok=True)

# Format dataset
formatted_df = pd.DataFrame({
    'audio': './data/' + finalCombination_df['audio.path'],
    'caption': finalCombination_df['caption']
})

# Export as metadata.csv
formatted_df.to_csv(os.path.join(EXPORT_FOLDER_NAME, "metadata.csv"), index=False)

# Export .wavs
for index, row in finalCombination_df.iterrows():
    with open(os.path.join(EXPORT_FOLDER_NAME, "data", row['audio.path']), 'wb') as f:
        f.write(row['audio.bytes'])

# Zip
shutil.make_archive(EXPORT_FOLDER_NAME, "zip", EXPORT_FOLDER_NAME)

This zip file should be uploaded to the interface for testing.