<a href="https://colab.research.google.com/github/nmwiley808/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project/blob/main/notebooks/01_multi-dataset_download_and_inspection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01 – Multi-Dataset Download & Inspection

## Description
This notebook downloads and inspects three benchmark music datasets:

1. GTZAN (single-label genre classification)
2. MTG-Jamendo (multi-label tagging, Top-50 tags subset)
3. MagnaTagATune (multi-label music tagging)

Datasets are obtained from stable official or Kaggle sources to ensure
reproducibility.

---
# Objectives:
- Create structured raw dataset directories
- Download datasets from official sources
- Inspect audio sampling rates and durations
- Analyze metadata and label structure
- Detect corrupted files
- Detect duplicate tracks (GTZAN)
- Define a unified label schema

In [1]:
# Mount Drive & Navigate
from google.colab import drive
import os

drive.mount('/content/drive')

base_path = "/content/drive/MyDrive/csci198"
repo_name = "csci198-Music-Intelligence-with-Deep-Learning-Senior-Project"
full_path = os.path.join(base_path, repo_name)

%cd {full_path}
print("Current Directory:", os.getcwd())

Mounted at /content/drive
/content/drive/MyDrive/csci198/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project
Current Directory: /content/drive/MyDrive/csci198/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project


In [2]:
# Create Structured Data Folders
folders = [
    "data/raw/gtzan",
    "data/raw/mtg_jamendo",
    "data/raw/magnatagatune",
    "data/interim",
    "data/processed"
]

for folder in folders:
  os.makedirs(folder, exist_ok=True)

print("Structured data directories ready.")

Structured data directories ready.


# PART 1 - GTZAN (Kaggle)

----

In [8]:
# Upload Kaggle API File
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"noahwiley","key":"5b0f69f37df2a7fbee0d04864c2f0e97"}'}

In [9]:
# Configure Kaggle
import os

os.makedirs('/root/.kaggle', exist_ok=True)
!mv kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

print("Kaggle API configured successfully.")

Kaggle API configured successfully.


In [10]:
# Verify Kaggle Works
!kaggle datasets list -s gtzan

ref                                                      title                                                 size  lastUpdated                 downloadCount  voteCount  usabilityRating  
-------------------------------------------------------  ---------------------------------------------  -----------  --------------------------  -------------  ---------  ---------------  
andradaolteanu/gtzan-dataset-music-genre-classification  GTZAN Dataset - Music Genre Classification      1301492495  2020-03-24 14:05:33.357000         112925        925  0.88235295       
carlthome/gtzan-genre-collection                         GTZAN Genre Collection                          1225975597  2019-10-30 07:38:06.633000           8557         76  0.8125           
lnicalo/gtzan-musicspeech-collection                     GTZAN music/speech collection                    296835448  2017-10-24 12:52:44.587000           2834         43  0.625            
mantasu/gtzan-stems                                    

In [11]:
# Download GTZAN
%cd data/raw/gtzan

!kaggle datasets download -d andradaolteanu/gtzan-dataset-music-genre-classification
!unzip -q gtzan-dataset-music-genre-classification.zip

%cd ../../../

/content/drive/MyDrive/csci198/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project/data/raw/gtzan
Dataset URL: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification
License(s): other
Downloading gtzan-dataset-music-genre-classification.zip to /content/drive/MyDrive/csci198/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project/data/raw/gtzan
100% 1.21G/1.21G [00:15<00:00, 24.0MB/s]
100% 1.21G/1.21G [00:15<00:00, 86.3MB/s]
/content/drive/MyDrive/csci198/csci198-Music-Intelligence-with-Deep-Learning-Senior-Project


In [12]:
# Insepct GTZAN Audio
import librosa

gtzan_path = "data/raw/gtzan/Data/genres_original"
genres = os.listdir(gtzan_path)

print("Genres:", genres)
print("Number of Genres:", len(genres))

sample_file = os.path.join(
    gtzan_path,
    genres[0],
    os.listdir(os.path.join(gtzan_path, genres[0]))[0]
)

y, sr = librosa.load(sample_file, sr=None)

print("Sample Rate:", sr)
print("Duration (sec):", librosa.get_duration(y=y, sr=sr))

Genres: ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']
Number of Genres: 10
Sample Rate: 22050
Duration (sec): 30.013333333333332


In [13]:
# Corruption Check (GTZAN)
import soundfile as sf

corrupted_gtzan = []

for genre in genres:
    genre_path = os.path.join(gtzan_path, genre)
    for file in os.listdir(genre_path):
        file_path = os.path.join(genre_path, file)
        try:
            sf.read(file_path)
        except:
            corrupted_gtzan.append(file_path)

print("Corrupted GTZAN Files:", len(corrupted_gtzan))

Corrupted GTZAN Files: 1


In [14]:
# Duplicate Detection (GTZAN)
import hashlib

hashes = {}
duplicates = []

for genre in genres:
    genre_path = os.path.join(gtzan_path, genre)
    for file in os.listdir(genre_path):
        file_path = os.path.join(genre_path, file)
        with open(file_path, "rb") as f:
            file_hash = hashlib.md5(f.read()).hexdigest()
        if file_hash in hashes:
            duplicates.append((file_path, hashes[file_hash]))
        else:
            hashes[file_hash] = file_path

print("Duplicate GTZAN Files:", len(duplicates))

Duplicate GTZAN Files: 14


# PART 2 - MTG Jamendo Metadata

---

In [23]:
# Download Jamendo Metadata

%cd {PROJECT_PATH}/data/raw/mtg-jamendo
!pwd

!wget -O autotagging_top50tags-train.tsv \
https://raw.githubusercontent.com/MTG/mtg-jamendo-dataset/master/data/splits/split-0/autotagging_top50tags-train.tsv

%cd {PROJECT_PATH}

[Errno 2] No such file or directory: '{PROJECT_PATH}/data/raw/mtg-jamendo'
/
/
--2026-02-18 22:35:53--  https://raw.githubusercontent.com/MTG/mtg-jamendo-dataset/master/data/splits/split-0/autotagging_top50tags-train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3742433 (3.6M) [text/plain]
Saving to: ‘autotagging_top50tags-train.tsv’


2026-02-18 22:35:53 (49.8 MB/s) - ‘autotagging_top50tags-train.tsv’ saved [3742433/3742433]

[Errno 2] No such file or directory: '{PROJECT_PATH}'
/


In [None]:
# Inspect Jamendo Metadata

import pandas as pd

jamendo_metadata = pd.read.csv(
    "data/raw/mtg_jamendo/autotagging_top50tags-train.tsv",
    sep="\t"
)

jamendo_metadata.head()

# PART 3 - MagnaTagATune Metedata

---

In [None]:
#Download Magna Annotations
%cd data/raw/magnatagatune

!wget https://raw.githubusercontent.com/keunwoochoi/magnatagatune-dataset/master/annotations_final.csv

%cd ../../../

In [None]:
# Insepct Magna Metadata
magna_metadata = pd.read_csv(
    "data/raw/magnatagatune/annotations_final.csv"
)

magna_metadata.head()

In [None]:
# Unified Label Schema
unified_schema = {
    "genre": None,
    "mood": [],
    "instrument": [],
    "other_tags": []
}

unified_schema