# Music Dataset Creator for AI Model Training

This notebook imports local MP3 files from a folder and creates a comprehensive song dataset for training AI models to generate music.

## Features:
- Load MP3 files from a local directory
- Extract audio features (tempo, spectral features, MFCCs, etc.)
- Create a structured dataset
- Save dataset for AI model training

## Step 1: Import Required Libraries

In [None]:
# Install required libraries (uncomment if needed)
# !pip install librosa pandas numpy matplotlib scipy

import os
import librosa
import librosa.display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Step 2: Set Up Folder Path for MP3 Files

In [None]:
# Set the path to your folder containing MP3 files
# Change this path to point to your local music folder
MUSIC_FOLDER_PATH = "./music_files"  # Update this path as needed

# Create the folder if it doesn't exist
os.makedirs(MUSIC_FOLDER_PATH, exist_ok=True)

print(f"Looking for MP3 files in: {MUSIC_FOLDER_PATH}")
print(f"Folder exists: {os.path.exists(MUSIC_FOLDER_PATH)}")

## Step 3: Load MP3 Files from Folder

In [None]:
def get_mp3_files(folder_path):
    """
    Get all MP3 files from the specified folder.
    
    Args:
        folder_path (str): Path to folder containing MP3 files
        
    Returns:
        list: List of MP3 file paths
    """
    mp3_files = []
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.mp3'):
                mp3_files.append(os.path.join(root, file))
    return mp3_files

# Get all MP3 files
mp3_files = get_mp3_files(MUSIC_FOLDER_PATH)
print(f"Found {len(mp3_files)} MP3 files")

if len(mp3_files) > 0:
    print("\nFirst few files:")
    for file in mp3_files[:5]:
        print(f"  - {os.path.basename(file)}")
else:
    print(f"\nNo MP3 files found in {MUSIC_FOLDER_PATH}")
    print("Please add MP3 files to the folder and run again.")

## Step 4: Extract Audio Features

We'll extract various audio features that are useful for AI music generation:
- **Tempo**: Beats per minute
- **MFCCs**: Mel-frequency cepstral coefficients (timbre features)
- **Spectral Centroid**: Brightness of the sound
- **Spectral Rolloff**: Measure of shape of signal
- **Zero Crossing Rate**: Rate of sign changes in signal
- **Chroma Features**: Pitch class distribution

In [None]:
def extract_features(file_path, duration=30):
    """
    Extract audio features from an MP3 file.
    
    Args:
        file_path (str): Path to MP3 file
        duration (int): Duration in seconds to analyze (default: 30)
        
    Returns:
        dict: Dictionary of extracted features
    """
    try:
        # Load audio file
        y, sr = librosa.load(file_path, duration=duration)
        
        # Extract features
        tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
        
        # MFCCs (Mel-frequency cepstral coefficients)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        mfcc_mean = np.mean(mfccs, axis=1)
        mfcc_std = np.std(mfccs, axis=1)
        
        # Spectral features
        spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
        spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(y=y, sr=sr))
        
        # Zero crossing rate
        zero_crossing_rate = np.mean(librosa.feature.zero_crossing_rate(y))
        
        # Chroma features
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        chroma_mean = np.mean(chroma, axis=1)
        
        # Root Mean Square Energy
        rms = np.mean(librosa.feature.rms(y=y))
        
        # Create feature dictionary
        features = {
            'filename': os.path.basename(file_path),
            'filepath': file_path,
            'duration': librosa.get_duration(y=y, sr=sr),
            'tempo': float(tempo),
            'spectral_centroid': float(spectral_centroid),
            'spectral_rolloff': float(spectral_rolloff),
            'zero_crossing_rate': float(zero_crossing_rate),
            'rms_energy': float(rms),
        }
        
        # Add MFCC features
        for i in range(13):
            features[f'mfcc_{i}_mean'] = float(mfcc_mean[i])
            features[f'mfcc_{i}_std'] = float(mfcc_std[i])
        
        # Add chroma features
        for i in range(12):
            features[f'chroma_{i}'] = float(chroma_mean[i])
        
        return features
    
    except Exception as e:
        print(f"Error processing {os.path.basename(file_path)}: {str(e)}")
        return None

print("Feature extraction function defined.")

## Step 5: Process All MP3 Files and Create Dataset

In [None]:
def create_dataset(mp3_files, output_path='music_dataset.csv'):
    """
    Create a dataset from MP3 files by extracting features.
    
    Args:
        mp3_files (list): List of MP3 file paths
        output_path (str): Path to save the dataset CSV file
        
    Returns:
        pd.DataFrame: DataFrame containing all extracted features
    """
    dataset = []
    
    print(f"Processing {len(mp3_files)} files...\n")
    
    for i, file_path in enumerate(mp3_files, 1):
        print(f"[{i}/{len(mp3_files)}] Processing: {os.path.basename(file_path)}")
        features = extract_features(file_path)
        
        if features is not None:
            dataset.append(features)
    
    # Create DataFrame
    df = pd.DataFrame(dataset)
    
    # Save to CSV
    if len(df) > 0:
        df.to_csv(output_path, index=False)
        print(f"\n✓ Dataset saved to: {output_path}")
        print(f"✓ Total songs in dataset: {len(df)}")
        print(f"✓ Total features per song: {len(df.columns)}")
    else:
        print("\n⚠ No features extracted. Dataset is empty.")
    
    return df

# Create the dataset
if len(mp3_files) > 0:
    dataset_df = create_dataset(mp3_files)
else:
    print("No MP3 files to process. Please add MP3 files to the folder.")
    dataset_df = pd.DataFrame()

## Step 6: Explore the Dataset

In [None]:
# Display dataset info
if len(dataset_df) > 0:
    print("Dataset Overview:")
    print("=" * 50)
    print(f"Number of songs: {len(dataset_df)}")
    print(f"Number of features: {len(dataset_df.columns)}")
    print("\nFirst few rows:")
    display(dataset_df.head())
    
    print("\nDataset Statistics:")
    display(dataset_df.describe())
else:
    print("Dataset is empty. Please add MP3 files and run the processing cells.")

## Step 7: Visualize Feature Distributions

In [None]:
if len(dataset_df) > 0:
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Audio Feature Distributions', fontsize=16, fontweight='bold')
    
    # Tempo distribution
    axes[0, 0].hist(dataset_df['tempo'], bins=20, color='skyblue', edgecolor='black')
    axes[0, 0].set_title('Tempo Distribution (BPM)')
    axes[0, 0].set_xlabel('Tempo (BPM)')
    axes[0, 0].set_ylabel('Frequency')
    
    # Spectral Centroid distribution
    axes[0, 1].hist(dataset_df['spectral_centroid'], bins=20, color='lightcoral', edgecolor='black')
    axes[0, 1].set_title('Spectral Centroid Distribution')
    axes[0, 1].set_xlabel('Spectral Centroid')
    axes[0, 1].set_ylabel('Frequency')
    
    # Zero Crossing Rate distribution
    axes[1, 0].hist(dataset_df['zero_crossing_rate'], bins=20, color='lightgreen', edgecolor='black')
    axes[1, 0].set_title('Zero Crossing Rate Distribution')
    axes[1, 0].set_xlabel('Zero Crossing Rate')
    axes[1, 0].set_ylabel('Frequency')
    
    # RMS Energy distribution
    axes[1, 1].hist(dataset_df['rms_energy'], bins=20, color='plum', edgecolor='black')
    axes[1, 1].set_title('RMS Energy Distribution')
    axes[1, 1].set_xlabel('RMS Energy')
    axes[1, 1].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.savefig('feature_distributions.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("\nVisualization saved as 'feature_distributions.png'")
else:
    print("No data available for visualization.")

## Step 8: Save Dataset in Multiple Formats

In [None]:
if len(dataset_df) > 0:
    # Save as CSV
    dataset_df.to_csv('music_dataset.csv', index=False)
    print("✓ Dataset saved as CSV: music_dataset.csv")
    
    # Save as JSON
    dataset_df.to_json('music_dataset.json', orient='records', indent=2)
    print("✓ Dataset saved as JSON: music_dataset.json")
    
    # Save as Pickle (for Python)
    dataset_df.to_pickle('music_dataset.pkl')
    print("✓ Dataset saved as Pickle: music_dataset.pkl")
    
    # Save feature names
    with open('feature_names.txt', 'w') as f:
        for col in dataset_df.columns:
            f.write(f"{col}\n")
    print("✓ Feature names saved: feature_names.txt")
    
    print("\n" + "="*50)
    print("Dataset creation complete!")
    print("="*50)
else:
    print("No data to save.")

## Next Steps

Now that you have created your music dataset, you can:

1. **Use the dataset for AI model training:**
   - Train a music generation model (e.g., using RNNs, LSTMs, or Transformers)
   - Use features for music classification or recommendation
   - Train a model to generate similar music

2. **Enhance the dataset:**
   - Add genre labels manually or automatically
   - Include more audio features
   - Analyze longer segments of audio

3. **Preprocess for deep learning:**
   - Normalize features
   - Create training/validation/test splits
   - Generate spectrograms or mel-spectrograms for CNN models