# Audio Sample Analysis and ML Experiments

This notebook demonstrates how to use the enhanced `acid_cat` tool for machine learning experiments on audio sample libraries.

## Features:
- Load and explore extracted audio features
- Visualize feature distributions and relationships
- Perform similarity analysis
- Cluster samples by audio characteristics
- Build recommendation systems

## Prerequisites:
Run `acid_cat.py` with `--ml-ready` flag to generate feature-rich CSV files.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

In [None]:
# Load the audio features dataset
# Replace 'your_metadata.csv' with your actual CSV file path
# df = pd.read_csv('your_metadata.csv')

# For demo purposes, let's load the sample data
try:
    df = pd.read_csv('samples_metadata.csv')
    print(f"Loaded {len(df)} audio samples")
    print(f"Dataset shape: {df.shape}")
    print("\nColumn names:")
    print(df.columns.tolist())
except FileNotFoundError:
    print("CSV file not found. Please run: python acid_cat.py data/samples --ml-ready")
    df = None

## 2. Data Exploration and Visualization

In [None]:
if df is not None:
    # Basic info about the dataset
    print("Dataset Info:")
    print(df.info())
    
    print("\nFirst few rows:")
    print(df.head())
    
    print("\nBasic statistics:")
    print(df.describe())

In [None]:
if df is not None:
    # Identify numeric columns for analysis
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    print(f"Found {len(numeric_cols)} numeric features:")
    print(numeric_cols)
    
    # Remove problematic columns that might contain lists
    feature_cols = [col for col in numeric_cols if not any(x in col.lower() for x in ['tempo_librosa'])]
    print(f"\nUsing {len(feature_cols)} features for analysis")

In [None]:
if df is not None and len(df) > 1:
    # Distribution of key audio features
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('Distribution of Key Audio Features', fontsize=16)
    
    features_to_plot = ['bpm', 'duration_sec', 'spectral_centroid_mean', 
                       'mfcc_1_mean', 'rms_mean', 'zcr_mean']
    
    for i, feature in enumerate(features_to_plot):
        if feature in df.columns:
            ax = axes[i//3, i%3]
            df[feature].hist(bins=20, ax=ax, alpha=0.7)
            ax.set_title(f'{feature}')
            ax.set_xlabel(feature)
            ax.set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
else:
    print("Need more than one sample for meaningful visualization")

## 3. Feature Correlation Analysis

In [None]:
if df is not None and len(feature_cols) > 1:
    # Calculate correlation matrix
    correlation_matrix = df[feature_cols].corr()
    
    # Plot correlation heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0,
                square=True, cbar_kws={'shrink': 0.8})
    plt.title('Audio Feature Correlation Matrix')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
    
    # Find highly correlated features
    high_corr_pairs = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_val = correlation_matrix.iloc[i, j]
            if abs(corr_val) > 0.8:
                high_corr_pairs.append((correlation_matrix.columns[i], 
                                      correlation_matrix.columns[j], 
                                      corr_val))
    
    if high_corr_pairs:
        print("\nHighly correlated feature pairs (|r| > 0.8):")
        for feat1, feat2, corr in high_corr_pairs:
            print(f"{feat1} <-> {feat2}: {corr:.3f}")
    else:
        print("\nNo highly correlated feature pairs found (|r| > 0.8)")
else:
    print("Need multiple features and samples for correlation analysis")

## 4. Dimensionality Reduction and Visualization

In [None]:
if df is not None and len(df) > 1 and len(feature_cols) > 2:
    # Prepare data for dimensionality reduction
    X = df[feature_cols].fillna(0)  # Fill NaN values
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # PCA
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_scaled)
    
    # t-SNE (only if we have enough samples)
    if len(df) >= 5:
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(df)-1))
        X_tsne = tsne.fit_transform(X_scaled)
    
    # Plot results
    fig, axes = plt.subplots(1, 2 if len(df) >= 5 else 1, figsize=(15, 6))
    if len(df) < 5:
        axes = [axes]
    
    # PCA plot
    axes[0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, s=50)
    axes[0].set_title(f'PCA Visualization\n(Explained variance: {pca.explained_variance_ratio_.sum():.2%})')
    axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
    axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
    
    # t-SNE plot
    if len(df) >= 5:
        axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.7, s=50)
        axes[1].set_title('t-SNE Visualization')
        axes[1].set_xlabel('t-SNE 1')
        axes[1].set_ylabel('t-SNE 2')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nPCA explained variance ratio: {pca.explained_variance_ratio_}")
    print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.2%}")
else:
    print("Need more samples and features for dimensionality reduction")

## 5. Audio Sample Clustering

In [None]:
if df is not None and len(df) > 2 and len(feature_cols) > 2:
    # Prepare data
    X = df[feature_cols].fillna(0)
    X_scaled = StandardScaler().fit_transform(X)
    
    # K-Means clustering
    n_clusters = min(3, len(df))  # Adjust based on sample size
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    
    # Add cluster labels to dataframe
    df_clustered = df.copy()
    df_clustered['cluster'] = cluster_labels
    
    # Visualize clusters in PCA space
    if 'X_pca' in locals():
        plt.figure(figsize=(10, 6))
        scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, 
                            cmap='viridis', alpha=0.7, s=50)
        plt.colorbar(scatter)
        plt.title(f'K-Means Clustering (k={n_clusters}) in PCA Space')
        plt.xlabel('PC1')
        plt.ylabel('PC2')
        
        # Add sample names as annotations
        for i, filename in enumerate(df['filename']):
            plt.annotate(filename.split('/')[-1].split('\\')[-1][:10], 
                        (X_pca[i, 0], X_pca[i, 1]), 
                        xytext=(5, 5), textcoords='offset points', 
                        fontsize=8, alpha=0.7)
        
        plt.tight_layout()
        plt.show()
    
    # Cluster characteristics
    print("\nCluster Characteristics:")
    key_features = ['bpm', 'duration_sec', 'spectral_centroid_mean', 'mfcc_1_mean']
    cluster_summary = df_clustered.groupby('cluster')[key_features].mean()
    print(cluster_summary)
    
    print("\nSamples per cluster:")
    print(df_clustered['cluster'].value_counts().sort_index())
else:
    print("Need more samples for meaningful clustering")

## 6. Audio Similarity Search

In [None]:
def find_similar_samples(df, target_idx, feature_cols, n_similar=5):
    """
    Find samples similar to the target sample using cosine similarity.
    """
    if len(df) < 2:
        return pd.DataFrame()
    
    # Prepare features
    X = df[feature_cols].fillna(0)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Calculate similarity
    target_features = X_scaled[target_idx].reshape(1, -1)
    similarities = cosine_similarity(target_features, X_scaled)[0]
    
    # Get most similar samples (excluding the target itself)
    similar_indices = np.argsort(similarities)[::-1][1:n_similar+1]
    
    # Create results dataframe
    results = df.iloc[similar_indices].copy()
    results['similarity_score'] = similarities[similar_indices]
    
    return results[['filename', 'bpm', 'duration_sec', 'similarity_score']]

if df is not None and len(df) > 1:
    # Example: Find samples similar to the first sample
    target_idx = 0
    target_sample = df.iloc[target_idx]['filename']
    
    print(f"Finding samples similar to: {target_sample}")
    print(f"Target BPM: {df.iloc[target_idx]['bpm']}")
    print(f"Target Duration: {df.iloc[target_idx]['duration_sec']:.2f}s")
    
    similar_samples = find_similar_samples(df, target_idx, feature_cols, n_similar=3)
    
    if not similar_samples.empty:
        print("\nMost similar samples:")
        print(similar_samples)
    else:
        print("\nNo similar samples found (need more data)")
else:
    print("Need multiple samples for similarity search")

## 7. Feature Importance Analysis

In [None]:
if df is not None and len(df) > 1 and len(feature_cols) > 2:
    # Use PCA to understand feature importance
    X = df[feature_cols].fillna(0)
    X_scaled = StandardScaler().fit_transform(X)
    
    pca_full = PCA()
    pca_full.fit(X_scaled)
    
    # Plot explained variance
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), 
             pca_full.explained_variance_ratio_, 'bo-')
    plt.title('PCA Explained Variance by Component')
    plt.xlabel('Principal Component')
    plt.ylabel('Explained Variance Ratio')
    plt.grid(True)
    
    plt.subplot(1, 2, 2)
    cumsum = np.cumsum(pca_full.explained_variance_ratio_)
    plt.plot(range(1, len(cumsum) + 1), cumsum, 'ro-')
    plt.title('Cumulative Explained Variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.axhline(y=0.95, color='k', linestyle='--', alpha=0.7, label='95%')
    plt.legend()
    plt.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Find number of components for 95% variance
    n_components_95 = np.argmax(cumsum >= 0.95) + 1
    print(f"\nNumber of components needed for 95% variance: {n_components_95}")
    print(f"Total features: {len(feature_cols)}")
    print(f"Dimensionality reduction: {len(feature_cols)} -> {n_components_95} ({n_components_95/len(feature_cols)*100:.1f}%)")
else:
    print("Need more samples and features for importance analysis")

## 8. Sample Recommendation System

In [None]:
class AudioSampleRecommender:
    def __init__(self, df, feature_cols):
        self.df = df.copy()
        self.feature_cols = feature_cols
        self.scaler = StandardScaler()
        self.features_scaled = None
        self.nn_model = None
        self._prepare_data()
    
    def _prepare_data(self):
        # Prepare and scale features
        X = self.df[self.feature_cols].fillna(0)
        self.features_scaled = self.scaler.fit_transform(X)
        
        # Initialize nearest neighbors model
        self.nn_model = NearestNeighbors(n_neighbors=min(5, len(self.df)), 
                                        metric='cosine')
        self.nn_model.fit(self.features_scaled)
    
    def recommend_by_sample(self, sample_idx, n_recommendations=3):
        """Recommend samples similar to a given sample."""
        if len(self.df) < 2:
            return pd.DataFrame()
            
        target_features = self.features_scaled[sample_idx].reshape(1, -1)
        distances, indices = self.nn_model.kneighbors(target_features, 
                                                     n_neighbors=n_recommendations+1)
        
        # Exclude the target sample itself
        similar_indices = indices[0][1:]
        similarities = 1 - distances[0][1:]  # Convert distance to similarity
        
        results = self.df.iloc[similar_indices].copy()
        results['similarity_score'] = similarities
        
        return results[['filename', 'bpm', 'duration_sec', 'similarity_score']]
    
    def recommend_by_criteria(self, target_bpm=None, target_duration=None, 
                            target_key=None, n_recommendations=3):
        """Recommend samples based on musical criteria."""
        candidates = self.df.copy()
        
        # Filter by criteria
        if target_bpm is not None:
            bpm_tolerance = 10  # Allow ±10 BPM
            candidates = candidates[
                (candidates['bpm'] >= target_bpm - bpm_tolerance) & 
                (candidates['bpm'] <= target_bpm + bpm_tolerance)
            ]
        
        if target_duration is not None:
            duration_tolerance = 2.0  # Allow ±2 seconds
            candidates = candidates[
                (candidates['duration_sec'] >= target_duration - duration_tolerance) & 
                (candidates['duration_sec'] <= target_duration + duration_tolerance)
            ]
        
        if target_key is not None:
            candidates = candidates[
                (candidates['smpl_root_key'] == target_key) |
                (candidates['acid_root_note'] == target_key)
            ]
        
        # Return top matches
        return candidates.head(n_recommendations)[['filename', 'bpm', 'duration_sec', 
                                                  'smpl_root_key', 'acid_root_note']]

# Create recommender system
if df is not None and len(df) > 1:
    recommender = AudioSampleRecommender(df, feature_cols)
    
    print("=== Audio Sample Recommender ===")
    print("\n1. Similarity-based recommendations:")
    similar_recs = recommender.recommend_by_sample(0, n_recommendations=2)
    print(similar_recs)
    
    print("\n2. Criteria-based recommendations:")
    # Example: Find samples around 95 BPM
    criteria_recs = recommender.recommend_by_criteria(target_bpm=95, n_recommendations=3)
    print(criteria_recs)
else:
    print("Need multiple samples to create a recommender system")

## 9. Export and Save Analysis Results

In [None]:
if df is not None:
    # Save analysis results
    analysis_results = {
        'dataset_info': {
            'total_samples': len(df),
            'total_features': len(feature_cols),
            'avg_duration': df['duration_sec'].mean() if 'duration_sec' in df.columns else None,
            'avg_bpm': df['bpm'].mean() if 'bpm' in df.columns else None
        },
        'feature_columns': feature_cols
    }
    
    print("Analysis Summary:")
    print(f"Total samples analyzed: {analysis_results['dataset_info']['total_samples']}")
    print(f"Total features extracted: {analysis_results['dataset_info']['total_features']}")
    
    if analysis_results['dataset_info']['avg_duration']:
        print(f"Average duration: {analysis_results['dataset_info']['avg_duration']:.2f} seconds")
    
    if analysis_results['dataset_info']['avg_bpm']:
        print(f"Average BPM: {analysis_results['dataset_info']['avg_bpm']:.1f}")
    
    # Save enhanced dataset with cluster labels if available
    if 'df_clustered' in locals():
        df_clustered.to_csv('audio_analysis_with_clusters.csv', index=False)
        print("\nSaved enhanced dataset with cluster labels to: audio_analysis_with_clusters.csv")
    
    print("\n=== Next Steps ===")
    print("1. Run acid_cat on larger sample libraries for richer analysis")
    print("2. Experiment with different clustering algorithms")
    print("3. Build custom similarity metrics for your specific use case")
    print("4. Create playlists or collections based on similarity clusters")
    print("5. Train ML models for automatic sample classification")
else:
    print("No data to analyze. Please run acid_cat with --ml-ready flag first.")

## 10. Advanced Analysis Ideas

This notebook provides a foundation for audio sample analysis. Here are some advanced ideas to explore:

### Machine Learning Models
- **Genre Classification**: Train classifiers to predict music genres
- **Mood Detection**: Analyze emotional content of samples
- **Instrument Recognition**: Identify dominant instruments in samples

### Feature Engineering
- **Rhythmic Patterns**: Extract beat patterns and rhythmic complexity
- **Harmonic Analysis**: Analyze chord progressions and harmonic content
- **Dynamic Range**: Measure loudness variations and dynamics

### Recommendation Systems
- **Collaborative Filtering**: Learn from user preferences
- **Content-Based Filtering**: Use audio features for recommendations
- **Hybrid Systems**: Combine multiple recommendation approaches

### Visualization
- **Audio Spectrograms**: Visualize frequency content over time
- **3D Feature Spaces**: Explore high-dimensional feature relationships
- **Interactive Dashboards**: Build web interfaces for sample exploration

### Production Tools
- **Sample Matching**: Find samples that work well together
- **Key/Tempo Compatibility**: Suggest harmonically compatible samples
- **Loop Generation**: Create new loops based on existing patterns