# MusicGen - Generation Musicale par IA

**Module :** 02-Audio-Advanced  
**Niveau :** Intermediaire  
**Technologies :** Meta MusicGen (AudioCraft), ~10 GB VRAM  
**Duree estimee :** 45 minutes  

## Objectifs d'Apprentissage

- [ ] Installer et charger MusicGen depuis la bibliotheque AudioCraft
- [ ] Generer de la musique a partir de descriptions textuelles
- [ ] Utiliser le melody conditioning (melodie de reference + description)
- [ ] Maitriser les parametres de generation (temperature, top_k, top_p)
- [ ] Controler la duree des morceaux generes
- [ ] Comparer les modeles small, medium et large
- [ ] Experimenter avec des descriptions en francais

## Prerequis

- GPU NVIDIA avec au moins 10 GB VRAM (medium), 4 GB (small)
- `pip install audiocraft`
- Connaissances de base en traitement audio

**Navigation :** [<< 02-2](02-2-XTTS-Voice-Cloning.ipynb) | [Index](../README.md) | [Suivant >>](02-4-Demucs-Source-Separation.ipynb)

In [None]:
# Parametres Papermill - JAMAIS modifier ce commentaire

# Configuration notebook
notebook_mode = "interactive"        # "interactive" ou "batch"
skip_widgets = False               # True pour mode batch MCP
debug_level = "INFO"

# Parametres MusicGen
model_size = "facebook/musicgen-medium"  # small, medium ou large
duration_seconds = 10                     # Duree de generation (secondes)
device = "cuda"                           # "cuda" ou "cpu"
temperature = 1.0                         # Temperature de sampling (0.5-1.5)

# Configuration
generate_audio = True              # Generer les fichiers audio
save_results = True                # Sauvegarder les fichiers generes
test_melody_conditioning = True    # Tester le melody conditioning
compare_models = False             # Comparer small/medium (necessite plus de VRAM)

In [None]:
# Setup environnement et imports
import os
import sys
import json
import time
import gc
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
import logging

import numpy as np
import soundfile as sf
from IPython.display import Audio, display, HTML

# Import helpers GenAI
GENAI_ROOT = Path.cwd()
while GENAI_ROOT.name != 'GenAI' and len(GENAI_ROOT.parts) > 1:
    GENAI_ROOT = GENAI_ROOT.parent

HELPERS_PATH = GENAI_ROOT / 'shared' / 'helpers'
if HELPERS_PATH.exists():
    sys.path.insert(0, str(HELPERS_PATH.parent))
    try:
        from helpers.audio_helpers import play_audio, save_audio
        print("Helpers audio importes")
    except ImportError:
        print("Helpers audio non disponibles - mode autonome")

# Repertoires
OUTPUT_DIR = GENAI_ROOT / 'outputs' / 'audio' / 'musicgen'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Configuration logging
logging.basicConfig(level=getattr(logging, debug_level))
logger = logging.getLogger('musicgen')

# Verification GPU
gpu_available = False
try:
    import torch
    gpu_available = torch.cuda.is_available()
    if gpu_available:
        gpu_name = torch.cuda.get_device_name(0)
        gpu_vram = torch.cuda.get_device_properties(0).total_mem / (1024**3)
        print(f"GPU : {gpu_name} ({gpu_vram:.1f} GB VRAM)")
    else:
        print("GPU non disponible - MusicGen sera tres lent sur CPU")
        if device == "cuda":
            device = "cpu"
            print("Fallback vers CPU")
except ImportError:
    print("torch non installe")
    device = "cpu"

print(f"\nMusicGen - Generation Musicale par IA")
print(f"Date : {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Mode : {notebook_mode}, Device : {device}")
print(f"Modele : {model_size}, Duree : {duration_seconds}s")
print(f"Temperature : {temperature}")
print(f"Sortie : {OUTPUT_DIR}")

In [None]:
# Chargement .env
from dotenv import load_dotenv

current_path = Path.cwd()
found_env = False
for _ in range(4):
    env_path = current_path / '.env'
    if env_path.exists():
        load_dotenv(env_path)
        print(f"Fichier .env charge depuis : {env_path}")
        found_env = True
        break
    current_path = current_path.parent

if not found_env:
    print("Aucun fichier .env trouve")

# MusicGen ne necessite pas de cle API
# HuggingFace token utile pour le telechargement
hf_token = os.getenv('HUGGINGFACE_TOKEN') or os.getenv('HF_TOKEN')
if hf_token:
    print(f"Token HuggingFace disponible")
else:
    print(f"Token HuggingFace non disponible (telechargement public uniquement)")

## Section 1 : Presentation de MusicGen

MusicGen est un modele de generation musicale developpe par Meta dans le cadre du projet AudioCraft. Il genere de la musique mono-source de haute qualite a partir de descriptions textuelles.

### Variantes disponibles

| Modele | Parametres | VRAM | Qualite | Vitesse |
|--------|-----------|------|---------|--------|
| `musicgen-small` | 300M | ~4 GB | Bonne | Rapide |
| `musicgen-medium` | 1.5B | ~10 GB | Tres bonne | Moyen |
| `musicgen-large` | 3.3B | ~20 GB | Excellente | Lent |
| `musicgen-melody` | 1.5B | ~10 GB | Tres bonne | Moyen |

### Modes de generation

| Mode | Description | Modele requis |
|------|-------------|---------------|
| Text-to-music | Description textuelle -> musique | Tous |
| Melody conditioning | Melodie de reference + description -> musique | musicgen-melody |
| Continuation | Continuer un morceau existant | Tous |

In [None]:
# Chargement du modele MusicGen
print("CHARGEMENT DU MODELE MUSICGEN")
print("=" * 45)

musicgen_loaded = False

try:
    from audiocraft.models import MusicGen

    print(f"Chargement {model_size}...")
    print(f"(Premier lancement : telechargement du modele)")
    start_time = time.time()

    musicgen = MusicGen.get_pretrained(model_size)
    load_time = time.time() - start_time
    musicgen_loaded = True

    # Configuration de la generation
    musicgen.set_generation_params(
        duration=duration_seconds,
        temperature=temperature,
        top_k=250,
        top_p=0.0,       # Desactive si top_k > 0
        cfg_coef=3.0     # Classifier-free guidance
    )

    print(f"Modele charge en {load_time:.1f}s")
    print(f"Sample rate : {musicgen.sample_rate} Hz")
    print(f"Duree configuree : {duration_seconds}s")

    if gpu_available:
        vram_used = torch.cuda.memory_allocated(0) / (1024**3)
        print(f"VRAM utilisee : {vram_used:.2f} GB")

except ImportError:
    print("audiocraft non installe")
    print("Installation : pip install audiocraft")
except Exception as e:
    print(f"Erreur lors du chargement : {type(e).__name__} - {str(e)[:200]}")

## Section 2 : Generation text-to-music

La generation text-to-music est le mode principal de MusicGen. La description textuelle guide le style, les instruments, le tempo et l'ambiance.

### Conseils pour les descriptions

| Element | Exemples | Impact |
|---------|----------|--------|
| Genre | jazz, rock, classical, electronic | Style general |
| Instruments | piano, guitar, drums, synthesizer | Timbres |
| Tempo | slow, medium, fast, 120 BPM | Rythme |
| Ambiance | happy, melancholic, energetic, calm | Emotion |
| Qualite | high quality, lo-fi, crisp, warm | Production |

In [None]:
# Generation text-to-music
print("GENERATION TEXT-TO-MUSIC")
print("=" * 45)

descriptions = [
    "A happy upbeat jazz piece with piano and saxophone, medium tempo",
    "Calm ambient electronic music with soft pads and gentle arpeggios",
    "Energetic rock with electric guitar riffs and driving drums, high energy",
]

generation_results = {}

if musicgen_loaded and generate_audio:
    for i, desc in enumerate(descriptions):
        print(f"\n--- Morceau {i+1} ---")
        print(f"Description : {desc}")

        start_time = time.time()
        wav = musicgen.generate([desc])
        gen_time = time.time() - start_time

        # wav shape: [batch, channels, samples]
        samples = wav[0, 0].cpu().numpy()
        sample_rate = musicgen.sample_rate
        duration = len(samples) / sample_rate

        generation_results[f"morceau_{i+1}"] = {
            "description": desc,
            "duration": duration,
            "gen_time": gen_time
        }

        print(f"  Duree : {duration:.1f}s | Temps de generation : {gen_time:.2f}s")
        print(f"  Ratio temps reel : {duration / gen_time:.2f}x")
        display(Audio(data=samples, rate=sample_rate))

        if save_results:
            filepath = OUTPUT_DIR / f"music_{i+1}_{desc[:30].replace(' ', '_')}.wav"
            sf.write(str(filepath), samples, sample_rate)
            print(f"  Sauvegarde : {filepath.name}")

    # Recapitulatif
    print(f"\nRecapitulatif des generations :")
    print(f"{'Morceau':<12} {'Duree (s)':<12} {'Temps gen (s)':<15} {'Description':<40}")
    print("-" * 79)
    for name, data in generation_results.items():
        print(f"{name:<12} {data['duration']:<12.1f} {data['gen_time']:<15.2f} {data['description'][:40]}")
else:
    print("Modele non charge ou generation desactivee")

### Interpretation : Generation text-to-music

| Aspect | Valeur typique | Signification |
|--------|---------------|---------------|
| Ratio temps reel | 0.5-2x (GPU) | La generation musicale est plus lente que le TTS |
| Qualite audio | 32kHz mono | Qualite correcte pour la musique generee |
| Coherence | Bonne | La musique suit generalement bien la description |

**Points cles** :
1. Les descriptions en anglais donnent de meilleurs resultats (langue d'entrainement)
2. Plus la description est precise, plus le resultat est fidele
3. La generation est non-deterministe : deux appels donnent des resultats differents

## Section 3 : Parametres de generation

Les parametres de sampling influencent la diversite et la qualite de la musique generee.

| Parametre | Valeur par defaut | Plage | Impact |
|-----------|-------------------|-------|--------|
| `temperature` | 1.0 | 0.5-1.5 | Diversite (bas=conservateur, haut=creatif) |
| `top_k` | 250 | 0-1000 | Filtre les k tokens les plus probables |
| `top_p` | 0.0 | 0.0-1.0 | Nucleus sampling (0 = desactive) |
| `cfg_coef` | 3.0 | 1.0-10.0 | Fidelite a la description (haut=plus fidele) |
| `duration` | 10 | 1-30 | Duree en secondes |

In [None]:
# Test des parametres de generation
print("TEST DES PARAMETRES")
print("=" * 45)

test_description = "A gentle classical piano melody, slow tempo, emotional"

temperatures = [0.5, 0.8, 1.0, 1.2]
param_results = {}

if musicgen_loaded and generate_audio:
    print(f"Description : {test_description}")
    print(f"Duree : {duration_seconds}s")

    for temp in temperatures:
        print(f"\nTemperature = {temp}")

        musicgen.set_generation_params(
            duration=duration_seconds,
            temperature=temp,
            top_k=250,
            top_p=0.0,
            cfg_coef=3.0
        )

        start_time = time.time()
        wav = musicgen.generate([test_description])
        gen_time = time.time() - start_time

        samples = wav[0, 0].cpu().numpy()
        sample_rate = musicgen.sample_rate
        duration = len(samples) / sample_rate

        # Mesure de la dynamique (indicateur indirect de diversite)
        rms = np.sqrt(np.mean(samples**2))
        peak = np.max(np.abs(samples))
        dynamic_range = 20 * np.log10(peak / (rms + 1e-10))

        param_results[temp] = {
            "gen_time": gen_time,
            "rms": rms,
            "dynamic_range": dynamic_range
        }

        print(f"  Temps : {gen_time:.2f}s | RMS : {rms:.4f} | Dynamic range : {dynamic_range:.1f} dB")
        display(Audio(data=samples, rate=sample_rate))

        if save_results:
            filepath = OUTPUT_DIR / f"temp_{temp:.1f}.wav"
            sf.write(str(filepath), samples, sample_rate)

    # Restaurer les parametres par defaut
    musicgen.set_generation_params(
        duration=duration_seconds,
        temperature=temperature,
        top_k=250,
        top_p=0.0,
        cfg_coef=3.0
    )

    # Tableau recapitulatif
    print(f"\nRecapitulatif :")
    print(f"{'Temperature':<14} {'Temps gen (s)':<15} {'RMS':<10} {'Dyn. range (dB)':<16}")
    print("-" * 55)
    for temp, data in param_results.items():
        print(f"{temp:<14.1f} {data['gen_time']:<15.2f} {data['rms']:<10.4f} {data['dynamic_range']:<16.1f}")
else:
    print("Modele non charge ou generation desactivee")

### Interpretation : Parametres de generation

| Temperature | Observation | Recommandation |
|------------|-------------|----------------|
| 0.5 | Conservateur, peu varie | Quand on veut un resultat previsible |
| 0.8 | Bon equilibre | Usage general |
| 1.0 | Standard, creatif | Exploration, diversite |
| 1.2 | Tres creatif, parfois incoherent | Experimentation |

**Points cles** :
1. La temperature n'affecte pas significativement le temps de generation
2. Des temperatures elevees augmentent la diversite mais peuvent reduire la coherence
3. Le cfg_coef controle la fidelite a la description (3.0 est un bon defaut)

## Section 4 : Melody conditioning

Le melody conditioning permet de fournir une melodie de reference que le modele reproduira avec le style demande dans la description textuelle.

| Element | Description |
|---------|-------------|
| Entree melodie | Audio WAV avec la melodie de base |
| Entree texte | Description du style souhaite |
| Sortie | Musique combinant la melodie et le style |

> **Note** : Le melody conditioning necessite le modele `musicgen-melody` specifique.

In [None]:
# Melody conditioning
print("MELODY CONDITIONING")
print("=" * 45)

if musicgen_loaded and generate_audio and test_melody_conditioning:
    # Creer une melodie de reference simple (sinusoides)
    print("Creation d'une melodie de reference synthetique...")
    melody_sr = musicgen.sample_rate
    t = np.linspace(0, duration_seconds, int(melody_sr * duration_seconds), endpoint=False)

    # Melodie simple : Do-Mi-Sol-Do (arpege de Do majeur)
    freqs = [261.63, 329.63, 392.00, 523.25]  # C4, E4, G4, C5
    melody_samples = np.zeros_like(t)
    note_duration = duration_seconds / len(freqs)
    for i, freq in enumerate(freqs):
        start_idx = int(i * note_duration * melody_sr)
        end_idx = int((i + 1) * note_duration * melody_sr)
        note_t = t[start_idx:end_idx]
        # Enveloppe ADSR simplifiee
        envelope = np.minimum(1.0, (note_t - note_t[0]) * 10) * np.exp(-1.5 * (note_t - note_t[0]))
        melody_samples[start_idx:end_idx] = 0.5 * np.sin(2 * np.pi * freq * note_t) * envelope

    melody_path = OUTPUT_DIR / "melody_reference.wav"
    sf.write(str(melody_path), melody_samples, melody_sr)
    print(f"Melodie de reference : {melody_path.name} ({duration_seconds}s)")
    display(Audio(data=melody_samples, rate=melody_sr))

    # Chargement du modele melody (si different)
    melody_model_name = "facebook/musicgen-melody"
    if model_size != melody_model_name:
        print(f"\nChargement du modele melody ({melody_model_name})...")
        try:
            melody_model = MusicGen.get_pretrained(melody_model_name)
            melody_model.set_generation_params(
                duration=duration_seconds,
                temperature=temperature
            )
            melody_model_loaded = True
        except Exception as e:
            print(f"Erreur : {str(e)[:100]}")
            print("Utilisation du modele principal sans conditioning")
            melody_model_loaded = False
    else:
        melody_model = musicgen
        melody_model_loaded = True

    # Generation avec melody conditioning
    style_descriptions = [
        "Jazz version with saxophone and piano",
        "Electronic ambient version with synthesizers and reverb",
    ]

    if melody_model_loaded:
        import torchaudio
        melody_tensor, _ = torchaudio.load(str(melody_path))
        melody_tensor = melody_tensor.unsqueeze(0)  # [1, channels, samples]

        for i, style in enumerate(style_descriptions):
            print(f"\nStyle : {style}")
            start_time = time.time()

            wav = melody_model.generate_with_chroma(
                descriptions=[style],
                melody_wavs=melody_tensor,
                melody_sample_rate=melody_sr
            )

            gen_time = time.time() - start_time
            samples = wav[0, 0].cpu().numpy()

            print(f"  Temps : {gen_time:.2f}s")
            display(Audio(data=samples, rate=musicgen.sample_rate))

            if save_results:
                filepath = OUTPUT_DIR / f"melody_style_{i+1}.wav"
                sf.write(str(filepath), samples, musicgen.sample_rate)

        # Liberation modele melody si charge separement
        if model_size != melody_model_name:
            del melody_model
            gc.collect()
            if gpu_available:
                torch.cuda.empty_cache()
    else:
        print("Modele melody non disponible")
else:
    print("Melody conditioning desactive ou modele non charge")

### Interpretation : Melody conditioning

| Aspect | Observation | Signification |
|--------|-------------|---------------|
| Fidelite melodique | Bonne | La melodie de base est recognaissable |
| Adaptation de style | Tres bonne | Les instruments et l'ambiance changent |
| Temps de generation | Similaire | Le conditioning n'ajoute pas de surcout majeur |

**Points cles** :
1. Le melody conditioning extrait le chroma (contenu melodique) de la reference
2. L'arrangement et les instruments sont determines par la description textuelle
3. Le modele `musicgen-melody` est specifiquement entraine pour cette tache

In [None]:
# Mode interactif - Generation personnalisee
if notebook_mode == "interactive" and not skip_widgets:
    print("MODE INTERACTIF - GENERATION MUSICALE PERSONNALISEE")
    print("=" * 55)
    print("\nDecrivez la musique que vous souhaitez generer :")
    print("(Laissez vide pour passer a la suite)")
    print("Exemple : 'A calm acoustic guitar melody with soft percussion'")

    try:
        user_desc = input("\nDescription : ")

        if user_desc.strip() and musicgen_loaded:
            user_duration = input(f"Duree [{duration_seconds}s] (1-30) : ").strip()
            user_duration = int(user_duration) if user_duration else duration_seconds
            user_duration = max(1, min(30, user_duration))

            musicgen.set_generation_params(
                duration=user_duration,
                temperature=temperature,
                top_k=250,
                cfg_coef=3.0
            )

            print(f"\nGeneration en cours ({user_duration}s)...")
            start_time = time.time()
            wav = musicgen.generate([user_desc])
            gen_time = time.time() - start_time

            samples = wav[0, 0].cpu().numpy()
            print(f"Genere en {gen_time:.2f}s")
            display(Audio(data=samples, rate=musicgen.sample_rate))

            if save_results:
                ts = datetime.now().strftime('%Y%m%d_%H%M%S')
                filepath = OUTPUT_DIR / f"custom_{ts}.wav"
                sf.write(str(filepath), samples, musicgen.sample_rate)
                print(f"Sauvegarde : {filepath.name}")

            # Restaurer la duree par defaut
            musicgen.set_generation_params(
                duration=duration_seconds,
                temperature=temperature,
                top_k=250,
                cfg_coef=3.0
            )
        else:
            print("Mode interactif ignore")

    except (KeyboardInterrupt, EOFError):
        print("Mode interactif interrompu")
    except Exception as e:
        error_type = type(e).__name__
        if "StdinNotImplemented" in error_type or "input" in str(e).lower():
            print("Mode interactif non disponible (execution automatisee)")
        else:
            print(f"Erreur : {error_type} - {str(e)[:100]}")
else:
    print("Mode batch - Interface interactive desactivee")

## Bonnes pratiques et guide de generation

### Rediger de bonnes descriptions

| Element | Bon exemple | Mauvais exemple |
|---------|------------|----------------|
| Genre | "Smooth jazz with piano" | "Jazz" |
| Ambiance | "Melancholic and introspective" | "Triste" |
| Instruments | "Acoustic guitar, soft drums, double bass" | "Guitare" |
| Tempo | "Slow tempo, around 70 BPM" | "Lent" |

### Limites actuelles

| Limite | Description | Contournement |
|--------|-------------|---------------|
| Duree max ~30s | Le modele genere des extraits courts | Concatener plusieurs generations |
| Mono uniquement | Pas de stereo natif | Post-traitement avec panning |
| Pas de paroles | Genere uniquement de la musique instrumentale | Combiner avec un modele TTS |
| Anglais | Les descriptions en anglais sont les plus efficaces | Traduire les descriptions |

In [None]:
# Statistiques de session et prochaines etapes
print("STATISTIQUES DE SESSION")
print("=" * 45)

print(f"Date : {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Modele : {model_size}")
print(f"Device : {device}")
print(f"Duree configuree : {duration_seconds}s")
print(f"Temperature : {temperature}")
print(f"Modele charge : {'Oui' if musicgen_loaded else 'Non'}")

if gpu_available:
    vram_current = torch.cuda.memory_allocated(0) / (1024**3)
    print(f"VRAM utilisee : {vram_current:.2f} GB")

if save_results:
    saved = list(OUTPUT_DIR.glob('*.wav'))
    total_size = sum(f.stat().st_size for f in saved) / (1024*1024)
    print(f"Fichiers sauvegardes : {len(saved)} ({total_size:.1f} MB) dans {OUTPUT_DIR}")

# Liberation memoire
if musicgen_loaded:
    print(f"\nLiberation du modele...")
    del musicgen
    gc.collect()
    if gpu_available:
        torch.cuda.empty_cache()
    print(f"Memoire liberee")

print(f"\nPROCHAINES ETAPES")
print(f"1. Decouvrir la separation de sources avec Demucs (02-4)")
print(f"2. Comparer tous les modeles audio (03-1)")
print(f"3. Construire un pipeline vocal complet (03-2)")
print(f"4. Creer des compositions multi-etapes (04-3)")

print(f"\nNotebook MusicGen termine - {datetime.now().strftime('%H:%M:%S')}")