# üî¥ DuckDB : SQLite pour l'analytique

**Badge:** üî¥ Avanc√© | ‚è± 60 min | üîë **Concepts cl√©s :** DuckDB, SQL sur fichiers, Parquet, int√©gration Pandas

## Objectifs

- D√©couvrir DuckDB comme alternative analytique √† SQLite
- Requ√™ter directement des fichiers (CSV, Parquet, JSON) sans import
- Int√©grer DuckDB avec Pandas de mani√®re transparente
- Utiliser des fonctions analytiques avanc√©es (WINDOW, PIVOT)
- Exploiter les performances pour gros volumes
- Lire depuis des sources distantes (S3)

## Pr√©requis

- SQL (SELECT, JOIN, GROUP BY)
- Pandas pour la manipulation de donn√©es
- Connaissance des formats CSV et Parquet

## 1. Qu'est-ce que DuckDB ?

**DuckDB** est une base de donn√©es embarqu√©e optimis√©e pour l'analytique (OLAP).

### DuckDB vs SQLite

| Feature | SQLite | DuckDB |
|---------|--------|--------|
| Type | OLTP (transactions) | OLAP (analytique) |
| Performance | ‚úì Rapide pour INSERT/UPDATE | ‚úì‚úì‚úì Ultra-rapide pour SELECT/GROUP BY |
| Format colonnaire | ‚ùå | ‚úÖ |
| SQL sur fichiers | ‚ùå | ‚úÖ (CSV, Parquet, JSON) |
| Fonctions analytiques | Basiques | Avanc√©es (WINDOW, PIVOT) |
| Vectorisation | ‚ùå | ‚úÖ |
| Int√©gration Pandas | Manuelle | Native |

**Motto** : "SQLite pour l'analytique"

### Cas d'usage
- Exploration rapide de donn√©es locales
- Prototypage avant passage √† BigQuery/Snowflake
- Data science sur laptop
- ETL local avec SQL
- Alternative √† Spark pour volumes moyens (< 100 GB)

In [None]:
# Installation : pip install duckdb
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path
import time

print(f"DuckDB version : {duckdb.__version__}")
print("‚úì DuckDB pr√™t √† l'emploi !")

## 2. Premiers pas : SQL directement sur fichiers

In [None]:
# Cr√©er des donn√©es de test
Path('duckdb_data').mkdir(exist_ok=True)

np.random.seed(42)
df_sales = pd.DataFrame({
    'order_id': range(1, 10001),
    'order_date': pd.date_range('2024-01-01', periods=10000, freq='5min'),
    'customer_id': np.random.randint(1, 1000, 10000),
    'product_name': np.random.choice(['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'], 10000),
    'category': np.random.choice(['Electronics', 'Accessories'], 10000),
    'quantity': np.random.randint(1, 5, 10000),
    'unit_price': np.random.uniform(50, 2000, 10000).round(2)
})

df_sales['total_amount'] = (df_sales['quantity'] * df_sales['unit_price']).round(2)

# Sauvegarder en CSV et Parquet
df_sales.to_csv('duckdb_data/sales.csv', index=False)
df_sales.to_parquet('duckdb_data/sales.parquet', index=False)

print(f"‚úì {len(df_sales):,} ventes cr√©√©es")
print(f"  CSV : duckdb_data/sales.csv")
print(f"  Parquet : duckdb_data/sales.parquet")

In [None]:
# SQL directement sur un fichier CSV - SANS IMPORT !
result = duckdb.sql("""
    SELECT * 
    FROM 'duckdb_data/sales.csv'
    LIMIT 5
""")

print("‚úì Requ√™te SQL directe sur CSV :")
print(result)

# Convertir en DataFrame
df_result = result.df()
print(f"\nType de r√©sultat : {type(df_result)}")

In [None]:
# SQL sur Parquet - encore plus rapide
result = duckdb.sql("""
    SELECT 
        product_name,
        COUNT(*) as order_count,
        SUM(total_amount) as total_revenue,
        AVG(total_amount) as avg_order_value
    FROM 'duckdb_data/sales.parquet'
    GROUP BY product_name
    ORDER BY total_revenue DESC
""")

print("‚úì Analyse sur Parquet :")
print(result)

## 3. Int√©gration avec Pandas : Transparent et puissant

In [None]:
# Requ√™ter directement un DataFrame Pandas
# DuckDB voit automatiquement les DataFrames dans l'environnement

result = duckdb.sql("""
    SELECT 
        category,
        product_name,
        COUNT(*) as sales_count,
        SUM(total_amount) as revenue
    FROM df_sales
    GROUP BY category, product_name
    ORDER BY revenue DESC
    LIMIT 10
""")

print("‚úì Requ√™te SQL directe sur un DataFrame Pandas :")
print(result)

# R√©cup√©rer comme DataFrame
df_top_products = result.df()
print(f"\n‚úì R√©sultat converti en DataFrame : {len(df_top_products)} lignes")

In [None]:
# Syntaxe alternative : query sur une relation
df_result = duckdb.query("""
    SELECT 
        DATE_TRUNC('day', order_date) as day,
        COUNT(*) as orders,
        SUM(total_amount) as daily_revenue
    FROM df_sales
    GROUP BY day
    ORDER BY day
    LIMIT 7
""").to_df()

print("‚úì Ventes quotidiennes (7 premiers jours) :")
print(df_result)

## 4. Connexion persistante (optionnel)

In [None]:
# Cr√©er une base de donn√©es persistante (fichier)
conn = duckdb.connect('duckdb_data/analytics.duckdb')

# Cr√©er une table depuis un fichier Parquet
conn.execute("""
    CREATE OR REPLACE TABLE sales AS 
    SELECT * FROM 'duckdb_data/sales.parquet'
""")

print("‚úì Table 'sales' cr√©√©e dans analytics.duckdb")

# Requ√™te sur la table
result = conn.execute("""
    SELECT COUNT(*) as total_orders,
           SUM(total_amount) as total_revenue
    FROM sales
""").fetchone()

print(f"\nStatistiques :")
print(f"  Total commandes : {result[0]:,}")
print(f"  Revenu total : {result[1]:,.2f}‚Ç¨")

# Fermer la connexion
conn.close()
print("\n‚úì Connexion ferm√©e")

## 5. Fonctions analytiques avanc√©es

### WINDOW functions

In [None]:
# Fonctions de fen√™tre (WINDOW)
result = duckdb.sql("""
    SELECT 
        order_id,
        customer_id,
        order_date,
        total_amount,
        -- Rang par montant
        ROW_NUMBER() OVER (ORDER BY total_amount DESC) as rank_by_amount,
        -- Cumul par client
        SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as cumulative_spend,
        -- Moyenne mobile sur 3 commandes
        AVG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date 
                                ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3
    FROM df_sales
    LIMIT 10
""")

print("‚úì Fonctions de fen√™tre :")
print(result.df().to_string())

### PIVOT : Transformer lignes en colonnes

In [None]:
# PIVOT : cr√©er une matrice cat√©gorie x produit
result = duckdb.sql("""
    PIVOT (
        SELECT category, product_name, SUM(quantity) as total_qty
        FROM df_sales
        GROUP BY category, product_name
    )
    ON product_name
    USING SUM(total_qty)
""")

print("‚úì Tableau crois√© dynamique (PIVOT) :")
print(result.df())

### UNPIVOT : Transformer colonnes en lignes

In [None]:
# Cr√©er un DataFrame wide
df_wide = pd.DataFrame({
    'product': ['Laptop', 'Smartphone', 'Tablet'],
    'Q1': [100, 150, 80],
    'Q2': [120, 160, 90],
    'Q3': [110, 170, 85],
    'Q4': [130, 180, 95]
})

print("DataFrame wide :")
print(df_wide)

# UNPIVOT : transformer en format long
result = duckdb.sql("""
    UNPIVOT df_wide
    ON Q1, Q2, Q3, Q4
    INTO
        NAME quarter
        VALUE sales
""")

print("\n‚úì Apr√®s UNPIVOT (format long) :")
print(result.df())

## 6. Agr√©gations et statistiques avanc√©es

In [None]:
# Statistiques avanc√©es
result = duckdb.sql("""
    SELECT 
        product_name,
        COUNT(*) as count,
        AVG(total_amount) as mean,
        STDDEV(total_amount) as std,
        MIN(total_amount) as min,
        QUANTILE_CONT(total_amount, 0.25) as q25,
        MEDIAN(total_amount) as median,
        QUANTILE_CONT(total_amount, 0.75) as q75,
        MAX(total_amount) as max
    FROM df_sales
    GROUP BY product_name
    ORDER BY mean DESC
""")

print("‚úì Statistiques descriptives par produit :")
print(result.df().round(2).to_string())

In [None]:
# GROUP BY avec ROLLUP (sous-totaux)
result = duckdb.sql("""
    SELECT 
        category,
        product_name,
        COUNT(*) as orders,
        SUM(total_amount) as revenue
    FROM df_sales
    GROUP BY ROLLUP(category, product_name)
    ORDER BY category NULLS LAST, product_name NULLS LAST
""")

print("‚úì GROUP BY avec ROLLUP (totaux et sous-totaux) :")
print(result.df().head(15))

## 7. Jointures et requ√™tes complexes

In [None]:
# Cr√©er une table de clients
df_customers = pd.DataFrame({
    'customer_id': range(1, 1001),
    'customer_name': [f'Customer_{i}' for i in range(1, 1001)],
    'country': np.random.choice(['France', 'USA', 'UK', 'Germany'], 1000),
    'segment': np.random.choice(['Premium', 'Standard', 'Basic'], 1000)
})

df_customers.to_parquet('duckdb_data/customers.parquet', index=False)
print("‚úì Table customers cr√©√©e")

# Jointure entre fichiers
result = duckdb.sql("""
    SELECT 
        c.segment,
        c.country,
        COUNT(DISTINCT s.customer_id) as customers,
        COUNT(s.order_id) as orders,
        SUM(s.total_amount) as revenue,
        AVG(s.total_amount) as avg_order_value
    FROM 'duckdb_data/sales.parquet' s
    JOIN 'duckdb_data/customers.parquet' c
        ON s.customer_id = c.customer_id
    GROUP BY c.segment, c.country
    ORDER BY revenue DESC
    LIMIT 10
""")

print("\n‚úì Analyse par segment et pays (jointure entre 2 fichiers) :")
print(result.df().round(2).to_string())

## 8. Export : COPY TO

In [None]:
# Exporter le r√©sultat d'une requ√™te en Parquet
duckdb.sql("""
    COPY (
        SELECT 
            DATE_TRUNC('day', order_date) as day,
            category,
            COUNT(*) as orders,
            SUM(total_amount) as revenue
        FROM df_sales
        GROUP BY day, category
    ) TO 'duckdb_data/daily_sales.parquet' (FORMAT PARQUET)
""")

print("‚úì R√©sultat export√© : duckdb_data/daily_sales.parquet")

# Exporter en CSV
duckdb.sql("""
    COPY (
        SELECT * FROM df_sales WHERE total_amount > 5000
    ) TO 'duckdb_data/high_value_orders.csv' (HEADER, DELIMITER ',')
""")

print("‚úì Commandes > 5000‚Ç¨ export√©es en CSV")

# V√©rifier
df_exported = pd.read_parquet('duckdb_data/daily_sales.parquet')
print(f"\n‚úì Fichier Parquet contient {len(df_exported)} lignes")
print(df_exported.head())

## 9. Performance : DuckDB vs Pandas

In [None]:
# Cr√©er un dataset plus gros pour benchmark
n = 500_000
df_large = pd.DataFrame({
    'id': range(n),
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
    'value': np.random.randn(n),
    'amount': np.random.uniform(1, 1000, n)
})

print(f"Dataset : {len(df_large):,} lignes")
print(f"M√©moire : {df_large.memory_usage(deep=True).sum() / 1024**2:.2f} MB\n")

# Benchmark 1 : GROUP BY avec agr√©gations
print("Benchmark : GROUP BY avec statistiques")
print("="*50)

# Pandas
start = time.time()
result_pandas = df_large.groupby('category').agg({
    'value': ['count', 'mean', 'std', 'min', 'max'],
    'amount': ['sum', 'mean']
})
pandas_time = time.time() - start

print(f"Pandas : {pandas_time:.3f}s")

# DuckDB
start = time.time()
result_duckdb = duckdb.sql("""
    SELECT 
        category,
        COUNT(value) as value_count,
        AVG(value) as value_mean,
        STDDEV(value) as value_std,
        MIN(value) as value_min,
        MAX(value) as value_max,
        SUM(amount) as amount_sum,
        AVG(amount) as amount_mean
    FROM df_large
    GROUP BY category
""").df()
duckdb_time = time.time() - start

print(f"DuckDB : {duckdb_time:.3f}s")
print(f"\n‚úì DuckDB est {pandas_time / duckdb_time:.1f}x plus rapide")

In [None]:
# Benchmark 2 : Lecture depuis Parquet
df_large.to_parquet('duckdb_data/large.parquet', index=False)

print("\nBenchmark : Lecture Parquet + filtrage")
print("="*50)

# Pandas
start = time.time()
df_pandas = pd.read_parquet('duckdb_data/large.parquet')
df_filtered_pandas = df_pandas[df_pandas['amount'] > 500].groupby('category')['amount'].sum()
pandas_time2 = time.time() - start

print(f"Pandas : {pandas_time2:.3f}s")

# DuckDB
start = time.time()
result_duckdb2 = duckdb.sql("""
    SELECT category, SUM(amount) as total
    FROM 'duckdb_data/large.parquet'
    WHERE amount > 500
    GROUP BY category
""").df()
duckdb_time2 = time.time() - start

print(f"DuckDB : {duckdb_time2:.3f}s")
print(f"\n‚úì DuckDB est {pandas_time2 / duckdb_time2:.1f}x plus rapide")
print("\nüí° DuckDB excelle sur grosses donn√©es et requ√™tes analytiques complexes")

## 10. Lecture depuis S3 (sources distantes)

DuckDB peut lire directement depuis S3, Azure Blob, HTTP, etc.

In [None]:
# Installer l'extension httpfs
conn = duckdb.connect()
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")

print("‚úì Extension httpfs install√©e")

# Exemple : lire un fichier public sur S3
# Note : n√©cessite des credentials pour S3 priv√©

# Configuration S3 (si n√©cessaire)
# conn.execute("""
#     SET s3_region='us-east-1';
#     SET s3_access_key_id='your_key';
#     SET s3_secret_access_key='your_secret';
# """)

# Lecture d'un bucket S3 public (exemple)
# result = conn.execute("""
#     SELECT * FROM 's3://bucket-name/path/file.parquet'
#     LIMIT 10
# """).df()

print("\nüí° Exemple de configuration S3 :")
print("""
    -- Configurer S3
    SET s3_region='us-east-1';
    SET s3_access_key_id='your_key';
    SET s3_secret_access_key='your_secret';
    
    -- Requ√™ter directement S3
    SELECT * FROM 's3://my-bucket/data/*.parquet'
    WHERE date >= '2024-01-01';
""")

conn.close()

## 11. Cas pratique : Analyse compl√®te du dataset e-commerce

In [None]:
# Pipeline d'analyse complet avec DuckDB

print("ANALYSE COMPL√àTE DES VENTES E-COMMERCE")
print("="*70)

# 1. Vue d'ensemble
overview = duckdb.sql("""
    SELECT 
        COUNT(*) as total_orders,
        COUNT(DISTINCT customer_id) as unique_customers,
        COUNT(DISTINCT product_name) as unique_products,
        SUM(total_amount) as total_revenue,
        AVG(total_amount) as avg_order_value,
        MIN(order_date) as first_order,
        MAX(order_date) as last_order
    FROM df_sales
""").df()

print("\n1. VUE D'ENSEMBLE")
print(overview.T)

# 2. Top clients (RFM-like)
top_customers = duckdb.sql("""
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(total_amount) as lifetime_value,
        AVG(total_amount) as avg_order,
        MAX(order_date) as last_order_date,
        DATE_DIFF('day', MAX(order_date), CURRENT_DATE) as days_since_last_order
    FROM df_sales
    GROUP BY customer_id
    ORDER BY lifetime_value DESC
    LIMIT 10
""").df()

print("\n2. TOP 10 CLIENTS (par valeur)")
print(top_customers.to_string())

# 3. Tendances temporelles
trends = duckdb.sql("""
    SELECT 
        DATE_TRUNC('hour', order_date) as hour,
        COUNT(*) as orders,
        SUM(total_amount) as revenue,
        AVG(total_amount) as avg_order
    FROM df_sales
    GROUP BY hour
    ORDER BY hour
    LIMIT 24
""").df()

print("\n3. TENDANCES PAR HEURE (24 premi√®res heures)")
print(trends.to_string())

# 4. Analyse par produit avec ranking
products = duckdb.sql("""
    WITH product_stats AS (
        SELECT 
            product_name,
            category,
            COUNT(*) as orders,
            SUM(quantity) as units_sold,
            SUM(total_amount) as revenue
        FROM df_sales
        GROUP BY product_name, category
    )
    SELECT 
        *,
        RANK() OVER (ORDER BY revenue DESC) as revenue_rank,
        ROUND(100.0 * revenue / SUM(revenue) OVER (), 2) as revenue_pct
    FROM product_stats
    ORDER BY revenue DESC
""").df()

print("\n4. ANALYSE PAR PRODUIT")
print(products.to_string())

# 5. Cohort analysis (simplifi√©)
cohort = duckdb.sql("""
    WITH customer_first_order AS (
        SELECT 
            customer_id,
            MIN(DATE_TRUNC('month', order_date)) as cohort_month
        FROM df_sales
        GROUP BY customer_id
    )
    SELECT 
        cfo.cohort_month,
        COUNT(DISTINCT cfo.customer_id) as cohort_size,
        SUM(s.total_amount) as cohort_revenue
    FROM customer_first_order cfo
    JOIN df_sales s ON cfo.customer_id = s.customer_id
    GROUP BY cfo.cohort_month
    ORDER BY cfo.cohort_month
""").df()

print("\n5. ANALYSE DE COHORTE (par mois d'acquisition)")
print(cohort.to_string())

print("\n" + "="*70)
print("‚úì Analyse compl√®te termin√©e avec DuckDB !")

## Pi√®ges courants

### 1. DuckDB en m√©moire par d√©faut

In [None]:
# ‚ùå DuckDB en m√©moire : donn√©es perdues apr√®s fermeture
# conn = duckdb.connect()  # Base en m√©moire

# ‚úÖ Base persistante
conn_persist = duckdb.connect('my_analytics.duckdb')  # Fichier sur disque
conn_persist.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER)")
conn_persist.close()

print("‚úì Utilisez un fichier .duckdb pour persistance")
print("üí° DuckDB par d√©faut = en m√©moire (comme SQLite ':memory:')")

### 2. Limitations de concurrence

In [None]:
# ‚ö†Ô∏è DuckDB = un seul writer √† la fois
print("‚ö†Ô∏è LIMITATIONS :")
print("  - Un seul writer √† la fois (comme SQLite)")
print("  - Pas de serveur distant (embedded database)")
print("  - Pas de r√©plication / haute disponibilit√©")
print("\n‚úì Pour production multi-users : BigQuery, Snowflake, PostgreSQL")
print("‚úì DuckDB = exploration locale, prototypage, ETL laptop")

### 3. Noms de colonnes avec espaces

In [None]:
# Attention aux noms de colonnes avec espaces/caract√®res sp√©ciaux
df_spaces = pd.DataFrame({
    'Order ID': [1, 2, 3],
    'Total Amount': [100, 200, 300]
})

# ‚úÖ Utilisez des guillemets doubles
result = duckdb.sql("""
    SELECT "Order ID", "Total Amount"
    FROM df_spaces
""")

print("‚úì Colonnes avec espaces : utilisez des guillemets doubles")
print(result)

## Mini-exercices

### Exercice 1 : Requ√™te multi-fichiers

1. Cr√©ez 3 fichiers Parquet avec des ventes de diff√©rents mois  
2. Requ√™tez-les tous en une seule query avec UNION ALL  
3. Calculez le total par mois

In [None]:
# Votre code ici


### Exercice 2 : WINDOW functions avanc√©es

√Ä partir de df_sales :  
1. Calculez le ranking des clients par montant total d√©pens√©  
2. Pour chaque client, calculez la diff√©rence entre sa commande actuelle et la pr√©c√©dente (LAG)  
3. Identifiez les clients dans le top 10%

In [None]:
# Votre code ici


### Exercice 3 : Pipeline complet

1. Lisez le fichier sales.parquet  
2. Filtrez les commandes > 1000‚Ç¨  
3. Joignez avec customers.parquet  
4. Calculez le total par segment et pays  
5. Exportez le r√©sultat en CSV

In [None]:
# Votre code ici


## Solutions des exercices

In [None]:
# Solution Exercice 1
# Cr√©er 3 fichiers mensuels
for month in [1, 2, 3]:
    df_month = pd.DataFrame({
        'date': pd.date_range(f'2024-{month:02d}-01', periods=100, freq='H'),
        'product': np.random.choice(['A', 'B', 'C'], 100),
        'amount': np.random.uniform(10, 500, 100)
    })
    df_month.to_parquet(f'duckdb_data/sales_month_{month}.parquet', index=False)

print("‚úì 3 fichiers mensuels cr√©√©s\n")

# Requ√™te UNION ALL
result_ex1 = duckdb.sql("""
    WITH all_sales AS (
        SELECT * FROM 'duckdb_data/sales_month_1.parquet'
        UNION ALL
        SELECT * FROM 'duckdb_data/sales_month_2.parquet'
        UNION ALL
        SELECT * FROM 'duckdb_data/sales_month_3.parquet'
    )
    SELECT 
        DATE_TRUNC('month', date) as month,
        COUNT(*) as orders,
        SUM(amount) as total
    FROM all_sales
    GROUP BY month
    ORDER BY month
""").df()

print("Solution Exercice 1 :")
print(result_ex1)

In [None]:
# Solution Exercice 2
result_ex2 = duckdb.sql("""
    WITH customer_totals AS (
        SELECT 
            customer_id,
            SUM(total_amount) as total_spent
        FROM df_sales
        GROUP BY customer_id
    ),
    ranked_customers AS (
        SELECT 
            customer_id,
            total_spent,
            RANK() OVER (ORDER BY total_spent DESC) as rank,
            NTILE(10) OVER (ORDER BY total_spent DESC) as decile
        FROM customer_totals
    ),
    order_diffs AS (
        SELECT 
            customer_id,
            order_date,
            total_amount,
            LAG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_amount,
            total_amount - LAG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as diff
        FROM df_sales
    )
    SELECT 
        rc.customer_id,
        rc.total_spent,
        rc.rank,
        rc.decile,
        CASE WHEN rc.decile = 1 THEN 'Top 10%' ELSE 'Others' END as segment
    FROM ranked_customers rc
    WHERE rc.decile = 1
    ORDER BY rc.total_spent DESC
    LIMIT 20
""").df()

print("Solution Exercice 2 (Top 10% clients) :")
print(result_ex2.to_string())

In [None]:
# Solution Exercice 3
duckdb.sql("""
    COPY (
        SELECT 
            c.segment,
            c.country,
            COUNT(*) as high_value_orders,
            SUM(s.total_amount) as total_revenue,
            AVG(s.total_amount) as avg_order_value
        FROM 'duckdb_data/sales.parquet' s
        JOIN 'duckdb_data/customers.parquet' c
            ON s.customer_id = c.customer_id
        WHERE s.total_amount > 1000
        GROUP BY c.segment, c.country
        ORDER BY total_revenue DESC
    ) TO 'duckdb_data/high_value_by_segment.csv' (HEADER, DELIMITER ',')
""")

print("‚úì Solution Exercice 3 : Pipeline ex√©cut√©")
print("  R√©sultat export√© : duckdb_data/high_value_by_segment.csv")

# V√©rifier
df_verify = pd.read_csv('duckdb_data/high_value_by_segment.csv')
print(f"\nAper√ßu du r√©sultat ({len(df_verify)} lignes) :")
print(df_verify.head(10).to_string())

## R√©sum√©

### Points cl√©s

1. **DuckDB** = SQLite pour l'analytique, optimis√© OLAP
2. **SQL sur fichiers** : requ√™tez CSV/Parquet/JSON sans import
3. **Int√©gration Pandas** : DuckDB voit automatiquement vos DataFrames
4. **Performance** : 5-10x plus rapide que Pandas sur requ√™tes analytiques
5. **Fonctions avanc√©es** : WINDOW, PIVOT, UNPIVOT, ROLLUP
6. **Export facile** : COPY TO pour Parquet, CSV, JSON
7. **Sources distantes** : lecture depuis S3, Azure Blob, HTTP
8. **Limitations** : embedded (pas de serveur), un seul writer

### Quand utiliser DuckDB ?

‚úÖ **Utilisez DuckDB pour** :  
- Exploration de donn√©es locales  
- Prototypage avant cloud  
- ETL sur laptop  
- Remplacement de Pandas pour gros volumes (< 100 GB)  
- Requ√™tes analytiques complexes  

‚ùå **N'utilisez PAS DuckDB pour** :  
- Applications web multi-users  
- Production √† haute concurrence  
- Volumes > 100 GB (pr√©f√©rez Spark)  
- Donn√©es distribu√©es  

### Prochaines √©tapes

- Notebook suivant : **ETL avec Python**
- Approfondir : MotherDuck (DuckDB cloud), extensions DuckDB