# negotiating the past

To install the environment, please read README.md

## Project Overview

This presentation explores the intersection of historical imagination, artificial intelligence, and collective memory. We'll examine how users "negotiate" with AI systems to express their conceptions of the past, and how these interactions can reveal tensions between user expectations and AI-embedded historical patterns.

## Structure

1. **Theoretical Framework**: How LLMs encode historical perspectives 
2. **Methodological Approach**: Analyzing historical references in prompts
3. **Results Analysis**: What prompt analysis reveals about historical imagination
4. **Conclusion**: New spaces for historical negotiation

# Part I: Theoretical Framework

## LLMs and Embedded Historical Patterns

- LLMs as Statistical Pattern Recognizers
- Training Data as Historical Record
- Historical Biases in Language Models
- "Stochastic Parrots" and Historical Truth

## Technical Foundation of LLMs

LLMs rely on transformer architectures that predict tokens based on previous context. Their "knowledge" of history comes from statistical patterns in training data, not genuine understanding. This creates an interesting dynamic when users prompt these systems about historical topics - the system's responses reveal embedded historical narratives from their training data.

## Historical Knowledge in Vector Space

- Word embeddings capture semantic relationships
- Historical concepts represented as vectors
- Temporal relationships encoded in semantic proximity
- Cultural associations embedded in language patterns

Within the vector space of LLMs, historical concepts are encoded as points in multidimensional space. The relationships between historical events, figures, and concepts are captured in the distances and directions between these vectors. These semantic relationships reflect collective memory patterns from the training corpus.

## Collective Memory and LLMs

- LLMs as repositories of digitized collective memory
- Training data selection as memory politics
- The "averaged" nature of AI-generated historical narratives
- Absence of contested memory in statistical consensus

From a memory studies perspective, LLMs function as repositories of digitized collective memory. The selection of training data constitutes a form of memory politics, determining which historical perspectives are included or excluded. The statistical nature of these models produces "averaged" historical narratives that often elide contestation and complexity.

# Part II: Methodological Approach

## The Challenge of Historical Prompt Identification

- Beyond simple keyword approaches
- Historical references: explicit vs. implicit
- Temporality in language
- Building a robust identification strategy

Identifying prompts that reference history requires more sophisticated approaches than simple keyword matching. Historical references can be explicit ("Napoleon Bonaparte") or implicit ("the Emperor's exile"), and may involve complex temporal markers. Our methodology must capture this complexity.

In [5]:
!python --version
!pip install umap-learn
!pip install upa

Python 3.11.11


## Creating a Historical Prompt Dataset

In [6]:
# Loading our dataset of 10 million prompts
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import ssl

# Make sure you have the necessary NLTK resources

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')
nltk.download('stopwords')

# Read the CSV file
prompts_df = pd.read_csv("data/prompts.csv", nrows=100000)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frederic.clavert/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/frederic.clavert/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Historical Reference Detection Approach With DistilBERT

In [7]:
import pandas as pd
import numpy as np
from transformers import DistilBertTokenizer, DistilBertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from umap import UMAP
from sklearn.cluster import DBSCAN
from tqdm import tqdm  # For progress bars
import gc  # Garbage collection
import multiprocessing
from functools import partial
import os

# Set the number of CPU cores to use
num_cpu_cores = 8
torch.set_num_threads(num_cpu_cores)  # Set PyTorch to use your 8 cores
os.environ["OMP_NUM_THREADS"] = str(num_cpu_cores)  # OpenMP threads
os.environ["MKL_NUM_THREADS"] = str(num_cpu_cores)  # MKL threads

# Load DistilBERT model and tokenizer
print("Loading DistilBERT model...")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

# Switch to CPU mode explicitly
model = model.to("cpu")
print("Model loaded on CPU")

# Function to get DistilBERT embeddings for a batch of texts
def process_batch(batch_texts, tokenizer, model):
    # Filter out None values and empty strings
    batch_texts = [text for text in batch_texts if isinstance(text, str) and text.strip()]
    
    if not batch_texts:
        return np.array([])
        
    # Tokenize the texts
    encoded_input = tokenizer(batch_texts, padding=True, truncation=True, 
                             max_length=256, return_tensors='pt')
    
    # Compute token embeddings
    with torch.no_grad():
        outputs = model(**encoded_input)
    
    # Use the [CLS] token embedding as the sentence embedding
    sentence_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    
    # Clear CUDA cache if using GPU
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        
    return sentence_embeddings

# Function to split a list into chunks
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

# Function to get embeddings with multiprocessing
def get_distilbert_embeddings_parallel(texts, tokenizer, model, batch_size=32, n_processes=8):
    # Split texts into batches
    batches = list(chunks(texts, batch_size))
    
    # Process batches in parallel
    with multiprocessing.Pool(processes=n_processes) as pool:
        process_func = partial(process_batch, tokenizer=tokenizer, model=model)
        results = list(tqdm(pool.imap(process_func, batches), total=len(batches), desc="Generating embeddings"))
    
    # Filter out empty results and concatenate
    results = [r for r in results if r.size > 0]
    if not results:
        return np.array([])
    
    return np.concatenate(results, axis=0)

# Memory-efficient function to calculate cosine similarities in chunks
def calculate_similarities_in_chunks(embeddings_a, embeddings_b, chunk_size=1000):
    n_samples = embeddings_a.shape[0]
    n_features = embeddings_b.shape[0]
    similarities = np.zeros((n_samples, n_features))
    
    for i in tqdm(range(0, n_samples, chunk_size), desc="Calculating similarities"):
        end_idx = min(i + chunk_size, n_samples)
        chunk_similarities = cosine_similarity(
            embeddings_a[i:end_idx], 
            embeddings_b
        )
        similarities[i:end_idx] = chunk_similarities
        
    return similarities

# Main processing pipeline
def process_prompts_for_historical_content(prompts_df, sample_size=100000):
    # Sample if dataset is large
    if len(prompts_df) > sample_size:
        sample_df = prompts_df.sample(sample_size, random_state=42)
        print(f"Sampled {sample_size} prompts from {len(prompts_df)} total")
    else:
        sample_df = prompts_df.copy()
        print(f"Processing all {len(sample_df)} prompts")
    
    # Clean memory
    gc.collect()
    
    # Get list of prompts
    prompts_list = sample_df['prompt'].fillna('').tolist()
    
    # Generate embeddings in parallel
    print("Generating prompt embeddings...")
    prompt_embeddings = get_distilbert_embeddings_parallel(
        prompts_list, 
        tokenizer, 
        model, 
        batch_size=32,  # Smaller batches for memory efficiency
        n_processes=num_cpu_cores
    )
    
    # Define historical concepts
    historical_concepts = [
        "ancient history", "medieval times", "renaissance period", 
        "world war", "cold war", "industrial revolution",
        "historic events", "historical figures", "archaeological findings",
        "prehistoric era", "colonial period", "civil rights movement",
        "past civilizations", "historical artifacts", "ancient empires",
        "historical battles", "kings and queens", "historical architecture"
    ]
    
    # Generate embeddings for historical concepts
    print("Generating historical concept embeddings...")
    historical_embeddings = get_distilbert_embeddings_parallel(
        historical_concepts, 
        tokenizer, 
        model, 
        batch_size=len(historical_concepts),  # Process all concepts in one batch
        n_processes=1  # No need for parallelism with small number of concepts
    )
    
    # Clean memory
    gc.collect()
    
    # Calculate similarities in memory-efficient chunks
    print("Calculating similarities...")
    similarity_matrix = calculate_similarities_in_chunks(
        prompt_embeddings, 
        historical_embeddings, 
        chunk_size=1000
    )
    
    # For each prompt, get the maximum similarity to any historical concept
    max_similarities = np.max(similarity_matrix, axis=1)
    
    # Find the index of the most similar historical concept for each prompt
    most_similar_concept_idx = np.argmax(similarity_matrix, axis=1)
    sample_df['most_similar_concept'] = [historical_concepts[idx] for idx in most_similar_concept_idx]
    sample_df['similarity_score'] = max_similarities
    
    # Filter prompts with similarity scores above a threshold
    threshold = 0.6  # Adjust based on your needs
    historical_prompts = sample_df[max_similarities > threshold].copy()
    
    print(f"Identified {len(historical_prompts)} historical prompts ({len(historical_prompts)/len(sample_df):.2%} of sample)")
    
    # Clean memory before UMAP/clustering
    del similarity_matrix
    gc.collect()
    
    return historical_prompts, prompt_embeddings, max_similarities

# Assuming prompts_df is already loaded
# historical_prompts, prompt_embeddings, max_similarities = process_prompts_for_historical_content(prompts_df)

# The rest of the code (UMAP and clustering) can remain largely the same as before
def analyze_historical_clusters(historical_prompts, prompt_embeddings, max_similarities, threshold=0.6):
    if len(historical_prompts) > 10:  # Need at least a few samples for clustering
        # Get embeddings for the filtered historical prompts
        historical_embeddings = prompt_embeddings[max_similarities > threshold]
        
        print("Applying UMAP dimensionality reduction...")
        # Apply UMAP with CPU optimization
        umap_model = UMAP(n_neighbors=15, min_dist=0.1, random_state=42, 
                          n_jobs=num_cpu_cores)  # Use all cores
        historical_umap = umap_model.fit_transform(historical_embeddings)
        
        print("Clustering with DBSCAN...")
        # Apply DBSCAN
        dbscan = DBSCAN(eps=0.5, min_samples=5, n_jobs=num_cpu_cores)  # Use all cores
        cluster_labels = dbscan.fit_predict(historical_umap)
        
        # Add cluster labels to the dataframe
        historical_prompts['cluster'] = cluster_labels
        
        # Generate visualizations and report
        generate_cluster_report(historical_prompts, historical_umap, cluster_labels)
        
        return historical_prompts, historical_umap, cluster_labels
    else:
        print("Not enough historical prompts for clustering")
        return historical_prompts, None, None

def generate_cluster_report(historical_prompts, historical_umap, cluster_labels):
    # Count number of prompts per cluster
    cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
    print("\nNumber of prompts per cluster:")
    print(cluster_counts)
    
    # Visualize the clusters
    plt.figure(figsize=(12, 10))
    # Exclude noise points (cluster -1) for better visualization
    non_noise_mask = cluster_labels != -1
    
    scatter = plt.scatter(
        historical_umap[non_noise_mask, 0], 
        historical_umap[non_noise_mask, 1], 
        c=cluster_labels[non_noise_mask], 
        cmap='tab20', 
        alpha=0.6, 
        s=10
    )
    plt.colorbar(scatter, label='Cluster')
    plt.title('Clusters of Historical Prompts (DistilBERT method)')
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    plt.savefig('historical_prompt_clusters_distilbert.png', dpi=300, bbox_inches='tight')
    plt.close()  # Close the figure to free memory
    
    # Sample prompts from each cluster
    print("\nSample prompts from each cluster:")
    for cluster_id in sorted(historical_prompts['cluster'].unique()):
        if cluster_id == -1:
            continue  # Skip noise points
            
        cluster_prompts = historical_prompts[historical_prompts['cluster'] == cluster_id]
        print(f"\nCluster {cluster_id} ({len(cluster_prompts)} prompts):")
        
        # Sample up to 5 prompts from this cluster
        samples = cluster_prompts.sample(min(5, len(cluster_prompts)))
        for _, row in samples.iterrows():
            print(f"  - {row['prompt'][:100]}... (Similarity: {row['similarity_score']:.2f}, Concept: {row['most_similar_concept']})")

# Example usage
if __name__ == "__main__":
    # Load your data
    # prompts_df = pd.read_csv("prompts.csv")
    
    # Run the complete pipeline
    # historical_prompts, prompt_embeddings, max_similarities = process_prompts_for_historical_content(prompts_df)
    # historical_prompts, umap_result, cluster_labels = analyze_historical_clusters(
    #     historical_prompts, prompt_embeddings, max_similarities)
    
    print("Done!")


ImportError: cannot import name 'UMAP' from 'umap' (/opt/anaconda3/envs/negotiating_past/lib/python3.11/site-packages/umap/__init__.py)

### 1. Visualisation de la distribution des scores de similarité

Cette cellule permet de visualiser comment les scores de similarité sont distribués et d'évaluer si le seuil choisi (0.6) est approprié

In [None]:
def visualize_similarity_distribution(similarity_scores, threshold=0.6, output_path='./figures/'):
    """
    Visualise la distribution des scores de similarité et marque le seuil utilisé.
    
    Args:
        similarity_scores: Tableau numpy des scores de similarité
        threshold: Seuil utilisé pour filtrer les prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import matplotlib.pyplot as plt
    import numpy as np
    import os
    
    os.makedirs(output_path, exist_ok=True)
    
    plt.figure(figsize=(10, 6))
    
    # Histogramme des scores de similarité
    plt.hist(similarity_scores, bins=50, alpha=0.7, color='steelblue')
    
    # Ligne verticale pour le seuil
    plt.axvline(x=threshold, color='red', linestyle='--', 
                label=f'Seuil ({threshold})')
    
    # Annotations
    plt.title('Distribution des scores de similarité avec les concepts historiques', 
              fontsize=14)
    plt.xlabel('Score de similarité', fontsize=12)
    plt.ylabel('Nombre de prompts', fontsize=12)
    plt.legend()
    plt.grid(alpha=0.3)
    
    # Annotation des statistiques clés
    plt.text(0.02, 0.95, 
             f"Total: {len(similarity_scores)}\nHistoriques: {np.sum(similarity_scores > threshold)} ({np.mean(similarity_scores > threshold):.1%})",
             transform=plt.gca().transAxes, 
             bbox=dict(facecolor='white', alpha=0.8))
    
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, 'similarity_distribution.png'), dpi=300)
    plt.show()

# D'abord, exécutez le pipeline principal pour générer les données nécessaires
historical_prompts, prompt_embeddings, max_similarities = process_prompts_for_historical_content(prompts_df)

# Exemple d'utilisation
visualize_similarity_distribution(max_similarities)

### 2. Carte de chaleur des similarités entre concepts historiques

Cette cellule permet de visualiser comment les concepts historiques sont reliés entre eux

In [None]:
def visualize_historical_concepts_similarity(historical_concepts, historical_embeddings, output_path='./figures/'):
    """
    Crée une carte de chaleur montrant les similarités entre concepts historiques.
    
    Args:
        historical_concepts: Liste des concepts historiques
        historical_embeddings: Embeddings des concepts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    from sklearn.metrics.pairwise import cosine_similarity
    import os
    
    os.makedirs(output_path, exist_ok=True)
    
    # Calculer la matrice de similarité entre concepts
    concept_similarity = cosine_similarity(historical_embeddings)
    
    # Créer une figure de taille appropriée
    plt.figure(figsize=(12, 10))
    
    # Créer la heatmap avec seaborn
    sns.heatmap(concept_similarity, annot=True, fmt=".2f", cmap="YlGnBu",
                xticklabels=historical_concepts, yticklabels=historical_concepts)
    
    plt.title("Similarité entre les concepts historiques", fontsize=16)
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, 'historical_concepts_similarity.png'), dpi=300)
    plt.show()

# Exemple d'utilisation
visualize_historical_concepts_similarity(historical_concepts, historical_embeddings)

### 3. Projection UMAP des prompts avec coloration par concept le plus similaire
Cette visualisation permet de voir comment les prompts se regroupent naturellement et si les concepts les plus similaires forment des clusters cohérents

In [None]:
def visualize_umap_by_concept(historical_prompts, historical_umap, output_path='./figures/'):
    """
    Visualise la projection UMAP des prompts, colorés par concept historique le plus similaire.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques avec leur concept le plus similaire
        historical_umap: Coordonnées UMAP des prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import matplotlib.pyplot as plt
    import numpy as np
    import os
    from matplotlib.colors import ListedColormap
    
    os.makedirs(output_path, exist_ok=True)
    
    # Obtenir les concepts uniques
    unique_concepts = historical_prompts['most_similar_concept'].unique()
    n_concepts = len(unique_concepts)
    
    # Créer un mapping des concepts aux indices
    concept_to_idx = {concept: i for i, concept in enumerate(unique_concepts)}
    
    # Obtenir un tableau numpy des indices de concepts
    concept_indices = np.array([concept_to_idx[concept] for concept in historical_prompts['most_similar_concept']])
    
    # Créer une colormap avec suffisamment de couleurs distinctes
    import matplotlib.cm as cm
    if n_concepts <= 10:
        cmap = ListedColormap(plt.cm.tab10.colors[:n_concepts])
    elif n_concepts <= 20:
        cmap = ListedColormap(plt.cm.tab20.colors[:n_concepts])
    else:
        cmap = plt.cm.nipy_spectral
    
    plt.figure(figsize=(14, 10))
    
    # Scatter plot avec coloration par concept
    scatter = plt.scatter(historical_umap[:, 0], historical_umap[:, 1], 
                         c=concept_indices, cmap=cmap, 
                         alpha=0.7, s=10)
    
    # Créer une légende explicite
    from matplotlib.lines import Line2D
    legend_elements = [Line2D([0], [0], marker='o', color='w', 
                              markerfacecolor=cmap(concept_to_idx[concept]), 
                              markersize=8, label=concept)
                       for concept in unique_concepts]
    
    plt.legend(handles=legend_elements, loc='upper right', 
               bbox_to_anchor=(1.1, 1), ncol=1)
    
    plt.title('Projection UMAP des prompts historiques par concept', fontsize=14)
    plt.xlabel('UMAP Dimension 1', fontsize=12)
    plt.ylabel('UMAP Dimension 2', fontsize=12)
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, 'umap_by_concept.png'), dpi=300)
    plt.show()

# Exemple d'utilisation
# visualize_umap_by_concept(historical_prompts, historical_umap)

### 4. Nuage de mots pour chaque cluster

Cette visualisation permet d'explorer les termes les plus fréquents dans chaque cluster

In [None]:
def generate_cluster_wordclouds(historical_prompts, output_path='./figures/'):
    """
    Génère un nuage de mots pour chaque cluster identifié.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques avec leurs clusters
        output_path: Chemin pour sauvegarder les figures
    """
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    import os
    import nltk
    from nltk.corpus import stopwords
    
    # Télécharger les stopwords si nécessaire
    try:
        nltk.data.find('corpora/stopwords')
    except LookupError:
        nltk.download('stopwords')
    
    stop_words = set(stopwords.words('english'))
    
    os.makedirs(output_path, exist_ok=True)
    
    # Pour chaque cluster
    for cluster_id in sorted(historical_prompts['cluster'].unique()):
        if cluster_id == -1:  # Ignorer les points de bruit
            continue
            
        # Filtrer les prompts de ce cluster
        cluster_prompts = historical_prompts[historical_prompts['cluster'] == cluster_id]['prompt']
        
        if len(cluster_prompts) == 0:
            continue
            
        # Combiner tous les textes
        text = ' '.join(cluster_prompts)
        
        # Créer le nuage de mots
        wordcloud = WordCloud(
            width=800, 
            height=400, 
            background_color='white',
            stopwords=stop_words,
            max_words=100,
            contour_width=3
        ).generate(text)
        
        # Afficher et sauvegarder
        plt.figure(figsize=(16, 8))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Nuage de mots pour le Cluster {cluster_id} ({len(cluster_prompts)} prompts)', 
                 fontsize=16)
        plt.tight_layout()
        plt.savefig(os.path.join(output_path, f'wordcloud_cluster_{cluster_id}.png'), dpi=300)
        plt.close()
    
    print(f"Nuages de mots générés pour {len(historical_prompts['cluster'].unique()) - 1} clusters")

# Exemple d'utilisation
# generate_cluster_wordclouds(historical_prompts)

### 5. Distribution des concepts historiques par cluster
Cette visualisation montre comment les différents concepts historiques sont distribués dans les clusters identifiés

In [None]:
def visualize_concepts_by_cluster(historical_prompts, output_path='./figures/'):
    """
    Crée une heatmap montrant la distribution des concepts historiques par cluster.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    import os
    
    os.makedirs(output_path, exist_ok=True)
    
    # Ignorer les points de bruit (cluster -1) si présents
    if -1 in historical_prompts['cluster'].unique():
        df_filtered = historical_prompts[historical_prompts['cluster'] != -1].copy()
    else:
        df_filtered = historical_prompts.copy()
    
    # Créer une table de contingence
    cross_tab = pd.crosstab(
        df_filtered['most_similar_concept'], 
        df_filtered['cluster'], 
        normalize='index'
    )
    
    # Tri pour une meilleure visualisation
    # On trie les concepts par cluster dominant
    dominant_clusters = cross_tab.idxmax(axis=1)
    sorted_concepts = dominant_clusters.sort_values().index
    cross_tab = cross_tab.loc[sorted_concepts]
    
    plt.figure(figsize=(14, 10))
    sns.heatmap(cross_tab, annot=True, cmap="YlGnBu", fmt='.0%')
    plt.title('Distribution des concepts historiques par cluster', fontsize=16)
    plt.ylabel('Concept historique', fontsize=14)
    plt.xlabel('Cluster', fontsize=14)
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, 'concept_cluster_distribution.png'), dpi=300)
    plt.show()

# Exemple d'utilisation
# visualize_concepts_by_cluster(historical_prompts)

### 6. Analyse des termes les plus communs par concept historique

Cette visualisation aide à comprendre quels termes sont les plus associés à chaque concept historique

In [None]:
def analyze_terms_by_concept(historical_prompts, output_path='./figures/'):
    """
    Analyse et visualise les termes les plus fréquents pour chaque concept historique.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import matplotlib.pyplot as plt
    import os
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from collections import Counter
    
    # Télécharger les ressources NLTK nécessaires
    try:
        nltk.data.find('tokenizers/punkt')
        nltk.data.find('corpora/stopwords')
    except LookupError:
        nltk.download('punkt')
        nltk.download('stopwords')
    
    stop_words = set(stopwords.words('english'))
    
    os.makedirs(output_path, exist_ok=True)
    
    # Pour chaque concept historique
    for concept in historical_prompts['most_similar_concept'].unique():
        # Filtrer les prompts de ce concept
        concept_prompts = historical_prompts[historical_prompts['most_similar_concept'] == concept]['prompt']
        
        # Combiner tous les prompts
        text = ' '.join(concept_prompts)
        
        # Tokenizer
        tokens = word_tokenize(text.lower())
        
        # Filtrer les stopwords et les tokens courts
        filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words and len(word) > 2]
        
        # Compter les occurrences
        word_counts = Counter(filtered_tokens)
        
        # Prendre les N mots les plus fréquents
        top_n = 20
        top_words = word_counts.most_common(top_n)
        
        # Préparer les données pour le graphique
        words = [word for word, count in top_words]
        counts = [count for word, count in top_words]
        
        # Créer le graphique
        plt.figure(figsize=(12, 8))
        plt.barh(words[::-1], counts[::-1], color='steelblue')
        plt.xlabel('Fréquence')
        plt.title(f'Termes les plus fréquents pour "{concept}" ({len(concept_prompts)} prompts)', 
                 fontsize=14)
        plt.tight_layout()
        plt.savefig(os.path.join(output_path, f'terms_{concept.replace(" ", "_")}.png'), dpi=300)
        plt.close()
    
    print(f"Analyse des termes générée pour {len(historical_prompts['most_similar_concept'].unique())} concepts")

# Exemple d'utilisation
# analyze_terms_by_concept(historical_prompts)

### 7. Visualisation interactive avec Plotly

Cette cellule crée une visualisation interactive de la projection UMAP qui permet d'explorer les prompts historiques de manière plus interactive

In [None]:
def create_interactive_visualization(historical_prompts, historical_umap, output_path='./figures/'):
    """
    Crée une visualisation interactive des prompts historiques avec Plotly.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques
        historical_umap: Coordonnées UMAP des prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import plotly.express as px
    import pandas as pd
    import os
    
    os.makedirs(output_path, exist_ok=True)
    
    # Créer un DataFrame pour Plotly
    viz_df = pd.DataFrame({
        'UMAP1': historical_umap[:, 0],
        'UMAP2': historical_umap[:, 1],
        'Cluster': historical_prompts['cluster'],
        'Concept': historical_prompts['most_similar_concept'],
        'Score': historical_prompts['similarity_score'],
        'Prompt': historical_prompts['prompt']
    })
    
    # Créer la visualisation interactive
    fig = px.scatter(
        viz_df,
        x='UMAP1',
        y='UMAP2',
        color='Concept',
        hover_data=['Prompt', 'Score', 'Cluster'],
        opacity=0.7,
        title='Exploration interactive des prompts historiques',
        template='plotly_white',
        color_discrete_sequence=px.colors.qualitative.Bold
    )
    
    # Améliorer la mise en page
    fig.update_layout(
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=-0.2,
            xanchor="center",
            x=0.5
        ),
        width=1200,
        height=800
    )
    
    # Enregistrer en tant que fichier HTML autonome
    fig.write_html(os.path.join(output_path, 'interactive_visualization.html'))
    
    return fig

# Exemple d'utilisation
# fig = create_interactive_visualization(historical_prompts, historical_umap)
# fig.show()  # Afficher dans le notebook

### 8. Réseau de co-occurrence de concepts dans les clusters
Cette visualisation montre comment les concepts historiques sont liés entre eux à travers leur présence dans les mêmes clusters

In [None]:
def visualize_concept_network(historical_prompts, output_path='./figures/'):
    """
    Crée une visualisation en réseau des relations entre concepts historiques
    basée sur leur co-occurrence dans les clusters.
    
    Args:
        historical_prompts: DataFrame contenant les prompts historiques
        output_path: Chemin pour sauvegarder les figures
    """
    import networkx as nx
    import matplotlib.pyplot as plt
    import pandas as pd
    import os
    
    os.makedirs(output_path, exist_ok=True)
    
    # Filtrer pour exclure les points de bruit
    if -1 in historical_prompts['cluster'].unique():
        df_filtered = historical_prompts[historical_prompts['cluster'] != -1].copy()
    else:
        df_filtered = historical_prompts.copy()
    
    # Créer un graphe
    G = nx.Graph()
    
    # Ajouter des nœuds pour chaque concept
    concepts = df_filtered['most_similar_concept'].unique()
    for concept in concepts:
        count = df_filtered[df_filtered['most_similar_concept'] == concept].shape[0]
        G.add_node(concept, size=count, count=count)
    
    # Pour chaque cluster, créer des liens entre concepts présents
    for cluster in df_filtered['cluster'].unique():
        # Obtenir les concepts dans ce cluster
        cluster_concepts = df_filtered[df_filtered['cluster'] == cluster]['most_similar_concept'].unique()
        
        # Créer des liens pour chaque paire de concepts
        for i, concept1 in enumerate(cluster_concepts):
            for concept2 in cluster_concepts[i+1:]:
                # Si le lien existe déjà, augmenter son poids
                if G.has_edge(concept1, concept2):
                    G[concept1][concept2]['weight'] += 1
                else:
                    G.add_edge(concept1, concept2, weight=1)
    
    # Taille des nœuds basée sur la fréquence
    node_sizes = [G.nodes[node]['size'] * 20 for node in G.nodes]
    
    # Épaisseur des liens basée sur les poids
    edge_weights = [G[u][v]['weight'] * 0.5 for u, v in G.edges]
    
    # Positionner les nœuds
    pos = nx.spring_layout(G, seed=42, k=0.3)
    
    plt.figure(figsize=(14, 12))
    
    # Dessiner les nœuds
    nx.draw_networkx_nodes(G, pos, 
                          node_size=node_sizes, 
                          node_color='skyblue', 
                          alpha=0.8)
    
    # Dessiner les liens
    nx.draw_networkx_edges(G, pos, 
                          width=edge_weights, 
                          alpha=0.5, 
                          edge_color='gray')
    
    # Ajouter les étiquettes
    nx.draw_networkx_labels(G, pos, font_size=10, font_family='sans-serif')
    
    plt.title('Réseau de co-occurrence des concepts historiques', fontsize=16)
    plt.axis('off')
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, 'concept_network.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    return G

# Exemple d'utilisation
concept_network = visualize_concept_network(historical_prompts)