# üöÄ Pipeline Complet : Du Texte Brut √† la Pr√©diction LSTM

Ce notebook combine toutes les techniques apprises :
- Nettoyage de texte (Module 2)
- BOW et TF-IDF (Module 3)
- Word Embeddings (Module 4)
- LSTM pour la classification (Module 5)
- Sauvegarde et r√©utilisation du mod√®le

## üì¶ 1. Import des biblioth√®ques

In [ ]:
# Imports standards
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import pickle
import json

# NLP et preprocessing
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Deep Learning
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# T√©l√©charger les ressources NLTK n√©cessaires
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

print("‚úÖ Toutes les biblioth√®ques sont import√©es !")

## üìä 2. Chargement et exploration des donn√©es

In [None]:
# Cr√©ons un dataset d'exemple pour la classification de sentiment
# Dans la pratique, vous chargeriez vos propres donn√©es

# Exemples de critiques de films (positives et n√©gatives)
texts = [
    "This movie was absolutely fantastic! Best film I've seen all year.",
    "Terrible waste of time. The plot made no sense at all.",
    "Amazing performances by all actors. Highly recommend!",
    "Boring and predictable. I fell asleep halfway through.",
    "A masterpiece of cinema. Every scene was perfectly crafted.",
    "Worst movie ever. Complete disaster from start to finish.",
    "Brilliant storytelling and stunning visuals. Oscar-worthy!",
    "Disappointing sequel. Nothing like the original.",
    "Incredible movie! Had me on the edge of my seat.",
    "Total garbage. Want my money back.",
    "Outstanding film with great character development.",
    "Awful acting and terrible dialogue. Skip this one.",
    "Exceptional movie that touches your heart.",
    "Mediocre at best. Nothing special about it.",
    "Phenomenal! Best movie of the decade.",
    "Horrible experience. Left the theater early."
]

# Labels : 1 = positif, 0 = n√©gatif
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

# Cr√©er un DataFrame
df = pd.DataFrame({
    'text': texts,
    'sentiment': labels
})

print(f"üìä Dataset cr√©√© avec {len(df)} exemples")
print(f"\nüîç Distribution des sentiments:")
print(df['sentiment'].value_counts())
print("\nüìù Exemples de textes:")
df.head()

## üßπ 3. Nettoyage de texte (Module 2)

In [None]:
# Initialisation des outils
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def clean_text(text, use_lemmatization=True):
    """
    Pipeline de nettoyage complet pour le texte
    """
    # 1. Convertir en minuscules
    text = text.lower()
    
    # 2. Supprimer les URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # 3. Supprimer les mentions et hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # 4. Supprimer la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # 5. Supprimer les chiffres
    text = re.sub(r'\d+', '', text)
    
    # 6. Tokenisation
    tokens = word_tokenize(text)
    
    # 7. Supprimer les stop words
    tokens = [token for token in tokens if token not in stop_words]
    
    # 8. Lemmatization ou Stemming
    if use_lemmatization:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    else:
        tokens = [stemmer.stem(token) for token in tokens]
    
    # 9. Rejoindre les tokens
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

# Appliquer le nettoyage
df['cleaned_text'] = df['text'].apply(clean_text)

print("üßπ Nettoyage termin√© !")
print("\nüìù Comparaison avant/apr√®s :")
for i in range(3):
    print(f"\nOriginal: {df['text'].iloc[i]}")
    print(f"Nettoy√©:  {df['cleaned_text'].iloc[i]}")

## üìä 4. Feature Engineering : BOW et TF-IDF (Module 3)

In [None]:
# Bag of Words
bow_vectorizer = CountVectorizer(max_features=100)
bow_features = bow_vectorizer.fit_transform(df['cleaned_text']).toarray()

print("üéí Bag of Words:")
print(f"Forme de la matrice BOW: {bow_features.shape}")
print(f"Vocabulaire (10 premiers mots): {list(bow_vectorizer.vocabulary_.keys())[:10]}")

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=100)
tfidf_features = tfidf_vectorizer.fit_transform(df['cleaned_text']).toarray()

print("\nüìà TF-IDF:")
print(f"Forme de la matrice TF-IDF: {tfidf_features.shape}")

# Visualiser les mots les plus importants selon TF-IDF
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_scores = np.mean(tfidf_features, axis=0)
top_indices = np.argsort(tfidf_scores)[::-1][:10]

plt.figure(figsize=(10, 6))
top_words = [feature_names[i] for i in top_indices]
top_scores = [tfidf_scores[i] for i in top_indices]
plt.barh(top_words, top_scores)
plt.xlabel('Score TF-IDF moyen')
plt.title('Top 10 mots selon TF-IDF')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## üß† 5. Pr√©paration pour LSTM avec Embeddings (Module 4 & 5)

In [None]:
# Param√®tres pour le mod√®le LSTM
MAX_WORDS = 1000  # Taille du vocabulaire
MAX_SEQUENCE_LENGTH = 100  # Longueur maximale des s√©quences
EMBEDDING_DIM = 100  # Dimension des embeddings

# Tokenizer de Keras pour pr√©parer les s√©quences
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token='<OOV>')
tokenizer.fit_on_texts(df['cleaned_text'])

# Convertir les textes en s√©quences d'indices
sequences = tokenizer.texts_to_sequences(df['cleaned_text'])

# Padding des s√©quences
X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')
y = np.array(df['sentiment'])

print(f"üî¢ Taille du vocabulaire: {len(tokenizer.word_index)}")
print(f"üìè Forme des donn√©es X: {X.shape}")
print(f"\nüìù Exemple de conversion:")
print(f"Texte original: {df['cleaned_text'].iloc[0]}")
print(f"S√©quence: {sequences[0]}")
print(f"Apr√®s padding: {X[0][:20]}...")

# Division train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## üöÄ 6. Construction du mod√®le LSTM

In [ ]:
# Construction du mod√®le LSTM
model = Sequential([
    # Couche d'embedding
    Embedding(input_dim=MAX_WORDS, 
              output_dim=EMBEDDING_DIM,
              name='embedding_layer'),
    
    # LSTM bidirectionnel pour capturer le contexte dans les deux sens
    Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3, return_sequences=True)),
    
    # Deuxi√®me couche LSTM
    LSTM(32, dropout=0.3, recurrent_dropout=0.3),
    
    # Couches denses
    Dense(32, activation='relu'),
    Dropout(0.5),
    
    # Couche de sortie
    Dense(1, activation='sigmoid')
])

# Compilation
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Construire le mod√®le avec la forme des donn√©es
model.build(input_shape=(None, MAX_SEQUENCE_LENGTH))

# Afficher l'architecture
model.summary()

# Visualisation de l'architecture (apr√®s construction)
try:
    tf.keras.utils.plot_model(model, to_file='lstm_model.png', show_shapes=True, show_layer_names=True)
    print("‚úÖ Diagramme du mod√®le sauvegard√© dans 'lstm_model.png'")
except Exception as e:
    print(f"‚ö†Ô∏è Impossible de cr√©er le diagramme: {e}")
    print("Le mod√®le est quand m√™me pr√™t pour l'entra√Ænement !")

## üìà 7. Entra√Ænement du mod√®le

In [None]:
# Callbacks pour l'optimisation
callbacks = [
    # Early stopping pour √©viter l'overfitting
    EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True,
        verbose=1
    ),
    
    # R√©duction du learning rate
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=3,
        min_lr=1e-7,
        verbose=1
    ),
    
    # Sauvegarde du meilleur mod√®le
    ModelCheckpoint(
        'best_lstm_model.h5',
        monitor='val_loss',
        save_best_only=True,
        verbose=1
    )
]

# Entra√Ænement
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=callbacks,
    verbose=1
)

# Visualisation de l'entra√Ænement
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Evolution de la Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Evolution de l\'Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

## üìä 8. √âvaluation du mod√®le

In [None]:
# Pr√©dictions sur le test set
y_pred_proba = model.predict(X_test)
y_pred = (y_pred_proba > 0.5).astype(int).flatten()

# M√©triques
accuracy = accuracy_score(y_test, y_pred)
print(f"üéØ Accuracy sur le test set: {accuracy:.2%}")

# Rapport de classification
print("\nüìä Rapport de classification:")
print(classification_report(y_test, y_pred, target_names=['N√©gatif', 'Positif']))

# Matrice de confusion
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['N√©gatif', 'Positif'],
            yticklabels=['N√©gatif', 'Positif'])
plt.title('Matrice de Confusion')
plt.ylabel('Vraie classe')
plt.xlabel('Classe pr√©dite')
plt.show()

## üíæ 9. Sauvegarde du mod√®le et des pr√©processeurs

In [None]:
# 1. Sauvegarder le mod√®le Keras
model.save('sentiment_lstm_model.h5')
print("‚úÖ Mod√®le LSTM sauvegard√© dans 'sentiment_lstm_model.h5'")

# 2. Sauvegarder le tokenizer
tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(tokenizer_json)
print("‚úÖ Tokenizer sauvegard√© dans 'tokenizer.json'")

# 3. Sauvegarder les param√®tres du mod√®le
model_config = {
    'max_words': MAX_WORDS,
    'max_sequence_length': MAX_SEQUENCE_LENGTH,
    'embedding_dim': EMBEDDING_DIM,
    'model_type': 'LSTM_Bidirectional',
    'task': 'sentiment_analysis'
}

with open('model_config.json', 'w') as f:
    json.dump(model_config, f, indent=4)
print("‚úÖ Configuration sauvegard√©e dans 'model_config.json'")

# 4. Sauvegarder aussi les vectorizers (optionnel)
with open('bow_vectorizer.pkl', 'wb') as f:
    pickle.dump(bow_vectorizer, f)
    
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)
    
print("‚úÖ Vectorizers BOW et TF-IDF sauvegard√©s")

## üîÆ 10. Fonction de pr√©diction pour nouveaux textes

In [None]:
class SentimentPredictor:
    def __init__(self, model_path, tokenizer_path, config_path):
        """Charger le mod√®le et les pr√©processeurs"""
        # Charger le mod√®le
        self.model = load_model(model_path)
        
        # Charger le tokenizer
        with open(tokenizer_path, 'r', encoding='utf-8') as f:
            tokenizer_json = f.read()
        self.tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer_json)
        
        # Charger la configuration
        with open(config_path, 'r') as f:
            self.config = json.load(f)
            
        # Initialiser les outils de nettoyage
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
    def clean_text(self, text):
        """Nettoyer le texte"""
        text = text.lower()
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
        text = re.sub(r'@\w+|#\w+', '', text)
        text = text.translate(str.maketrans('', '', string.punctuation))
        text = re.sub(r'\d+', '', text)
        
        tokens = word_tokenize(text)
        tokens = [token for token in tokens if token not in self.stop_words]
        tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    def predict(self, text):
        """Pr√©dire le sentiment d'un texte"""
        # Nettoyer le texte
        cleaned = self.clean_text(text)
        
        # Convertir en s√©quence
        sequence = self.tokenizer.texts_to_sequences([cleaned])
        padded = pad_sequences(sequence, 
                             maxlen=self.config['max_sequence_length'],
                             padding='post',
                             truncating='post')
        
        # Pr√©diction
        prediction = self.model.predict(padded, verbose=0)[0][0]
        
        # Interpr√©tation
        sentiment = 'Positif' if prediction > 0.5 else 'N√©gatif'
        confidence = prediction if prediction > 0.5 else 1 - prediction
        
        return {
            'text': text,
            'cleaned_text': cleaned,
            'sentiment': sentiment,
            'confidence': float(confidence),
            'raw_score': float(prediction)
        }

# Instancier le pr√©dicteur
predictor = SentimentPredictor(
    model_path='sentiment_lstm_model.h5',
    tokenizer_path='tokenizer.json',
    config_path='model_config.json'
)

print("‚úÖ Pr√©dicteur charg√© et pr√™t !")

## üéØ 11. Test sur de nouveaux textes

In [None]:
# Textes de test
test_texts = [
    "This product is amazing! I love it so much.",
    "Terrible quality. Complete waste of money.",
    "Not bad, but could be better. Average product.",
    "Absolutely fantastic! Exceeded all my expectations!",
    "Disappointing. Does not work as advertised."
]

print("üîÆ Pr√©dictions sur de nouveaux textes:\n")

for text in test_texts:
    result = predictor.predict(text)
    print(f"üìù Texte: '{text}'")
    print(f"üßπ Nettoy√©: '{result['cleaned_text']}'")
    print(f"üòä Sentiment: {result['sentiment']} (Confiance: {result['confidence']:.2%})")
    print(f"üìä Score brut: {result['raw_score']:.3f}")
    print("-" * 80)

## üíª 12. Interface interactive pour tester

In [None]:
def analyse_sentiment_interactif():
    """Interface interactive pour tester le mod√®le"""
    print("=" * 80)
    print("ü§ñ Analyseur de Sentiment LSTM")
    print("=" * 80)
    print("Entrez du texte pour analyser le sentiment (ou 'quit' pour arr√™ter)\n")
    
    while True:
        text = input("\nüí¨ Votre texte: ")
        
        if text.lower() in ['quit', 'exit', 'q']:
            print("\nüëã Au revoir !")
            break
        
        if text.strip():
            result = predictor.predict(text)
            
            # Affichage avec emojis selon le sentiment
            emoji = "üòä" if result['sentiment'] == 'Positif' else "üò¢"
            
            print(f"\n{emoji} Sentiment: {result['sentiment']}")
            print(f"üìä Confiance: {result['confidence']:.2%}")
            
            # Barre de progression visuelle
            bar_length = 30
            filled = int(bar_length * result['raw_score'])
            bar = "‚ñà" * filled + "‚ñë" * (bar_length - filled)
            print(f"\nN√©gatif [{bar}] Positif")
            print(f"Score: {result['raw_score']:.3f}")

# D√©commenter pour utiliser l'interface interactive
# analyse_sentiment_interactif()

## üì¶ 13. Script Python standalone pour utilisation future

In [None]:
# Cr√©er un script Python r√©utilisable
script_content = '''#!/usr/bin/env python3
"""
Script d'analyse de sentiment avec LSTM
Utilisation: python sentiment_analyzer.py "Votre texte ici"
"""

import sys
import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import string

class SentimentAnalyzer:
    def __init__(self):
        # Charger le mod√®le
        self.model = load_model('sentiment_lstm_model.h5')
        
        # Charger le tokenizer
        with open('tokenizer.json', 'r', encoding='utf-8') as f:
            tokenizer_json = f.read()
        self.tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer_json)
        
        # Charger la config
        with open('model_config.json', 'r') as f:
            self.config = json.load(f)
            
        # Outils NLP
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def clean_text(self, text):
        text = text.lower()
        text = re.sub(r'https?://\\S+|www\\.\\S+', '', text)
        text = re.sub(r'@\\w+|#\\w+', '', text)
        text = text.translate(str.maketrans('', '', string.punctuation))
        text = re.sub(r'\\d+', '', text)
        
        tokens = word_tokenize(text)
        tokens = [token for token in tokens if token not in self.stop_words]
        tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    def analyze(self, text):
        cleaned = self.clean_text(text)
        sequence = self.tokenizer.texts_to_sequences([cleaned])
        padded = pad_sequences(sequence, 
                             maxlen=self.config['max_sequence_length'],
                             padding='post')
        
        prediction = self.model.predict(padded, verbose=0)[0][0]
        sentiment = 'Positif' if prediction > 0.5 else 'N√©gatif'
        confidence = prediction if prediction > 0.5 else 1 - prediction
        
        return sentiment, confidence, prediction

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python sentiment_analyzer.py \"Votre texte\"")
        sys.exit(1)
    
    text = ' '.join(sys.argv[1:])
    
    analyzer = SentimentAnalyzer()
    sentiment, confidence, score = analyzer.analyze(text)
    
    emoji = "üòä" if sentiment == 'Positif' else "üò¢"
    print(f"\n{emoji} Sentiment: {sentiment}")
    print(f"üìä Confiance: {confidence:.2%}")
    print(f"üìà Score: {score:.3f}")
'''

# Sauvegarder le script
with open('sentiment_analyzer.py', 'w') as f:
    f.write(script_content)

print("‚úÖ Script standalone cr√©√© : 'sentiment_analyzer.py'")
print("\nüìå Utilisation:")
print("   python sentiment_analyzer.py \"Votre texte √† analyser\"")

## üéâ Conclusion

F√©licitations ! Vous avez cr√©√© un pipeline complet de NLP qui combine :

1. **Nettoyage de texte** (Module 2)
2. **Feature engineering** avec BOW et TF-IDF (Module 3)
3. **Word embeddings** (Module 4)
4. **LSTM** pour la classification (Module 5)

### üìÅ Fichiers sauvegard√©s :
- `sentiment_lstm_model.h5` : Le mod√®le LSTM entra√Æn√©
- `tokenizer.json` : Le tokenizer pour pr√©parer les textes
- `model_config.json` : La configuration du mod√®le
- `sentiment_analyzer.py` : Script Python standalone
- `bow_vectorizer.pkl` et `tfidf_vectorizer.pkl` : Vectorizers (optionnel)

### üöÄ Prochaines √©tapes :
- Entra√Æner sur un dataset plus large
- Essayer d'autres architectures (GRU, Transformers)
- D√©ployer le mod√®le dans une API
- Am√©liorer le preprocessing
- Ajouter plus de classes (sentiment neutre, etc.)