Entraînement des modèles : Récupération des embeddings pour entraîner les modèles de classification.


Évaluation et sauvegarde du modèle : Évaluation des performances des modèles et conservation du meilleur modèle pour la prédiction future.

5. Choose and Train Model
Start simple, then advance:

Baseline: Logistic Regression or Naive Bayes with TF-IDF
Mid-level: LSTM or CNN neural networks
Advanced: Fine-tune pre-trained models like BERT, DistilBERT, or RoBERTa

6. Evaluate Performance
Track these metrics:

Accuracy
Precision, Recall, F1-score per class
Confusion matrix to see misclassifications

7. Iterate and Improve

Add more training data for weak categories
Try data augmentation (paraphrasing)
Adjust model hyperparameters
Handle class imbalance with weighted loss or oversampling

8. Deploy

Save your trained model
Create an API or interface to classify new tweets
Monitor performance on real data

In [None]:
from sklearn.model_selection import train_test_split


X_embeddings


# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_embeddings, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

# === MODEL 1: Logistic Regression ===
print("\n" + "="*60)
print("MODEL 1: Logistic Regression")
print("="*60)

lr_model = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print(f"\nWeighted F1-Score: {f1_score(y_test, y_pred_lr, average='weighted'):.4f}")

# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['negative', 'neutral', 'positive'],
            yticklabels=['negative', 'neutral', 'positive'])
plt.title('Logistic Regression - Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# === MODEL 2: Random Forest ===
print("\n" + "="*60)
print("MODEL 2: Random Forest")
print("="*60)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
print(f"\nWeighted F1-Score: {f1_score(y_test, y_pred_rf, average='weighted'):.4f}")

# === MODEL 3: SVM (Optional - slower) ===
print("\n" + "="*60)
print("MODEL 3: SVM")
print("="*60)

svm_model = SVC(kernel='rbf', random_state=42, class_weight='balanced')
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm))
print(f"\nWeighted F1-Score: {f1_score(y_test, y_pred_svm, average='weighted'):.4f}")

# === COMPARE MODELS ===
models = {
    'Logistic Regression': f1_score(y_test, y_pred_lr, average='weighted'),
    'Random Forest': f1_score(y_test, y_pred_rf, average='weighted'),
    'SVM': f1_score(y_test, y_pred_svm, average='weighted')
}

print("\n" + "="*60)
print("MODEL COMPARISON (Weighted F1-Score)")
print("="*60)
for model_name, score in sorted(models.items(), key=lambda x: x[1], reverse=True):
    print(f"{model_name}: {score:.4f}")

# Save best model
best_model_name = max(models, key=models.get)
best_model = {'Logistic Regression': lr_model, 'Random Forest': rf_model, 'SVM': svm_model}[best_model_name]

os.makedirs("../models", exist_ok=True)
with open("../models/best_model.pkl", "wb") as f:
    pickle.dump(best_model, f)

print(f"\n✅ Best model ({best_model_name}) saved to ../models/best_model.pkl")