# Evaluation

## Objectif

L'objectif de cette √©valuation est de parvenir √† d√©tecter si une requ√™te est une tentative d'intrusion ou bien une requ√™te l√©gitime.

Votre mod√®le devrait obtenir un score de 95%.

## Livrable

Vous devrez compl√©ter ce notebook Jupyter et r√©pondre aux questions du Google Form associ√© [EPSI - PARIS - 2026 - TRDE707 - TP Not√©](https://forms.gle/ZhcULFdgYDDm4P7Q9).

## D√©pendances et modules

Les modules disponibles pour l'ex√©cution de ce notebook sont :
* pandas
* scikit-learn
* matplotlib

## Source de donn√©es

Les sources de donn√©es sont des fichiers au format CSV. Vous pouvez les t√©l√©charger depuis le dossier [trde707-datasets](https://drive.google.com/drive/folders/135R4uXKxgwFHxNQY8x3iVmSrfGWmYM8a?usp=sharing).

L'archive public_network_log.zip contient un unique fichier public_network_log.csv qui repr√©sente un log de requ√™tes identifi√©es comme une tentative d'intrusion ou une requ√™te l√©gitime.

L'archive dbip-country-lite-2026-01.zip contient un unique fichier dbip-country-lite-2026-01.csv qui associe √† chaque plage d'IP un code Pays. Cette base de donn√©es est issue du portail [dbip](https://db-ip.com/db/format/ip-to-country/csv.html).


Charger les donn√©es


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
import ipaddress
import warnings
warnings.filterwarnings('ignore')


Afficher les effectifs de la colonne `Intrusion`.


In [4]:
# Compter le nombre de requ√™tes l√©gitimes vs intrusion
print(logs['Intrusion'].value_counts())

# Optionnel : repr√©sentation graphique
logs['Intrusion'].value_counts().plot(kind='bar', color=['green','red'])
plt.title("Effectifs des requ√™tes l√©gitimes et intrusions")
plt.show()


NameError: name 'logs' is not defined

Afficher les 2 repr√©sentations les plus adapt√©es √† la colonne `Payload_Size`.


In [None]:
# Histogramme
plt.hist(logs['Payload_Size'], bins=50, color='blue', alpha=0.7)
plt.title("Histogramme de Payload_Size")
plt.xlabel("Payload_Size")
plt.ylabel("Nombre de requ√™tes")
plt.show()

# Boxplot
plt.boxplot(logs['Payload_Size'])
plt.title("Boxplot de Payload_Size")
plt.ylabel("Payload_Size")
plt.show()


Afficher la repr√©sentation la plus adapt√©e √† `Port`.


In [None]:
pythonplt.figure(figsize=(12, 6))
port_counts = df_logs['Port'].value_counts().head(15)
port_counts.plot(kind='bar', color='coral', edgecolor='black')
plt.title('Distribution des Ports les plus utilis√©s')
plt.xlabel('Num√©ro de Port')
plt.ylabel('Nombre de requ√™tes')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("Top 10 des ports:")
print(df_logs['Port'].value_counts().head(10))

Afficher sur un m√™me graphique les variables `Payload_Size`, `Port` et `Intrusion`.


In [None]:
import matplotlib.pyplot as plt

# Scatter plot : Port vs Payload_Size, couleur selon Intrusion
plt.figure(figsize=(12, 6))

legitimate = df_logs[df_logs['Intrusion'] == 0]
intrusion = df_logs[df_logs['Intrusion'] == 1]

plt.scatter(legitimate['Port'], legitimate['Payload_Size'], 
           alpha=0.5, c='green', label='L√©gitime (0)', s=30)
plt.scatter(intrusion['Port'], intrusion['Payload_Size'], 
           alpha=0.5, c='red', label='Intrusion (1)', s=30)

plt.title('Relation entre Port, Payload_Size et Intrusion')
plt.xlabel('Port')
plt.ylabel('Payload Size')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Boxplots comparatifs : Payload_Size et Port selon Intrusion
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot Payload_Size
df_logs.boxplot(column='Payload_Size', by='Intrusion', ax=axes[0])
axes[0].set_title('Payload_Size selon Intrusion')
axes[0].set_xlabel('Intrusion')

# Boxplot Port
df_logs.boxplot(column='Port', by='Intrusion', ax=axes[1])
axes[1].set_title('Port selon Intrusion')
axes[1].set_xlabel('Intrusion')

plt.suptitle('')
plt.tight_layout()
plt.show()


S√©lectionner les colonnes pour lesquelles un encodage est n√©cessaire.


In [None]:
# Colonnes cat√©gorielles √† encoder
categorical_cols = ['Request_Type', 'Protocol', 'User_Agent', 'Status', 'Scan_Type', 'Country']

print("Colonnes cat√©gorielles n√©cessitant un encodage:")
print(categorical_cols)

print("\nValeurs uniques par colonne:")
for col in categorical_cols:
    print(f"\n{col}: {df_logs[col].nunique()} valeurs uniques")
    print(df_logs[col].value_counts().head(3))


S√©lectionner les colonnes pour lesquelles une standardisation est n√©cessaire.


In [None]:
from sklearn.preprocessing import StandardScaler

# Colonnes num√©riques √† standardiser
numerical_cols = ['Port', 'Payload_Size']

print("Colonnes num√©riques n√©cessitant une standardisation:")
print(numerical_cols)

print("\nStatistiques (avant standardisation):")
print(df_logs[numerical_cols].describe())

# Standardisation (optionnelle √† appliquer apr√®s s√©lection)
scaler = StandardScaler()
df_logs[numerical_cols] = scaler.fit_transform(df_logs[numerical_cols])

print("\nStatistiques (apr√®s standardisation):")
print(df_logs[numerical_cols].describe())


S√©lectionner un mod√®le et l'entrainer.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Pr√©paration des donn√©es
df_processed = df_logs.copy()

# Supprimer les colonnes IPs
cols_to_drop = ['Source_IP', 'Destination_IP', 'source_ip_int']
df_processed = df_processed.drop(columns=cols_to_drop)

# Encoder les variables cat√©gorielles
for col in categorical_cols:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col].astype(str))

# S√©parer X et y
X = df_processed.drop('Intrusion', axis=1)
y = df_processed['Intrusion']

# Standardiser les colonnes num√©riques
scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

print("Colonnes utilis√©es:", X.columns.tolist())
print("Shape de X:", X.shape)

# Division train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Mod√®le Random Forest
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

print(f"\nEntra√Ænement sur {X_train.shape[0]} √©chantillons...")
model.fit(X_train, y_train)
print("‚úì Mod√®le entra√Æn√© !")

# Importance des features
feature_imp = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 features importantes:")
print(feature_imp.head())


Evaluer les performances de votre mod√®le.

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Pr√©dictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# M√©triques
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print("="*50)
print("PERFORMANCES DU MOD√àLE")
print("="*50)
print(f"\nüéØ Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"üéØ ROC AUC: {roc_auc:.4f}")

print("\n" + "="*50)
print("RAPPORT DE CLASSIFICATION")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['L√©gitime', 'Intrusion']))

# V√©rification surapprentissage
train_acc = accuracy_score(y_train, model.predict(X_train))
print(f"\nTrain accuracy: {train_acc:.4f}")
print(f"Test accuracy: {accuracy:.4f}")



Choisir un affichage permettant de restituer les performances du mod√®le s√©lectionn√©.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Matrice de confusion
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
           xticklabels=['L√©gitime', 'Intrusion'],
           yticklabels=['L√©gitime', 'Intrusion'])
axes[0, 0].set_title('Matrice de Confusion', fontweight='bold')
axes[0, 0].set_ylabel('Valeur R√©elle')
axes[0, 0].set_xlabel('Valeur Pr√©dite')

# 2. Courbe ROC
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, 'darkorange', lw=2, label=f'AUC = {roc_auc:.2f}')
axes[0, 1].plot([0, 1], [0, 1], 'navy', lw=2, linestyle='--')
axes[0, 1].set_xlabel('Taux de Faux Positifs')
axes[0, 1].set_ylabel('Taux de Vrais Positifs')
axes[0, 1].set_title('Courbe ROC', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. Distribution des probabilit√©s
axes[1, 0].hist(y_pred_proba[y_test==0], bins=50, alpha=0.7, label='L√©gitime', color='green')
axes[1, 0].hist(y_pred_proba[y_test==1], bins=50, alpha=0.7, label='Intrusion', color='red')
axes[1, 0].set_xlabel('Probabilit√© pr√©dite')
axes[1, 0].set_ylabel('Nombre')
axes[1, 0].set_title('Distribution des Probabilit√©s', fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# 4. M√©triques en barres
precision = cm[1,1] / (cm[1,1] + cm[0,1])
recall = cm[1,1] / (cm[1,1] + cm[1,0])
f1 = 2 * (precision * recall) / (precision + recall)

metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'ROC AUC': roc_auc
}

bars = axes[1, 1].bar(metrics.keys(), metrics.values(), 
                      color=['#2ecc71', '#e74c3c', '#f39c12', '#9b59b6', '#3498db'])
axes[1, 1].set_ylim([0, 1.1])
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_title('M√©triques', fontweight='bold')
axes[1, 1].grid(alpha=0.3, axis='y')

for bar in bars:
    h = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., h,
                   f'{h:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# R√©sum√©
print("\n" + "="*50)
print("üèÜ R√âSUM√â")
print("="*50)
print(f"Score: {accuracy*100:.2f}%")
print(f"Vrais Positifs: {cm[1,1]}")
print(f"Vrais N√©gatifs: {cm[0,0]}")
print(f"Faux Positifs: {cm[0,1]}")
print(f"Faux N√©gatifs: {cm[1,0]}")
