# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

##### Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Exercise 1

In [8]:
# Import necessary libraries
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

# Step 1: Load the dataset
data = fetch_kddcup99()
X = pd.DataFrame(data.data)  # Features
y = pd.Series(data.target)   # Target

# Decode byte strings in categorical columns
X = X.applymap(lambda x: x.decode() if isinstance(x, bytes) else x)
y = y.apply(lambda x: x.decode() if isinstance(x, bytes) else x)

# Preprocessing: Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_cols)
    ])

# Step 2: Split the data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Step 3: Define classifiers and hyperparameters
# 1. Decision Tree
dt_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])
dt_params = {
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_split': [2, 5, 10]
}

# 2. Random Forest
rf_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])
rf_params = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None]
}

# Step 4: Hyperparameter tuning with GridSearchCV
# Function to perform grid search and output best model and parameters
def tune_and_evaluate(pipeline, param_grid, X_train, y_train, X_val, y_val):
    grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    val_predictions = best_model.predict(X_val)
    val_accuracy = accuracy_score(y_val, val_predictions)
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Validation Accuracy: {val_accuracy:.4f}")
    return best_model

# Tune each classifier
print("Tuning Decision Tree...")
best_dt = tune_and_evaluate(dt_pipeline, dt_params, X_train, y_train, X_val, y_val)

print("\nTuning Random Forest...")
best_rf = tune_and_evaluate(rf_pipeline, rf_params, X_train, y_train, X_val, y_val)

# Step 5: Evaluate on the test set
def evaluate_on_test(model, X_test, y_test, model_name):
    test_predictions = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, test_predictions)
    print(f"{model_name} Test Accuracy: {test_accuracy:.4f}")

print("\nEvaluating on Test Set...")
evaluate_on_test(best_dt, X_test, y_test, "Decision Tree")
evaluate_on_test(best_rf, X_test, y_test, "Random Forest")

  X = X.applymap(lambda x: x.decode() if isinstance(x, bytes) else x)


Tuning Decision Tree...




Best Parameters: {'classifier__max_depth': None, 'classifier__min_samples_split': 2}
Validation Accuracy: 0.9996

Tuning Random Forest...




Best Parameters: {'classifier__max_depth': None, 'classifier__n_estimators': 200}
Validation Accuracy: 0.9997

Evaluating on Test Set...
Decision Tree Test Accuracy: 0.9997
Random Forest Test Accuracy: 0.9997


# Exercise 2

In [19]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer

# Load the KDDcup99 dataset
data = fetch_kddcup99()
X, y = pd.DataFrame(data.data, columns=data.feature_names), pd.Series(data.target)

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Identify categorical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Create a ColumnTransformer with OneHotEncoder
preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_features),  # Apply One-Hot Encoding to categorical features
    remainder='passthrough'  # Keep other features as they are
)

# Create an ensemble of Decision Trees using Bagging
n_estimators = 25  # Number of trees in the ensemble
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Use 'estimator' instead of 'base_estimator'
    n_estimators=n_estimators,
    random_state=42
)

# Create a pipeline that first transforms the data and then fits the model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('bagging', bagging_clf)
])

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Fit the model using the pipeline
pipeline.fit(X_train, y_train)

# Predict on the validation set
y_pred = pipeline.predict(X_val)
y_proba = pipeline.predict_proba(X_val)  # Get probabilities for each class

# Calculate uncertainty as 1 - max probability (higher uncertainty means lower confidence)
uncertainty = 1 - np.max(y_proba, axis=1)

# Create a DataFrame to hold predictions and uncertainty
results_df = pd.DataFrame({
    'Predicted': label_encoder.inverse_transform(y_pred),  # Convert back to original labels
    'Uncertainty': uncertainty
})

# Identify top and bottom 10% based on uncertainty
top_10_percent_threshold = results_df['Uncertainty'].quantile(0.9)
bottom_10_percent_threshold = results_df['Uncertainty'].quantile(0.1)

top_10_percent = results_df[results_df['Uncertainty'] >= top_10_percent_threshold]
bottom_10_percent = results_df[results_df['Uncertainty'] <= bottom_10_percent_threshold]

# Output results
print("Top 10% Uncertainty Predictions:")
print(top_10_percent)

print("\nBottom 10% Uncertainty Predictions:")
print(bottom_10_percent)


Top 10% Uncertainty Predictions:
         Predicted  Uncertainty
0      b'neptune.'          0.0
1       b'normal.'          0.0
2        b'smurf.'          0.0
3       b'normal.'          0.0
4       b'normal.'          0.0
...            ...          ...
98800   b'normal.'          0.0
98801    b'smurf.'          0.0
98802    b'smurf.'          0.0
98803   b'normal.'          0.0
98804  b'neptune.'          0.0

[98805 rows x 2 columns]

Bottom 10% Uncertainty Predictions:
         Predicted  Uncertainty
0      b'neptune.'          0.0
1       b'normal.'          0.0
2        b'smurf.'          0.0
3       b'normal.'          0.0
4       b'normal.'          0.0
...            ...          ...
98800   b'normal.'          0.0
98801    b'smurf.'          0.0
98802    b'smurf.'          0.0
98803   b'normal.'          0.0
98804  b'neptune.'          0.0

[98673 rows x 2 columns]


# Exercise 3

In [21]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel, SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer

# Load the KDDcup99 dataset
data = fetch_kddcup99()
X, y = pd.DataFrame(data.data, columns=data.feature_names), pd.Series(data.target)

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Identify categorical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Create a ColumnTransformer with OneHotEncoder
preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_features),  # Apply One-Hot Encoding to categorical features
    remainder='passthrough'  # Keep other features as they are
)

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Feature Selection Method 1: Using Random Forest for feature importance
rf = RandomForestClassifier(random_state=42)
X_train_transformed = preprocessor.fit_transform(X_train)

# Fit the Random Forest model to get feature importances
rf.fit(X_train_transformed, y_train)
importances = rf.feature_importances_

# Get indices of the 10 most important features
indices = np.argsort(importances)[-10:]

# Get the names of the selected features
feature_names = preprocessor.get_feature_names_out()
top_rf_features = feature_names[indices]

# Display the 10 most important features from Random Forest
print("Top 10 Features from Random Forest:")
print(top_rf_features)

# Feature Selection Method 2: Using SelectKBest with chi-squared test
# Note: chi-squared test requires non-negative features; we need to ensure the transformed data meets this condition.
X_train_kbest = X_train_transformed

# Apply SelectKBest
k_best_selector = SelectKBest(score_func=chi2, k=10)
k_best_selector.fit(X_train_kbest, y_train)

# Get the mask of selected features
k_best_mask = k_best_selector.get_support()
top_kbest_features = feature_names[k_best_mask]

# Display the 10 most important features from SelectKBest
print("Top 10 Features from SelectKBest (chi-squared):")
print(top_kbest_features)

# Combine selected features from both methods
selected_features = np.unique(np.concatenate((top_rf_features, top_kbest_features)))

# Create new training and validation sets with only selected features
X_train_selected = X_train_transformed[:, np.isin(feature_names, selected_features)]
X_val_selected = preprocessor.transform(X_val)[:, np.isin(feature_names, selected_features)]

# Retrain the Decision Tree classifier with the selected features
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train_selected, y_train)

# Evaluate the model on the validation set
y_pred_selected = decision_tree.predict(X_val_selected)

# Calculate accuracy
accuracy = np.mean(y_pred_selected == y_val)
print(f"Accuracy of Decision Tree with selected features: {accuracy:.4f}")

Top 10 Features from Random Forest:
['onehotencoder__dst_host_serror_rate_1.0'
 'onehotencoder__same_srv_rate_1.0' 'onehotencoder__logged_in_0'
 'onehotencoder__dst_host_same_src_port_rate_1.0'
 "onehotencoder__service_b'private'" 'onehotencoder__dst_bytes_0'
 'onehotencoder__dst_host_same_src_port_rate_0.0'
 "onehotencoder__service_b'ecr_i'" "onehotencoder__protocol_type_b'tcp'"
 "onehotencoder__protocol_type_b'icmp'"]
Top 10 Features from SelectKBest (chi-squared):
['onehotencoder__src_bytes_28' 'onehotencoder__src_bytes_1480'
 'onehotencoder__src_bytes_54540' 'onehotencoder__dst_bytes_8127'
 'onehotencoder__dst_bytes_8314' 'onehotencoder__land_1'
 'onehotencoder__wrong_fragment_1' 'onehotencoder__wrong_fragment_3'
 'onehotencoder__hot_2' 'onehotencoder__num_compromised_1']
Accuracy of Decision Tree with selected features: 0.9934


# Exercise 4

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the KDDcup99 dataset
data = fetch_kddcup99()
X, y = pd.DataFrame(data.data, columns=data.feature_names), pd.Series(data.target)

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Identify categorical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Create a ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),  # One-hot encode categorical features
        ('num', StandardScaler(), X.select_dtypes(include=['float64', 'int64']).columns.tolist())  # Scale numerical features
    ],
    remainder='passthrough'  # Keep other features as they are
)

# Preprocess the data (fit and transform)
X_transformed = preprocessor.fit_transform(X)

# Convert to dense array
X_dense = X_transformed.toarray() if hasattr(X_transformed, 'toarray') else X_transformed

# Define clustering algorithms
kmeans = KMeans(n_clusters=10, random_state=42)
agg_clustering = AgglomerativeClustering(n_clusters=10)
dbscan = DBSCAN(eps=0.5, min_samples=5)

# Fit and predict clusters
clusters_kmeans = kmeans.fit_predict(X_dense)
clusters_agg = agg_clustering.fit_predict(X_dense)
clusters_dbscan = dbscan.fit_predict(X_dense)

# Evaluate clustering performance using silhouette score
silhouette_kmeans = silhouette_score(X_dense, clusters_kmeans)
silhouette_agg = silhouette_score(X_dense, clusters_agg)
# Note: DBSCAN may produce noise (-1), so we will only compute the silhouette score for valid clusters
if len(set(clusters_dbscan)) > 1:  # Check if there are multiple clusters
    silhouette_dbscan = silhouette_score(X_dense[clusters_dbscan != -1], clusters_dbscan[clusters_dbscan != -1])
else:
    silhouette_dbscan = -1  # If there's only one cluster or noise

# Output the silhouette scores
print(f"Silhouette Score for K-Means: {silhouette_kmeans:.4f}")
print(f"Silhouette Score for Agglomerative Clustering: {silhouette_agg:.4f}")
print(f"Silhouette Score for DBSCAN: {silhouette_dbscan:.4f}")

# Analyze cluster distribution for K-Means
kmeans_df = pd.DataFrame({'Cluster': clusters_kmeans, 'True Label': y_encoded})
kmeans_distribution = kmeans_df.groupby(['Cluster', 'True Label']).size().unstack(fill_value=0)

# Visualize cluster distribution for K-Means
plt.figure(figsize=(12, 8))
sns.heatmap(kmeans_distribution, annot=True, fmt='d', cmap='Blues')
plt.title("K-Means Cluster Distribution")
plt.ylabel("Cluster")
plt.xlabel("True Label")
plt.show()

# Exercise 5

In [None]:
# Step 1: Retrieve top and bottom 10% instances based on uncertainty (from Exercise 2)
top_10_percent_indices = np.argsort(uncertainties)[-int(0.1 * num_samples):]  # Most uncertain
bottom_10_percent_indices = np.argsort(uncertainties)[:int(0.1 * num_samples)]  # Least uncertain

# Extract the corresponding data
X_top_10 = X_test.iloc[top_10_percent_indices]
X_bottom_10 = X_test.iloc[bottom_10_percent_indices]

# Step 2: Cluster the Top 10% instances
clustering_algorithms = {
    "KMeans": KMeans(n_clusters=3, random_state=42),
    "Agglomerative Clustering": AgglomerativeClustering(n_clusters=3),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5)
}

# Analyze clusters for Top 10% uncertain data
top_clusters_comparison = {}
for name, model in clustering_algorithms.items():
    model.fit(X_top_10)
    if name == "DBSCAN":
        labels_top = model.labels_
    else:
        labels_top = model.predict(X_top_10)

    # Store results
    top_clusters_comparison[name] = labels_top

# Analyze clusters for Bottom 10% uncertain data
bottom_clusters_comparison = {}
for name, model in clustering_algorithms.items():
    model.fit(X_bottom_10)
    if name == "DBSCAN":
        labels_bottom = model.labels_
    else:
        labels_bottom = model.predict(X_bottom_10)

    # Store results
    bottom_clusters_comparison[name] = labels_bottom

# Step 3: Examine characteristics of the clusters formed
def analyze_clusters(X_data, labels, title):
    cluster_df = pd.DataFrame({'Cluster': labels})
    cluster_counts = cluster_df['Cluster'].value_counts().sort_index()
    print(f"\n{title} Cluster Sizes:")
    print(cluster_counts)

    # Optionally, display a sample of instances from each cluster
    for cluster in cluster_counts.index:
        print(f"\nSample Instances from Cluster {cluster}:")
        print(X_data[cluster_df['Cluster'] == cluster].head())

# Analyze clusters for the Top 10% uncertain data
print("\nAnalyzing clusters for Top 10% uncertain data:")
for name, labels in top_clusters_comparison.items():
    analyze_clusters(X_top_10, labels, f"Top 10% - {name}")

# Analyze clusters for the Bottom 10% uncertain data
print("\nAnalyzing clusters for Bottom 10% uncertain data:")
for name, labels in bottom_clusters_comparison.items():
    analyze_clusters(X_bottom_10, labels, f"Bottom 10% - {name}")

# Exercise 6

In [None]:
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.preprocessing import LabelEncoder

# Step 1: Load the KDDcup99 dataset
data = fetch_kddcup99()
X, y = pd.DataFrame(data.data, columns=data.feature_names), pd.Series(data.target)

# Step 2: Select Features and Check Dimensions
num_features = X.shape[1]  # Number of features in KDDcup99
print(f"Original number of features: {num_features}")

# Step 3: Ensure Features are Discrete or Continuous
# Convert categorical features to numerical values using Label Encoding or One-Hot Encoding
for col in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Convert all feature types to integers or floats
X = X.astype(float)  # Convert to float for consistency

# Check the new dimensionality
new_num_features = X.shape[1]  # Number of features after transformation
print(f"New number of features: {new_num_features}")

# Step 4: Create Target Variable
# Target should be 'normal.' or the name of the anomaly type
# Map the original target to 'normal.' and anomaly types
y = y.apply(lambda x: 'normal.' if x == 'normal.' else x)  # Retain anomaly names

# Combine Features and Target
final_dataset = pd.concat([X, y.rename('target')], axis=1)

# Step 5: Output the New Structure
samples_total = final_dataset.shape[0]
dimensionality = final_dataset.shape[1] - 1  # Exclude target column

print(f"Samples total: {samples_total}")
print(f"Dimensionality: {dimensionality}")
print(f"Features: {X.dtypes}")

# Export the transformed dataset to CSV
final_dataset.to_csv('SA.csv', index=False)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, LeaveOneOut
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report

# Step 1: Load the "SA" dataset (Assuming it's in CSV format)
# Replace 'path_to_sa_dataset.csv' with the actual path to the dataset
sa_data = pd.read_csv('SA.csv')

# Step 2: Preprocess the dataset
# Here, you may need to define your feature columns and label (if applicable)
# For example, if your dataset has a label column named 'label':
features = sa_data.drop(columns=['label'], errors='ignore')  # Assuming 'label' indicates normal/anomaly
labels = sa_data['label'] if 'label' in sa_data.columns else None

# Step 3: Create a subsample of 250 datapoints
sa_sample = features.sample(n=250, random_state=42)

# Step 4: Perform Leave-One-Out Cross-Validation for each algorithm
loo = LeaveOneOut()

# Define anomaly detection algorithms
anomaly_detectors = {
    "Isolation Forest": IsolationForest(contamination=0.1, random_state=42),
    "One-Class SVM": OneClassSVM(gamma='auto', nu=0.1),  # Adjust 'nu' as needed
    "Local Outlier Factor": LocalOutlierFactor(n_neighbors=20, contamination=0.1)
}

# Initialize performance report
performance_report = {}

# Evaluate each algorithm using LOOCV
for name, detector in anomaly_detectors.items():
    y_true = []  # True labels (for comparison, if available)
    y_pred = []  # Predicted labels

    for train_index, test_index in loo.split(sa_sample):
        X_train, X_test = sa_sample.iloc[train_index], sa_sample.iloc[test_index]

        if name == "Local Outlier Factor":
            # LOF does not need to fit on the test point
            detector.fit(X_train)
            y_pred.append(detector.fit_predict(X_test)[0])  # LOF returns -1 for outliers and 1 for inliers
        else:
            detector.fit(X_train)
            y_pred.append(detector.predict(X_test)[0])  # Predict for the test point

        # Record the true label if available
        if labels is not None:
            y_true.append(labels.iloc[test_index[0]])

    # Convert predictions to binary (1: inliers, -1: outliers)
    y_pred_binary = [1 if p == -1 else 0 for p in y_pred]  # Convert outlier predictions to binary

    # If true labels are available, generate classification report
    if labels is not None:
        print(f"\nClassification Report for {name}:")
        print(classification_report(y_true, y_pred_binary))
    performance_report[name] = y_pred_binary  # Store predictions for later use

# Optional: Analyze and visualize the predictions for each detector
# This part can be customized based on the specific needs of the analysis


# Exercise 7 (Redo Exercise 6 with Leave One Out)

# Exercise 8

# Final Results
