# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [2]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [3]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [4]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [5]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [6]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [7]:
len(np.unique(D["target"]))

23

In [8]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Excerise 1

Logistic Regression

In [9]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


In [10]:
df = pd.DataFrame(D['data'], columns=D['feature_names'])

#Assign features and target variables
X = df 
y = D['target'] 

y = (y == b'normal.').astype(int)

In [11]:
#Preprocessing transformations
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

X = preprocessor.fit_transform(X)

In [12]:
#Train/Test Split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [13]:
#Train a Base Logistic Regression Model
base_lr = LogisticRegression(solver='liblinear')
base_lr.fit(X_train, y_train)

#Evaluate Base Logistic Regression on Validation Set
y_val_pred = base_lr.predict(X_val)
val_acc = accuracy_score(y_val, y_val_pred)
print(f"Base Logistic Regression Validation Accuracy: {val_acc:.7f}\n")

Base Logistic Regression Validation Accuracy: 0.9997065



In [14]:
#Train Logistic Regression with Hyperparameter Tuning
param_grid_lr = {'C': [0.1, 1, 10], 'penalty': ['l2']}
logistic_regression = LogisticRegression(solver='liblinear')
grid_search_lr = GridSearchCV(logistic_regression, param_grid_lr, cv=5, n_jobs=-1)
grid_search_lr.fit(X_train, y_train)

#Best parameters and validation score
print(f"Logistic Regression Best Parameters: {grid_search_lr.best_params_}")
print(f"Logistic Regression Best Validation Accuracy: {grid_search_lr.best_score_:.4f}\n")


best_lr = grid_search_lr.best_estimator_
y_pred = best_lr.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Test Accuracy: {acc_lr:.4f}\n")


Logistic Regression Best Parameters: {'C': 10, 'penalty': 'l2'}
Logistic Regression Best Validation Accuracy: 0.9997

Logistic Regression Test Accuracy: 0.9997



Support Vector Machine

In [None]:
#Base SVM Model
base_svm = SVC()
base_svm.fit(X_train, y_train)

#Evaluate Base SVM on Validation Set
y_val_pred_svm = base_svm.predict(X_val)
val_acc_svm = accuracy_score(y_val, y_val_pred_svm)
print(f"Base SVM Validation Accuracy: {val_acc_svm:.4f}\n")


In [None]:
#Train SVM with Hyperparameter Tuning
param_grid_svm = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
svm = SVC()
grid_search_svm = GridSearchCV(svm, param_grid_svm, cv=5, n_jobs=-1)
grid_search_svm.fit(X_train, y_train)


In [None]:
#Best parameters and validation score for SVM
print(f"SVM Best Parameters: {grid_search_svm.best_params_}")
print(f"SVM Best Validation Accuracy: {grid_search_svm.best_score_:.4f}\n")


In [None]:
#Evaluate SVM on Test Set
best_svm = grid_search_svm.best_estimator_
y_pred_svm = best_svm.predict(X_test)
acc_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Test Accuracy: {acc_svm:.4f}\n")


Decision Tree

In [None]:
#Base Decision Tree Model
base_dt = DecisionTreeClassifier()
base_dt.fit(X_train, y_train)

#Evaluate Base Decision Tree on Validation Set
y_val_pred_dt = base_dt.predict(X_val)
val_acc_dt = accuracy_score(y_val, y_val_pred_dt)
print(f"Base Decision Tree Validation Accuracy: {val_acc_dt:.4f}\n")


In [None]:
#Train Decision Tree with Hyperparameter Tuning
param_grid_dt = {'max_depth': [5, 10, 15], 'min_samples_split': [2, 10, 20]}
dt = DecisionTreeClassifier()
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=5, n_jobs=-1)
grid_search_dt.fit(X_train, y_train)


In [None]:
#Best parameters and validation score for Decision Tree
print(f"Decision Tree Best Parameters: {grid_search_dt.best_params_}")
print(f"Decision Tree Best Validation Accuracy: {grid_search_dt.best_score_:.4f}\n")


In [None]:
# Decision Tree on Test Set
best_dt = grid_search_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Test Accuracy: {acc_dt:.4f}\n")


# Exercise 2
Logistic Regression

In [15]:
from sklearn.utils import resample

#25 model ensemble 
n_models = 25
models = []

#Train 25 Logistic Regression models on subsets
for i in range(n_models):
    X_train_sample, y_train_sample = resample(X_train, y_train, random_state=i)
    lr = LogisticRegression(max_iter=1000, solver='liblinear', C=10, penalty='l2')
    lr.fit(X_train_sample, y_train_sample)
    models.append(lr)


In [16]:
#Initializing matrix to store the probabilities for the models
ensemble_probs = np.zeros((X_test.shape[0], n_models))

#Probability prediction for each model in the ensemble
for i, model in enumerate(models):
    ensemble_probs[:, i] = model.predict_proba(X_test)[:, 1]

#Average prediction
average_probs = np.mean(ensemble_probs, axis=1)

#Variance Calculations
uncertainty = np.var(ensemble_probs, axis=1)

#Sorting of top and bottom
n_top_bottom = int(0.1 * X_test.shape[0])
sorted_indices = np.argsort(uncertainty)

#The top 10% of the data in terms of uncertainty
top_10_percent_indices = sorted_indices[-n_top_bottom:]

#The bottom 10% of the data in terms of uncertainty
bottom_10_percent_indices = sorted_indices[:n_top_bottom]

print("Top 10% most uncertain samples:")
print(X_test[top_10_percent_indices])
print("\nBottom 10% least uncertain samples:")
print(X_test[bottom_10_percent_indices])

Top 10% most uncertain samples:
  (0, 0)	1.0
  (0, 2495)	1.0
  (0, 2512)	1.0
  (0, 2573)	1.0
  (0, 3066)	1.0
  (0, 5875)	1.0
  (0, 16600)	1.0
  (0, 16602)	1.0
  (0, 16605)	1.0
  (0, 16609)	1.0
  (0, 16631)	1.0
  (0, 16637)	1.0
  (0, 16639)	1.0
  (0, 16662)	1.0
  (0, 16664)	1.0
  (0, 16667)	1.0
  (0, 16687)	1.0
  (0, 16705)	1.0
  (0, 16708)	1.0
  (0, 16715)	1.0
  (0, 16716)	1.0
  (0, 16717)	1.0
  (0, 17174)	1.0
  (0, 17644)	1.0
  (0, 17679)	1.0
  :	:
  (9879, 16687)	1.0
  (9879, 16705)	1.0
  (9879, 16708)	1.0
  (9879, 16715)	1.0
  (9879, 16716)	1.0
  (9879, 16717)	1.0
  (9879, 16723)	1.0
  (9879, 17213)	1.0
  (9879, 17679)	1.0
  (9879, 17771)	1.0
  (9879, 17822)	1.0
  (9879, 17899)	1.0
  (9879, 18024)	1.0
  (9879, 18096)	1.0
  (9879, 18173)	1.0
  (9879, 18446)	1.0
  (9879, 18450)	1.0
  (9879, 18704)	1.0
  (9879, 18835)	1.0
  (9879, 18905)	1.0
  (9879, 19006)	1.0
  (9879, 19071)	1.0
  (9879, 19171)	1.0
  (9879, 19256)	1.0
  (9879, 19344)	1.0

Bottom 10% least uncertain samples:
  (0, 0)	

# Exercise 3


In [None]:
from sklearn.feature_selection import RFE

#Feature Selection using RFE
rfe = RFE(estimator=best_lr, n_features_to_select=10)
rfe.fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

#Train Logistic Regression using RFE-selected features
lr_rfe = LogisticRegression(max_iter=1000, solver='liblinear', C=10, penalty='l2')
lr_rfe.fit(X_train_rfe, y_train)
y_pred_rfe = lr_rfe.predict(X_test_rfe)
acc_rfe = accuracy_score(y_test, y_pred_rfe)
print(f"Logistic Regression Test Accuracy with RFE-selected features: {acc_rfe:.4f}\n")

In [None]:
#Feature Selection using Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
feature_importances = rf.feature_importances_
important_features_indices = np.argsort(feature_importances)[-10:]

#Select the top 10 important features for train and test sets
X_train_rf = X_train[:, important_features_indices]
X_test_rf = X_test[:, important_features_indices]

In [None]:
#Train Logistic Regression using Random Forest-selected features
lr_rf = LogisticRegression(max_iter=1000, solver='liblinear', C=10, penalty='l2')
lr_rf.fit(X_train_rf, y_train)
y_pred_rf = lr_rf.predict(X_test_rf)
acc_rf = accuracy_score(y_test, y_pred_rf)
print(f"Logistic Regression Test Accuracy with Random Forest-selected features: {acc_rf:.4f}\n")

# Exercise 4

In [None]:
#Clustering Analysis
X_all = np.vstack([X_train, X_test])
y_all = np.hstack([y_train, y_test])

In [None]:
#K-means Clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_all)
kmeans_ari = adjusted_rand_score(y_all, kmeans_labels)
kmeans_silhouette = silhouette_score(X_all, kmeans_labels)
print(f"KMeans ARI: {kmeans_ari:.4f}, Silhouette Score: {kmeans_silhouette:.4f}\n")

In [None]:
#Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=2)
agg_labels = agg_clustering.fit_predict(X_all)
agg_ari = adjusted_rand_score(y_all, agg_labels)
agg_silhouette = silhouette_score(X_all, agg_labels)
print(f"Agglomerative Clustering ARI: {agg_ari:.4f}, Silhouette Score: {agg_silhouette:.4f}\n")


In [None]:
#DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_all)
valid_labels = dbscan_labels != -1 
if np.sum(valid_labels) > 0:
    dbscan_ari = adjusted_rand_score(y_all[valid_labels], dbscan_labels[valid_labels])
    dbscan_silhouette = silhouette_score(X_all[valid_labels], dbscan_labels[valid_labels])
    print(f"DBSCAN ARI: {dbscan_ari:.4f}, Silhouette Score: {dbscan_silhouette:.4f}\n")
else:
    print("DBSCAN did not find enough clusters to evaluate.\n")


# Exercise 5

In [None]:
#Clustering Analysis on Top & Bottom 10%
def cluster_analysis(samples, title):
    print(f"Clustering on {title} Samples:\n")

    # KMeans
    kmeans = KMeans(n_clusters=2, random_state=42)
    kmeans_labels = kmeans.fit_predict(samples)
    print(f"KMeans Labels: {np.unique(kmeans_labels)}\n")

    # Agglomerative Clustering
    agg = AgglomerativeClustering(n_clusters=2)
    agg_labels = agg.fit_predict(samples)
    print(f"Agglomerative Clustering Labels: {np.unique(agg_labels)}\n")

    # DBSCAN
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    dbscan_labels = dbscan.fit_predict(samples)
    valid_labels = dbscan_labels != -1
    if np.sum(valid_labels) > 0:
        print(f"DBSCAN Labels: {np.unique(dbscan_labels[valid_labels])}\n")
    else:
        print("DBSCAN did not find enough clusters to evaluate.\n")


# Exercise 6