# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Exercise 1

In [8]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Load the dataset from D
df = pd.DataFrame(D['data'], columns=D['feature_names'])

# Assign features and target variables
X = df  # All features
y = D['target']  # Target labels

# Convert target variable to binary
y = (y == b'normal.').astype(int)

In [9]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Preprocessing via transformations
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Apply the preprocessor
X = preprocessor.fit_transform(X)

In [10]:
# Train/CV/Test Split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

## Logistic Regression

In [12]:
param_grid_lr = {
    'C': [0.1, 1, 10],
    'penalty': ['l2']
}

lr = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid_lr, cv=5)
lr.fit(X_train, y_train)

# Best params and score on validation set
print("Logistic Regression Best Parameters:", lr.best_params_)
print("Logistic Regression Best Validation Accuracy:", lr.best_score_)

Logistic Regression Best Parameters: {'C': 10, 'penalty': 'l2'}
Logistic Regression Best Validation Accuracy: 0.9996963684542365


## Stochastic Gradient Descent

In [13]:
# Define the parameter grid
param_grid_sgd = {
    'alpha': [0.0001, 0.001, 0.01],  # Regularization strength
    'loss': ['hinge', 'log']  # Hinge for SVM, log for logistic regression
}

# Initialize SGDClassifier and perform GridSearchCV
sgd = GridSearchCV(SGDClassifier(max_iter=1000, tol=1e-3), param_grid_sgd, cv=5)
sgd.fit(X_train, y_train)

# Display best hyperparameters and validation accuracy
print("SGDClassifier Best Parameters:", sgd.best_params_)
print("SGDClassifier Best Validation Accuracy:", sgd.best_score_)



SGDClassifier Best Parameters: {'alpha': 0.0001, 'loss': 'hinge'}
SGDClassifier Best Validation Accuracy: 0.9995917848406652


## Decision Tree

In [14]:
# Define the parameter grid
param_grid_dt = {
    'max_depth': [5, 10, 15],  # Control the depth of the tree
    'min_samples_split': [2, 10, 20]  # Control how splits occur
}

# Initialize Decision Tree Classifier and perform GridSearchCV
dt = GridSearchCV(DecisionTreeClassifier(), param_grid_dt, cv=5)
dt.fit(X_train, y_train)

# Display best hyperparameters and validation accuracy
print("Decision Tree Best Parameters:", dt.best_params_)
print("Decision Tree Best Validation Accuracy:", dt.best_score_)

Decision Tree Best Parameters: {'max_depth': 15, 'min_samples_split': 2}
Decision Tree Best Validation Accuracy: 0.9996693789822373


## Evaluating Classifiers on Test Set

In [15]:
from sklearn.metrics import accuracy_score

# Logistic Regression on Test Set
y_pred_lr = lr.best_estimator_.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred_lr)

# SGDC on the test set
y_pred_sgd = sgd.best_estimator_.predict(X_test)
acc_sgd = accuracy_score(y_test, y_pred_sgd)

# Decision Tree on the test set
y_pred_dt = dt.best_estimator_.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Assuming you already have Logistic Regression results
print(f"SGDClassifier Test Accuracy: {acc_sgd:.3f}")
print(f"Decision Tree Test Accuracy: {acc_dt:.3f}")
print(f"Logistic Regression Test Accuracy: {acc_lr:.3f}")

SGDClassifier Test Accuracy: 1.000
Decision Tree Test Accuracy: 0.999
Logistic Regression Test Accuracy: 1.000


In [16]:
results = pd.DataFrame({
    'Classifier': ['Logistic Regression', 'SGDClassifier', 'Decision Tree'],
    'Accuracy': [acc_lr, acc_sgd, acc_dt]
})

print(results)

            Classifier  Accuracy
0  Logistic Regression  0.999666
1        SGDClassifier  0.999666
2        Decision Tree  0.999494


### Out of all three classifiers, Logistic Regression seems to have done the best

# Exercise 2

In [17]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample
from sklearn.metrics import accuracy_score

In [18]:
# Set up the number of models in the ensemble
n_models = 25
models = []

# Train 25 Logistic Regression models on different subsets of the training data
for i in range(n_models):
    # Bootstraping sample from the training data
    X_train_sample, y_train_sample = resample(X_train, y_train, random_state=i)
    
    # Initialize and train Logistic Regression model
    lr = LogisticRegression(max_iter=1000)
    lr.fit(X_train_sample, y_train_sample)
    
    # Store the trained model
    models.append(lr)

In [19]:
# Initialize a matrix to store the probabilities for each model's predictions
ensemble_probs = np.zeros((X_test.shape[0], n_models))

# Get probability predictions for each model in the ensemble
for i, model in enumerate(models):
    # Predict probabilities for class 1 (positive class)
    ensemble_probs[:, i] = model.predict_proba(X_test)[:, 1]

# Average the predictions across all models to get the final prediction probabilities
average_probs = np.mean(ensemble_probs, axis=1)

In [20]:
# Calculate the variance of the ensemble predictions for each sample
uncertainty = np.var(ensemble_probs, axis=1)

In [21]:
# Get the number of samples in the top and bottom 10%
n_top_bottom = int(0.1 * X_test.shape[0])

# Sort the indices based on uncertainty
sorted_indices = np.argsort(uncertainty)

# Get the top 10% of the data in terms of uncertainty
top_10_percent_indices = sorted_indices[-n_top_bottom:]

# Get the bottom 10% of the data in terms of uncertainty
bottom_10_percent_indices = sorted_indices[:n_top_bottom]

# Get the actual samples for the top and bottom 10% uncertain
X_top_10_percent = X_test[top_10_percent_indices]
X_bottom_10_percent = X_test[bottom_10_percent_indices]

# Optionally, print or display the results
print("Top 10% most uncertain samples:")
print(X_top_10_percent)

print("\nBottom 10% least uncertain samples:")
print(X_bottom_10_percent)

Top 10% most uncertain samples:
  (0, 0)	1.0
  (0, 2496)	1.0
  (0, 2543)	1.0
  (0, 2565)	1.0
  (0, 2575)	1.0
  (0, 5875)	1.0
  (0, 16600)	1.0
  (0, 16602)	1.0
  (0, 16605)	1.0
  (0, 16609)	1.0
  (0, 16631)	1.0
  (0, 16637)	1.0
  (0, 16639)	1.0
  (0, 16662)	1.0
  (0, 16664)	1.0
  (0, 16667)	1.0
  (0, 16687)	1.0
  (0, 16705)	1.0
  (0, 16708)	1.0
  (0, 16715)	1.0
  (0, 16716)	1.0
  (0, 16717)	1.0
  (0, 16949)	1.0
  (0, 17211)	1.0
  (0, 17679)	1.0
  :	:
  (9879, 16687)	1.0
  (9879, 16705)	1.0
  (9879, 16708)	1.0
  (9879, 16715)	1.0
  (9879, 16716)	1.0
  (9879, 16717)	1.0
  (9879, 16723)	1.0
  (9879, 17213)	1.0
  (9879, 17679)	1.0
  (9879, 17771)	1.0
  (9879, 17822)	1.0
  (9879, 17899)	1.0
  (9879, 18024)	1.0
  (9879, 18096)	1.0
  (9879, 18173)	1.0
  (9879, 18446)	1.0
  (9879, 18450)	1.0
  (9879, 18704)	1.0
  (9879, 18835)	1.0
  (9879, 18905)	1.0
  (9879, 19006)	1.0
  (9879, 19071)	1.0
  (9879, 19171)	1.0
  (9879, 19256)	1.0
  (9879, 19344)	1.0

Bottom 10% least uncertain samples:
  (0, 0)	

In [22]:
# Convert the average probabilities to final predictions (using 0.5 as the threshold)
ensemble_predictions = (average_probs >= 0.5).astype(int)

# Calculate the accuracy of the ensemble
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
print(f"Ensemble Test Accuracy: {ensemble_accuracy:.3f}")

Ensemble Test Accuracy: 1.000


# Exercise 3

In [23]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
# Initialize Logistic Regression for RFE
lr = LogisticRegression(max_iter=1000)

# Perform RFE to select the top 10 features
rfe = RFE(estimator=lr, n_features_to_select=10)
rfe.fit(X_train, y_train)

# Get the indices of the top 10 features
rfe_top_10_indices = np.where(rfe.ranking_ == 1)[0]

# Select the top 10 features for train and test sets
X_train_rfe = X_train[:, rfe_top_10_indices]
X_test_rfe = X_test[:, rfe_top_10_indices]

print("Top 10 features selected by RFE:", rfe_top_10_indices)

In [None]:
# Initialize Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importance from Random Forest
feature_importances = rf.feature_importances_

# Get the indices of the top 10 important features
rf_top_10_indices = np.argsort(feature_importances)[-10:]

# Select the top 10 features for train and test sets
X_train_rf = X_train[:, rf_top_10_indices]
X_test_rf = X_test[:, rf_top_10_indices]

print("Top 10 features selected by Random Forest:", rf_top_10_indices)

In [None]:
# Train Logistic Regression using features from RFE
lr_rfe = LogisticRegression(max_iter=1000)
lr_rfe.fit(X_train_rfe, y_train)
y_pred_lr_rfe = lr_rfe.predict(X_test_rfe)
acc_lr_rfe = accuracy_score(y_test, y_pred_lr_rfe)

# Train Logistic Regression using features from Random Forest
lr_rf = LogisticRegression(max_iter=1000)
lr_rf.fit(X_train_rf, y_train)
y_pred_lr_rf = lr_rf.predict(X_test_rf)
acc_lr_rf = accuracy_score(y_test, y_pred_lr_rf)

print(f"Logistic Regression Accuracy with RFE features: {acc_lr_rfe:.3f}")
print(f"Logistic Regression Accuracy with Random Forest features: {acc_lr_rf:.3f}")

In [None]:
# Train SGDClassifier using features from RFE
sgd_rfe = SGDClassifier(max_iter=1000, tol=1e-3)
sgd_rfe.fit(X_train_rfe, y_train)
y_pred_sgd_rfe = sgd_rfe.predict(X_test_rfe)
acc_sgd_rfe = accuracy_score(y_test, y_pred_sgd_rfe)

# Train SGDClassifier using features from Random Forest
sgd_rf = SGDClassifier(max_iter=1000, tol=1e-3)
sgd_rf.fit(X_train_rf, y_train)
y_pred_sgd_rf = sgd_rf.predict(X_test_rf)
acc_sgd_rf = accuracy_score(y_test, y_pred_sgd_rf)

print(f"SGDClassifier Accuracy with RFE features: {acc_sgd_rfe:.3f}")
print(f"SGDClassifier Accuracy with Random Forest features: {acc_sgd_rf:.3f}")

In [None]:
# Train Decision Tree using features from RFE
dt_rfe = DecisionTreeClassifier()
dt_rfe.fit(X_train_rfe, y_train)
y_pred_dt_rfe = dt_rfe.predict(X_test_rfe)
acc_dt_rfe = accuracy_score(y_test, y_pred_dt_rfe)

# Train Decision Tree using features from Random Forest
dt_rf = DecisionTreeClassifier()
dt_rf.fit(X_train_rf, y_train)
y_pred_dt_rf = dt_rf.predict(X_test_rf)
acc_dt_rf = accuracy_score(y_test, y_pred_dt_rf)

print(f"Decision Tree Accuracy with RFE features: {acc_dt_rfe:.3f}")
print(f"Decision Tree Accuracy with Random Forest features: {acc_dt_rf:.3f}")

In [None]:
results = pd.DataFrame({
    'Classifier': ['Logistic Regression', 'SGDClassifier', 'Decision Tree'],
    'RFE Accuracy': [acc_lr_rfe, acc_sgd_rfe, acc_dt_rfe],
    'Random Forest Accuracy': [acc_lr_rf, acc_sgd_rf, acc_dt_rf]
})

print(results)

# Exercise 4

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Exercise 5

In [None]:
# Combine train and test for clustering
X_all = np.vstack([X_train, X_test])
y_all = np.hstack([y_train, y_test])  # Keep labels for evaluation

# Standardize the data for clustering algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)

## K-Means Clustering

In [None]:
n_classes = len(np.unique(y_all))  # Assuming binary classification, this would be 2

# Apply K-Means
kmeans = KMeans(n_clusters=n_classes, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Evaluate clustering performance using Adjusted Rand Index (ARI) and Silhouette Score
ari_kmeans = adjusted_rand_score(y_all, kmeans_labels)
silhouette_kmeans = silhouette_score(X_scaled, kmeans_labels)

print(f"K-Means ARI: {ari_kmeans:.3f}")
print(f"K-Means Silhouette Score: {silhouette_kmeans:.3f}")

## Agglomerative Clustering

In [None]:
# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=n_classes)
agg_labels = agg_clustering.fit_predict(X_scaled)

# Evaluate clustering performance using ARI and Silhouette Score
ari_agg = adjusted_rand_score(y_all, agg_labels)
silhouette_agg = silhouette_score(X_scaled, agg_labels)

print(f"Agglomerative Clustering ARI: {ari_agg:.3f}")
print(f"Agglomerative Clustering Silhouette Score: {silhouette_agg:.3f}")

## DBSCAN Clustering

In [None]:
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Some points might be marked as noise (-1), so we filter those out for evaluation
valid_labels = dbscan_labels != -1

# Evaluate clustering performance (if valid labels exist)
if np.sum(valid_labels) > 0:
    ari_dbscan = adjusted_rand_score(y_all[valid_labels], dbscan_labels[valid_labels])
    silhouette_dbscan = silhouette_score(X_scaled[valid_labels], dbscan_labels[valid_labels])

    print(f"DBSCAN ARI: {ari_dbscan:.3f}")
    print(f"DBSCAN Silhouette Score: {silhouette_dbscan:.3f}")
else:
    print("DBSCAN did not find enough clusters to evaluate.")

## Comparing Performance

In [None]:
results = pd.DataFrame({
    'Clustering Algorithm': ['K-Means', 'Agglomerative Clustering', 'DBSCAN'],
    'Adjusted Rand Index (ARI)': [ari_kmeans, ari_agg, ari_dbscan],
    'Silhouette Score': [silhouette_kmeans, silhouette_agg, silhouette_dbscan]
})

print(results)

## Interpreting Results

# Exercise 6

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

# Exercise 7

# Exercise 8