In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd "/content/drive/MyDrive/Data3461_solutions/Labs/Lab.6"

/content/drive/MyDrive/Data3461_solutions/Labs/Lab.6


# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [4]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [5]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [6]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [7]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [8]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [9]:
len(np.unique(D["target"]))

23

In [10]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

In [14]:
import pandas as pd

# Convert features and target to DataFrames
feature_df = pd.DataFrame(D.data, columns=D["feature_names"])
target_df = pd.Series(D.target).rename('target')

# Concatenate features and target into a single DataFrame
data_df = pd.concat([feature_df, target_df], axis=1)

In [15]:
from sklearn.datasets import fetch_kddcup99
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import numpy as np

D = fetch_kddcup99()


categorical_features = ['protocol_type', 'service', 'flag']
numerical_features = [f for f in D["feature_names"] if f not in categorical_features]


categorical_transformer = OneHotEncoder()
numerical_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

X = preprocessor.fit_transform(data_df.iloc[:, :-1])
y = data_df['target'].values

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from scipy.stats import randint

# Split the dataset
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert target variable to integer labels using LabelEncoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)

# Define classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'Logistic Regression': LogisticRegression()
}

# Define hyperparameter grids for each classifier
# Tune not more than two params and use simple linear models
param_grids = {
    'Random Forest': {'n_estimators': [50, 100], 'max_depth': [None, 10]},
    'Support Vector Machine': {'C': [1, 10], 'kernel': ['linear']},
    'Logistic Regression': {'C': [0.01, 0.1, 1, 10], 'max_iter': [100, 200]}
}

# Optimize hyperparameters and evaluate on the validation set
# Randomized search with reduced space and verbose output
best_classifiers = {}
for classifier_name, classifier in classifiers.items():
    param_dist = param_grids[classifier_name]
    randomized_search = RandomizedSearchCV(classifier, param_distributions=param_dist, n_iter=5, cv=3, scoring='accuracy', n_jobs=-1, verbose=2)
    randomized_search.fit(X_train, y_train_encoded)

    best_classifier = randomized_search.best_estimator_
    best_classifiers[classifier_name] = best_classifier

    # Evaluate on the validation set
    y_pred_val = best_classifier.predict(X_val)
    accuracy_val = accuracy_score(y_val_encoded, y_pred_val)
    print(f"{classifier_name} - Validation Accuracy: {accuracy_val}")

# Compare performances on the test set
for classifier_name, classifier in best_classifiers.items():
    y_pred_test = classifier.predict(X_test)
    accuracy_test = accuracy_score(y_test_encoded, y_pred_test)
    print(f"{classifier_name} - Test Accuracy: {accuracy_test}")


Fitting 3 folds for each of 4 candidates, totalling 12 fits




Random Forest - Validation Accuracy: 0.9997975790453828
Fitting 3 folds for each of 2 candidates, totalling 6 fits




Support Vector Machine - Validation Accuracy: 0.9993657476755327
Fitting 3 folds for each of 5 candidates, totalling 15 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression - Validation Accuracy: 0.9991768214512233
Random Forest - Test Accuracy: 0.999743603584152
Support Vector Machine - Test Accuracy: 0.9993387671380762
Logistic Regression - Test Accuracy: 0.9991498434632409


In [21]:
categorical_indices = [1, 2, 3]
numerical_indices = [0, 4, 5]

# Update the preprocessor to ignore unknown categories
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_indices),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_indices)
    ])


preprocessor.fit(X_train)

# Transform both training and test data again
X_train_encoded = preprocessor.transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

In [22]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

In [23]:
from sklearn.ensemble import RandomForestClassifier

ensemble_clf = RandomForestClassifier(n_estimators=25, random_state=42)

ensemble_clf.fit(X_train_encoded, y_train_encoded)

ensemble_predictions = ensemble_clf.predict_proba(X_test_encoded)

In [24]:
# Calculate the maximum predicted probability for each sample
max_probabilities = np.max(ensemble_predictions, axis=1)

# Calculate the thresholds for the top and bottom 10%
top_10_percent_threshold = np.percentile(max_probabilities, 90)
bottom_10_percent_threshold = np.percentile(max_probabilities, 10)

# Identify the top and bottom 10% of the data
top_10_percent_indices = np.where(max_probabilities >= top_10_percent_threshold)[0]
bottom_10_percent_indices = np.where(max_probabilities <= bottom_10_percent_threshold)[0]

# Output the results
print(f"Indices of the top 10% uncertain data: {top_10_percent_indices}")
print(f"Indices of the bottom 10% uncertain data: {bottom_10_percent_indices}")

Indices of the top 10% uncertain data: [    0     1     2 ... 74100 74101 74102]
Indices of the bottom 10% uncertain data: [    4    14    15 ... 74098 74099 74103]


In [25]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Instantiate a classifier to use with SelectFromModel
rf_classifier = RandomForestClassifier(n_estimators=50, random_state=42)

# Fit the classifier to get feature importances
rf_classifier.fit(X_train_encoded, y_train_encoded)

selector = SelectFromModel(rf_classifier, max_features=10, prefit=True)


X_train_selected = selector.transform(X_train_encoded)
X_test_selected = selector.transform(X_test_encoded)


rf_classifier.fit(X_train_selected, y_train_encoded)
y_pred_selected = rf_classifier.predict(X_test_selected)


accuracy_selected = accuracy_score(y_test_encoded, y_pred_selected)

In [26]:
rf_classifier.fit(X_train_encoded, y_train_encoded)

importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

# Select the top 10 most important features
top_indices = indices[:10]
X_train_top_features = X_train_encoded[:, top_indices]
X_test_top_features = X_test_encoded[:, top_indices]

# Train a classifier on the top features
rf_classifier.fit(X_train_top_features, y_train_encoded)
y_pred_top_features = rf_classifier.predict(X_test_top_features)

# Evaluate performance
accuracy_top_features = accuracy_score(y_test_encoded, y_pred_top_features)

In [27]:
print(f'Accuracy with RFE selected features: {accuracy_selected}')
print(f'Accuracy with top model features: {accuracy_top_features}')

Accuracy with RFE selected features: 0.9782332937493253
Accuracy with top model features: 0.9782332937493253


In [28]:
number_of_classes = len(np.unique(y_train))


In [29]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, adjusted_rand_score
import numpy as np

In [38]:
from sklearn.metrics import adjusted_rand_score

# Split both the data and the labels
X_sample, _, y_sample, _ = train_test_split(X_train_encoded, y_train, train_size=0.1, random_state=42)


# Apply K-Means clustering
kmeans = KMeans(n_clusters=number_of_classes, random_state=42)
kmeans_labels_sample = kmeans.fit_predict(X_sample)

# Calculate silhouette score
silhouette_kmeans_sample = silhouette_score(X_sample, kmeans_labels_sample)
print(f'Silhouette Score for K-Means (on sample): {silhouette_kmeans_sample}')

# Ensure y_sample is a 1D array of strings
y_sample_str = y_sample.astype(str)

# Calculate ARI
ari_kmeans_sample = adjusted_rand_score(y_sample_str, kmeans_labels_sample)
print(f'Adjusted Rand Index for K-Means (on sample): {ari_kmeans_sample}')




Silhouette Score for K-Means (on sample): 0.8196713888611975
Adjusted Rand Index for K-Means (on sample): 0.7353621402635897


In [37]:
print(type(y_sample))
print(np.unique(y_sample))



<class 'numpy.ndarray'>
[b'back.' b'buffer_overflow.' b'ftp_write.' b'guess_passwd.' b'imap.'
 b'ipsweep.' b'land.' b'loadmodule.' b'multihop.' b'neptune.' b'nmap.'
 b'normal.' b'phf.' b'pod.' b'portsweep.' b'rootkit.' b'satan.' b'smurf.'
 b'teardrop.' b'warezclient.' b'warezmaster.']


In [39]:
categorical_indices = [1, 2, 3]
numerical_indices = [0, 4, 5]

# Update the preprocessor to ignore unknown categories
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_indices),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_indices)
    ])


preprocessor.fit(X_train)

# Transform both training and test data again
X_train_encoded = preprocessor.transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

In [40]:
from sklearn.model_selection import train_test_split
X_sample, _ = train_test_split(X_train_encoded, train_size=0.1, random_state=42)

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import adjusted_rand_score

X_sample_reduced, _, y_sample_reduced, _ = train_test_split(X_sample, y_sample, train_size=0.05, random_state=42)

y_sample_reduced_str = y_sample_reduced.astype(str)


# Convert to dense format and apply Agglomerative Clustering
X_sample_reduced_dense = X_sample_reduced.toarray()
agg_clustering_reduced = AgglomerativeClustering(n_clusters=number_of_classes)
agg_labels_reduced = agg_clustering_reduced.fit_predict(X_sample_reduced_dense)

# Calculate the silhouette score
silhouette_agg_reduced = silhouette_score(X_sample_reduced_dense, agg_labels_reduced)
print(f'Silhouette Score for Agglomerative Clustering (reduced sample): {silhouette_agg_reduced}')

# Calculate the ARI
ari_agg_reduced = adjusted_rand_score(y_sample_reduced_str, agg_labels_reduced)
print(f'Adjusted Rand Index for Agglomerative Clustering (reduced sample): {ari_agg_reduced}')

Silhouette Score for Agglomerative Clustering (reduced sample): 0.8447762623234524
Adjusted Rand Index for Agglomerative Clustering (reduced sample): 0.7136785767939295


In [43]:
X_train_sample, _ = train_test_split(X_train_encoded, train_size=0.05, random_state=42)  # Sample 5% of the data

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_train_sample)


if len(set(dbscan_labels)) > 1:
    silhouette_dbscan = silhouette_score(X_train_sample, dbscan_labels)
    print(f'Silhouette Score for DBSCAN: {silhouette_dbscan}')
else:
    print("DBSCAN found less than 2 clusters")

Silhouette Score for DBSCAN: 0.8119808331882435


In [45]:
from sklearn.cluster import DBSCAN


X_train_sample, _, y_train_sample, _ = train_test_split(X_train_encoded, y_train, train_size=0.05, random_state=42)

y_train_sample_str = y_train_sample.astype(str)


dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_train_sample)

if len(set(dbscan_labels)) > 1:
    silhouette_dbscan = silhouette_score(X_train_sample, dbscan_labels)
    print(f'Silhouette Score for DBSCAN: {silhouette_dbscan}')
    # Calculate the ARI
    ari_dbscan = adjusted_rand_score(y_train_sample_str, dbscan_labels)
    print(f'Adjusted Rand Index for DBSCAN: {ari_dbscan}')
else:
    print("DBSCAN found less than 2 clusters")

Silhouette Score for DBSCAN: 0.8119808331882435
Adjusted Rand Index for DBSCAN: 0.7378514667998353


In [46]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report, roc_auc_score

In [47]:
d = fetch_kddcup99()
df = fetch_kddcup99(as_frame=True)

In [48]:
data, target = df['data'], df['target']


In [49]:
SA_data, SA_target = data, target
X_train, X_test, y_train, y_test = train_test_split(SA_data, SA_target, test_size=0.2, random_state=42)

In [50]:
# Anomaly Detection algorithms
iso_forest = IsolationForest()
oc_svm = OneClassSVM()
lof = LocalOutlierFactor()

In [66]:
from sklearn.model_selection import train_test_split

# Sample 10% of the data
X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=0.1, random_state=42)

# Further split the sampled data
X_sample, _, y_sample, _ = train_test_split(X_train_sample, y_train_sample, train_size=0.1, random_state=42)


In [61]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Assuming 'categorical_features' is a list of categorical column names
categorical_features = ['protocol_type', 'service', 'flag']
numerical_features = [f for f in X_train.columns if f not in categorical_features]

# Column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Apply one-hot encoding to X_train_sample
X_train_sample_encoded = preprocessor.fit_transform(X_train_sample)

# Now use the encoded data for your models
iso_forest_pred = iso_forest.fit_predict(X_train_sample_encoded)
oc_svm_pred = oc_svm.fit_predict(X_train_sample_encoded)
lof_pred = lof.fit_predict(X_train_sample_encoded)


In [67]:
print("Isolation Forest Performance on Sample:")
print(classification_report(y_sample, iso_forest_pred))

print("One-Class SVM Performance on Sample:")
print(classification_report(y_sample, oc_svm_pred))

print("Local Outlier Factor Performance on Sample:")
print(classification_report(y_sample, lof_pred))

Isolation Forest Performance on Sample:


ValueError: ignored

In [70]:
from sklearn.model_selection import train_test_split, LeaveOneOut
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [71]:
X_subsample, _, y_subsample, _ = train_test_split(X, y, train_size=250, stratify=y, random_state=4)


In [72]:

iso_forest = IsolationForest()
oc_svm = OneClassSVM()
lof = LocalOutlierFactor(novelty=True)

In [73]:
loo = LeaveOneOut()


In [74]:
y_true, iso_forest_preds, oc_svm_preds, lof_preds = [], [], [], []


In [75]:
# Initialize lists for true labels and predictions
y_true = []
iso_forest_preds = []
oc_svm_preds = []
lof_preds = []

In [76]:
for train_index, test_index in loo.split(X_subsample):
    X_train, X_test = X_subsample[train_index], X_subsample[test_index]
    y_train, y_test = y_subsample[train_index], y_subsample[test_index]

    # Append true label for this iteration
    y_true.append(y_test[0])

    # Isolation Forest
    iso_forest.fit(X_train, y_train)
    iso_forest_preds.append(iso_forest.predict(X_test)[0])

    # One-Class SVM
    oc_svm.fit(X_train, y_train)
    oc_svm_preds.append(oc_svm.predict(X_test)[0])

    # Local Outlier Factor
    lof.fit(X_train, y_train)
    lof_preds.append(lof.predict(X_test)[0])

In [77]:
# Convert lists to numpy arrays
y_true = np.array(y_true)
iso_forest_preds = np.array(iso_forest_preds)
oc_svm_preds = np.array(oc_svm_preds)
lof_preds = np.array(lof_preds)

In [78]:
anomaly_labels = ['smurf.', 'neptune.', ...]
y_true_numeric = np.array([-1 if label in anomaly_labels else 1 for label in y_true])

In [83]:
iso_forest_predictions = iso_forest_preds
oc_svm_predictions = oc_svm_preds
lof_predictions = lof_preds

# Calculate performance metrics for each model
iso_forest_metrics = {
    "Accuracy": accuracy_score(y_true_numeric, iso_forest_predictions),
    "Precision": precision_score(y_true_numeric, iso_forest_predictions, pos_label=-1, zero_division=1),
    "Recall": recall_score(y_true_numeric, iso_forest_predictions, pos_label=-1, zero_division=1),
    "F1 Score": f1_score(y_true_numeric, iso_forest_predictions, pos_label=-1, zero_division=1),
}

oc_svm_metrics = {
    "Accuracy": accuracy_score(y_true_numeric, oc_svm_predictions),
    "Precision": precision_score(y_true_numeric, oc_svm_predictions, pos_label=-1, zero_division=1),
    "Recall": recall_score(y_true_numeric, oc_svm_predictions, pos_label=-1, zero_division=1),
    "F1 Score": f1_score(y_true_numeric, oc_svm_predictions, pos_label=-1, zero_division=1),
}

lof_metrics = {
    "Accuracy": accuracy_score(y_true_numeric, lof_predictions),
    "Precision": precision_score(y_true_numeric, lof_predictions, pos_label=-1, zero_division=1),
    "Recall": recall_score(y_true_numeric, lof_predictions, pos_label=-1, zero_division=1),
    "F1 Score": f1_score(y_true_numeric, lof_predictions, pos_label=-1, zero_division=1),
}

# Print metrics for each model
print("Isolation Forest Metrics:")
for metric, value in iso_forest_metrics.items():
    print(f"{metric}: {value}")

print("\nOne-Class SVM Metrics:")
for metric, value in oc_svm_metrics.items():
    print(f"{metric}: {value}")

print("\nLocal Outlier Factor Metrics:")
for metric, value in lof_metrics.items():
    print(f"{metric}: {value}")


Isolation Forest Metrics:
Accuracy: 0.908
Precision: 0.0
Recall: 1.0
F1 Score: 0.0

One-Class SVM Metrics:
Accuracy: 0.348
Precision: 0.0
Recall: 1.0
F1 Score: 0.0

Local Outlier Factor Metrics:
Accuracy: 0.7
Precision: 0.0
Recall: 1.0
F1 Score: 0.0


In [84]:
print("Label type:", type(y_subsample[0]))
print("Unique labels:", np.unique(y_subsample))

Label type: <class 'bytes'>
Unique labels: [b'back.' b'ipsweep.' b'neptune.' b'normal.' b'portsweep.' b'satan.'
 b'smurf.' b'warezclient.']


In [85]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_subsample_encoded = label_encoder.fit_transform(y_subsample)

In [86]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Initialize RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_subsample, y_subsample_encoded)


# Select the top 5 features
selector = SelectFromModel(rf, max_features=5, prefit=True)
X_subsample_selected = selector.transform(X_subsample)

# Get the indices of the selected features
selected_features_indices = selector.get_support(indices=True)
print("Selected feature indices:", selected_features_indices)

Selected feature indices: [19 20 32 38 55]


In [87]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

# Isolation Forest
iso_forest_selected = IsolationForest()
iso_forest_selected.fit(X_subsample_selected)

# One-Class SVM
oc_svm_selected = OneClassSVM()
oc_svm_selected.fit(X_subsample_selected)

# Local Outlier Factor
lof_selected = LocalOutlierFactor(novelty=True)
lof_selected.fit(X_subsample_selected)

In [88]:
X_test_selected = selector.transform(X_test)


In [89]:
anomaly_labels = ['smurf.', 'neptune.', ...]
y_test_binary = np.array([-1 if label in anomaly_labels else 1 for label in y_test])

In [91]:
from sklearn.metrics import classification_report

iso_forest_pred_test = iso_forest_selected.predict(X_test_selected)
print("Isolation Forest on Selected Features:")
print(classification_report(y_test_binary, iso_forest_pred_test))

oc_svm_pred_test = oc_svm_selected.predict(X_test_selected)
print("One-Class SVM on Selected Features:")
print(classification_report(y_test_binary, oc_svm_pred_test))

lof_pred_test = lof_selected.predict(X_test_selected)
print("Local Outlier Factor on Selected Features:")
print(classification_report(y_test_binary, lof_pred_test))

Isolation Forest on Selected Features:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1

One-Class SVM on Selected Features:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       1.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0

Local Outlier Factor on Selected Features:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
