# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, silhouette_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Exercise 1

In [8]:
df=pd.DataFrame(D.data, columns=D.feature_names)
df["target"]=D.target
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'


In [9]:
continuous_columns=["duration", "src_bytes", "dst_bytes", "wrong_fragment", "urgent", "hot",
    "num_failed_logins", "num_compromised", "root_shell", "su_attempted", "num_root",
    "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
    "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate",
    "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
    "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate"]

df[continuous_columns]=df[continuous_columns].apply(pd.to_numeric, errors="coerce")

In [10]:
df.shape

(494021, 42)

In [11]:
df.dtypes

duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                            object
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                       object
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                   object
is_guest_login                  object
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

In [12]:
df["target"]=df["target"].str.decode("utf-8")
target_mapping={
    "smurf.": 0,
    "neptune.": 1,
    "normal.": 2,
    "back.": 3,
    "satan.": 4,
    "ipsweep.": 5,
    "portsweep.": 6,
    "warezclient.": 7,
    "teardrop.": 8,
    "pod.": 9,
    "nmap.": 10,
    "guess_passwd.": 11,
    "buffer_overflow.": 12,
    "land.": 13,
    "warezmaster.": 14,
    "imap.": 15,
    "rootkit.": 16,
    "loadmodule.": 17,
    "ftp_write.": 18,
    "multihop.": 19,
    "phf.": 20,
    "perl.": 21,
    "spy.": 22}

df["target"]=df["target"].replace(target_mapping)

  df["target"]=df["target"].replace(target_mapping)


In [13]:
object_cols=df.select_dtypes(include=["object"]).columns
print(object_cols)

Index(['protocol_type', 'service', 'flag', 'land', 'logged_in',
       'is_host_login', 'is_guest_login'],
      dtype='object')


In [14]:
df["land"]=df["land"].replace({0:0, 1:1})
df["logged_in"]=df["logged_in"].replace({0:0, 1:1})
df["is_host_login"]=df["is_host_login"].replace({0:0, 1:1})
df["is_guest_login"]=df["is_guest_login"].replace({0:0, 1:1})

protocol_mapping={b'icmp':0, b'tcp':1, b'udp':2}

service_mapping={b'ecr_i':0, b'private':1, b'http':2, b'smtp':3, b'other':4,
    b'domain_u':5, b'ftp_data':6, b'eco_i':7, b'ftp':8, b'finger':9,
    b'urp_i':10, b'telnet':11, b'ntp_u':12, b'auth':13, b'pop_3':14,
    b'time':15, b'csnet_ns':16, b'remote_job':17, b'gopher':18, b'imap4':19,
    b'discard':20, b'domain':21, b'iso_tsap':22, b'systat':23, b'shell':24,
    b'echo':25, b'rje':26, b'whois':27, b'sql_net':28, b'printer':29,
    b'nntp':30, b'courier':31, b'sunrpc':32, b'netbios_ssn':33, b'mtp':34,
    b'vmnet':35, b'uucp_path':36, b'uucp':37, b'klogin':38, b'bgp':39,
    b'ssh':40, b'supdup':41, b'nnsp':42, b'login':43, b'hostnames':44,
    b'efs':45, b'daytime':46, b'link':47, b'netbios_ns':48, b'pop_2':49,
    b'ldap':50, b'netbios_dgm':51, b'exec':52, b'http_443':53, b'kshell':54,
    b'name':55, b'ctf':56, b'netstat':57, b'Z39_50':58, b'IRC':59,
    b'urh_i':60, b'X11':61, b'tim_i':62, b'pm_dump':63, b'tftp_u':64, b'red_i':65}

flag_mapping={b'SF':0, b'S0':1, b'REJ':2, b'RSTR':3, b'RSTO':4, b'SH':5, b'S1':6, b'S2':7,
              b'RSTOS0':8, b'S3':9, b'OTH':10}

df["protocol_type"].replace(protocol_mapping, inplace=True)
df["service"].replace(service_mapping, inplace=True)
df["flag"].replace(flag_mapping, inplace=True)


  df["land"]=df["land"].replace({0:0, 1:1})
  df["logged_in"]=df["logged_in"].replace({0:0, 1:1})
  df["is_host_login"]=df["is_host_login"].replace({0:0, 1:1})
  df["is_guest_login"]=df["is_guest_login"].replace({0:0, 1:1})
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["protocol_type"].replace(protocol_mapping, inplace=True)
  df["protocol_type"].replace(protocol_mapping, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplac

In [15]:
df.dtypes

duration                         int64
protocol_type                    int64
service                          int64
flag                             int64
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

Must Change

In [16]:
X=df.drop(columns=["target"], axis=1)
y=df["target"]

X_train, X_temp, y_train, y_temp=train_test_split(X, y, test_size=0.3, random_state=11)
X_val, X_test, y_val, y_test=train_test_split(X_temp, y_temp, test_size=0.5, random_state=11)

In [17]:
def Classifiers(pipeline, param_grid, X_train, y_train, X_val, y_val):
    
    Grid = GridSearchCV(pipeline, param_grid, cv = 3, scoring = "accuracy")
    Grid.fit(X_train, y_train)
    top_model = Grid.best_estimator_
    y_val_pred = top_model.predict(X_val)
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"Model Accuracy: {accuracy}")
    return top_model

classifiers = {"Logistic Regression": LogisticRegression(), "Random Forest": RandomForestClassifier(),
    "SVM": SVC()}

parameters = {
    "Logistic Regression": {"classifier__C": [0.1, 1, 10], "classifier__solver": ["liblinear"]},
    "Random Forest": {"classifier__n_estimators": [100, 200], "classifier__max_depth": [10, 20, None]},
    "SVM": {"classifier__C": [0.1, 1, 10], "classifier__kernel": ["linear", "rbf"]}}

classifier = {}
for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    pipeline = Pipeline([("scaler", StandardScaler()),("classifier", clf)])
    top_model = Classifiers(pipeline, parameters[name], X_train, y_train, X_val, y_val)
    classifiers[name] = top_model

for name, model in classifier.items():
    Y_Test_Prediction = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, Y_Test_Prediction)
    print(f"Accuracy of {name}: {test_accuracy}")


Training Logistic Regression...




Model Accuracy: 0.998475095475217

Training Random Forest...
Model Accuracy: 0.9998245685059984

Training SVM...




Model Accuracy: 0.9995546738998421


Random forest performed better.

# Exercise 2


In [19]:
scaler = StandardScaler()
X_Train_Scaled = scaler.fit_transform(X_train)
X_Value_Scaled = scaler.transform(X_val)
X_Test_Scaled = scaler.transform(X_test)
N = 25
Models = []
Predictions = np.zeros((X_Value_Scaled.shape[0], N, len(np.unique(y))))

for i in range(N):
    Model = RandomForestClassifier(n_estimators = 100, random_state = i)
    Model.fit(X_Train_Scaled, y_train)
    Models.append(Model)
    Predictions[:, i, :] = Model.predict_proba(X_Value_Scaled)

In [20]:
def Entropy_of_Predictions(Predictions):
    
    Probability = np.mean(Predictions, axis=1)
    Entropy = -np.sum(Probability*np.log(Probability+1e-10), axis=1)
    return Entropy

Uncertainty = Entropy_of_Predictions(Predictions)
Top_10_Percent  = np.percentile(Uncertainty, 90)
Bottom_10_Percent = np.percentile(Uncertainty, 10)
Top_Uncertainty = np.where(Uncertainty >= Top_10_Percent)[0] 
Bottom_Uncertainty = np.where(Uncertainty <= Bottom_10_Percent)[0]
Top_Uncertainty1 = X_Value_Scaled[Top_Uncertainty]
Bottom_Uncertainty1 = X_Value_Scaled[Bottom_Uncertainty]

print(f'Top 10% Uncertainty of Data: {Top_Uncertainty1.shape[0]}')
print(f'Bottom 10% Uncertainty of Data: {Bottom_Uncertainty1.shape[0]}')

Top 10% Uncertainty of Data: 74103
Bottom 10% Uncertainty of Data: 71900


# Exercise 3


In [21]:
Random_Forest_Classifier = RandomForestClassifier(n_estimators = 100, random_state = 11)
Random_forest_Estimator = RFE(estimator = Random_Forest_Classifier, n_features_to_select = 10)
Random_forest_Estimator.fit(X_train, y_train)
Random_forest_Estimator_Support = Random_forest_Estimator.support_
Random_forest_Estimator_Features = X.columns[Random_forest_Estimator_Support]

print("Top 10 Features with the Random Forest Estimator Model:")
print(Random_forest_Estimator_Features)

Top 10 Features with the Random Forest Estimator Model:
Index(['protocol_type', 'service', 'flag', 'src_bytes', 'count', 'srv_count',
       'same_srv_rate', 'diff_srv_rate', 'dst_host_same_srv_rate',
       'dst_host_same_src_port_rate'],
      dtype='object')


In [22]:
Random_Forest_Classifier.fit(X_train, y_train)
Model = SelectFromModel(Random_Forest_Classifier, prefit = True, threshold = "median")
X_Train_Selected = Model.transform(X_train)
X_Value_Selected = Model.transform(X_val)
Feature_Indices = Model.get_support(indices = True)
Feature_Names = X.columns[Feature_Indices]

print("Top 10 Features using Select From Model:")
print(Feature_Names)

Top 10 Features using Select From Model:
Index(['protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
       'logged_in', 'count', 'srv_count', 'srv_serror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate'],
      dtype='object')




In [23]:
Features = np.unique(np.concatenate((Random_forest_Estimator_Features, Feature_Names)))
X_Train_Features = X_train.loc[:, np.isin(X.columns, Features)]
X_Value_Features = X_val.loc[:, np.isin(X.columns, Features)]
X_Test_Features = X_test.loc[:, np.isin(X.columns, Features)]
Random_Forest_Classifier_Features = RandomForestClassifier(n_estimators=100, random_state=42)
Random_Forest_Classifier_Features.fit(X_Train_Features, y_train)
Y_Value_Predicted_Features = Random_Forest_Classifier_Features.predict(X_Value_Features)

print(f"Accuracy with Top Features: {accuracy_score(y_val, Y_Value_Predicted_Features)}")
print(classification_report(y_val, Y_Value_Predicted_Features))

Accuracy with Top Features: 0.9998515579666141
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42026
           1       1.00      1.00      1.00     16233
           2       1.00      1.00      1.00     14538
           3       1.00      1.00      1.00       341
           4       1.00      1.00      1.00       240
           5       0.99      1.00      1.00       176
           6       1.00      1.00      1.00       156
           7       1.00      0.99      1.00       163
           8       1.00      1.00      1.00       134
           9       0.97      1.00      0.99        35
          10       1.00      0.97      0.99        35
          11       1.00      1.00      1.00         9
          12       0.83      1.00      0.91         5
          13       1.00      1.00      1.00         2
          14       0.67      1.00      0.80         2
          15       1.00      1.00      1.00         2
          17       0.00      0.00 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


From the previous question random forest performed at 0.99982 while here it performs at 0.99985 accuracy. Which is much better.

# Exercise 4

In [24]:
# K-Means Clustering
K_Means_Clustering = KMeans(n_clusters = 3, random_state = 1)
K_Means_Clusters = K_Means_Clustering.fit_predict(X_Train_Scaled)

In [25]:
#Gaussian Mixture
Gaussian_Mixture=GaussianMixture(n_components = 3, random_state = 11)
Gaussian_Mixture_Clusters= Gaussian_Mixture.fit_predict(X_Train_Scaled)

In [None]:
#Density-Based Clustering
Density_Based_Clustering = DBSCAN(eps=1, min_samples=10)
Density_Based_Clusters = Density_Based_Clustering.fit_predict(X_Train_Scaled)

In [None]:
K_Means_Score = silhouette_score(X_Train_Scaled, K_Means_Clusters)
Gaussian_Score = silhouette_score(X_Train_Scaled, Gaussian_Mixture_Clusters)
Density_Score = silhouette_score(X_Train_Scaled[dbscan_clusters != -1], Density_Based_Clusters[Density_Based_Clusters != -1])

print(f'Silhouette Score for K-Means: {K_Means_Score:.4f}')
print(f'Silhouette Score for Gaussian Mixture Models (GMM): {Gaussian_Score:.4f}')
print(f'Silhouette Score for DBSCAN: {Density_Score:.4f}')

def plot_clusters(X, clusters, title):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap="viridis", alpha=0.5)
    plt.title(title)
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.colorbar(label="Cluster Label")
    plt.show()

plot_clusters(X_Train_Scaled, K_Means_Clusters, "K-Means Clustering")
plot_clusters(X_Train_Scaled, Gaussian_Mixture_Clusters, "Gaussian Mixture Clustering")
plot_clusters(X_Train_Scaled, Density_Based_Clusters, "Density Based Clustering")

In [None]:
results = pd.DataFrame(X_train)
results["True Label"] = y_train.values
results["K-Means Cluster"] = K_Means_Clusters
results["Gaussian Mixture Cluster"] = Gaussian_Mixture_Clusters
results["Density Based Cluster"] = Density_Based_Clusters

print("\nK-Means Clustering vs True Labels:")
print(results.groupby(["True Label", "K-Means Cluster"]).size())
print("\nGaussian Mixture Clustering vs True Labels:")
print(results.groupby(["True Label", "Gaussian Mixture Clustering"]).size())
print("\nDensity Based Clustering vs True Labels:")
print(results.groupby(["True Label", "Density Based Clustering"]).size())

It kept failing here, it keeps making the kernel fail and restart regardless of what other clustering algorithm i use, it takes over 4 hours for it to get here each try, from my laptop, to my desktop, to google colab. Tried this for 3 weeks straight, and it black screened my computer 5 times, i dont want to brick my desktop or laptop. Data set is just too huge. Please use a smaller data set for the next poor souls trying to do this. 