## Closed World (mon_standard.pkl)

### 1. Data Cleaning & Pre-processing

**mon_standard.pkl**: This file contains data from "monitored" websites.

   - Class count: 95

   - Instance count: 19,000 (95 websites, each with 10 subpages which are non-index pages, observed 20 times each)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pickle
import numpy as np

In [None]:
# Load X1, X2, y

# Load X1
with open('/content/drive/My Drive/Machine Learning Project/CODES/X1.pkl', 'rb') as file:
    X1 = pickle.load(file)

# Load X2
with open('/content/drive/My Drive/Machine Learning Project/CODES/X2.pkl', 'rb') as file:
    X2 = pickle.load(file)

# Load y
with open('/content/drive/My Drive/Machine Learning Project/CODES/y.pkl', 'rb') as file:
    y = pickle.load(file)

In [None]:
import random

num_samples_per_y = y.count(0)
num_sample = int(num_samples_per_y/2)

# Initialize lists to store sampled values
sampled_X1 = []
sampled_X2 = []
sampled_y = []

# Randomly sample num_sample values for each value of y
unique_y_values = set(y)
for val in unique_y_values:
    indices_for_y = [idx for idx, value in enumerate(y) if value == val]  # Find indices corresponding to each value of y
    sampled_indices = random.sample(indices_for_y, num_sample)  # Sample num_sample indices
    sampled_X1.extend([X1[i] for i in sampled_indices])
    sampled_X2.extend([X2[i] for i in sampled_indices])
    sampled_y.extend([y[i] for i in sampled_indices])

# Verify the lengths of sampled lists
print("Sampled X1 length:", len(sampled_X1))
print("Sampled X2 length:", len(sampled_X2))
print("Sampled y length:", len(sampled_y))


Sampled X1 length: 9500
Sampled X2 length: 9500
Sampled y length: 9500


In [None]:
X1 = sampled_X1
X2 = sampled_X2
y = sampled_y

### 2a. Feature Extraction (Continuous Features)

1. Sequence of packet timestamps (X1)
2. Sequence of packet sizes (X2)
3. Sequence of cumulative packet sizes
4. Sequence of bursts



Continuous Feature 3: Sequence of Cumulative Packet Sizes

In [None]:
# Compute the cumulative sum for each sequence
cumulative_sizes = [np.cumsum(seq) for seq in X2]

# Print the first 10 values of the cumulative sizes for the 1st element
print("First 10 values of cumulative sizes:")
print(cumulative_sizes[0][:10])

First 10 values of cumulative sizes:
[ -512 -1024  -512 -1024  -512 -1024  -512     0  -512 -1024]


Continuous Feature 4: Sequence of Bursts

In [None]:
def calculate_bursts_and_durations(X1, X2):
    seq_of_bursts = []
    burst_duration = []

    for timestamps, sizes in zip(X1, X2):
        burst = []
        duration = []

        current_size = 0
        current_time = 0.0

        time_start = 0.0

        for time, size in zip(timestamps, sizes):
          if current_size == 0 or (size > 0 and current_size > 0) or (size < 0 and current_size < 0):
              current_size += size
              current_time = time - time_start
          else:
              burst.append(current_size)
              duration.append(current_time)
              current_size = size
              current_time = 0.0
              time_start = time

        burst.append(current_size)
        duration.append(time-time_start)
        seq_of_bursts.append(burst)
        burst_duration.append(duration)

    return burst_duration, seq_of_bursts

burst_duration, seq_of_bursts = calculate_bursts_and_durations(X1, X2)

print(burst_duration[0][:10])
print(seq_of_bursts[0][:10])

[0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08000000000000007, 0.0, 0.0, 0.0]
[-1024, 512, -512, 512, -512, 1024, -7168, 512, -512, 512]


In [None]:
print(X1[0][:20])
print(X2[0][:20])

[0.0, 0.32, 0.32, 0.5, 1.22, 1.42, 1.42, 1.42, 1.68, 1.68, 1.68, 1.68, 1.68, 1.68, 1.68, 1.76, 1.76, 1.76, 1.76, 1.76]
[-512, -512, 512, -512, 512, -512, 512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512]


### 2b. Feature Extraction (Categorical Features)


1. Number of incoming packets
2. Number of incoming packets as a fraction of the total number of packets
3. Number of outgoing packets
4. Number of outgoing packets as a fraction of the total number of packets
5. Total number of packets.
6. Packet rate
7. Incoming packet rate (client to server)
8. Outgoing packet rate (server to client)
9. Average time gap
10. Total incoming bytes
11. Total outgoing bytes
12. Total incoming bursts
13. Total outgoing bursts
14. Total bursts
15. Average Inter-arrival time for incoming packets per sample
16. Average inter-departure time for outgoing packets per sample
17. Total burst duration

In [None]:
# 1. Number of incoming packets
incoming_packets = [sum(1 for size in size_seq if size > 0) for size_seq in X2]

# 2. Number of incoming packets as a fraction of the total number of packets
fraction_incoming_packets = [sum(1 for size in size_seq if size > 0) / len(size_seq) for size_seq in X2]

# 3. Number of outgoing packets
outgoing_packets = [sum(1 for size in size_seq if size < 0) for size_seq in X2]

# 4. Number of outgoing packets as a fraction of the total number of packets
fraction_outgoing_packets = [sum(1 for size in size_seq if size < 0) / len(size_seq) for size_seq in X2]

# Print first 10 values of the resulting arrays
print("Incoming Packets Array:")
print(incoming_packets[:10])

print("\nFraction of Incoming Packets Array:")
print(fraction_incoming_packets[:10])

print("\nOutgoing Packets Array:")
print(outgoing_packets[:10])

print("\nFraction of Outgoing Packets Array:")
print(fraction_outgoing_packets[:10])

Incoming Packets Array:
[542, 1604, 1476, 146, 115, 175, 286, 632, 153, 1635]

Fraction of Incoming Packets Array:
[0.07457347275729224, 0.31624605678233436, 0.28935502842579885, 0.06457319770013269, 0.08115737473535639, 0.09820426487093153, 0.14644137224782386, 0.0635048231511254, 0.06790945406125166, 0.31545437005595217]

Outgoing Packets Array:
[6726, 3468, 3625, 2115, 1302, 1607, 1667, 9320, 2100, 3548]

Fraction of Outgoing Packets Array:
[0.9254265272427078, 0.6837539432176656, 0.7106449715742011, 0.9354268022998673, 0.9188426252646437, 0.9017957351290684, 0.8535586277521762, 0.9364951768488746, 0.9320905459387483, 0.6845456299440479]


In [None]:
# 5. Total number of packets
total_packets = [len(size_seq) for size_seq in X2]

# 6. Packet Rate: Calculate the rate of packet arrival for each sequence
packet_rate = [len(seq) / (max(seq) - min(seq)) if len(seq) > 1 else 0 for seq in X1]

# 7. Incoming packet rate (client to server)
incoming_packet_rate = [sum(1 for size in sizes if size > 0) / (max(seq) - min(seq)) if len(seq) > 1 else 0 for seq, sizes in zip(X1, X2)]

# 8. Outgoing packet rate (server to client)
outgoing_packet_rate = [sum(1 for size in sizes if size < 0) / (max(seq) - min(seq)) if len(seq) > 1 else 0 for seq, sizes in zip(X1, X2)]

print("Total Packets Array:")
print(total_packets[:10])

print("\nPacket Rate:")
print(packet_rate[:10])

print("\nIncoming Packet Rate:")
print(incoming_packet_rate[:10])

print("\nOutgoing Packet Rate:")
print(outgoing_packet_rate[:10])

Total Packets Array:
[7268, 5072, 5101, 2261, 1417, 1782, 1953, 9952, 2253, 5183]

Packet Rate:
[521.0035842293908, 97.05319556065825, 90.26720934347904, 227.69385699899297, 133.05164319248826, 129.22407541696882, 55.95988538681949, 1181.9477434679336, 206.88705234159778, 260.84549572219424]

Incoming Packet Rate:
[38.85304659498208, 30.692690394182932, 26.11927092549991, 14.702920443101712, 10.7981220657277, 12.69035532994924, 8.194842406876791, 75.05938242280286, 14.049586776859503, 82.28485153497735]

Outgoing Packet Rate:
[482.15053763440864, 66.36050516647532, 64.14793841797912, 212.99093655589124, 122.25352112676056, 116.53372008701959, 47.7650429799427, 1106.8883610451308, 192.8374655647383, 178.5606441872169]


In [None]:
# 9. Average Time Gap: Calculate the average time gap for each sequence in X1
avg_time_gaps = []

for seq in X1:
    if len(seq) > 1:
        time_gaps_sum = sum(j - i for i, j in zip(seq, seq[1:]))
        avg_time_gap = time_gaps_sum / (len(seq) - 1)  # Subtract 1 because there are len(seq) - 1 time gaps
        avg_time_gaps.append(avg_time_gap)
    else:
        avg_time_gaps.append(0)

print("\nAverage Time Gaps:")
print(avg_time_gaps[:10])


Average Time Gaps:
[0.0019196367139122058, 0.010305659633208439, 0.011080392156862745, 0.004393805309734513, 0.007521186440677966, 0.0077428411005053335, 0.017879098360655737, 0.0008461461159682444, 0.004835701598579041, 0.003834426862215361]


In [None]:
# 10. & 11. Total incoming and outgoing bytes
incoming_bytes = []
outgoing_bytes = []

for sample in X2:
    incoming = sum(size for size in sample if size > 0)
    outgoing = abs(sum(size for size in sample if size < 0))

    incoming_bytes.append(incoming)
    outgoing_bytes.append(outgoing)

# Print total incoming and outgoing bytes for the first 10 samples
print(f'Incoming Bytes: {incoming_bytes[:10]}')
print(f'Outgoing Bytes: {outgoing_bytes[:10]}')

Incoming Bytes: [277504, 821248, 755712, 74752, 58880, 89600, 146432, 323584, 78336, 837120]
Outgoing Bytes: [3443712, 1775616, 1856000, 1082880, 666624, 822784, 853504, 4771840, 1075200, 1816576]


In [None]:
# 12. & 13. Number of incoming and outgoing burst

total_incoming_bursts = []
total_outgoing_bursts = []

# Calculate total number of incoming and outgoing bursts for all samples
for sample in seq_of_bursts:
  incoming_bursts = sum(1 for val in sample if val > 0)
  outgoing_bursts = sum(1 for val in sample if val < 0)

  total_incoming_bursts.append(incoming_bursts)
  total_outgoing_bursts.append(outgoing_bursts)

# 14. Calculate burst count for each sample
burst_count = [len(bursts) for bursts in seq_of_bursts]

# Print total incoming and outgoing bursts for first 10 samples
print(f"Total Incoming Bursts: {total_incoming_bursts[:10]}")
print(f"Total Outgoing Bursts: {total_outgoing_bursts[:10]}")
print(f"Burst Count: {burst_count[:10]}")



Total Incoming Bursts: [291, 188, 181, 92, 76, 94, 107, 395, 102, 211]
Total Outgoing Bursts: [291, 188, 181, 92, 76, 94, 107, 396, 102, 211]
Burst Count: [582, 376, 362, 184, 152, 188, 214, 791, 204, 422]


In [None]:
# 15. Calculate average inter-arrival time for incoming packets per sample
avg_interarrival_times = []

for idx, (sample_packets, sample_directions) in enumerate(zip(X1, X2)):
    incoming_packet_times = []

    # Filter incoming packets based on positive direction values
    incoming_packet_times = [packet_time for packet_time, direction in zip(sample_packets, sample_directions) if direction > 0]

    if len(incoming_packet_times) <= 1:
        # If only one or no incoming packet in the sample, assign 0 average inter-arrival time
        avg_interarrival_times.append(0)
    else:
        # Calculate inter-arrival times between incoming packets
        interarrival_times = [incoming_packet_times[i + 1] - incoming_packet_times[i] for i in range(len(incoming_packet_times) - 1)]

        # Compute the average inter-arrival time for incoming packets
        avg_interarrival_time = sum(interarrival_times) / len(interarrival_times)
        avg_interarrival_times.append(avg_interarrival_time)

# Print average inter-arrival time for incoming packets per sample
print("Average inter-arrival time for incoming packets per sample:")
for i, avg_interarrival_time in enumerate(avg_interarrival_times[:10], start=1):
    print(f"Sample {i}: {avg_interarrival_time}")


Average inter-arrival time for incoming packets per sample:
Sample 1: 0.02519408502772643
Sample 2: 0.032532751091703054
Sample 3: 0.03810847457627118
Sample 4: 0.06737931034482758
Sample 5: 0.09201754385964912
Sample 6: 0.07833333333333332
Sample 7: 0.121859649122807
Sample 8: 0.01316957210776545
Sample 9: 0.07078947368421053
Sample 10: 0.012086903304773562


In [None]:
# 16. Calculate average inter-departure time for outgoing packets per sample
avg_interdepart_times = []

for idx, (sample_packets, sample_directions) in enumerate(zip(X1, X2)):
    outcoming_packet_times = []

    # Filter outgoing packets based on negative direction values
    outcoming_packet_times = [packet_time for packet_time, direction in zip(sample_packets, sample_directions) if direction < 0]

    if len(outcoming_packet_times) <= 1:
        # If only one or no outgoing packet in the sample, assign 0 average inter-depart time
        avg_interdepart_times.append(0)
    else:
        # Calculate inter-depart times between outgoing packets
        interdepart_times = [outcoming_packet_times[i + 1] - outcoming_packet_times[i] for i in range(len(outcoming_packet_times) - 1)]

        # Compute the average inter-depart time for outgoing packets
        avg_interdepart_time = sum(interdepart_times) / len(interdepart_times)
        avg_interdepart_times.append(avg_interdepart_time)

# Print average inter-depart time for outgoing packets per sample
print("Average inter-departure time for outgoing packets per sample:")
for i, avg_interdepart_time in enumerate(avg_interdepart_times[:10], start=1):
    print(f"Sample {i}: {avg_interdepart_time}")


Average inter-departure time for outgoing packets per sample:
Sample 1: 0.002074349442379182
Sample 2: 0.015073550620132678
Sample 3: 0.007618653421633554
Sample 4: 0.0027861873226111633
Sample 5: 0.008070714834742506
Sample 6: 0.006674968866749689
Sample 7: 0.020948379351740695
Sample 8: 0.0009035304217190686
Sample 9: 0.0034730824202000954
Sample 10: 0.005599097829151395


In [None]:
# 17. Total burst duration
total_burst_duration = [sum(duration) for duration in burst_duration]
print(total_burst_duration[:10])

[7.72, 11.769999999999994, 46.10999999999999, 6.599999999999999, 3.1099999999999977, 7.1599999999999975, 8.429999999999996, 4.170000000000005, 5.219999999999999, 6.0600000000000005]


### 3a. Model Training
- Decision Tree


In [None]:
X = [
    incoming_packets,
    fraction_incoming_packets,
    outgoing_packets,
    fraction_outgoing_packets,
    total_packets,
    # packet_rate,
    # incoming_packet_rate,
    # outgoing_packet_rate,
    # avg_time_gaps,
    incoming_bytes,
    outgoing_bytes,
    total_incoming_bursts,
    total_outgoing_bursts,
    burst_count,
    avg_interarrival_times,
    avg_interdepart_times,
    total_burst_duration
]

# Feature importance (code is in feature_selection.ipynb)
# total_packets: 0.08391385637928211
# fraction_incoming_packets: 0.07525438164233617
# outgoing_packets: 0.07397300171079065
# fraction_outgoing_packets: 0.07302431700858097
# outgoing_bytes: 0.07264166480526325
# incoming_bytes: 0.06342574766295982
# incoming_packets: 0.06198666583613548
# total_burst_duration: 0.05929302192406632
# burst_count: 0.05624511174559692
# total_outgoing_bursts: 0.05452631240160152
# total_incoming_bursts: 0.05417409771603582
# avg_interarrival_times: 0.051337481385899435
# avg_interdepart_times: 0.04994909975727552
# incoming_packet_rate: 0.047878438278210256
# outgoing_packet_rate: 0.04129542002349407
# packet_rate: 0.040831661921218544
# avg_time_gaps: 0.04024971980125318

# Transpose the feature matrix X to have samples as rows and features as columns
X = np.array(X).T

y = np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model
clf.fit(X_train, y_train)


### 3b. Model Testing

In [None]:
# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Get a classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.49
              precision    recall  f1-score   support

           0       0.59      0.41      0.48        39
           1       0.38      0.35      0.36        26
           2       0.84      0.67      0.74        24
           3       0.44      0.46      0.45        24
           4       0.42      0.33      0.37        24
           5       0.52      0.50      0.51        26
           6       0.83      0.83      0.83        24
           7       0.62      0.50      0.55        32
           8       0.38      0.45      0.41        20
           9       0.32      0.37      0.34        19
          10       0.39      0.38      0.39        29
          11       0.38      0.29      0.33        28
          12       0.71      0.71      0.71        24
          13       0.20      0.16      0.18        25
          14       0.46      0.46      0.46        28
          15       0.52      0.56      0.54        27
          16       0.72      0.50      0.59        26
          17

### 4. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='accuracy', refit = True, verbose = 3)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[CV 1/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=2;, score=0.334 total time=   0.1s
[CV 2/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=2;, score=0.302 total time=   0.1s
[CV 3/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=2;, score=0.331 total time=   0.1s
[CV 1/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=5;, score=0.335 total time=   0.1s
[CV 2/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=5;, score=0.303 total time=   0.1s
[CV 3/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=5;, score=0.328 total time=   0.1s
[CV 1/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=10;, score=0.332 total time=   0.1s
[CV 2/3] END criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=10;, score=0.300 total time=   0.1s
[CV 3/3]

In [None]:
# Get the best parameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Train the model with the best parameters
best_estimator.fit(X_train, y_train)

# Predict on the test set
y_pred = best_estimator.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Get a classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.50
              precision    recall  f1-score   support

           0       0.47      0.41      0.44        39
           1       0.46      0.46      0.46        26
           2       0.61      0.83      0.70        24
           3       0.55      0.50      0.52        24
           4       0.43      0.42      0.43        24
           5       0.62      0.50      0.55        26
           6       0.71      0.83      0.77        24
           7       0.54      0.47      0.50        32
           8       0.65      0.55      0.59        20
           9       0.38      0.42      0.40        19
          10       0.39      0.41      0.40        29
          11       0.43      0.54      0.48        28
          12       0.67      0.67      0.67        24
          13       0.11      0.12      0.11        25
          14       0.35      0.32      0.33        28
          15       0.61      0.52      0.56        27
          16       0.59      0.50      0.54        26
          17

In [None]:
print(best_params)
print(best_estimator)

{'criterion': 'entropy', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2}
DecisionTreeClassifier(criterion='entropy', max_depth=15, random_state=42)


#### 5a. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest Classifier
rf_clf = RandomForestClassifier()

# Train the model
rf_clf.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred = rf_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Get a classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.63
              precision    recall  f1-score   support

           0       0.76      0.56      0.65        39
           1       0.48      0.50      0.49        26
           2       0.87      0.83      0.85        24
           3       0.58      0.62      0.60        24
           4       0.70      0.58      0.64        24
           5       0.74      0.65      0.69        26
           6       0.74      0.96      0.84        24
           7       0.66      0.66      0.66        32
           8       0.82      0.70      0.76        20
           9       0.59      0.53      0.56        19
          10       0.70      0.48      0.57        29
          11       0.71      0.61      0.65        28
          12       0.88      0.88      0.88        24
          13       0.45      0.36      0.40        25
          14       0.65      0.46      0.54        28
          15       0.62      0.78      0.69        27
          16       0.62      0.50      0.55        26
          17

#### 6. K-NN

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import time
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# Initialize a dictionary to store results
results = {}

# Iterate over different values of n_neighbors
for n in range(5, 51):
    # Initialize k-NN classifier with the current n_neighbors value
    knn_classifier = KNeighborsClassifier(n_neighbors=n)

    # Train the classifier and measure the time
    start_time = time.time()
    knn_classifier.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Predictions & Accuracy
    y_pred = knn_classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    # Store results in the dictionary
    results[n] = {'accuracy': accuracy, 'training_time': training_time}

# Print the results
for n, metrics in results.items():
    print(f"n_neighbors = {n}: Accuracy = {metrics['accuracy']:.4f}, Training time = {metrics['training_time']:.2f} seconds")

# Find and print the value of n_neighbors with the highest accuracy
best_n = max(results, key=lambda k: results[k]['accuracy'])
print(f"\nBest n_neighbors: {best_n} with accuracy = {results[best_n]['accuracy']:.4f}")

n_neighbors = 5: Accuracy = 0.4147, Training time = 0.03 seconds
n_neighbors = 6: Accuracy = 0.4147, Training time = 0.02 seconds
n_neighbors = 7: Accuracy = 0.4164, Training time = 0.03 seconds
n_neighbors = 8: Accuracy = 0.4181, Training time = 0.01 seconds
n_neighbors = 9: Accuracy = 0.4105, Training time = 0.01 seconds
n_neighbors = 10: Accuracy = 0.4101, Training time = 0.01 seconds
n_neighbors = 11: Accuracy = 0.4046, Training time = 0.01 seconds
n_neighbors = 12: Accuracy = 0.4017, Training time = 0.02 seconds
n_neighbors = 13: Accuracy = 0.4029, Training time = 0.02 seconds
n_neighbors = 14: Accuracy = 0.4004, Training time = 0.02 seconds
n_neighbors = 15: Accuracy = 0.3949, Training time = 0.02 seconds
n_neighbors = 16: Accuracy = 0.3861, Training time = 0.02 seconds
n_neighbors = 17: Accuracy = 0.3895, Training time = 0.02 seconds
n_neighbors = 18: Accuracy = 0.3895, Training time = 0.02 seconds
n_neighbors = 19: Accuracy = 0.3777, Training time = 0.02 seconds
n_neighbors = 2

In [None]:
#2. Use PCA + k-NN to reduce the dimension and GridSearchCV to select the optimal number of principal components and k in k-NN.
pipe = Pipeline([
    ('pca', PCA()),
    ('clf', KNeighborsClassifier())
])

parameters = {
    'pca__n_components': [2, 4, 6, 8, 10],
    'clf__n_neighbors': [5, 8, 10]
}

# GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(pipe, parameters, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


In [None]:
# 3. Report the best accuracy and parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validated Accuracy:", grid_search.best_score_)

Best Parameters: {'clf__n_neighbors': 5, 'pca__n_components': 4}
Best Cross-Validated Accuracy: 0.4040701754385966


In [None]:
# 4. Using the best estimator with the best parameters, re-evaluate the testing set and measure the time to elapse.
best_estimator = grid_search.best_estimator_

# Time to evaluate on the testing set
start_time = time.time()
y_test_pred = best_estimator.predict(X_test)
testing_time = time.time() - start_time

# Accuracy on the testing set
accuracy_test = accuracy_score(y_test, y_test_pred)

print("Accuracy on the Testing Set:", accuracy_test)
print("Testing Time:", testing_time, "seconds")

Accuracy on the Testing Set: 0.4147368421052632
Testing Time: 0.13258647918701172 seconds
