### we will use:
##### Pandas
- load and preprocess the dataset
##### sklearn
-  split the data into training testing and cross validation sets
-  Train the KNN classifier
-  Evaluate the model using accuracy, precision, recall, and F1-score


In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

Define column names for the dataset since.

In [44]:
column_names = ["I1", "I2", "I3", "I4", "I5", "I6", "I7","I8", "I9", "I10", "o1"]
df = pd.read_csv("magic04.data", names = column_names)
# print(df.head())

Downsample the gamma class to match the number of hadron samples to balance the dataset.

In [47]:
gamma_df = df[df["o1"] == "g"]
hadron_df = df[df["o1"] == "h"]
gamma_df = gamma_df.sample(n=len(hadron_df), random_state=42)
balanced_df = pd.concat([gamma_df, hadron_df])

Split dataset randomly so that the training set would form 70% of the validation set 15%
and 15% for the testing set

In [49]:
X = balanced_df.drop(columns=["o1"])
Y = balanced_df["o1"]
x_train, x_, y_train,y_ = train_test_split(X, Y, test_size=0.3, random_state=1)
x_cv, x_test, y_cv,y_test = train_test_split(x_, y_, test_size=0.5, random_state=1)

Iterate over k values from 1 to 20 to find the optimal number of neighbors for KNN.

Train a KNN classifier with the current k value using KNN classifier from sklearn library.

Predict labels for the cross-validation set and calculate evaluation metrics (accuracy, precision, recall, F1-score).

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

$$
\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

$$
\text{Confusion Matrix} =
\begin{bmatrix}
TP & FN \\
FP & TN
\end{bmatrix}
$$


In [51]:
results = []
for k in range(1, 21):
    eval = KNeighborsClassifier(n_neighbors=k)
    eval.fit(x_train,y_train)
    y_pred = eval.predict(x_cv)
    
    accuracy = accuracy_score(y_cv, y_pred)
    precision = precision_score(y_cv, y_pred, pos_label='g')
    recall = recall_score(y_cv, y_pred, pos_label='g')
    f1 = f1_score(y_cv, y_pred, pos_label='g')
    cm = confusion_matrix(y_cv, y_pred, labels=['g','h'])
    
    results.append({
        'k': k,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'confusion_matrix': cm,
        'eval model': eval
    })

Report all of the trained model accuracy, precision, recall and f-score as well as confusion
matrix.

In [60]:
for result in results:
    print(f"k = {result['k']}")
    print(f"Accuracy: {result['accuracy']:.4f}")
    print(f"Precision: {result['precision']:.4f}")
    print(f"Recall: {result['recall']:.4f}")
    print(f"F1-Score: {result['f1_score']:.4f}")
    print("Confusion Matrix:")
    print(result['confusion_matrix'])
    print("\n")


k = 1
Accuracy: 0.7383
Precision: 0.7236
Recall: 0.7957
F1-Score: 0.7580
Confusion Matrix:
[[822 211]
 [314 659]]


k = 2
Accuracy: 0.7433
Precision: 0.6874
Recall: 0.9197
F1-Score: 0.7867
Confusion Matrix:
[[950  83]
 [432 541]]


k = 3
Accuracy: 0.7657
Precision: 0.7425
Recall: 0.8345
F1-Score: 0.7858
Confusion Matrix:
[[862 171]
 [299 674]]


k = 4
Accuracy: 0.7537
Precision: 0.7062
Recall: 0.8935
F1-Score: 0.7889
Confusion Matrix:
[[923 110]
 [384 589]]


k = 5
Accuracy: 0.7697
Precision: 0.7422
Recall: 0.8470
F1-Score: 0.7911
Confusion Matrix:
[[875 158]
 [304 669]]


k = 6
Accuracy: 0.7627
Precision: 0.7167
Recall: 0.8916
F1-Score: 0.7947
Confusion Matrix:
[[921 112]
 [364 609]]


k = 7
Accuracy: 0.7747
Precision: 0.7485
Recall: 0.8470
F1-Score: 0.7947
Confusion Matrix:
[[875 158]
 [294 679]]


k = 8
Accuracy: 0.7682
Precision: 0.7243
Recall: 0.8877
F1-Score: 0.7977
Confusion Matrix:
[[917 116]
 [349 624]]


k = 9
Accuracy: 0.7702
Precision: 0.7416
Recall: 0.8500
F1-Score: 0.7921

Select the model with the highest F1-score from the validation results.

Retrieve the best-trained KNN model and use it for final evaluation on the test set.

Predict labels for the test set and compute evaluation metrics (accuracy, precision, recall, F1-score).


In [70]:
best_result = max(results, key=lambda x: x['f1_score'])
eval = best_result['eval model']
y_pred = eval.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)
test_precision = precision_score(y_test, y_pred, pos_label='g')
test_recall = recall_score(y_test, y_pred, pos_label='g')
test_f1 = f1_score(y_test, y_pred, pos_label='g')
test_cm = confusion_matrix(y_test, y_pred, labels=['g','h'])


Display the best value of k (number of neighbors) that resulted in the highest F1-score.

In [75]:
print(f"best k = {best_result['k']}")
print(f"Test accuracy: {test_accuracy:.4f}")
print(f"Test recision: {test_precision:.4f}")
print(f"Test recall: {test_recall:.4f}")
print(f"Test F1-Score: {test_f1:.4f}")
print("Test Confusion Matrix:")
print(test_cm)
print("\n")

best k = 16
Test accuracy: 0.7539
Test recision: 0.6969
Test recall: 0.8740
Test F1-Score: 0.7755
Test Confusion Matrix:
[[853 123]
 [371 660]]


