### K-Nearest Neighbors (KNN)

- KNN is a superised machine learning method to tackle classififcation and regression problems.
- It can handle both numerical and categorical data
- A non-parametric method that makes predictions based on the similarity of data points in a given dataset
- KNN  is less sensitive to outliers compared to other algorithms
- KN algorithm works by finding the K nearest neighbors to a given data point on a distance metric, such as **Euclidean** distance.
- The class or value of the data points is determined by themajority vote or average of the K neighbors.
- This approach allows algoritth to adapt to different patterns and make predictions based on the local structure data

In [98]:
# Importing Libraries

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [99]:
# code out the model

df = pd.read_csv("500hits.csv", encoding = "latin-1")
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [100]:
df[df["HOF"] == 2]


Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
160,Tim Raines Sr.,23,2502,8872,1571,2605,430,113,170,980,1330,966,808,146,0.294,2


In [101]:
# replace the HOF values from 2 to 1 for clarity

df["HOF"] = df["HOF"].replace(2, 1)
df["HOF"].unique()

array([1, 0], dtype=int64)

In [102]:
# Clean the data - drop not important columns
df = df.drop(columns = ["PLAYER", "CS"])

In [103]:
# splitting the data into X/y test and train by arranging the data

# train columns
X = df[['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB','BA']]
print(X)

# target column
y = df["HOF"]
print(y)
# Splitting the columns into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.2)
#X_train.shape

     YRS     G     AB     R     H   2B   3B   HR   RBI    BB    SO   SB     BA
0     24  3035  11434  2246  4189  724  295  117   726  1249   357  892  0.366
1     22  3026  10972  1949  3630  725  177  475  1951  1599   696   78  0.331
2     22  2789  10195  1882  3514  792  222  117   724  1381   220  432  0.345
3     20  2747  11195  1923  3465  544   66  260  1311  1082  1840  358  0.310
4     21  2792  10430  1736  3430  640  252  101     0   963   327  722  0.329
..   ...   ...    ...   ...   ...  ...  ...  ...   ...   ...   ...  ...    ...
460   15  1920   6653  1105  1665  285   39  291   964  1224  1427  225  0.250
461   17  1829   6092   900  1664  379   10  275  1065   936  1453   20  0.273
462   15  1834   6499  1062  1661  338   67  210   761   960  1190  315  0.256
463   16  1822   6309   714  1660  254   25   54   593   396   489   74  0.263
464   15  1468   5629   785  1660  247   71   61   499   266   471  267  0.295

[465 rows x 13 columns]
0      1
1      1
2      1


In [104]:
y_test.unique()

array([0, 1], dtype=int64)

In [105]:
# Scalling the data

scaleMinMax = MinMaxScaler(feature_range = (0, 1))
X_train = scaleMinMax.fit_transform(X_train)
X_test = scaleMinMax.fit_transform(X_test)

# Runing the knn predictive model
knn = KNeighborsClassifier(n_neighbors = 8)
knn.fit(X_train, y_train)

# locate the prediction of HOF
y_pred = knn.predict(X_test)
print(y_pred, "\n") 

# Checking the score of accuracy

knn_score = knn.score(X_test, y_test) * 100
print(f" The score for the accuracy of prediction from the model is about {knn_score}%")

[0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0
 0 0 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0] 

 The score for the accuracy of prediction from the model is about 80.64516129032258%


In [106]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[55,  6],
       [12, 20]], dtype=int64)

| Actual/Predicted  | Predicted Not HOF (0)     | Predicted HOF (1)      |
|-------------------|---------------------------|-------------------------|
| Actual Not HOF (0) | True Negatives (TN) = 55 | False Positives (FP) = 6|
| Actual HOF (1)     | False Negatives (FN) = 12| True Positives (TP) = 

**Interpretation**
- True Negatives (TN):
  Count: 55 - Meaning: The model correctly predicted 55 individuals as not being in the Hall of Fame, and they indeed are not.

- False Positives (FP):
  Count: 6 - Meaning: The model predicted 6 individuals as Hall of Fame, but they are not.

- False Negatives (FN):
  Count: 12 - Meaning: The model predicted 12 individuals as not being in the Hall of Fame, but they actually are.

- True Positives (TP):
  Count: 20 - Meaning: The model correctly predicted 20 individuals as Hall of Fame, and they indeed are.20|


In [107]:
# Check Unique Classes in y_test and y_pred
print("Unique classes in y_test:", set(y_test))
print("Unique classes in y_pred:", set(y_pred))

Unique classes in y_test: {0, 1}
Unique classes in y_pred: {0, 1}


In [108]:
# Classification Report

cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.82      0.90      0.86        61
           1       0.77      0.62      0.69        32

    accuracy                           0.81        93
   macro avg       0.80      0.76      0.77        93
weighted avg       0.80      0.81      0.80        93

