# K-Nearest Neighbors Algorithm

The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses the proximity of an individual data point to a group of data points to make classifications about the grouping of an individual data point.

In [None]:
# Load packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## Dataset

Our dataset classifies water potability based on nine chemical measurables. All of these columns are numercial, while the potability is a categorical column.

In [None]:
# Load the dataframe
df = pd.read_csv('water_potability.csv')
print(df.shape)
df.head()

**Cleaning**

In [None]:
# Check for null values
df.isna().sum()

There are some null values in the ph, Sulfate and Trihalomethanes columns. We check for mean and median to identify possible outliers in order to fill the NaN values reasonably. For a dataset of 3200 rows dropping ~1000 rows skews the data too.

In [None]:
# Display mean and median of column with missing values
print(f"Mean ph value = {df['ph'].mean()}")
print(f"Median ph value = {df['ph'].median()}")
print('\n')
print(f"Mean Sulfate value = {df['Sulfate'].mean()}")
print(f"Median Sulfate value = {df['Sulfate'].median()}")
print('\n')
print(f"Mean Trihalomethanes value = {df['Trihalomethanes'].mean()}")
print(f"Median Trihalomethanes value = {df['Trihalomethanes'].median()}")

We see that for each column mean and median are almost equal, so we can deduct that there are no extreme outliers and we can safely use the mean to fill ne missing values.

In [None]:
# Filling missing values with the column mean
df['Trihalomethanes'] = df['Trihalomethanes'].fillna(df['Trihalomethanes'].mean())
df['Trihalomethanes'] = df['Trihalomethanes'].infer_objects(copy=False)

df['ph'] = df['ph'].fillna(df['ph'].mean())
df['ph'] = df['ph'].infer_objects(copy=False)

df['Sulfate'] = df['Sulfate'].fillna(df['Sulfate'].mean())
df['Sulfate'] = df['Sulfate'].infer_objects(copy=False)

In [None]:
# Check the column datatypes
df.info()

As desired all columns are floats (numerical) except the potability column which has the datatype of integer for classification (equivalent to boolean).

0 = non-potable (False)

1 = potable (True)

**Exploratory Data Analysis**

In [None]:
# Plot pairwise relationships in a dataset
sns.pairplot(df, hue="Potability");

The pairplots illustrate that there is no relationship between two columns that enables unambigious grouping. This is crucial for an accurate application of the knn algorithm.

# KNN Algorithm

m = number of features

For each test set data point we calculate the distance of this data point to all train set data points in the m-dimensional space. By setting n we define the n nearest neighbors of the test data point. The resulting class of the test data point is the class average of its n neighbors.

**Features and target**

In [None]:
# Define feature (chemical measurables) and target (potability classification) columns
features = df.drop(columns = ["Potability"])
target = df["Potability"]

**Train, test, and split groups**

In [None]:
# Set the size of the test group to 20% of the total dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [None]:
# Create an instance of the knn classifier without setting the number of nearest neighbors
knn = KNeighborsClassifier()

In [None]:
# Train the model with the training features and target
knn.fit(X_train, y_train)

In [None]:
# Make a prediction for the water potability of the test group
pred = knn.predict(X_test)
pred[:10]

In [None]:
y_test.values[:10]

When comparing the predicted potability with the correct potability we see that many predictions are wrong. We calculate the accuracy to numerically assess the ratio of the number of true predictions to the number of all predictions.

In [None]:
knn.score(X_test, y_test)