# Machine Learning with K-Nearest Neighbors (KNN)

## K-Nearest Neighbors (KNN) Dataset Generator

This script generates synthetic training and test datasets for a K-Nearest Neighbors machine learning algorithm. The datasets consist of:

**1. Two numerical features (feature1 and feature2):**
- Generated using normal distribution with mean=0 and standard deviation=0.2
- These could represent any measurable characteristics (e.g., height/weight, temperature/humidity, price/quality scores, etc.)

**2. Binary classification labels ('A' or 'B'):**
- Randomly assigned with equal probability (50% each)
- In real scenarios, these might represent categories like pass/fail, spam/not spam, positive/negative sentiment, etc.

**The script creates two separate files:**
- KNNTraining.txt: 20 samples for training the KNN model
- KNNTest.txt: 30 samples for testing the model's performance

**Output format:** Each line follows the format: `feature1,feature2,label`
Example: `-0.123,0.456,A`

**This synthetic data can be used to:**
- Test KNN algorithm implementations
- Demonstrate classification techniques
- Practice data preprocessing and model evaluation

In [1]:
import numpy as np
import codecs

base_path = ""

print("Create training samples")
feature1 = list(np.random.normal(0, 0.2, 20)) # 20 random numbers from -0.2 to 0.2 
feature2 = list(np.random.normal(0, 0.2, 20))
label = list(np.random.choice(['A', 'B'], size=20, p=[0.5, 0.5])) # randomize a 20 element list with either A or B

print("Save training samples into a file")
with codecs.open(base_path + "KNNTraining.txt", "w", "UTF-8") as file:
    for f1, f2, l in zip(feature1, feature2, label):
        file.write(str(f1) + "," + str(f2) + "," + str(l) + "\n")

print("Create test samples")
feature1 = list(np.random.normal(0, 0.2, 30))
feature2 = list(np.random.normal(0, 0.2, 30))
label = list(np.random.choice(['A', 'B'], size=30, p=[0.5, 0.5]))

print("Save test samples into a file")
with codecs.open(base_path + "KNNTest.txt", "w", "UTF-8") as file:
    for f1, f2, l in zip(feature1, feature2, label):
        file.write(str(f1) + "," + str(f2) + "," + str(l) + "\n")

print("Dataset generation complete!")
print("Files created:")
print("- KNNTraining.txt (20 samples)")
print("- KNNTest.txt (30 samples)")

Create training samples
Save training samples into a file
Create test samples
Save test samples into a file
Dataset generation complete!
Files created:
- KNNTraining.txt (20 samples)
- KNNTest.txt (30 samples)


# Euclidean Distance Function

Calculate the Euclidean distance between two points represented as lists. This function computes the straight-line distance between two points in n-dimensional space using the formula: distance = sqrt((x2-x1)² + (y2-y1)² + ... + (n1-n2)²).

This is commonly used in machine learning algorithms like K-Nearest Neighbors (KNN) to measure similarity between data points.

<img src="euclidean_distance_example.png" width="auto" height="100" />

In [7]:
import math
from typing import List

def euclideanDistance(list1: List, list2: List):
    """
    Computes euclidean distance
    """
    sumList = 0 # accumulates the squared difference between 
    for x, y in zip(list1, list2):
        sumList += (y - x) ** 2
    return math.sqrt(sumList)

# Test the function
point1 = [1, 2]
point2 = [4, 6]
distance = euclideanDistance(point1, point2)
print(f"Distance between {point1} and {point2}: {distance}")

point3 = [0, 0, 0]
point4 = [3, 4, 0]
distance2 = euclideanDistance(point3, point4)
print(f"Distance between {point3} and {point4}: {distance2}")

Distance between [1, 2] and [4, 6]: 5.0
Distance between [0, 0, 0] and [3, 4, 0]: 5.0


# KNN Classification Function

This function implements the K-Nearest Neighbors classification algorithm. It takes a test sample and finds the k closest training samples based on Euclidean distance, then returns the most common label (mode) among those k neighbors.

**Parameters:**
- testList: The test sample to classify (list of features)
- trainingLists: List of all training samples
- trainingLabels: Corresponding labels for training samples
- k: Number of nearest neighbors to consider

**Returns:** The predicted class label based on majority vote of k nearest neighbors

In [10]:
from operator import itemgetter # retrieves items from an iterable passing the index or the key we want to retrieve
from statistics import mode # returns the statistic mode of a given array
from typing import List

def classify(testList, trainingLists: List[List], trainingLabels, k):
    """
    This function represents the ML model
    """
    distance = []  # Store distance and label pairs

    for trainingList, label in zip(trainingLists, trainingLabels):  # Iterate through training data
        dist = euclideanDistance(testList, trainingList)  # Calculate distance to test point
        distance.append((dist, label))  # Store as tuple (distance, label)

    distance.sort(key=itemgetter(0))  # Sort by distance (ascending order)
    voteLabels = []  # Store labels of k nearest neighbors
    for x in distance[:k]:  # Get first k closest neighbors
        voteLabels.append(x[1])  # Extract label from tuple

    return mode(voteLabels)  # Return most frequent label

# KNN Model Implementation and Evaluation

Implementation of k-nearest neighbors - prueba

This section loads the previously generated training and test datasets, applies the KNN classification algorithm, and evaluates the model's accuracy by comparing predictions with actual labels.

**Definition of accuracy:** number of correct predictions / total of predictions made

**Process:**
1. Load training and test data from files
2. Apply KNN classification to each test sample
3. Calculate and display model accuracy

**Summary**
In a few words, given several vectors with their tag in a n-dimensional space, it looks for the k closest neighbors in that space and classifies the new vector with the most common label of the k closes neighbors

In [12]:
import codecs

training = []
test = []
trainingLabels = []
testLabels = []
k = 5

print("Load training samples")
with codecs.open("KNNTraining.txt", "r", "UTF-8") as file:
    for line in file:
        elements = line.rstrip("\n").split(",")  # Split by comma
        training.append([float(elements[0]), float(elements[1])])  # Features as floats
        trainingLabels.append(elements[2])  # Label as string

print("Load test samples")
with codecs.open("KNNTest.txt", "r", "UTF-8") as file:
    for line in file:
        elements = line.rstrip("\n").split(",")  # Split by comma
        test.append([float(elements[0]), float(elements[1])])  # Features as floats
        testLabels.append(elements[2])  # Label as string

print("Apply the KNN approach over test samples")
correctPredictions = 0
totalPredictions = len(test)

for i, testSample in enumerate(test):  # Iterate through test samples
    prediction = classify(testSample, training, trainingLabels, k)  # Get prediction
    if prediction == testLabels[i]:  # Check if prediction matches actual label
        correctPredictions += 1
    print(f"Test {i+1}: Predicted = {prediction}, Actual = {testLabels[i]}")

print("Model accuracy: " + str(correctPredictions/totalPredictions))

Load training samples
Load test samples
Apply the KNN approach over test samples
Test 1: Predicted = B, Actual = A
Test 2: Predicted = B, Actual = A
Test 3: Predicted = A, Actual = B
Test 4: Predicted = A, Actual = A
Test 5: Predicted = A, Actual = A
Test 6: Predicted = A, Actual = A
Test 7: Predicted = B, Actual = A
Test 8: Predicted = A, Actual = B
Test 9: Predicted = A, Actual = B
Test 10: Predicted = B, Actual = B
Test 11: Predicted = B, Actual = A
Test 12: Predicted = A, Actual = B
Test 13: Predicted = A, Actual = A
Test 14: Predicted = B, Actual = A
Test 15: Predicted = A, Actual = B
Test 16: Predicted = B, Actual = B
Test 17: Predicted = B, Actual = B
Test 18: Predicted = B, Actual = B
Test 19: Predicted = A, Actual = B
Test 20: Predicted = A, Actual = A
Test 21: Predicted = B, Actual = B
Test 22: Predicted = B, Actual = A
Test 23: Predicted = A, Actual = B
Test 24: Predicted = A, Actual = A
Test 25: Predicted = A, Actual = A
Test 26: Predicted = A, Actual = A
Test 27: Predicted

# 6 Implementación en Scikit-learn

Con la implementación básica de Python hecha, ahora utilizaremos la librería favorita de personas
asociadas a la ciencia o analítica de datos para comparar.

## 6.1 KNN parametros basicos
El algoritmo de KNN se encuentra disponible en scikit-learn a través de las clases
KNeighborsClassifier para problemas de clasificación y KNeighborsRegressor para problemas
de regresión. La implementación de kNN en esta biblioteca es intuitiva y flexible, permitiendo la
configuración de parámetros como el número de vecinos k, la métrica de distancia (por ejemplo,
euclidiana o Manhattan) y la forma en que se ponderan los vecinos. Esto facilita la personalización
del modelo para adaptarse a diversos tipos de datos y necesidades específicas.
De manera especifica, los parametros de KNN son los siguientes:
1. n_neighbors: Especifica el número de vecinos más cercanos a considerar para la clasificación.
Este parámetro controla el valor de k en kNN. Un valor comúnmente usado es 5, pero puede
ajustarse según el problema específico.
2. weights: Define el método de ponderación para los vecinos. Puede ser:
• uniform: Todos los vecinos tienen el mismo peso.
• distance: Los vecinos más cercanos tienen un peso mayor, inversamente proporcional a la
distancia.
• También se puede pasar una función que asigne pesos personalizados.
3. algorithm: Determina el algoritmo a utilizar para calcular los vecinos. Las opciones son:
• auto: Elige automáticamente el mejor algoritmo basado en los datos.
• ball_tree: Usa un árbol de búsqueda Ball.
• kd_tree: Usa un árbol de búsqueda KD.
• brute: Realiza una búsqueda exhaustiva de todos los vecinos.
4. p: Especifica el tipo de distancia entre vectores. Si p=1, se usa la distancia de Manhattan; si
p=2, se usa la distancia euclidiana. Para valores distintos, se utiliza la distancia de Minkowski
general.
5. n_jobs: Número de procesos en paralelo para ejecutar el cálculo de vecinos. Un valor de -1
utiliza todos los núcleos disponibles del procesador.