# Imputing the data

Since we have missing values in some fields we have to do **Imputation**, we will create substitution values using the k neighbors algorithm (KNN).

This must be done after normalization to make sure all data is in the same scale.

In [1]:
import pandas as pd
import numpy as np

testSrc = "../data/dataset_test_norm.csv"
df_test = pd.read_csv(testSrc, index_col=0)

In [2]:
complete_rows = df_test.dropna()
incomplete_rows = df_test[df_test.isnull().any(axis=1)]
print("Number of complete rows:",len(complete_rows))
print("Number of incomplete rows:",len(incomplete_rows))

Number of complete rows: 332
Number of incomplete rows: 68


## 1. Euclidean Distance

This function returns the **Euclidean Distance** between two rows only in the specified column.

$$
dist = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}
$$

In [3]:
def euclidean_distance(row1, row2, columns):
    return np.sqrt(sum((row1[columns] - row2[columns])**2))

## 2. Get K Neighbors

Finds closest K neighbors for a given row.

In [4]:
def get_k_neighbors(target_row, complete_data, columns_to_use, k=5):
    distances = []
    for idx, row in complete_data.iterrows():
        dist = euclidean_distance(target_row, row, columns_to_use)
        distances.append((idx, dist))

    # Sort by distance and get closest neighbors
    distances.sort(key=lambda x: x[1])
    return [idx for idx, _ in distances[:k]]

## 3. Impute Missing Values

Imputes missing values using KNN.

In [5]:
def impute_missing_values(df, k=5):
    df_imputed = df.copy()
    complete_data = df.dropna()

    # For each row of missing values
    for idx, row in df[df.isnull().any(axis=1)].iterrows():
        # Find missing values in row
        missing_cols = row[row.isnull()].index

        # Find available columns to calculate distance
        available_cols = row[row.notnull()].index

        # Find closest k neighbours
        neighbors_idx = get_k_neighbors(row, complete_data, available_cols, k)

        # Impute each missing row with avg of neighbours
        for col in missing_cols:
            neighbor_values = complete_data.loc[neighbors_idx, col]
            df_imputed.loc[idx, col] = neighbor_values.mean()

    return df_imputed

In [6]:
df_test_imputed = impute_missing_values(df_test, k=5)

In [7]:
print("Verify no missing values:")
print(df_test_imputed.isnull().sum())
print("Dataset dimensions:", df_test_imputed.shape)

Verify no missing values:
Arithmancy                       0
Herbology                        0
Defense Against the Dark Arts    0
Divination                       0
Muggle Studies                   0
Ancient Runes                    0
Transfiguration                  0
Potions                          0
Care of Magical Creatures        0
Charms                           0
Flying                           0
dtype: int64
Dataset dimensions: (400, 11)


In [8]:
df_test_imputed.to_csv('../data/dataset_test_norm_imputed.csv')

# Results

- The resulting file `dataset_test_norm_imputed.csv` contains 400 complete and normalized registries.
- The dataset is ready to be used in the clasification process.
- The values where normalized with the same paramaters that where used in the training dataset.