In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

# metadata
print(adult.metadata)

# variable information
print(adult.variables)




## Dataset Analysis

### About
We chose the [Adult 2 dataset](https://archive.ics.uci.edu/dataset/2/adult) from UCI's Machine Learning Repository. This dataset compiles US Census information relevant to predicting income. It will be used to predict if a sample has an income over 50k USD a year.

### Does it comply with the requirements?
Adult 2 contains 48,000 rows, so it has over 2,000 rows. It contains 14 features, however some will not be used. We will still consider at least 10. Finally, there are several opportunities for encountering data leakage.

First, a feature *fnl_wgt* computes how many rows in the dataset are like this. Frequency of a specific sample considers every sample in the set, which means this feature is not individual to the sample. Therefore, we will not consider this column during training.

Second, there are several duplicate entries in the dataset. We must make each sample singular to give each sample equal weighting.

Finally, there are two columns pertaining to education: a string representation and an enumeration of the string options. Both should not be used, as it may give that feature unequal weighting, and samples with unknown education backgrounds would be minimized.

Therefore, this dataset is compliant with the project outlines.

In [None]:
# this block will drop some columns

dropped_columns = [
    'fnlwgt',
    'education-num'
]

new_data = X.drop(columns=dropped_columns)

print(new_data.columns)

# now new_data has all the right variables

In [None]:
# drops duplicate entries

print(len(new_data))
new_data = new_data.drop_duplicates()
print(len(new_data))

Some columns will have missing values, in that case we can do one of a few things:
1. Drop all rows with missing values
2. Replace missing values

To do this, we can use sklearn's ColumnTransformer, which will replace missing values with the mode (most common) feature.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#80-20 random split
rng = np.random.default_rng(0)
idx = np.arange(len(X))
rng.shuffle(idx)
split = int(0.8 * len(idx))

train_idx = idx[:split]
test_idx  = idx[split:]

X_train = X.iloc[train_idx]
y_train = y.iloc[train_idx]

X_test = X.iloc[test_idx]
y_test = y.iloc[test_idx]

num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

num_cols = ['age', 'capital-gain', 'capital-loss']
cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
            'relationship', 'race', 'sex', 'native-country']

#Transform Data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipe, num_cols),  # Apply StandardScaler to numerical columns
        ('cat', cat_pipe, cat_cols) # Apply OneHotEncoder to categorical columns
    ],
    remainder='passthrough' # Keep other columns
)

knn_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("knn", KNeighborsClassifier(n_neighbors=25, weights="distance", p=1)),
])

knn_pipe.fit(X_train, y_train)
pred = knn_pipe.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, pred))