**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("adult_income.zip"), directory="data/adult_income")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Pipelines and KNN: The Adult Income Dataset

As a more practical example of data preprocessing and KNN-based classification we will use the [Adult Income Dataset](https://archive.ics.uci.edu/ml/datasets/adult) dataset, which contains data from a census and the task is to predict whether a particular person has an income greater or smaller than 50 000 dollars.

As usual, we will first display the description of the data.



In [None]:
with open("data/adult_income/adult.names", "r") as file:
    print("".join(file.readlines()))

Let's start by loading the data from CSV files. The dataset comes pre-split ito the training and the testing fold – we will therefore load each separately. *The testing data contains an extra period at the end of the last column for some reason. To make it compatible with the training set we will remove it directly after loading the data.* 



In [None]:
df_train = pd.read_csv("data/adult_income/adult.data",
                       header=None)
df_test = pd.read_csv("data/adult_income/adult.test",
                      header=None, skiprows=1)

df_test[14] = df_test[14].apply(lambda x: x[:-1])
df_train.head()

---
### Task 1: Column Selection

**Our first task will be to – in a way similar to the previous example – to select, which columns will be used as inputs and whether they contain numeric or categorical data.**  The desired outputs are in the last column.

---


In [None]:
categorical_inputs = [           ]  # ----

numeric_inputs = [               ]  # ----

output = 14

---
### Task 2: Creating the Pipeline

**The next step is to create our preprocessing pipeline. You can copy the pipeline from the previous example into the following cell.**  The desired outputs will be preprocessed using the `OrdinalEncoder` transformer.

---


In [None]:
input_preproc = make_column_transformer(
    
    
    
    # ----


    

In [None]:
output_enc = OrdinalEncoder()

### Data Preprocessing

We will now use the transformers created above to preprocess our data.



In [None]:
X_train = input_preproc.fit_transform(df_train)
Y_train = output_enc.fit_transform(df_train[[output]]).reshape(-1)

**Keep in mind that we need to use method `transform` and not `fit_transform` to preprocess our testing data.** 



In [None]:
X_test = input_preproc.transform(df_test)
Y_test = output_enc.transform(df_test[[output]]).reshape(-1)

### Training

The code to train the model can be copied from the previous example verbatim.



In [None]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, Y_train)

### Testing

The code to test the model can be copied verbatim as well.



In [None]:
y_test = model.predict(X_test)

In [None]:
cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm)

In [None]:
acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))