KNN classifier

In [1]:
import pandas as pd
import settings
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
fullTrainData = pd.read_excel(settings.labelledDatapath)

print(fullTrainData.columns)
print("Class balance:")
fullTrainData["class"].value_counts()

  warn("Workbook contains no default style, apply openpyxl's default")


Index(['RowID', 'age', 'workclass', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'],
      dtype='object')
Class balance:


<=50K    24720
>50K      7841
Name: class, dtype: int64

In [3]:
#removing redundant columns
fullTrainData = fullTrainData.drop(labels=settings.redundantFeatures, axis=1)
fullTrainData.head()

Unnamed: 0,age,workclass,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
#making labels numerical - 1 for >50K, 0 for <=50K or unknown
fullTrainData["class"] = fullTrainData["class"].map(lambda x: 1 if(x == ">50K") else 0)

### Idea for using KNN

Euclidian distance between vectors $X$ and $Y$ comprising $n$ numerical attributes is given as:
$$d_{num}(X,Y) = \sqrt{\sum_{i=1}^{n}{(X_i-Y_i)^2}}$$

For a numerical attribute, $A$, the values are first scaled by:
$$n(A_i) = \frac{A_i - \min(A)}{\max(A)-\min(A)}$$
(MinMaxScaler from sklearn.preprocessing)



For categorical attributes, distance can be computed by:
$$d_{cat}(X_i,Y_i) =
    \begin{cases}
        0 & \text{if same category} \\
        b & \text{else}
    \end{cases}
$$


> potential values for $b$:
>
> - $N_{num}/N_{cat}$ where $N_{num|cat}$ is the number of categorical/numerical attributes.
>     - Every attribute gives the same "maximum" distance when not matched.
>     - The "maximum" distance is the ratio of numbers of numerical attributes to categorical attributes.
>         - When there are the same numbers of each type of attribute, each categorical non-match gets a value of 1.
>         - When there are more categorical attributes, a non-match pushes the distance less; when there are fewer categorical attributes, a non-match pushes the distance more.
>     - If just gave a distance of 1 for each non-matching attribute, then not being exactly the same for just a few categorical attributes would mean vectors are hugely dissimilar (even if they are very similar in other respects).
>     - Implementation-wise, this is the same as doing one-hot encoding of categorical features, then multiplying the 1 by $0.5 \times N_{num}/N_{cat}$ and then taking euclidian distance between whole vectors.
>
> - $1/n_i$ where $n_i$ is the number of categories for attribute $i$.
>
>     - This is the same thing as encoding categorical features as one-hot vectors, then dividing the 1 in the vector by 2*(number of categories) and then just doing euclidian distance.
>     - Means that the more categories there are, the less of an impact on the distance a non-match for that attribute will have.


Categorical distance may also be computed using jaccard index of one hot encoded attributes?



Overall distance between 2 vectors $X$ and $Y$ with mixed categorical and numerical attributes is:
$$ d(X,Y) = \sqrt{d_{num}(X_{num}, Y_{num})^2 + d_{cat}(X_{cat}, Y_{cat})^2}$$


In [5]:
#comverting dataframe to one-hot representation with one column for each possible value of each categorical attribute
fullTrainData = pd.get_dummies(fullTrainData, columns=settings.categoricalFeatures)
fullTrainData.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#normalising categorical attributes
oneHotCategorical = list(set(fullTrainData.columns.values.tolist()) - set(settings.numericalFeatures))    #getting list of one-hot encoded categorical column names
oneHotCategorical.remove("class")
weight = 0.5 * len(settings.numericalFeatures)/len(settings.categoricalFeatures)

fullTrainData[oneHotCategorical] = fullTrainData[oneHotCategorical].apply(lambda x: x*weight)
#comverting dataframe to one-hot representation with one column for each possible value of each categorical attribute
fullTrainData.head()


Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,13,2174,0,40,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
1,50,13,0,0,13,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
2,38,9,0,0,40,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
3,53,7,0,0,40,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
4,28,13,0,0,40,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
#normalising numerical attributes
scaler = MinMaxScaler()
scaler.fit(fullTrainData[settings.numericalFeatures])
print(scaler.data_max_)
fullTrainData[settings.numericalFeatures] = scaler.transform(fullTrainData[settings.numericalFeatures])
#fullTrainData[settings.numericalFeatures] = fullTrainData[numericalFeatures].apply(lambda x: (x-))
fullTrainData.head()

[9.0000e+01 1.6000e+01 9.9999e+04 4.3560e+03 9.9000e+01]


Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.30137,0.8,0.02174,0.0,0.397959,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
1,0.452055,0.8,0.0,0.0,0.122449,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
2,0.287671,0.533333,0.0,0.0,0.397959,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
3,0.493151,0.4,0.0,0.0,0.397959,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.357143,0.0,0.0
4,0.150685,0.8,0.0,0.0,0.397959,0,0.0,0.0,0.0,0.357143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
#splitting features and labels
attributes = fullTrainData.drop(labels=["class"], axis=1)
labels = fullTrainData["class"]

#getting test and train set
trainFeatures, testFeatures, trainLabels, testLabels = train_test_split(attributes, labels, test_size=0.2, random_state=5) #random state is like a seed to allow repeatable results


In [9]:
classifier = KNeighborsClassifier(settings.numNeighbours)
classifier.fit(trainFeatures, trainLabels)

In [10]:
#getting score for classifier based on test dataset
score = classifier.score(testFeatures, testLabels)

print(f"score: {score}")

score: 0.8357131890066022
