## Introdução

O objetivo deste trabalho é gerar um classificador knn para a base de dados [adult](http://mlr.cs.umass.edu/ml/datasets/Census+Income). Esta base de dados contém os dados do censo americado de 19xx e 
Para isto o primeiro passo é ler a base de dados e tratá-los para ser possível :
* Ler a base de dados e tratar eventuais inconsistências
* Remove colunas redudantes
* Transformar as categorias em encoding numérico 
* Normalizar atributos numéricos para variarem no intervalo de 0 a 1

### Importando bibliotecas e lendo base de dados

In [175]:
import os
import sys

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
import scipy.stats as stats


import numbers

prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
CURRENT_DIR = os.path.abspath(os.path.dirname(__name__))
DATA_DIR = os.path.join(CURRENT_DIR, 'data')

TRAIN_DATA_FILE = os.path.join(DATA_DIR, 'adult.data')
TEST_DATA_FILE = os.path.join(DATA_DIR, 'adult.test')

from collections import OrderedDict
#extracted from 
data_types = OrderedDict([
    ("age", "int"),
    ("workclass", "category"),
    ("final_weight", "int"),  # originally it was called fnlwgt
    ("education", "category"),
    ("education_num", "int"),
    ("marital_status", "category"),
    ("occupation", "category"),
    ("relationship", "category"),
    ("race", "category"),
    ("sex", "category"),
    ("capital_gain", "float"),  # required because of NaN values
    ("capital_loss", "int"),
    ("hours_per_week", "int"),
    ("native_country", "category"),
    ("income_class", "category"),
])
target_column = "income_class"

#reading data
def read_dataset(path):
    data = pd.read_csv(
        path,
        names=data_types,
        index_col=None,
        dtype=data_types,
        comment='|',  
        skipinitialspace=True
    )
    #data = data.drop('final_weight', axis=1)
    return data

train_data = read_dataset(TRAIN_DATA_FILE)
test_data = read_dataset(TEST_DATA_FILE)

#concatena teste a data para avaliar o pré processamento
data = pd.concat([test_data, train_data])
(data.describe(include='all'))

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_class
count,48842.0,48842,48842.0,48842,48842.0,48842,48842,48842,48842,48842,48842.0,48842.0,48842.0,48842,48842
unique,,9,,16,,7,15,6,5,2,,,,42,4
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,24720
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


Nesta primeira importação vemos que a coluna 'final_weight' contem apenas informações sobre a coleta dos dados e não nos diz nada sobre a variável que queremos aprender.

## Pré-processamento
Nesta etapa vamos
* Remover colunas redundante
* Codificar categorias
* Normalizar categorias
* Lidar com dados faltantes

### Remoção de inconsistências
Vemos de cara que algumas classes estão com valores inconsistencias. Por exemplo a classe alvo 'income_class' tem 4 valores quando na verdade deveriam ter apenas 2.


In [176]:
data.income_class.value_counts()

income_class
<=50K     24720
<=50K.    12435
>50K       7841
>50K.      3846
Name: count, dtype: int64

Observando o value_counts vemos que isto se deve a um '.' adicional em uma das categorias.
Deve ser limpado:

In [177]:
data['income_class'] = data.income_class.str.rstrip('.').astype('category')
data.income_class.value_counts()

income_class
<=50K    37155
>50K     11687
Name: count, dtype: int64

Também notamos que algumas categorias numéricas tem valores '9' de forma muito frequente o que pode indicar um placeholder para valores faltantes nestas categorias.
Observando a frequência dos 5 valores mais comuns de 'capital_gain' e 'hours_per_week'

In [178]:
import heapq
hours_per_week_counts = data.hours_per_week.value_counts()
data.hours_per_week.value_counts()[hours_per_week_counts.index.isin(heapq.nlargest(5, data.hours_per_week.unique()))]

hours_per_week
99    137
98     14
96      9
97      2
95      2
Name: count, dtype: int64

In [179]:
import heapq
capital_gain_counts = data.capital_gain.value_counts()
data.capital_gain.value_counts()[capital_gain_counts.index.isin(heapq.nlargest(5, data.capital_gain.unique()))]

capital_gain
99999.0    244
27828.0     58
25236.0     14
34095.0      6
41310.0      3
Name: count, dtype: int64

Como esperado vemos que estes valores estão muito mais frequentes que suas redondezas, indicando que devem ser removidos.
Para isto vamos substituí-los pela média destas colunas já excluindo estes valores.

In [180]:
capital_mean = np.mean(data.capital_gain[data.capital_gain != 99999])
data['capital_gain'] = data['capital_gain'].replace(99999, capital_mean)
hours_per_week_mean = np.mean(data.hours_per_week[data.hours_per_week != 99])
data['hours_per_week'] = data['hours_per_week'].replace(99, hours_per_week_mean)

### Dados faltantes
Observando os dados vemos que as colunas 'workclass', 'occuptation' e 'native_country' tem dados faltantes indicados por '?'.

In [181]:
(data == '?').sum(axis=0)

age                  0
workclass         2799
final_weight         0
education            0
education_num        0
marital_status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     857
income_class         0
dtype: int64

Nestas categorias vamos substituir os valores faltantes pelos mais frequentes nestas classes.

In [182]:
data['workclass'] = data['workclass'].replace('?', 'Private')
data['occupation'] = data['occupation'].replace('?', 'Prof-specialty')
data['native_country'] = data['native_country'].replace('?', 'United-States')

  data['workclass'] = data['workclass'].replace('?', 'Private')
  data['occupation'] = data['occupation'].replace('?', 'Prof-specialty')


### Avaliação de Classes Correlacionadas

Observando as colunas é imediato que algumas classes devem ter grande correlação entre si. Por exemplo: é de se esperar uma correlação grande entre 'relationship' e 'marital_status', 'education' e 'education_num'.
Para garantir que não estamos introduzindo nenhum viés adicional a base de dados vamos avaliar a correlação entre estas colunas.
Para avaliar a correlação entre duas categorias vamos utilizar a métrica 'Cramers V'.

In [183]:
# le = preprocessing.LabelEncoder()
# marital_status = le.fit_transform(data.marital_status)
# relationship = le.fit_transform(data.relationship)
# stats.pointbiserialr(marital_status, relationship)
def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    n = n.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r - 1) ** 2) / (n - 1)
    kcorr = k - ((k - 1) ** 2) / (n - 1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

confusion_matrix = pd.crosstab(data['marital_status'], data['relationship'])
cramers_v(confusion_matrix=confusion_matrix)

np.float64(0.4880589431633566)

Para comparação vamos ver a correlação entre 'marital_status' e 'occupation'


In [184]:
confusion_matrix = pd.crosstab(data['marital_status'], data['occupation'])
cramers_v(confusion_matrix=confusion_matrix)

np.float64(0.12395188062684377)

In [185]:
confusion_matrix = pd.crosstab(data['workclass'], data['occupation'])
cramers_v(confusion_matrix)

np.float64(0.18911202171141478)

Para avaliar a correlação entre 'education' e 'education_num' vamos utilizar a correlação biserial e utilizar uma serialização por LabelEncoder para  'education' apenas para verificar a correlação.

In [186]:
le = preprocessing.LabelEncoder()
# le = preprocessing.LabelEncoder()
education = le.fit_transform(data.education)
stats.pointbiserialr(education, data.education_num)

SignificanceResult(statistic=np.float64(0.3596676843392162), pvalue=np.float64(0.0))

Desta forma, vemos que há uma alta covariância entre as colunas:
* education x education_num
* marital_status x relationship
* ocupation x workclass

Para simplificar o modelo vamos escolher as colunas education_num (por já ser numérica e carregar a informação de "mais anos estudados"), relationship  e occupation (por conterém menos classes) .

### Classes com elementos superrepresentados

Vemos que na coluna 'native_country' 90% dos respondentes tem nacionalidade 'United-States' e além disso, temos 42 categorias diferentes para esta coluna.
Por conta disso vamos trocar essa classe por: native-american com 1 indicando que é americano, 0 indicando que não.

In [187]:
data['native_country'] = data['native_country'].astype('category')
mode = data['native_country'].cat.codes.mode()
usa_map = lambda a : True if a == mode[0] else False
native_usa = data['native_country'].cat.codes.map(usa_map)
data = data.drop('native_country', axis=1)
native_usa_df = pd.DataFrame(data={'native_usa': native_usa})
data = pd.concat([data, native_usa_df], axis=1)


Com isto podemos criar uma função para limpar todos os dados e mapear os valores inteiros entre 0 a 1 dividindo pelo máximo da coluna e utilizar HotEncoding para os valores categócios

In [188]:
data.describe(include='all')

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class,native_usa
count,48842.0,48842,48842.0,48842,48842.0,48842,48842,48842,48842,48842,48842.0,48842.0,48842.0,48842,48842
unique,,8,,16,,7,14,6,5,2,,,,2,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,<=50K,True
freq,,36705,,15784,,22379,8981,19716,41762,32650,,,,37155,44689
mean,38.643585,,189664.1,,10.078089,,,,,,582.412136,87.502314,40.257612,,
std,13.71051,,105604.0,,2.570973,,,,,,2530.307226,403.004552,11.995659,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [189]:
# exemplo de one hot encoding
marital_oh = pd.get_dummies(data['marital_status'], dummy_na=False)
data = data.drop('marital_status', axis=1)
data = pd.concat([data, marital_oh], axis=1)
data.head(5)
#data = data.join(marital_oh)

Unnamed: 0,age,workclass,final_weight,education,education_num,occupation,relationship,race,sex,capital_gain,...,hours_per_week,income_class,native_usa,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed
0,25,Private,226802,11th,7,Machine-op-inspct,Own-child,Black,Male,0.0,...,40.0,<=50K,True,False,False,False,False,True,False,False
1,38,Private,89814,HS-grad,9,Farming-fishing,Husband,White,Male,0.0,...,50.0,<=50K,True,False,False,True,False,False,False,False
2,28,Local-gov,336951,Assoc-acdm,12,Protective-serv,Husband,White,Male,0.0,...,40.0,>50K,True,False,False,True,False,False,False,False
3,44,Private,160323,Some-college,10,Machine-op-inspct,Husband,Black,Male,7688.0,...,40.0,>50K,True,False,False,True,False,False,False,False
4,18,Private,103497,Some-college,10,Prof-specialty,Own-child,White,Female,0.0,...,30.0,<=50K,True,False,False,False,False,True,False,False


In [None]:
def clean_dataset(data):
    data = data.drop('final_weight', axis=1) # drops final_weight
    data = data.drop('workclass', axis=1) # drops workclass
    data = data.drop('education', axis=1) # drops education
    data = data.drop('relationship', axis=1) #drops  relationship

    data['income_class'] = data.income_class.str.rstrip('.').astype('category')

    capital_mean = np.mean(data.capital_gain[data.capital_gain != 99999])
    data['capital_gain'] = data['capital_gain'].replace(99999, capital_mean)
    hours_per_week_mean = np.mean(data.hours_per_week[data.hours_per_week != 99])
    data['hours_per_week'] = data['hours_per_week'].replace(99, hours_per_week_mean)

    #data['workclass'] = data['workclass'].replace('?', 'Private')
    data['occupation'] = data['occupation'].replace('?', 'Prof-specialty')

    # condensa classe native_country
    data['native_country'] = data['native_country'].replace('?', 'United-States')
    data['native_country'] = data['native_country'].astype('category')
    mode = data['native_country'].cat.codes.mode()
    usa_map = lambda a : True if a == mode[0] else False

    native_usa = data['native_country'].cat.codes.map(usa_map)
    data = data.drop('native_country', axis=1)
    data = pd.concat([data, native_usa], axis=1)

    data['marital_status'] = data['marital_status'].replace('?', 'Married-civ-spouse')
    #normaliza valores numéricos
    data['age'] = data['age']/90
    data['education_num'] = data['education_num']/16
    data['capital_gain'] = data['capital_gain']/41310.0
    data['capital_loss'] = data['capital_loss']/4356.0
    data['hours_per_week'] = data['hours_per_week']/98

    # one hot enconding 
    marital_oh = pd.get_dummies(data['marital_status'])
    data = data.drop('marital_status', axis=1)
    data = pd.concat([data, marital_oh], axis=1)

    occupation_oh = pd.get_dummies(data['occupation'])
    data = data.drop('occupation', axis=1)
    data = pd.concat([data, occupation_oh], axis=1)

    race_oh = pd.get_dummies(data['race'])
    data = data.drop('race', axis=1)
    data = pd.concat([data, race_oh], axis=1)

    sex_oh = pd.get_dummies(data['sex'])
    data = data.drop('sex',axis=1)
    data = pd.concat([data, sex_oh], axis=1)
    #drop duplicates 
    data = data.drop_duplicates()

    #saída 
    y = data['income_class']
    data = data.drop('income_class', axis=1)
    return data, y


In [195]:
clean_test, clean_output = clean_dataset(test_data)
#clean_test.describe(include='all')
clean_test.head(5)

  data['occupation'] = data['occupation'].replace('?', 'Prof-specialty')
  data['native_country'] = data['native_country'].replace('?', 'United-States')


Unnamed: 0,age,education_num,capital_gain,capital_loss,hours_per_week,0,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,...,Sales,Tech-support,Transport-moving,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White,Female,Male
0,0.277778,0.4375,0.0,0.0,0.408163,True,,,,,...,,,,,,,,,,
1,0.422222,0.5625,0.0,0.0,0.510204,True,,,,,...,,,,,,,,,,
2,0.311111,0.75,0.0,0.0,0.408163,True,,,,,...,,,,,,,,,,
3,0.488889,0.625,0.186105,0.0,0.408163,True,,,,,...,,,,,,,,,,
4,0.2,0.625,0.0,0.0,0.306122,True,,,,,...,,,,,,,,,,


In [91]:
clean_train, train_output = clean_dataset(train_data)

  data['occupation'] = data['occupation'].replace('?', 'Prof-specialty')
  data['native_country'] = data['native_country'].replace('?', 'United-States')


38


## Construção do kNN
Para construir o classificador kNN precisamos definir o hiperparametro n.
Para isso vamos utilizar o método k-fold cross validation.
