# PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões
Análise e classificação com a base de dados [Adult](https://www.kaggle.com/c/adult-pmr3508), disponível também em [UCI Repository](https://archive.ics.uci.edu/ml/index.php).

Autor: Lucas Tonini Rosenberg Schneider

---

## 1 Inicialização

### 1.1 Importando pacotes necessários

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import sklearn
from sklearn import preprocessing as prep

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

%matplotlib inline

### 1.2 Leitura dos dados

In [2]:
x_columns = ['ID', 'Age', 'Workclass', 'Final Weight', 'Education', 'Education Num', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital Gain', 'Capital Loss', 'Hours per Week', 'Native Country']
y_column = ['Income']

train_data_raw = pd.read_csv('data/train_data.csv', names = (x_columns + y_column), na_values = '?', header = 0)
test_data_raw = pd.read_csv('data/test_data.csv', names = x_columns, na_values = '?', header = 0)

In [None]:
print(train_data_raw.shape)
train_data_raw.head()

---

## 2 Análise e compreensão

### 2.1 Dados Numéricos

A seguir, serão discutidos os dados numéricos mais relevantes, cuja compreensão ajuda a entender melhor o problema

In [4]:
numeric_columns = list(train_data_raw.select_dtypes(include = np.number).columns)
train_data_raw[numeric_columns].describe()

Unnamed: 0,ID,Age,Final Weight,Education Num,Capital Gain,Capital Loss,Hours per Week
count,32560.0,32560.0,32560.0,32560.0,32560.0,32560.0,32560.0
mean,32559.5,38.581634,189781.8,10.08059,1077.615172,87.306511,40.437469
std,9399.406719,13.640642,105549.8,2.572709,7385.402999,402.966116,12.347618
min,16280.0,17.0,12285.0,1.0,0.0,0.0,1.0
25%,24419.75,28.0,117831.5,9.0,0.0,0.0,40.0
50%,32559.5,37.0,178363.0,10.0,0.0,0.0,40.0
75%,40699.25,48.0,237054.5,12.0,0.0,0.0,45.0
max,48839.0,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
#sns.set()
#sns.pairplot(train_data_raw, vars = ['Age'], hue = 'Income')

In [3]:
def plot_feature_frequencies(data, feature):
    '''
    Plot frequencie for income <=50k and >50k for a specific feature
    '''
    
    less_50 = data.loc[data['Income'] == '<=50K', feature].value_counts().rename('<=50K')
    more_50 = data.loc[data['Income'] == '>50K', feature].value_counts().rename('>50K')
    plot_data = pd.concat([less_50, more_50], axis=1).dropna()
    plot_data.plot(xlabel = feature, ylabel = 'Frequency')

#### 2.1.1 ID

Essa feature contém apenas um número de identificação, ou seja, não é uma feature que faça sentido analisar, e, por isso, será removida durante o tratamento dos dados.

#### 2.1.2 Age

In [None]:
age_data = plot_feature_frequencies(train_data_raw, 'Age')
#age_data.plot(xlabel = 'Age', ylabel = 'Frequency')

É possível ver claramente como a proporção de pessoas que ganham mais que 50 mil aumenta drásticamente com a idade. Entre 20 e 40 anos, há um crescimento acelerado, e a partir daí, a proporção se mantém mais estável.

#### 2.1.3 Final Weight

In [None]:
plot_feature_frequencies(train_data_raw, 'Final Weight')


#### 2.1.4 Education Num

In [None]:
plot_feature_frequencies(train_data_raw, 'Education Num')

Há dois pontos principais mostrados aqui: primeiramente, todos os indivíduos que ganhavam mais que 50k tem nível de escolaridade maior que 8, ou seja, pelo menos um diploma no nível técnico. Além disso, apenas a partir do nível 14, ou seja, a partir do bacharelado, a propórção de indivíduos que ganahvam mais que 50k superou a dos que ganahvam menos. Logo, grau de escolaridade seria um fator importante para fazer predições em relação à renda.

#### 2.1.5 Capital Gain

In [None]:
plot_feature_frequencies(train_data_raw, 'Capital Gain')

#### 2.1.6 Capital Loss

In [None]:
plot_feature_frequencies(train_data_raw, 'Capital Loss')

#### 2.1.7 Hours per Week

In [None]:
plot_feature_frequencies(train_data_raw, 'Hours per Week')

Com esse gráfico percebe-se claramente que a maior parte das pessoas possui a jornada padrão de 40h por semana. Porém, quanto mais aumenta-se o número de horas trabalhadas, maior é a proporção de pessoas que ganham mais que 50k, enquanto para valores menores que 40h, a quantidade é bem pequena.

### 2.2 Dados categóricos

In [5]:
categoric_columns = list(train_data_raw.select_dtypes(exclude = np.number).columns)
train_data_raw[categoric_columns].describe()

Unnamed: 0,Workclass,Education,Marital Status,Occupation,Relationship,Race,Sex,Native Country,Income
count,30724,32560,32560,30717,32560,32560,32560,31977,32560
unique,8,16,7,14,6,5,2,41,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27815,21789,29169,24719


---

## 3 Preparação dos dados

### 3.1 Dados faltantes

In [6]:
def count_missing_values(data):
    '''
    Count missing values for each feature and return a sorted DataFrame with the resuls
    '''

    missing_count = []
    for column in data.columns:
        missing_count.append(data[column].isna().sum())
    missing_count = np.asarray(missing_count)
    missing_count = pd.DataFrame({'feature': data.columns, 'count': missing_count,
                                'freq. [%]': 100*missing_count/data.shape[0]}, index=None)
    missing_count.sort_values('count', ascending=False, inplace=True, ignore_index=True)
    return missing_count

In [None]:
count_missing_values(train_data_raw).head()

In [27]:
def handle_missing_values(original_data, fill_options = None):
    '''
    Choose what to do with the missing values.
    fill_options is a dictionary where de features are keys, and the values are how to fill the missing data, the options are: unknown (fill with 'unknown'), mean (complete with the mean value), moda (complete with most frequent value).
    The rest of the missing data will be droped.
    '''

    data = original_data.copy()
    if fill_options is not None:
        for feature, action in fill_options.items():
            # print(feature, action)
            if feature not in data.columns:
                # print(feature)
                continue
            if action == 'unknown':
                data[feature].fillna('unknown', inplace=True)
            elif action == 'mean':
                data[feature].fillna(data[feature].mean(), inplace=True)
            elif action == 'moda':
                top = data[feature].describe().top
                data[feature].fillna(top, inplace=True)

    data.dropna(inplace=True)
    return data


In [28]:
test = handle_missing_values(train_data_raw, {'Occupation': 'unknown'})
print(train_data_raw.shape)
print(test.shape)

(32560, 16)
(30168, 16)


### 3.2 Tratamento de features
seleção, novas, normalização

In [29]:
def prepare_data(train_raw, test_raw, fill_options = None, drop_columns = ['ID', 'Education']):
    '''
    Prepare the data to be used in the classifier
    '''

    train_data = train_raw.copy()
    test_data = test_raw.copy()

    # Remove unwanted columns
    if drop_columns is not None:
        train_data.drop(drop_columns, axis = 1, inplace=True)
        test_data.drop(drop_columns, axis = 1, inplace=True)

    # Handle the missing values
    train_data = handle_missing_values(train_data, fill_options)
    test_data = handle_missing_values(test_data, fill_options)

    categoric_columns = list(test_data.select_dtypes(exclude = np.number).columns)
    label_column = 'Income'

    # Encode the categoric feature into numbers
    cat_encoder = prep.OrdinalEncoder()
    cat_encoder.fit(train_data[categoric_columns])
    train_data[categoric_columns] = cat_encoder.transform(train_data[categoric_columns])
    test_data[categoric_columns] = cat_encoder.transform(test_data[categoric_columns])

    # Encode the labels
    label_encoder = prep.LabelEncoder()
    Y_train = label_encoder.fit_transform(train_data[label_column])

    train_data.drop(label_column, axis = 1, inplace=True)

    # Make sure the test and train data have the same number of features
    assert train_data.shape[1] == test_data.shape[1]

    scaler = prep.StandardScaler()
    scaler.fit(train_data)
    X_train = scaler.transform(train_data)
    X_test = scaler.transform(test_data)

    return X_train, Y_train, X_test, label_encoder

In [39]:
fill_options = {'Occupation': 'unknown', 'Workclass': 'unknown', 'Native Country': 'moda'}
drop_columns = ['ID', 'Final Weight', 'Workclass', 'Education', 'Native Country']
X_train, Y_train, X_test, label_encoder = prepare_data(train_data_raw, test_data_raw, fill_options, drop_columns)

In [31]:
print(f'X_train shape is {X_train.shape}')
print(f'Y_train shape is {Y_train.shape}')
print(f'X_test shape is {X_test.shape}')

X_train shape is (32560, 10)
Y_train shape is (32560,)
X_test shape is (16280, 10)


---

## 4 Model

In [36]:
metrics = ['manhattan'] #, 'chebyshev', 'minkowski']
cv_scores = {}

In [37]:
k_range = (30, 35)

In [38]:
for metric in metrics:
    for k_value in range(*k_range):
        knn_clf = KNeighborsClassifier(k_value, metric = metric)
        score = np.mean(cross_val_score(knn_clf, X_train, Y_train, cv=10))
        if metric not in cv_scores.keys():
            cv_scores[metric] = []
        cv_scores[metric].append(score)
        print(f'{metric} with {k_value} neighbors: {score}')

manhattan with 30 neighbors: 0.8429054054054055
manhattan with 31 neighbors: 0.8431511056511056
manhattan with 32 neighbors: 0.8429668304668304
manhattan with 33 neighbors: 0.8430896805896806
manhattan with 34 neighbors: 0.842936117936118


In [None]:
knn_results = pd.DataFrame(cv_scores), index=range(*k_range))
knn_results.plot(xlabel='Number of Neighbors', ylabel='CV score')

manhattan with 34 neighbors: 0.8448402948402949

In [40]:
metric = 'manhattan'
k_value = 34
knn_clf = KNeighborsClassifier(k_value, metric = metric)
score = np.mean(cross_val_score(knn_clf, X_train, Y_train, cv=10))
print(f'{metric} with {k_value} neighbors: {score}')

manhattan with 34 neighbors: 0.8448402948402949
