# PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões (2024)

## Segundo Exercício-Programa

O objetivo deste exercício é comparar diferentes *classificadores* e *regressores*, todos testados no mesmo conjunto de dados, a Base Adult

Lucas Carvalho, 2024

### Preparação do Ambiente

In [17]:
# Biblitoecas Base
import pandas as pd
import matplotlib.pyplot as plt

# Modelos utilizados
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

### Importação e Tratamento dos Dados

#### Importação

Primeiramente, importação dos dados, levando em consideração uma melhor nomenclatura de cada uma das Features, facilitando o entendimento do código

In [18]:
trainData = pd.read_csv(
    "./input/train_data.csv",
    na_values= '?',
    index_col= ['Id'],
    header= 0,
    names= [
        "Id",
        "Age",
        "Workclass",
        "Fnlwgt",
        "Education",
        "Education-num",
        "Marital-status",
        "Occupation",
        "Relationship",
        "Race",
        "Sex",
        "Capital-gain",
        "Capital-loss",
        "Hours-per-week",
        "Native-country",
        "Target"
    ]
)

testData = pd.read_csv(
    "./input/test_data.csv",
    na_values= '?',
    index_col= ['Id'],
    header= 0,
    names= [
        "Id",
        "Age",
        "Workclass",
        "Fnlwgt",
        "Education",
        "Education-num",
        "Marital-status",
        "Occupation",
        "Relationship",
        "Race",
        "Sex",
        "Capital-gain",
        "Capital-loss",
        "Hours-per-week",
        "Native-country"
    ]
)

#### Tratamento

Como o objetivo é comparar classificadores, será considerado o mesmo tratamento de dados feito anteriormente em relação ao KNN. Nota-se que houve aplicação de diferentes encoders, assim como ordenação binária

In [19]:
# Preenchendo valores nulos em 'Workclass' e fazendo o mapeamento
trainData["Workclass"] = trainData["Workclass"].fillna("Private")
trainData["Workclass"] = trainData["Workclass"].apply(lambda x: 1 if x == "Private" else 0)
trainData.rename(columns={"Workclass": "isPrivate"}, inplace=True)

# Removendo valores nulos em 'Occupation' e criando variáveis dummies
trainData = trainData.dropna(subset=["Occupation"])
trainData = pd.concat([trainData, pd.get_dummies(trainData["Occupation"], prefix="Occ")], axis=1)
trainData = trainData.drop("Occupation", axis=1)

# Mapeando 'Native-country'
trainData["Native-country"] = trainData["Native-country"].apply(lambda x: 1 if x == "United-States" else 0)
trainData.rename(columns={"Native-country": "isFromUS"}, inplace=True)

# Criando variáveis dummies para 'Marital-status', 'Relationship', 'Race'
trainData = pd.concat([trainData, pd.get_dummies(trainData["Marital-status"], prefix="Mar")], axis=1)
trainData = trainData.drop("Marital-status", axis=1)

trainData = pd.concat([trainData, pd.get_dummies(trainData["Relationship"], prefix="Rel")], axis=1)
trainData = trainData.drop("Relationship", axis=1)

trainData = pd.concat([trainData, pd.get_dummies(trainData["Race"], prefix="Race")], axis=1)
trainData = trainData.drop("Race", axis=1)

# Mapeando 'Sex'
trainData["Sex"] = trainData["Sex"].apply(lambda x: 1.0 if x == "Male" else 0.0)
trainData.rename(columns={"Sex": "isMale"}, inplace=True)

# Mapeando 'Target'
trainData["Target"] = trainData["Target"].apply(lambda x: 1.0 if x == ">50K" else 0.0)

trainData_dummies = pd.get_dummies(trainData, drop_first=True)

Mostrando o resultado do DataFrame após as operações indicadas

In [20]:
trainData

Unnamed: 0_level_0,Age,isPrivate,Fnlwgt,Education,Education-num,isMale,Capital-gain,Capital-loss,Hours-per-week,isFromUS,...,Rel_Not-in-family,Rel_Other-relative,Rel_Own-child,Rel_Unmarried,Rel_Wife,Race_Amer-Indian-Eskimo,Race_Asian-Pac-Islander,Race_Black,Race_Other,Race_White
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16280,34,1,204991,Some-college,10,1.0,0,0,44,1,...,False,False,True,False,False,False,False,False,False,True
16281,58,0,310085,10th,6,1.0,0,0,40,1,...,False,False,False,False,False,False,False,False,False,True
16282,25,1,146117,Some-college,10,1.0,0,0,42,1,...,True,False,False,False,False,False,False,False,False,True
16283,24,1,138938,Some-college,10,0.0,0,0,40,1,...,True,False,False,False,False,False,False,False,False,True
16284,57,0,258883,HS-grad,9,1.0,5178,0,60,0,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48835,42,1,384236,Masters,14,1.0,7688,0,40,1,...,False,False,False,False,False,False,False,False,False,True
48836,23,1,129042,HS-grad,9,0.0,0,0,40,1,...,False,False,False,True,False,False,False,True,False,False
48837,30,1,195488,HS-grad,9,0.0,0,0,40,0,...,False,False,True,False,False,False,False,False,False,True
48838,18,1,27620,HS-grad,9,0.0,0,0,25,1,...,True,False,False,False,False,False,False,False,False,True


### Modelo Cross-Validation

In [21]:
# Definindo parâmetros
cvFolds = 5
minCorr = 0.1
scores = ("accuracy", "f1")

trainData_dummies = pd.get_dummies(trainData, drop_first=True)
correlation = trainData_dummies.corr()

In [24]:
# Identificando colunas para remover, baseadas na correlação com "Target"
toRemove = correlation["Target"].where(correlation["Target"].abs() <= minCorr).dropna().index

# Separando features (X) e target (Y)
trainY = trainData["Target"]
trainX = trainData.drop("Target", axis=1)

# Garantir que estamos removendo apenas colunas que existem em trainX
toRemove_filtered = [col for col in toRemove if col in trainX.columns]

# Remover as colunas filtradas
trainX.drop(toRemove_filtered, axis=1, inplace=True)

In [25]:
trainX

Unnamed: 0_level_0,Age,isPrivate,Education,Education-num,isMale,Capital-gain,Capital-loss,Hours-per-week,Occ_Exec-managerial,Occ_Other-service,Occ_Prof-specialty,Mar_Divorced,Mar_Married-civ-spouse,Mar_Never-married,Rel_Husband,Rel_Not-in-family,Rel_Own-child,Rel_Unmarried,Rel_Wife
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
16280,34,1,Some-college,10,1.0,0,0,44,True,False,False,True,False,False,False,False,True,False,False
16281,58,0,10th,6,1.0,0,0,40,False,False,False,False,True,False,True,False,False,False,False
16282,25,1,Some-college,10,1.0,0,0,42,False,False,False,False,False,True,False,True,False,False,False
16283,24,1,Some-college,10,0.0,0,0,40,False,False,False,True,False,False,False,True,False,False,False
16284,57,0,HS-grad,9,1.0,5178,0,60,False,False,False,False,True,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48835,42,1,Masters,14,1.0,7688,0,40,False,False,True,False,True,False,True,False,False,False,False
48836,23,1,HS-grad,9,0.0,0,0,40,False,False,False,False,False,True,False,False,False,True,False
48837,30,1,HS-grad,9,0.0,0,0,40,False,False,False,False,False,True,False,False,True,False,False
48838,18,1,HS-grad,9,0.0,0,0,25,False,False,False,False,False,True,False,True,False,False,False


### KNN