# PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões (2024)

## Terceiro Exercício-Programa

O objetivo deste exercício é comparar diferentes *classificadores* e *regressores*, todos testados no mesmo conjunto de dados, a Base Adult

Lucas Carvalho, 2024

### Preparação do Ambiente

In [152]:
# Biblitoecas Base
import pandas as pd
import matplotlib.pyplot as plt

# Modelos utilizados
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Avaliação
from sklearn.metrics import accuracy_score

## Importação e Tratamento dos Dados

### Importação

Primeiramente, importação dos dados, levando em consideração uma melhor nomenclatura de cada uma das Features, facilitando o entendimento do código

In [153]:
trainData = pd.read_csv(
    "./input/train_data.csv",
    na_values= '?',
    index_col= ['Id'],
    header= 0,
    names= [
        "Id",
        "Age",
        "Workclass",
        "Fnlwgt",
        "Education",
        "Education-num",
        "Marital-status",
        "Occupation",
        "Relationship",
        "Race",
        "Sex",
        "Capital-gain",
        "Capital-loss",
        "Hours-per-week",
        "Native-country",
        "Target"
    ]
)

testData = pd.read_csv(
    "./input/test_data.csv",
    na_values= '?',
    index_col= ['Id'],
    header= 0,
    names= [
        "Id",
        "Age",
        "Workclass",
        "Fnlwgt",
        "Education",
        "Education-num",
        "Marital-status",
        "Occupation",
        "Relationship",
        "Race",
        "Sex",
        "Capital-gain",
        "Capital-loss",
        "Hours-per-week",
        "Native-country"
    ]
)

### Tratamento

Como o objetivo é comparar classificadores, será considerado o mesmo tratamento de dados feito anteriormente em relação ao KNN. Nota-se que houve aplicação de diferentes encoders, assim como ordenação binária

In [154]:
# Preenchendo valores nulos em 'Workclass' e fazendo o mapeamento
trainData["Workclass"] = trainData["Workclass"].fillna("Private")
trainData["Workclass"] = trainData["Workclass"].apply(lambda x: 1 if x == "Private" else 0)
trainData.rename(columns={"Workclass": "isPrivate"}, inplace=True)

# Removendo valores nulos em 'Occupation' e criando variáveis dummies
trainData = trainData.dropna(subset=["Occupation"])
trainData = pd.concat([trainData, pd.get_dummies(trainData["Occupation"], prefix="Occ")], axis=1)
trainData = trainData.drop("Occupation", axis=1)

# Mapeando 'Native-country'
trainData["Native-country"] = trainData["Native-country"].apply(lambda x: 1 if x == "United-States" else 0)
trainData.rename(columns={"Native-country": "isFromUS"}, inplace=True)

# Criando variáveis dummies para 'Marital-status', 'Relationship', 'Race'
trainData = pd.concat([trainData, pd.get_dummies(trainData["Marital-status"], prefix="Mar")], axis=1)
trainData = trainData.drop("Marital-status", axis=1)

trainData = pd.concat([trainData, pd.get_dummies(trainData["Relationship"], prefix="Rel")], axis=1)
trainData = trainData.drop("Relationship", axis=1)

trainData = pd.concat([trainData, pd.get_dummies(trainData["Race"], prefix="Race")], axis=1)
trainData = trainData.drop("Race", axis=1)

# Mapeando 'Sex'
trainData["Sex"] = trainData["Sex"].apply(lambda x: 1.0 if x == "Male" else 0.0)
trainData.rename(columns={"Sex": "isMale"}, inplace=True)

# Mapeando 'Target'
trainData["Target"] = trainData["Target"].apply(lambda x: 1.0 if x == ">50K" else 0.0)

trainData_dummies = pd.get_dummies(trainData, drop_first=True)

Analogamente, com o conjunto de teste

In [155]:
# Definindo o mesmo processo de preenchimento e mapeamento em testData
testData["Workclass"] = testData["Workclass"].fillna("Private")
testData["Workclass"] = testData["Workclass"].apply(lambda x: 1 if x == "Private" else 0)
testData.rename(columns={"Workclass": "isPrivate"}, inplace=True)

testData = testData.dropna(subset=["Occupation"])
testData = pd.concat([testData, pd.get_dummies(testData["Occupation"], prefix="Occ")], axis=1)
testData = testData.drop("Occupation", axis=1)

testData["Native-country"] = testData["Native-country"].apply(lambda x: 1 if x == "United-States" else 0)
testData.rename(columns={"Native-country": "isFromUS"}, inplace=True)

testData = pd.concat([testData, pd.get_dummies(testData["Marital-status"], prefix="Mar")], axis=1)
testData = testData.drop("Marital-status", axis=1)

testData = pd.concat([testData, pd.get_dummies(testData["Relationship"], prefix="Rel")], axis=1)
testData = testData.drop("Relationship", axis=1)

testData = pd.concat([testData, pd.get_dummies(testData["Race"], prefix="Race")], axis=1)
testData = testData.drop("Race", axis=1)

testData["Sex"] = testData["Sex"].apply(lambda x: 1.0 if x == "Male" else 0.0)
testData.rename(columns={"Sex": "isMale"}, inplace=True)

Mostrando o resultado do DataFrame após as operações indicadas

In [156]:
trainData

Unnamed: 0_level_0,Age,isPrivate,Fnlwgt,Education,Education-num,isMale,Capital-gain,Capital-loss,Hours-per-week,isFromUS,...,Rel_Not-in-family,Rel_Other-relative,Rel_Own-child,Rel_Unmarried,Rel_Wife,Race_Amer-Indian-Eskimo,Race_Asian-Pac-Islander,Race_Black,Race_Other,Race_White
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16280,34,1,204991,Some-college,10,1.0,0,0,44,1,...,False,False,True,False,False,False,False,False,False,True
16281,58,0,310085,10th,6,1.0,0,0,40,1,...,False,False,False,False,False,False,False,False,False,True
16282,25,1,146117,Some-college,10,1.0,0,0,42,1,...,True,False,False,False,False,False,False,False,False,True
16283,24,1,138938,Some-college,10,0.0,0,0,40,1,...,True,False,False,False,False,False,False,False,False,True
16284,57,0,258883,HS-grad,9,1.0,5178,0,60,0,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48835,42,1,384236,Masters,14,1.0,7688,0,40,1,...,False,False,False,False,False,False,False,False,False,True
48836,23,1,129042,HS-grad,9,0.0,0,0,40,1,...,False,False,False,True,False,False,False,True,False,False
48837,30,1,195488,HS-grad,9,0.0,0,0,40,0,...,False,False,True,False,False,False,False,False,False,True
48838,18,1,27620,HS-grad,9,0.0,0,0,25,1,...,True,False,False,False,False,False,False,False,False,True


In [157]:
testData

Unnamed: 0_level_0,Age,isPrivate,Fnlwgt,Education,Education-num,isMale,Capital-gain,Capital-loss,Hours-per-week,isFromUS,...,Rel_Not-in-family,Rel_Other-relative,Rel_Own-child,Rel_Unmarried,Rel_Wife,Race_Amer-Indian-Eskimo,Race_Asian-Pac-Islander,Race_Black,Race_Other,Race_White
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,25,1,120596,Bachelors,13,1.0,0,0,44,1,...,True,False,False,False,False,False,False,False,False,True
1,64,0,152537,Bachelors,13,1.0,0,0,45,1,...,False,False,False,False,False,False,False,False,False,True
2,31,1,100135,Masters,14,0.0,0,0,40,1,...,True,False,False,False,False,False,False,False,False,True
3,45,1,189123,HS-grad,9,1.0,0,0,40,1,...,False,False,True,False,False,False,False,False,False,True
4,64,0,487751,Bachelors,13,1.0,0,0,50,1,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16275,40,1,168113,HS-grad,9,1.0,0,0,40,1,...,False,False,False,False,False,False,False,False,False,True
16276,30,0,327203,HS-grad,9,1.0,0,0,40,1,...,False,False,False,False,False,False,False,False,False,True
16277,25,1,116358,HS-grad,9,1.0,0,0,40,0,...,False,False,True,False,False,False,True,False,False,False
16278,60,1,39263,Masters,14,0.0,3325,0,35,1,...,True,False,False,False,False,False,False,False,False,True


#### Modelo Cross-Validation

In [158]:
# Definindo parâmetros
cvFolds = 5
minCorr = 0.1
scores = ("accuracy", "f1")

# Convertendo trainData para dummies e separando features e target
trainData_dummies = pd.get_dummies(trainData, drop_first=True)
trainY = trainData_dummies["Target"]
trainX = trainData_dummies.drop("Target", axis=1)

# Identificando colunas para remover, baseadas na correlação com "Target"
correlation = trainX.corrwith(trainY)
toRemove = correlation.where(correlation.abs() <= minCorr).dropna().index

# Remover as colunas filtradas
trainX.drop(toRemove, axis=1, inplace=True)


Em relação ao conjunto de teste, são feitas apenas as correções e separação para avaliação adequada

In [159]:
# Criando dummies e alinhando colunas com trainData
testData_dummies = pd.get_dummies(testData, drop_first=True)
testData_dummies = testData_dummies.reindex(columns=trainData_dummies.columns, fill_value=0)

# Separando as features (X) e o target (Y) em ambos os conjuntos de dados
trainX = trainData_dummies.drop("Target", axis=1)
trainY = trainData_dummies["Target"]
testX = testData_dummies.drop("Target", axis=1)
testY = testData_dummies["Target"]

## KNN 

In [160]:
knn_classifier = KNeighborsClassifier(n_neighbors=17) 
knn_classifier.fit(trainX, trainY)

In [161]:
predictions = knn_classifier.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.929084497845109


## Rede Neural - MLP

Utilizar a arquitetura MLP para aplicar uma Rede Neural na solução desse problema de classificação

In [162]:
mlp = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42, activation='relu')

In [163]:
# Treinar a rede neural com os dados de treino
mlp.fit(trainX, trainY)

In [164]:
predictions = mlp.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.886900875016325


## Regressão Logística

In [165]:
# Treinando o modelo de Regressão Logística
log_reg = LogisticRegression(max_iter=300, random_state=42)
log_reg.fit(trainX, trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [166]:
predictions = log_reg.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.811283792608071


## SVM

In [167]:
# Treinando o modelo SVM
svm_classifier = SVC(kernel='rbf')  # 'linear', 'rbf', 'poly', etc.
svm_classifier.fit(trainX, trainY)

In [168]:
predictions = svm_classifier.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.9608201645553088


## Floresta Aleatória

In [169]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(trainX, trainY)

In [170]:
predictions = rf_classifier.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.7919550737886901


## Boosting

In [171]:
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(trainX, trainY)

In [172]:
predictions = gb_classifier.predict(testX)
accuracy = accuracy_score(testY, predictions)

print(f"Acurácia: {accuracy}")

Acurácia: 0.8087371033041661
