# Classificação com o Dataset Adult Income
Neste notebook, vamos prever se a renda de uma pessoa excede 50 mil dólares por ano, com base em características demográficas e profissionais.

**Modelos comparados:**
- Árvore de Decisão
- Regressão Logística
- KNN

**Etapas:**
- Carga e limpeza de dados
- Pré-processamento (encoding e normalização)
- Treinamento dos modelos
- Avaliação com Acurácia e F1-score

# Dataset: Adult Income

O dataset **Adult Income** (também conhecido como **Census Income**) tem como objetivo prever se uma pessoa ganha mais de **\$50.000** por ano com base em atributos demográficos e trabalhistas.

---

## Variáveis

- **age**: Idade do indivíduo.
- **workclass**: Tipo de empregador (ex: Private, Self-emp, Government).
- **fnlwgt**: Peso de amostra (não é uma feature informativa para predição).
- **education**: Nível de escolaridade (ex: Bachelors, HS-grad).
- **education-num**: Escolaridade em anos (versão numérica de education).
- **marital-status**: Estado civil (ex: Married, Single).
- **occupation**: Ocupação (ex: Tech-support, Craft-repair).
- **relationship**: Relação familiar (ex: Husband, Not-in-family).
- **race**: Raça (ex: White, Black, Asian-Pac-Islander).
- **sex**: Sexo (Male/Female).
- **capital-gain**: Ganho de capital.
- **capital-loss**: Perda de capital.
- **hours-per-week**: Número de horas trabalhadas por semana.
- **native-country**: País de origem (ex: United-States, Mexico).

---

## Variável alvo (`y`)

- **income**: Faixa de renda (binária):
  - `<=50K`
  - `>50K`

---

In [25]:
import kagglehub
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [2]:
# Download latest version
path = kagglehub.dataset_download("uciml/adult-census-income")

df = pd.read_csv(path + '/adult.csv')
df.head()



Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [3]:
# Verificar dados ausentes e tipos
df.replace('?', np.nan, inplace=True)
df.dropna(inplace=True)
df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30162 entries, 1 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30162 non-null  int64 
 1   workclass       30162 non-null  object
 2   fnlwgt          30162 non-null  int64 
 3   education       30162 non-null  object
 4   education.num   30162 non-null  int64 
 5   marital.status  30162 non-null  object
 6   occupation      30162 non-null  object
 7   relationship    30162 non-null  object
 8   race            30162 non-null  object
 9   sex             30162 non-null  object
 10  capital.gain    30162 non-null  int64 
 11  capital.loss    30162 non-null  int64 
 12  hours.per.week  30162 non-null  int64 
 13  native.country  30162 non-null  object
 14  income          30162 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 3.7+ MB


In [4]:
# Separar X e y
X = df.drop(columns='income')
y = df['income']

# Codificar variáveis categóricas
X_encoded = pd.get_dummies(X, drop_first=True)

# Dividir treino/teste
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42, stratify=y)

# Normalizar
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [5]:
# Treinar modelos
model_dt = DecisionTreeClassifier(random_state=42)
model_knn = KNeighborsClassifier()
model_lr = LogisticRegression(max_iter=1000)

model_dt.fit(X_train_scaled, y_train)
model_knn.fit(X_train_scaled, y_train)
model_lr.fit(X_train_scaled, y_train)

# Prever
y_pred_dt = model_dt.predict(X_test_scaled)
y_pred_knn = model_knn.predict(X_test_scaled)
y_pred_lr = model_lr.predict(X_test_scaled)

# Avaliação
results = pd.DataFrame({
    'Modelo': ['Árvore de Decisão', 'KNN', 'Regressão Logística'],
    'Acurácia': [
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_knn),
        accuracy_score(y_test, y_pred_lr)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_knn),
        f1_score(y_test, y_pred_lr)
    ]
})
results

Unnamed: 0,Modelo,Acurácia,F1-Score
0,Árvore de Decisão,0.819538,0.637192
1,KNN,0.820643,0.620173
2,Regressão Logística,0.849597,0.66813


In [6]:
# Definir a grade de hiperparâmetros
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Inicializar o modelo
tree = DecisionTreeClassifier(random_state=42)

# Criar o Grid Search com validação cruzada
grid_search = GridSearchCV(
    estimator=tree,
    param_grid=param_grid,
    cv=5,                      # 5-fold cross-validation
    scoring='f1',              # pode trocar por 'accuracy', 'roc_auc', etc.
    n_jobs=-1,                 # usa todos os núcleos da máquina
    verbose=1
)

# Treinar
grid_search.fit(X_train_scaled, y_train)

# Acessar o melhor modelo
best_tree = grid_search.best_estimator_

# Avaliar no conjunto de teste
y_pred_best = best_tree.predict(X_test_scaled)
print("Melhores parâmetros:", grid_search.best_params_)
print("F1-score:", round(f1_score(y_test, y_pred_best), 4))
print("Acurácia:", round(accuracy_score(y_test, y_pred_best), 4))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Melhores parâmetros: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10}
F1-score: 0.6789
Acurácia: 0.857


### Pruning

In [18]:
path = model_dt.cost_complexity_pruning_path(X_train_scaled, y_train)
ccp_alphas = path.ccp_alphas

In [19]:
# Treinar uma árvore para cada alpha
trees = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    clf.fit(X_train_scaled, y_train)
    trees.append(clf)

In [20]:
# Avaliar desempenho em teste
test_scores = [clf.score(X_test_scaled, y_test) for clf in trees]

In [21]:
# Escolher melhor alpha (maior acurácia no teste)
best_index = np.argmax(test_scores)
best_alpha = ccp_alphas[best_index]
print(f"Melhor ccp_alpha encontrado: {best_alpha:.5f}")

Melhor ccp_alpha encontrado: 0.00015


In [22]:
# Treinar a árvore podada final
model_dt_pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
model_dt_pruned.fit(X_train_scaled, y_train)

# Avaliar desempenho da árvore podada
y_pred_pruned = model_dt_pruned.predict(X_test_scaled)

print("\nÁrvore sem pruning:")
print(f"Acurácia: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"F1-score: {f1_score(y_test, y_pred_dt):.4f}")

print("\nÁrvore com pruning:")
print(f"Acurácia: {accuracy_score(y_test, y_pred_pruned):.4f}")
print(f"F1-score: {f1_score(y_test, y_pred_pruned):.4f}")


Árvore sem pruning:
Acurácia: 0.8195
F1-score: 0.6372

Árvore com pruning:
Acurácia: 0.8615
F1-score: 0.6896


In [29]:
# feature_names = X_train.columns

# # Visualizar a árvore sem pruning
# plt.figure(figsize=(14,6))
# plot_tree(model_dt, filled=True, max_depth=3, feature_names=feature_names)
# plt.title('Árvore sem Pruning (limitada a 3 níveis)')
# plt.show()

# # Visualizar a árvore com pruning
# plt.figure(figsize=(14,6))
# plot_tree(model_dt_pruned, filled=True, max_depth=3, feature_names=feature_names)
# plt.title('Árvore com Pruning (limitada a 3 níveis)')
# plt.show()

In [30]:
model_dt.get_depth()

46

In [31]:
model_dt_pruned.get_depth()

19

### Grid search com best alpha fixo

In [34]:
# Definir a nova grade de hiperparâmetros (sem ccp_alpha)
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Inicializar o modelo com ccp_alpha fixado
tree_with_fixed_alpha = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)

# Novo Grid Search com o alpha fixo
grid_search_fixed_alpha = GridSearchCV(
    estimator=tree_with_fixed_alpha,
    param_grid=param_grid,
    cv=5,                      # 5-fold cross-validation
    scoring='f1',              # ou 'accuracy'
    n_jobs=-1,
    verbose=1
)

# Treinar
grid_search_fixed_alpha.fit(X_train_scaled, y_train)

# Avaliar o melhor modelo
best_tree_final = grid_search_fixed_alpha.best_estimator_

y_pred_final = best_tree_final.predict(X_test_scaled)
print("Melhores parâmetros finais:", grid_search_fixed_alpha.best_params_)
print("Acurácia final:", round(accuracy_score(y_test, y_pred_final), 4))
print("F1-score final:", round(f1_score(y_test, y_pred_final), 4))


Fitting 5 folds for each of 36 candidates, totalling 180 fits
Melhores parâmetros finais: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2}
Acurácia final: 0.8562
F1-score final: 0.6769
