<a href="https://colab.research.google.com/github/kieskii/Trabalho-WebPython/blob/main/trabalho.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 10. Digits Dataset com Decision Tree e Random Forest

- **Objetivo:** Treinar tanto um modelo de árvore de decisão quanto um Random Forest no dataset Digits, e comparar o desempenho de ambos os modelos, especialmente no que diz respeito à capacidade de generalização e ao tempo de execução.

- **Técnica:** Comparação entre Decision Tree e Random Forest.

---

## Descrição dos Desafios:

- **Desafios 10:** Introduzem o uso de ensembles simples e a comparação direta entre diferentes classificadores para evidenciar as diferenças entre eles.

Esses desafios continuam a usar classificadores básicos e comuns, proporcionando uma base sólida em machine learning enquanto exploram técnicas como ajuste de hiperparâmetros, ensemble learning e análise de erros.

---

Conteúdo do teste
Pergunta <bdi></bdi>
Utilizar o banco de dados desbalanceado de covid e buscar atingir o maior resultado que conseguir. Use recursos como: Feature Selection, Sampling, Tuning, Ensemble, Feature Engineering...

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier

train_fold_0 = pd.read_csv('/content/lbp-train-fold_0.csv')
train_fold_1 = pd.read_csv('/content/lbp-train-fold_1.csv')
train_fold_2 = pd.read_csv('/content/lbp-train-fold_2.csv')
train_fold_3 = pd.read_csv('/content/lbp-train-fold_3.csv')
train_fold_4 = pd.read_csv('/content/lbp-train-fold_4.csv')

train_data = pd.concat([train_fold_0, train_fold_1, train_fold_2, train_fold_3, train_fold_4], ignore_index=True)

X = train_data.drop(columns=['class'])
y = train_data['class']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_val_scaled)

rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='f1_macro')
grid_search.fit(X_train_scaled, y_train)

best_rf_model = grid_search.best_estimator_
y_pred_rf = best_rf_model.predict(X_val_scaled)

data_test = pd.read_csv('/content/lbp-test.csv')
X_test = data_test.drop(columns=['class'])
y_test = data_test['class']

X_test_scaled = scaler.transform(X_test)

y_test_pred_dt = dt_model.predict(X_test_scaled)
y_test_pred_rf = best_rf_model.predict(X_test_scaled)

Test Set Decision Tree F1-Score Macro: 0.3611531733058017
Test Set Random Forest F1-Score Macro: 0.31667839549612953


In [None]:

svm_model = SVC(probability=True, class_weight='balanced')
svm_model.fit(X_train_scaled, y_train)
y_pred_svm = svm_model.predict(X_val_scaled)

gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train_scaled, y_train)
y_pred_gb = gb_model.predict(X_val_scaled)

voting_model = VotingClassifier(estimators=[
    ('dt', dt_model),
    ('rf', best_rf_model),
    ('svm', svm_model),
    ('gb', gb_model)],
    voting='soft')
voting_model.fit(X_train_scaled, y_train)
y_pred_voting = voting_model.predict(X_val_scaled)

models = {
    'Decision Tree': y_pred_dt,
    'Random Forest': y_pred_rf,
    'SVM': y_pred_svm,
    'Gradient Boosting': y_pred_gb,
    'Voting Ensemble': y_pred_voting,
}

for model_name, predictions in models.items():
    print(f"{model_name} F1-Score Macro:", f1_score(y_val, predictions, average='macro'))

y_test_pred_svm = svm_model.predict(X_test_scaled)
y_test_pred_gb = gb_model.predict(X_test_scaled)
y_test_pred_voting = voting_model.predict(X_test_scaled)

test_models = {
    'Decision Tree': y_test_pred_dt,
    'Random Forest': y_test_pred_rf,
    'SVM': y_test_pred_svm,
    'Gradient Boosting': y_test_pred_gb,
    'Voting Ensemble': y_test_pred_voting,
}

for model_name, predictions in test_models.items():
    print(f"Test Set {model_name} F1-Score Macro:", f1_score(y_test, predictions, average='macro'))


Decision Tree F1-Score Macro: 0.9758799619179772
Random Forest F1-Score Macro: 0.9960020232154271
SVM F1-Score Macro: 0.996035848406948
Gradient Boosting F1-Score Macro: 0.9960415100139687
Voting Ensemble F1-Score Macro: 0.9960876370154178
Test Set Decision Tree F1-Score Macro: 0.3611531733058017
Test Set Random Forest F1-Score Macro: 0.31667839549612953
Test Set SVM F1-Score Macro: 0.38624801413755555
Test Set Gradient Boosting F1-Score Macro: 0.27114019393998806
Test Set Voting Ensemble F1-Score Macro: 0.3611331099604021
