**Ensemble VotingClassifier**

In [90]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import classification_report
import pandas as pd

import imblearn
from imblearn.under_sampling import RandomUnderSampler

In [91]:
# dataset - divisão em conjunto de treino e teste 
URL = 'https://janeawsdata.s3.us-east-2.amazonaws.com/large_data.csv'
df = pd.read_csv(URL)

X = df.drop('TYPE', axis=1)
y = df.TYPE

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

**Ensemble usando os métodos de classificação: Árvores de Decisão, KNN e Naive Bayes**

In [92]:
model1 = GaussianNB()
model2 = DecisionTreeClassifier(max_depth = 15)
model3 = KNeighborsClassifier(n_neighbors = 5, p=2)
model = VotingClassifier(estimators=[
                                     ('nb', model1), 
                                     ('dt', model2),
                                     ('knn', model3)],
                                     voting='hard') #o parâmetro hard apresentou melhores resultado do que "soft"

**TESTE 1: Treinando o modelo utilizando o conjunto de treino com as [classes desbalanceadas](https://github.com/janeptn/fit/blob/main/Dataset_Info.ipynb)**

In [93]:
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, df.TYPE.unique())) #avaliação

              precision    recall  f1-score   support

     ALLERGY       0.98      0.97      0.98      4110
        COLD       0.46      0.67      0.55       239
       COVID       0.39      0.51      0.44       507
         FLU       0.96      0.92      0.94      6258

    accuracy                           0.92     11114
   macro avg       0.70      0.77      0.73     11114
weighted avg       0.93      0.92      0.92     11114



**TESTE 2: Treinando o modelo e aplicando a abordagem Sampling para lidar com as [classes desabalanceadas](https://github.com/janeptn/fit/blob/main/Dataset_Info.ipynb)**

In [94]:
# técnica under-sampling
rus = RandomUnderSampler()
X_res, y_res = rus.fit_sample(X_train, y_train)
model.fit(X_res, y_res)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, df.TYPE.unique())) #avaliacao



              precision    recall  f1-score   support

     ALLERGY       1.00      0.97      0.98      4110
        COLD       0.45      1.00      0.62       239
       COVID       0.50      0.97      0.66       507
         FLU       1.00      0.90      0.95      6258

    accuracy                           0.93     11114
   macro avg       0.74      0.96      0.80     11114
weighted avg       0.97      0.93      0.94     11114



*Observação: O segundo teste, aplicando a abordagem de sampling nos dados de treinamento, apresentou melhores resultados para o emsemble.*

