# Classificação com *Count Vectorizer*

- Versão feita em **19 outubro de 2023**

- Usando __RESUMO__

O arquivo é separado em 2 partes principais: a importação do dataset, e o emprego de classificadores.

- Descrição: existem 3 arquivos de entrada (*corpus*) com diferentes níveis de pré-processamento. Foi adicionado a stemmatização. Depois, é feita a vetorização com *Count Vectorizer* e analisado os algoritmos de classificação 'Multinomial Naive Bayes', 'KNN', 'SVM', 'Random Forest'. Usa-se um k-fold com k=5. A amostragem **É estratificada**.

- Nr Semente utilizado ==> 42

In [19]:
# dataset.csv   ou  dataset_pre_processado_1.csv  ou  dataset_pre_processado_stem_2.csv
#     CSV1                  CSV2                                   CSV3
dataset = "dataset.csv"

In [20]:
print("Lembre-se estamos usando o dataset: " + dataset)

Lembre-se estamos usando o dataset: dataset.csv


Naive Bayes (NB)
e Super Vector Machine (SVM),

#Tarefa de Classificação

### Importando bibliotecas:

In [21]:
from sklearn.metrics import classification_report # metricas de validação

In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import naive_bayes, svm
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import AdaBoostClassifier

## Carregando Dataset

In [23]:
df = pd.read_csv(dataset)
df.head(2)

Unnamed: 0,id,titulo,autor,url,tipo_documento,rotulo,resumo,texto
0,88,estudo dos efeitos de dircm em mísseis infrave...,"caio augusto de melo silvestre, lester de abre...",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,1,O crescente emprego de mísseis de ombro infrav...,"Mísseis Infravermelhos, especialmente os do t..."
1,125,caracterização de capacitores cerâmicos na fai...,"silva neto, l. p., rossi, j. o., barroso j. j.",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,2,Materiais dielétricos com baixas perdas e alta...,I. INTRODUÇÃO Cerâmicas dielétricas encontram ...


## Vetorização

- Vetorização:

In [24]:
# Vetorização usando CountVectorizer
# cv = CountVectorizer(max_features = N) # Max feature é o tamanho do vocabulário

## Algoritmos de classificação

- Algoritmos de classificação:

In [25]:
# Algoritmos de classificação
classifiers = [  #NB, KNN, SVM
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Complement Naive Bayes Classifier', ComplementNB()),
    ('KNN', KNeighborsClassifier()),  #n_neighbors default é 5
    ('SVM', svm.SVC( )),
     ('Random Forest', RandomForestClassifier(random_state=42)),
      ('AdaBoost',    AdaBoostClassifier(random_state=42)) #n_estimators default é 50
]

- Criando nosso dataframe para armazenar resultados

In [26]:
lista_classificador_nome = list()
for classifier_name, classifier in classifiers:
    lista_classificador_nome.append(classifier_name)

In [27]:
df_acc = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1 = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])

In [28]:
for classifier_name, classifier in classifiers:
    nova_linha = pd.DataFrame({'Classificador': [classifier_name], 'Rodada 1':[0] , 'Rodada 2':[0], 'Rodada 3':[0], 'Rodada 4':[0], 'Rodada 5':[0], 'Media':[0]})
    df_acc = pd.concat([df_acc, nova_linha], ignore_index=True)
    df_f1 = pd.concat([df_f1, nova_linha], ignore_index=True)

In [29]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


In [30]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


* K-fold

Perceba que ** FOI FEITA UMA AMOSTRAGEM ESTRATIFICADA**

In [31]:
# Avaliação dos modelos usando k-fold
k = 5
# kf = KFold(n_splits=k, shuffle=True, random_state=42)
kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

- Criando uma função de avaliação:

 Usei como parâmetro para **average** o **'macro'**

**'weighted'**:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

**'macro'**:

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 vide [Documentação oficial](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import precision_recall_fscore_support as score

def evaluate_model(model, X_test, y_test):
    # Predição dos rótulos
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

    # Cálculo da matriz de confusão
    cm = confusion_matrix(y_test, y_pred)

    # Cálculo da acurácia
    acc = accuracy_score(y_test, y_pred)

    # Cálculo do F1-score
    f1 = f1_score(y_test, y_pred, average='macro')

    # Outras métricas
    precision, recall, f1score, support = score(y_test, y_pred, average='macro')
    return cm, acc, f1, precision, recall, f1score, support

In [33]:
classificador=0
for classifier_name, classifier in classifiers:
    print('---', classifier_name, '---')
    y_true = []
    y_pred = []
    contador = 0
    serie_acc = pd.Series()
    serie_f1 = pd.Series()
    for train_index, test_index in kf.split(df['resumo'],df['rotulo']):
        contador +=1
        X_train, X_test = df.iloc[train_index]['resumo'], df.iloc[test_index]['resumo']
        y_train, y_test = df.iloc[train_index]['rotulo'], df.iloc[test_index]['rotulo']

        # Vetorização dos dados de treinamento
        vectorizer = CountVectorizer()
        X_train_vectorized = vectorizer.fit_transform(X_train)

        # Treinamento do modelo
        classifier.fit(X_train_vectorized, y_train)

        # Vetorização dos dados de teste
        X_test_vectorized = vectorizer.transform(X_test)

        # # Predição dos rótulos
        # y_pred.extend(classifier.predict(X_test_vectorized))
        # y_true.extend(y_test)

        cm, acc, f1, precision, recall, f1score, support = evaluate_model(classifier, X_test_vectorized, y_test)


        print(classifier_name + " Rodada " + str(contador) )
        print('Matriz de Confusão:')
        print(cm)
        print('Acurácia:', acc)
        print('F1-Score:', f1)
        print("outras métricas:")
        print('precision:', precision)
        print('recall:', recall)
        print('f1score:', f1score)
        print('support:', support)
        print('-------------------------------------')
        # serie_acc = serie_acc.append(pd.Series([acc]))
        serie_acc = pd.concat([serie_acc, pd.Series([acc])])
        # serie_f1 = serie_f1.append(pd.Series([f1]))
        serie_f1 = pd.concat([serie_f1, pd.Series([f1])])



    # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
    media_acc = serie_acc[:5].mean()
    media_f1 = serie_f1[:5].mean()
    # serie_acc = serie_acc.append(pd.Series([media_acc]))
    # serie_f1 = serie_f1.append(pd.Series([media_f1]))
    serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
    serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])


    # print("Acurácia: " )
    # print(serie_acc)
    # print("F-1: " )
    # print(serie_f1)
    df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
    df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values
    classificador+=1
    print("=======================================================================================")
    # cm = confusion_matrix(y_true, y_pred)
    # acc = accuracy_score(y_true, y_pred)

--- Multinomial Naive Bayes ---
              precision    recall  f1-score   support

           1       0.82      0.95      0.88        19
           2       0.50      0.12      0.20         8
           3       0.70      1.00      0.82         7

    accuracy                           0.76        34
   macro avg       0.67      0.69      0.63        34
weighted avg       0.72      0.76      0.71        34

Multinomial Naive Bayes Rodada 1
Matriz de Confusão:
[[18  1  0]
 [ 4  1  3]
 [ 0  0  7]]
Acurácia: 0.7647058823529411
F1-Score: 0.6338593974175036
outras métricas:
precision: 0.6727272727272728
recall: 0.6907894736842105
f1score: 0.6338593974175036
support: None
-------------------------------------
              precision    recall  f1-score   support

           1       0.68      1.00      0.81        19
           2       0.00      0.00      0.00         8
           3       1.00      0.71      0.83         7

    accuracy                           0.71        34
   macro avg 

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.73      1.00      0.84        19
           2       0.00      0.00      0.00         8
           3       0.57      0.67      0.62         6

    accuracy                           0.70        33
   macro avg       0.43      0.56      0.49        33
weighted avg       0.52      0.70      0.60        33

Multinomial Naive Bayes Rodada 4
Matriz de Confusão:
[[19  0  0]
 [ 5  0  3]
 [ 2  0  4]]
Acurácia: 0.696969696969697
F1-Score: 0.4866096866096865
outras métricas:
precision: 0.434065934065934
recall: 0.5555555555555555
f1score: 0.4866096866096865
support: None
-------------------------------------
              precision    recall  f1-score   support

           1       0.69      1.00      0.82        18
           2       0.00      0.00      0.00         9
           3       0.71      0.83      0.77         6

    accuracy                           0.70        33
   macro avg       0.47      0.61      0.53    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()


Complement Naive Bayes Classifier Rodada 2
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 1  0  6]]
Acurácia: 0.7352941176470589
F1-Score: 0.5771958537915984
outras métricas:
precision: 0.5595238095238095
recall: 0.6190476190476191
f1score: 0.5771958537915984
support: None
-------------------------------------
              precision    recall  f1-score   support

           1       0.68      1.00      0.81        19
           2       0.00      0.00      0.00         8
           3       1.00      0.86      0.92         7

    accuracy                           0.74        34
   macro avg       0.56      0.62      0.58        34
weighted avg       0.59      0.74      0.64        34

Complement Naive Bayes Classifier Rodada 3
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 1  0  6]]
Acurácia: 0.7352941176470589
F1-Score: 0.5771958537915984
outras métricas:
precision: 0.5595238095238095
recall: 0.6190476190476191
f1score: 0.5771958537915984
support: None
---------------------------------

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.62      0.95      0.75        19
           2       0.00      0.00      0.00         8
           3       1.00      0.57      0.73         7

    accuracy                           0.65        34
   macro avg       0.54      0.51      0.49        34
weighted avg       0.55      0.65      0.57        34

SVM Rodada 2
Matriz de Confusão:
[[18  1  0]
 [ 8  0  0]
 [ 3  0  4]]
Acurácia: 0.6470588235294118
F1-Score: 0.49242424242424243
outras métricas:
precision: 0.5402298850574713
recall: 0.506265664160401
f1score: 0.49242424242424243
support: None
-------------------------------------
              precision    recall  f1-score   support

           1       0.68      1.00      0.81        19
           2       0.00      0.00      0.00         8
           3       1.00      0.86      0.92         7

    accuracy                           0.74        34
   macro avg       0.56      0.62      0.58        34
weighted a

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.64      1.00      0.78        18
           2       0.00      0.00      0.00         9
           3       0.80      0.67      0.73         6

    accuracy                           0.67        33
   macro avg       0.48      0.56      0.50        33
weighted avg       0.50      0.67      0.56        33

SVM Rodada 5
Matriz de Confusão:
[[18  0  0]
 [ 8  0  1]
 [ 2  0  4]]
Acurácia: 0.6666666666666666
F1-Score: 0.5032938076416337
outras métricas:
precision: 0.480952380952381
recall: 0.5555555555555555
f1score: 0.5032938076416337
support: None
-------------------------------------
--- Random Forest ---


  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.76      1.00      0.86        19
           2       0.00      0.00      0.00         8
           3       0.67      0.86      0.75         7

    accuracy                           0.74        34
   macro avg       0.48      0.62      0.54        34
weighted avg       0.56      0.74      0.64        34

Random Forest Rodada 1
Matriz de Confusão:
[[19  0  0]
 [ 5  0  3]
 [ 1  0  6]]
Acurácia: 0.7352941176470589
F1-Score: 0.537878787878788
outras métricas:
precision: 0.47555555555555556
recall: 0.6190476190476191
f1score: 0.537878787878788
support: None
-------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.66      1.00      0.79        19
           2       0.00      0.00      0.00         8
           3       1.00      0.71      0.83         7

    accuracy                           0.71        34
   macro avg       0.55      0.57      0.54        34
weighted avg       0.57      0.71      0.61        34

Random Forest Rodada 2
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 2  0  5]]
Acurácia: 0.7058823529411765
F1-Score: 0.5416666666666666
outras métricas:
precision: 0.5517241379310345
recall: 0.5714285714285715
f1score: 0.5416666666666666
support: None
-------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.68      1.00      0.81        19
           2       0.00      0.00      0.00         8
           3       1.00      0.86      0.92         7

    accuracy                           0.74        34
   macro avg       0.56      0.62      0.58        34
weighted avg       0.59      0.74      0.64        34

Random Forest Rodada 3
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 1  0  6]]
Acurácia: 0.7352941176470589
F1-Score: 0.5771958537915984
outras métricas:
precision: 0.5595238095238095
recall: 0.6190476190476191
f1score: 0.5771958537915984
support: None
-------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.68      1.00      0.81        19
           2       0.00      0.00      0.00         8
           3       0.80      0.67      0.73         6

    accuracy                           0.70        33
   macro avg       0.49      0.56      0.51        33
weighted avg       0.54      0.70      0.60        33

Random Forest Rodada 4
Matriz de Confusão:
[[19  0  0]
 [ 7  0  1]
 [ 2  0  4]]
Acurácia: 0.696969696969697
F1-Score: 0.5119277885235332
outras métricas:
precision: 0.4928571428571429
recall: 0.5555555555555555
f1score: 0.5119277885235332
support: None
-------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()


              precision    recall  f1-score   support

           1       0.67      1.00      0.80        18
           2       0.00      0.00      0.00         9
           3       0.83      0.83      0.83         6

    accuracy                           0.70        33
   macro avg       0.50      0.61      0.54        33
weighted avg       0.52      0.70      0.59        33

Random Forest Rodada 5
Matriz de Confusão:
[[18  0  0]
 [ 8  0  1]
 [ 1  0  5]]
Acurácia: 0.696969696969697
F1-Score: 0.5444444444444444
outras métricas:
precision: 0.5
recall: 0.6111111111111112
f1score: 0.5444444444444444
support: None
-------------------------------------
--- AdaBoost ---
              precision    recall  f1-score   support

           1       0.71      0.79      0.75        19
           2       0.17      0.12      0.14         8
           3       0.71      0.71      0.71         7

    accuracy                           0.62        34
   macro avg       0.53      0.54      0.54        34


In [34]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.764706,0.705882,0.735294,0.69697,0.69697,0.719964
1,Complement Naive Bayes Classifier,0.735294,0.735294,0.735294,0.69697,0.69697,0.719964
2,KNN,0.705882,0.617647,0.794118,0.69697,0.545455,0.672014
3,SVM,0.735294,0.647059,0.735294,0.666667,0.666667,0.690196
4,Random Forest,0.735294,0.705882,0.735294,0.69697,0.69697,0.714082
5,AdaBoost,0.617647,0.588235,0.5,0.666667,0.515152,0.57754


In [35]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.633859,0.547281,0.577196,0.48661,0.529138,0.554817
1,Complement Naive Bayes Classifier,0.560224,0.577196,0.577196,0.48661,0.529138,0.546073
2,KNN,0.626068,0.483559,0.722507,0.587879,0.472087,0.57842
3,SVM,0.537879,0.492424,0.577196,0.463889,0.503294,0.514936
4,Random Forest,0.537879,0.541667,0.577196,0.511928,0.544444,0.542623
5,AdaBoost,0.535714,0.482385,0.482906,0.56613,0.417457,0.496918
