<a href="https://colab.research.google.com/github/mlfigueiredo/CienciaDosDados/blob/main/Sistema_de_classifica%C3%A7%C3%A3o_de_documentos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='blue'>Criando um Classificador de Documentos com PLN e Naive Bayes</font>

http://scikit-learn.org/stable/modules/naive_bayes.html

# Multinomial Naive Bayes - Scikit-Learn

O classificador Multinomial Naive Bayes é adequado para classificação com variáveis discretas (por exemplo, contagens de palavras para a classificação de texto). A distribuição multinomial normalmente requer contagens de entidades inteiras. No entanto, na prática, contagens fracionadas (contagem de palavras) como tf-idf também podem funcionar.

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

# Classificador de Documentos

> Ex: Escritórios de Advocacia (milhões de petições...), Diagnósticos de Exames Laboratoriais (Sabin , Exame) - Pesquisas Acadêmicas, Relatório Financeiro de empresas de Capital aberto

      



http://qwone.com/~jason/20Newsgroups/

In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
fetch_20newsgroups

<function sklearn.datasets._twenty_newsgroups.fetch_20newsgroups>

In [7]:
# Definindo as categorias 
# (usando apenas 4 de um total de 20 disponível para que o processo de classificação seja mais rápido)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [8]:
# Treinamento
twenty_train = fetch_20newsgroups(subset = 'train', categories = categories, shuffle = True, random_state = 42) 

In [9]:
# Classes
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [10]:
len(twenty_train.data)

2257

In [15]:
# Visualizando alguns dados (atributos)
print("\n".join(twenty_train.data[0].split("\n")[:50]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [18]:
# Visualizando variáveis target
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


In [19]:
# O Scikit-Learn registra os labels como array de números, a fim de aumentar a velocidade
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

In [20]:
# Visualizando as classes dos 10 primeiros registros
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


# Bag of Words (Saco de Palavras)

In [35]:
# Tokenizing: entrando em PLN, pega o documento e vai quebrando em paragrafos, frases e depois palavras e depois conta as palavras
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
count_vect.vocabulary_.get(u'algorithm')
X_train_counts.shape

(2257, 35788)

In [37]:
# De ocorrências a frequências - Term Frequency times Inverse Document Frequency (Tfidf). Verificando a importância das palavras
tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

In [42]:
# Mesmo resultado da célula anterior, mas combinando as funções. Verificando o cálculo da importância das palavras
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [43]:
# Criando o modelo Multinomial
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [44]:
# Previsões
docs_new = ['microsoft and internet']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'microsoft and internet' => comp.graphics


In [46]:
# Criando um Pipeline - Classificador Composto. Podia colocar hiperparâmetros no pipeline
# vectorizer => transformer => classifier 
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

In [47]:
# Fit
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [49]:
# Acurácia do Modelo
twenty_test = fetch_20newsgroups(subset = 'test', categories = categories, shuffle = True, random_state = 42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)    

0.8348868175765646

In [50]:
# Métricas
print(metrics.classification_report(twenty_test.target, predicted, target_names = twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [51]:
# Confusion Matrix
metrics.confusion_matrix(twenty_test.target, predicted)

array([[192,   2,   6, 119],
       [  2, 347,   4,  36],
       [  2,  11, 322,  61],
       [  2,   2,   1, 393]])

In [53]:
# Parâmetros para o GridSearchCV do Naive Bayes
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

In [54]:
# GridSearchCV
gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)

In [55]:
# Fit
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

In [56]:
# Teste
twenty_train.target_names[gs_clf.predict(['good'])[0]]

'soc.religion.christian'

In [57]:
# Score
gs_clf.best_score_        

0.9349999999999999

In [58]:
# Parâmetros utilizados
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.01
tfidf__use_idf: True
vect__ngram_range: (1, 2)
