# Atividade 8

Nessa atividade vocês irão trabalhar em um problema de classificação de texto multiclasse. Considere o conjunto de dados sobre fetch_20newsgroups  

    "The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."
    
Dado esse contexto, escolha um único classificador, sem otimizar hiperparametros, treine e teste modelos considerando

    Bag of Words (contagem), sem pré-processamento
    TF-IDF, sem pré-processamento 
    Bag of Words, com pré-processamento
    TF-IDF, com pré-processamento
    
Considere a métrica da acurácia e compare os resultados em uma tabela.

As etapas de pré-processamento devem conter pelo menos:

    lowercase
    remoção de pontuação
    remoção de números 
    remoção de stopwords (dica: utilize a biblioteca NLTK)
    lematização ou stemming (apenas um dos dois)
    
Outras etapas que você julgar necessárias podem ser utilizadas. Crie uma função para cada etapa e uma função chamada preprocess() que chame todas as etapas.

In [161]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [162]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test',  shuffle=True, random_state=42)

In [163]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [164]:
data_train = twenty_train.data
y_train = twenty_train.target
data_test = twenty_test.data
y_test = twenty_test.target

### Bag of Words (sem pré-processamento)

In [165]:
from sklearn.feature_extraction.text import CountVectorizer
bw = CountVectorizer()
X_train_bw = bw.fit_transform(data_train)
X_test_bw = bw.transform(data_test)

In [166]:
print(bw.get_feature_names())



In [167]:
X_train_bw_shape = X_train_bw.shape[1]

In [168]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train_bw, y_train)

MultinomialNB()

In [169]:
y_predict = clf.predict(X_test_bw)

In [170]:
from sklearn.metrics import accuracy_score
A1 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A1*100, 2)}%')

A acurácia foi de: 77.28%


### TF-IDF (sem pré-processamento)

In [171]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
X_train_tf = tf.fit_transform(data_train)
X_test_tf = tf.transform(data_test)

In [172]:
X_train_tf_shape = X_train_bw.shape[1]

In [173]:
clf = MultinomialNB()
clf.fit(X_train_tf, y_train)

MultinomialNB()

In [174]:
y_predict = clf.predict(X_test_tf)

In [175]:
A2 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A2*100, 2)}%')

A acurácia foi de: 77.39%


### Etapas de pré-processamento

In [176]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [177]:
def remover_pontuacao(text):
    sem_pontuacao="".join([i for i in text if i not in string.punctuation])
    return sem_pontuacao

In [178]:
def lower(text):
    return text.lower()


In [179]:
import nltk
def remove_stop_words(text):
    stopwords = nltk.corpus.stopwords.words('english')
    texto = [w for w in text.split() if w not in stopwords]
    frase = " ".join(texto)
    return frase

In [180]:
import re
def remove_nums(text):
    num_regex = '\d+'
    t = re.sub(num_regex, '', text)
    return t

In [181]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatizer(text):
    texto = [wordnet_lemmatizer.lemmatize(word) for word in text.split()]
    frase = " ".join(texto)
    return frase

In [182]:
def preprocess(text):
        N = remove_nums(text)
        L = lower(N)
        P = remover_pontuacao(L)
        R = remove_stop_words(P)
        lem = lemmatizer(R)
        return lem

### Pré-processamento dos dados

In [183]:
X_train_PP = []
for i in range(len(data_train)):
    F = preprocess(data_train[i])
    X_train_PP.append(F)

In [184]:
X_test_PP = []
for i in range(len(data_test)):
    F = preprocess(data_test[i])
    X_test_PP.append(F)

In [185]:
data_train[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [186]:
X_train_PP[0]

'lerxstwamumdedu wheres thing subject car nntppostinghost racwamumdedu organization university maryland college park line wondering anyone could enlighten car saw day door sport car looked late early called bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car made history whatever info funky looking car please email thanks il brought neighborhood lerxst'

### Bag of Words (com pré-processamento)

In [187]:
bw = CountVectorizer()
X_train_PP_bw = bw.fit_transform(X_train_PP)
X_test_PP_bw = bw.transform(X_test_PP)

In [188]:
X_train_PP_bw_shape = X_train_PP_bw.shape[1]

In [189]:
print(bw.get_feature_names())



In [190]:
clf = MultinomialNB()
clf.fit(X_train_PP_bw, y_train)

MultinomialNB()

In [191]:
y_predict = clf.predict(X_test_PP_bw)

In [192]:
A3 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A3*100, 2)}%')

A acurácia foi de: 81.24%


### TF-IDF (com pré-processamento)

In [193]:
tf = TfidfVectorizer()
X_train_PP_tf = tf.fit_transform(X_train_PP)
X_test_PP_tf = tf.transform(X_test_PP)

In [194]:
X_train_PP_tf_shape = X_train_PP_tf.shape[1]

In [195]:
clf = MultinomialNB()
clf.fit(X_train_PP_tf, y_train)

MultinomialNB()

In [196]:
y_predict = clf.predict(X_test_PP_tf)

In [197]:
A4 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A4*100, 2)}%')

A acurácia foi de: 81.43%


## Teste de pré-processamento para retirar palavras sem sentido

### Bag of Words1

In [224]:
bw2 = CountVectorizer(min_df=2)
X_train_PP_bw2 = bw2.fit_transform(X_train_PP)
X_test_PP_bw2 = bw2.transform(X_test_PP)

In [225]:
print(bw2.get_feature_names())



In [226]:
X_train_PP_bw2_shape = X_train_PP_bw2.shape[1]

In [227]:
clf2 = MultinomialNB()
clf2.fit(X_train_PP_bw2, y_train)

MultinomialNB()

In [228]:
y_predict = clf2.predict(X_test_PP_bw2)

In [229]:
# Apresentou pequena melhora na acurácia, porém, o texto continuou com muitas palavras sem sentido.
A5 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A5*100, 2)}%')

A acurácia foi de: 82.86%


### TF-IDF

In [204]:
tf2 = TfidfVectorizer(min_df=2)
X_train_PP_tf2 = tf2.fit_transform(X_train_PP)
X_test_PP_tf2 = tf2.transform(X_test_PP)

In [205]:
X_train_PP_tf2_shape = X_train_PP_tf2.shape[1]

In [206]:
clf3 = MultinomialNB()
clf3.fit(X_train_PP_tf2, y_train)

MultinomialNB()

In [207]:
y_predict = clf3.predict(X_test_PP_tf2)

In [208]:
A6 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A6*100, 2)}%')

A acurácia foi de: 81.94%


### Bag of Words 2

In [250]:
bw3 = CountVectorizer(min_df=0.01)
X_train_PP_bw3 = bw3.fit_transform(X_train_PP)
X_test_PP_bw3 = bw3.transform(X_test_PP)

In [251]:
print(bw3.get_feature_names())



In [252]:
X_train_PP_bw3_shape = X_train_PP_bw3.shape[1]

In [253]:
clf3 = MultinomialNB()
clf3.fit(X_train_PP_bw3, y_train)

MultinomialNB()

In [254]:
y_predict = clf3.predict(X_test_PP_bw3)

In [255]:
# Conseguiu retirar todas as palavras sem sentido, porém, causou um grande prejuízo na acurácia.
A7 = accuracy_score(y_test, y_predict)
print(f'A acurácia foi de: {round(A7*100, 2)}%')

A acurácia foi de: 70.14%


## Tabela de comparação

In [237]:
data = {'Modelos': ['BoW(SP)', 'TF-IDF(SP)', 'BoW(PP)', 'TF-IDF(PP)', 'BoW(PP2)', 'TF-IDF(PP2)', 'BoW(PP3)'],
        'Accuracy(%)': [round(A1*100, 2), round(A2*100, 2), round(A3*100, 2), round(A4*100, 2), round(A5*100, 2),round(A6*100, 2), round(A7*100, 2)],
        'Dimensões':[X_train_bw_shape, X_train_tf_shape, X_train_PP_bw_shape, X_train_PP_tf_shape, X_train_PP_bw2_shape, X_train_PP_tf2_shape, X_train_PP_bw3_shape],
        }
tabela = pd.DataFrame(data)
tabela

Unnamed: 0,Modelos,Accuracy(%),Dimensões
0,BoW(SP),77.28,130107
1,TF-IDF(SP),77.39,130107
2,BoW(PP),81.24,111379
3,TF-IDF(PP),81.43,111379
4,BoW(PP2),82.86,45567
5,TF-IDF(PP2),81.94,45567
6,BoW(PP3),70.14,1796
