# **Descrição**

Nesta atividade você vai usar os algoritmos de classificação que aprendemos para classificar notícas satíricas e reais. O notebook que fizemos para demontrar o Naïve Bayes e Regressão Logística pode ajudar. Aqui você encontra uma base de dados com notícias reais e satíricas coletadas de sites mainstream de notícias e do sensacionalista respectivamente. A sua tarefa é a seguinte:

- Rodar um modelo Naïve Bayes e calcular a acurácia;
- Rodar um modelo de Regressão Logística e calcular a acurácia;
- Rodar um modelo de Redes Neurais de Múltiplas Camadas e calcular a acurácia (teste diferentes configurações);
- Executar os dois modelos acima: 
 - (i) usando só os títulos; 
 - (ii) usando só o corpo do texto; 
 - (iii) usando título e corpo combinados.
- Para as tarefas acima compare representações binárias, TF e TF-IDF com diferentes n-gramas (e.g., unigrama, bigrama, etc.) para a matriz de entrada. Também use uma partição de treino e teste com a seguinte proporção: 75% para treino e 25% para teste.

### **Opcional:**

Experimentar outros algoritmos de classificação disponíveis no scikit-learn;
Como entregar

O notebook Colab com sua análise e respostas. Envie o link de compartilhamento do notebook.

# Imports

In [None]:
import pandas as pd
import nltk
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPRegressor

# Load Data

In [None]:
df = pd.read_csv('csv_satiras_reais.csv')
df.head(5)

Unnamed: 0,title,text,label
0,crise e tao grande que nem tiozao do pave fez...,a familia guimaraes passou a noite de natal pe...,satire
1,nao me representam diz jesus sobre intolerant...,uma menina de 11 anos apedrejada ao sair de um...,satire
2,marina silva e heloisa helena montam novo par...,insatisfeitas com seus partidos com as siglas ...,satire
3,dez propostas que podem realmente mudar o brasil,o instituto nupal nucleo de pesquisas da ameri...,satire
4,apresentadora do cidade alerta bahia dara cur...,assassinatos sequestros mortes violentas. nen...,satire


# Pre Processing Data

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('portuguese'))
stop_words.update(['que', 'até', 'esse', 
                    'essa', 'pro', 'pra',
                    'oi'])

def clean_text(text):  
  text = text.lower()
  text = re.sub("^\d+\s|\s\d+\s|\s\d+$", "", text)
  text = re.sub('[,.!?;:/_]', '', text)
  text = ' '.join(i for i in text.split() if not(i in stop_words) and len(i) > 2 and (not any (c.isdigit() for c in i)))
  return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df['text'] = df['text'].apply(clean_text)
df['title'] = df['title'].apply(clean_text)
df['concat'] = df['title'] + " " + df['text']
df['label'] = df['label'].apply(lambda x: 1 if x == "satire" else 0).values
df.head(5)

Unnamed: 0,title,text,label,concat
0,crise tao grande tiozao pave fez piada noite n...,familia guimaraes passou noite natal perplexa ...,1,crise tao grande tiozao pave fez piada noite n...
1,nao representam diz jesus sobre intolerantes a...,menina deanos apedrejada sair festa candomble ...,1,nao representam diz jesus sobre intolerantes a...
2,marina silva heloisa helena montam novo partid...,insatisfeitas partidos siglas partidos politic...,1,marina silva heloisa helena montam novo partid...
3,dez propostas podem realmente mudar brasil,instituto nupal nucleo pesquisas america latin...,1,dez propostas podem realmente mudar brasil ins...
4,apresentadora cidade alerta bahia dara curso o...,assassinatos sequestros mortes violentas nenhu...,1,apresentadora cidade alerta bahia dara curso o...


# Naive Bayes

## Title

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['title'])
text_counts.shape

(10000, 67709)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB()

In [None]:
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.8596

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.85      0.87      0.86      1237
        real       0.87      0.85      0.86      1263

    accuracy                           0.86      2500
   macro avg       0.86      0.86      0.86      2500
weighted avg       0.86      0.86      0.86      2500



## Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['text'])
text_counts.shape

(10000, 1073990)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB()

In [None]:
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9324

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.89      0.98      0.94      1237
        real       0.98      0.88      0.93      1263

    accuracy                           0.93      2500
   macro avg       0.94      0.93      0.93      2500
weighted avg       0.94      0.93      0.93      2500



## Title + Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['concat'])
text_counts.shape

(10000, 1112290)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB()

In [None]:
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9356

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.90      0.98      0.94      1237
        real       0.98      0.89      0.93      1263

    accuracy                           0.94      2500
   macro avg       0.94      0.94      0.94      2500
weighted avg       0.94      0.94      0.94      2500



# Regressão Logística

## Title

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['title'])
text_counts.shape

(10000, 67709)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
LR = LogisticRegression(max_iter=300)
LR.fit(X_train, Y_train)

LogisticRegression(max_iter=300)

In [None]:
predicted = LR.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.868

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.85      0.89      0.87      1237
        real       0.89      0.84      0.87      1263

    accuracy                           0.87      2500
   macro avg       0.87      0.87      0.87      2500
weighted avg       0.87      0.87      0.87      2500



## Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['text'])
text_counts.shape

(10000, 1073990)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
LR = LogisticRegression(max_iter=300)
LR.fit(X_train, Y_train)

LogisticRegression(max_iter=300)

In [None]:
predicted = LR.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9704

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.99      0.95      0.97      1237
        real       0.96      0.99      0.97      1263

    accuracy                           0.97      2500
   macro avg       0.97      0.97      0.97      2500
weighted avg       0.97      0.97      0.97      2500



## Title + Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['concat'])
text_counts.shape

(10000, 1112290)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
LR = LogisticRegression(max_iter=300)
LR.fit(X_train, Y_train)

LogisticRegression(max_iter=300)

In [None]:
predicted = LR.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9712

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.99      0.95      0.97      1237
        real       0.96      0.99      0.97      1263

    accuracy                           0.97      2500
   macro avg       0.97      0.97      0.97      2500
weighted avg       0.97      0.97      0.97      2500



 # Rede Neural

## Title

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['title'])
text_counts.shape

(10000, 67709)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
reg = MLPRegressor(hidden_layer_sizes=(64, 64, 64), activation="relu", random_state=1, max_iter=50)
reg.fit(X_train, Y_train)

MLPRegressor(hidden_layer_sizes=(64, 64, 64), max_iter=50, random_state=1)

In [None]:
predicted = reg.predict(X_test)
predicted = list(map(lambda x: 0 if x < 0.5 else 1, predicted))
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.834

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.77      0.95      0.85      1237
        real       0.94      0.72      0.81      1263

    accuracy                           0.83      2500
   macro avg       0.85      0.84      0.83      2500
weighted avg       0.85      0.83      0.83      2500



## Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['text'])
text_counts.shape

(10000, 1073990)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
reg = MLPRegressor(hidden_layer_sizes=(64, 64, 64), activation="relu", random_state=1, max_iter=50)
reg.fit(X_train, Y_train)

MLPRegressor(hidden_layer_sizes=(64, 64, 64), max_iter=50, random_state=1)

In [None]:
predicted = reg.predict(X_test)
predicted = list(map(lambda x: 0 if x < 0.5 else 1, predicted))
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9592

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.93      0.99      0.96      1237
        real       0.99      0.92      0.96      1263

    accuracy                           0.96      2500
   macro avg       0.96      0.96      0.96      2500
weighted avg       0.96      0.96      0.96      2500



## Title + Text

In [None]:
count_vectorizer = CountVectorizer(ngram_range = (1,2))
text_counts = count_vectorizer.fit_transform(df['concat'])
text_counts.shape

(10000, 1112290)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, df.label, test_size = 0.25, random_state=0)

In [None]:
reg = MLPRegressor(hidden_layer_sizes=(64, 64, 64), activation="relu", random_state=1, max_iter=50)
reg.fit(X_train, Y_train)

MLPRegressor(hidden_layer_sizes=(64, 64, 64), max_iter=50, random_state=1)

In [None]:
predicted = reg.predict(X_test)
predicted = list(map(lambda x: 0 if x < 0.5 else 1, predicted))
accuracy_score = metrics.accuracy_score(predicted, Y_test)
accuracy_score

0.9768

In [None]:
print(metrics.classification_report(Y_test, predicted, target_names=["satire","real"]))

              precision    recall  f1-score   support

      satire       0.98      0.98      0.98      1237
        real       0.98      0.98      0.98      1263

    accuracy                           0.98      2500
   macro avg       0.98      0.98      0.98      2500
weighted avg       0.98      0.98      0.98      2500

