# Gabarito - Projeto Final Trainee - Grupo Turing 2020

Notebook com Gabarito do Projeto Final do programa de Trainee do Grupo Turing. 

O objetivo do projeto é preparar os trainees para entrarem na área de Processamento de Linguagem Natural do grupo. Aqui, apresentamos algumas tarefas essenciais para se ter conhecimento para trabalhar com NLP. 

**A proposta**: O Projeto consiste em uma análise de sentimentos do dataset [IMDB Movie Reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data). Para fazer isso, serão apresentadas técnicas de *pré-processamentos*, *feature extraction* e *word vectors*. 

#### Importando Bibliotecas

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk 
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
import spacy

#### Importando os dados

In [2]:
df = pd.read_csv('IMDB_Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

Alterando os valores categóricos para numéricos:

In [4]:
df['sentiment'] = df['sentiment'].map({'positive' : 1, 'negative' : 0})

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Parte 1

### Pré-processamentos

In [15]:
spc = spacy.load('en')

def clean_text(text):
    #Retira tags html
    reviews = BeautifulSoup(text).get_text()
    
    #Seleciona apenas as letras 
    letters_only = re.sub('[^a-zA-Z]', ' ', reviews)
    
    #tokenização
    lower = letters_only.lower().split()
    
    #retira stopwords 
    stops = set(stopwords.words('english'))
    words = [w for w in lower if w not in stops]
    meaningful_words = ' '.join(words)
    
    #Instanciando o objeto spacy
    spc_en =  spc(meaningful_words)
    
    #Lemmização 
    tokens = [token.lemma_ if token.pos_ == 'VERB' else str(token) for token in spc_en]
    
    return " ".join(tokens) 

In [16]:
print(df.review[1])

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


In [17]:
print(clean_text(df.review[1]))

wonderful little production filming technique unassume old time bbc fashion give comfort sometimes discomforte sense realism entire piece actors extremely well choose michael sheen get polari voices pat truly see seamless editing guide references williams diary entrie well worth watch terrificly write perform piece masterful production one great master comedy life realism really come home little things fantasy guard rather use traditional dream techniques remain solid disappear play knowledge senses particularly scenes concern orton halliwell sets particularly flat halliwell murals decorate every surface terribly well do


Aqui, retiramos as tags html, selecionamos apenas as letras do texto, convertemos todas as letras para a sua forma minúscula, removemos stopwords e lematizamos o texto. 

Agora, vamos preparar o input para o Bag of Words:

In [19]:
df['clean_text'] = df['review'].apply(clean_text)

In [24]:
df.head()

Unnamed: 0,review,sentiment,clean_text
0,One of the other reviewers has mentioned that ...,1,one reviewers mention watch oz episode hook ri...
1,A wonderful little production. <br /><br />The...,1,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,1,think wonderful way spend time hot summer week...
3,Basically there's a family where a little boy ...,0,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,petter mattei love time money visually stunnin...


### Bag of Words

In [20]:
from sklearn.feature_extraction.text import CountVectorizer 

In [25]:
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,
                             lowercase = False,  
                             max_features=5000,
                             binary=True)

vector = vectorizer.fit_transform(df['clean_text'])

Com o `fit_transform()` o modelo aprende o vocabulário e, em seguida, transforma os nossos dados em vetores. O input deve ser uma lista de strings 

In [26]:
#convertendo para um array
features = vector.toarray()

In [27]:
print(features.shape)

(50000, 5000)


Podemos observar que temos 50000 linhas (uma para cada review) e 5000 colunas (uma para cada palavra do nosso vocabulário) 

### Aplicando Modelos 

#### Separando o dataset

In [28]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, df["sentiment"], test_size=0.2, random_state=0)

#### Modelos

In [29]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

In [30]:
#Random Forest
print('Treinando Random Forest')
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(x_train, y_train)
forest_result = forest.predict(x_test)

#SVM
print('Treinando SVM')
SVM = svm.LinearSVC()
SVM.fit(x_train, y_train)
svm_result = SVM.predict(x_test)

#Naive Bayes 
print('Treinando Naive Bayes')
nb = GaussianNB()
nb.fit(x_train, y_train)
naive_result = nb.predict(x_test)

#Regressão Logística
print('Treinando Regressão Logística')
logreg = LogisticRegression(solver='liblinear')
logreg.fit(x_train, y_train)
logreg_result = logreg.predict(x_test)

Treinando Random Forest
Treinando SVM




Treinando Naive Bayes
Treinando Regressão Logística


### Medindo a performance dos modelos 

In [31]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print('\nResultados com Random Forest: ')
accuracy = accuracy_score(y_test, forest_result)
print('Accuracy: ', accuracy)
cm = confusion_matrix(y_test, forest_result)
print('Confusion Matrix: \n', cm)
f1 = f1_score(y_test, forest_result)
print('f1 Score: ', f1)

print('\nResultados com SVM: ')
accuracy = accuracy_score(y_test, svm_result)
print('Accuracy: ', accuracy)
cm = confusion_matrix(y_test, svm_result)
print('Confusion Matrix: \n', cm)
f1 = f1_score(y_test, svm_result)
print('f1 Score: ', f1)

print('\nResultados com Naive Bayes: ')
accuracy = accuracy_score(y_test, naive_result)
print('Accuracy: ', accuracy)
cm = confusion_matrix(y_test, naive_result)
print('Confusion Matrix: \n', cm)
f1 = f1_score(y_test, naive_result)
print('f1 Score:', f1)

print('\nResultados com Regressão Logística: ')
accuracy = accuracy_score(y_test, logreg_result)
print('Accuracy: ', accuracy)
cm = confusion_matrix(y_test, logreg_result)
print('Confusion Matrix: \n', cm)
f1 = f1_score(y_test, logreg_result)
print('f1 Score:', f1)


Resultados com Random Forest: 
Accuracy:  0.8427
Confusion Matrix: 
 [[4295  740]
 [ 833 4132]]
f1 Score:  0.8400935244485107

Resultados com SVM: 
Accuracy:  0.8605
Confusion Matrix: 
 [[4318  717]
 [ 678 4287]]
f1 Score:  0.8600662052362323

Resultados com Naive Bayes: 
Accuracy:  0.7836
Confusion Matrix: 
 [[4215  820]
 [1344 3621]]
f1 Score: 0.7699340846268339

Resultados com Regressão Logística: 
Accuracy:  0.8705
Confusion Matrix: 
 [[4377  658]
 [ 637 4328]]
f1 Score: 0.8698623253944326


Testando os resultados com um dummy classifier:

In [32]:
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy= 'most_frequent').fit(x_train,y_train)
dummy_result = dummy.predict(x_test)

In [33]:
accuracy_score(y_test, dummy_result)

0.4965

Exportando o arquivo com as reviews "limpas":

In [34]:
df.to_csv('clean-imdb-reviews.csv')