# Classificação de textos para análise de sentimentos

Trabalho final da disciplina de deep learning da pós graduação em data science da FURB. 
Professor: @luann.porfirio
Aluno: João Poffo

# Instruções do professor

Base de dados 

Instruções:
- O objetivo deste trabalho é criar um modelo binário de aprendizado de máquina para classificação de textos. 
Para isso, será utilizado a base de dados [IMDb](http://ai.stanford.edu/~amaas/data/sentiment/), que consiste de dados textuais de críticas positivas e negativas de filmes
- Uma vez treinado, o modelo deve ter uma função `predict` que recebe uma string como parâmetro e retorna o valor 1 ou 0, aonde 1 significa uma crítica positiva e 0 uma crítica negativa
- O pré-processamento pode ser desenvolvido conforme desejar (ex.: remoção de stopwords, word embedding, one-hot encoding, char encoding)
- É preferível que seja empregado um modelo de recorrência (ex.: rnn, lstm, gru) para a etapa de classificação
- Documente o código (explique sucintamente o que cada função faz, insira comentários em trechos de código relevantes)
- **Atenção**: Uma vez treinado o modelo final, salve-o no diretório do seu projeto e crie uma célula ao final do notebook contendo uma função de leitura deste arquivo, juntamente com a execução da função `predict`

Sugestões:
- Explorar a base de dados nas células iniciais do notebook para ter um melhor entendimento do problema, distribuição dos dados, etc
- Após desenvolver a estrutura de classificação, é indicado fazer uma busca de hiperparâmetros e comparar os resultados obtidos em diferentes situações

Prazo de entrega:
- 01-08-2021 às 23:59hs GMT-3

Formato preferível de entrega:
- Postar no portal Ava da disciplina o link do projeto no github (ou anexar o projeto diretamente no portal Ava)

luann.porfirio@gmail.com

In [51]:
# Instalando libs
!pip install torchtext
!pip install gensim
!pip install pandas
!pip install sklearn

#Necessário se usássemos o vocabulário pré-treinado glove
#!pip install spacy
#!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://joao.poffo%40ambevtech.com.br:****@pkgs.dev.azure.com/AMBEV-SA/AMBEV-BIFROST/_packaging/canaa-packages/pypi/simple/
Looking in indexes: https://pypi.org/simple, https://joao.poffo%40ambevtech.com.br:****@pkgs.dev.azure.com/AMBEV-SA/AMBEV-BIFROST/_packaging/canaa-packages/pypi/simple/
Looking in indexes: https://pypi.org/simple, https://joao.poffo%40ambevtech.com.br:****@pkgs.dev.azure.com/AMBEV-SA/AMBEV-BIFROST/_packaging/canaa-packages/pypi/simple/
Looking in indexes: https://pypi.org/simple, https://joao.poffo%40ambevtech.com.br:****@pkgs.dev.azure.com/AMBEV-SA/AMBEV-BIFROST/_packaging/canaa-packages/pypi/simple/


In [6]:
# Bibliotecas necessárias
import torch
from torch import nn
import torch.optim as optim

from torchtext.legacy import datasets
from torchtext.legacy import data

import random
import pandas
import gensim

from sklearn.feature_extraction.text import TfidfVectorizer

import time


In [32]:
# PREPARAÇÃO DOS DADOS
# Primeiro teste é usando o gensim para tokenização. Usamos o gensim por causa do recurso de stemming.
# Exemplo com pytorch (muito didático!): https://github.com/bentrevett/pytorch-sentiment-analysis
# Exemplo do gensim: https://rohit-agrawal.medium.com/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch-17eea2883cd

# Seed e Deterministic para que os testes possam ser reproduzidos. Em produção, é melhor e mais rápido desligar ambos.
SEED = 789
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Definição dos filtros padrão do gensim. Estamos deixo abaixo para documentar a ordem.
#gensim.parsing.preprocessing.DEFAULT_FILTERS = [
#    lambda x: x.lower(), strip_tags, strip_punctuation,
#    strip_multiple_whitespaces, strip_numeric,
#    remove_stopwords, strip_short, stem_text
#]

# Função de tokenização
def tokenize(sentence):
    return gensim.parsing.preprocessing.preprocess_string(sentence)

# Montando o tipo dos campos do torch

# Aqui que associamos o tipo do campo com o gensim para construirmos o vocábulário adiante
TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField(dtype = torch.float)


In [53]:
# Função para buscar os dados e quebrar em treino, teste e validação
# - Necessário pois fazemos novos testes adiante.
def train_test_valid():

    # Trabalhando com o IMDb
    train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

    # Separa os dados de treino em treino e validação em 70/30
    train_data, valid_data = train_data.split(random_state = random.seed(SEED))

    return train_data, test_data, valid_data

# Obtém os dados e faz o split
train_data, test_data, valid_data = train_test_valid()


In [47]:
# Verificação do split dos dados
print('Exemplos de treino (35%):', len(train_data), ', teste (50%):', len(test_data), ' e validação (15%):', len(valid_data))

Exemplos de treino (35%): 17500 teste (50%): 25000  e validação (15%): 7500


In [54]:
# Teste com vocabulário de 30 mil palavras
MAX_VOCAB_SIZE = 30000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

print("Palavras únicas:", len(TEXT.vocab))
print("Classes únicas:", len(LABEL.vocab))
print("20 palavras mais comuns:", TEXT.vocab.freqs.most_common(20))
print("10 palavras do vocabulário de palavras para analisar estrutura:", TEXT.vocab.itos[:10])
print("Estrutura das classes:", LABEL.vocab.stoi)

Palavras únicas: 30002
Classes únicas: 2
20 palavras mais comuns: [('movi', 36384), ('film', 34043), ('like', 16064), ('time', 11242), ('good', 10752), ('charact', 9860), ('watch', 9781), ('stori', 9233), ('scene', 7433), ('look', 7114), ('end', 6838), ('bad', 6587), ('peopl', 6489), ('great', 6383), ('love', 6285), ('think', 6262), ('wai', 6192), ('act', 6183), ('plai', 6165), ('thing', 5778)]
10 palavras do vocabulário de palavras para analisar estrutura: ['<unk>', '<pad>', 'movi', 'film', 'like', 'time', 'good', 'charact', 'watch', 'stori']
Estrutura das classes: defaultdict(None, {'neg': 0, 'pos': 1})


In [55]:
# Criando lotes de 64 através do iterador para o treino
BATCH_SIZE = 64
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE)

# Classe de rede neural recorrente padrão simples
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        
        # Usamos o embedding pois é uma camada pra transformar um vetor esparço de dicionário em um denso. 
        # Na teoria, palavras com impacto similar na classificação dos sentimentos são mapeadas próximas uma das outras.
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):

        #text = [sent len, batch size]
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

# A quantidade features é a quantidade de palavras no nosso dicionário
INPUT_DIM = len(TEXT.vocab)

# Quão denso deve ser nosso embeeding
EMBEDDING_DIM = 100

# Tamanho da cama oculta
HIDDEN_DIM = 256

# Tamanho da saída
OUTPUT_DIM = 1

# Cria o modelo com os hiperparâmetros definidos acima
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# Imprime a arquitetura
print(model)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Esta modelo tem {count_parameters(model):,} parâmetros treináveis.')

RNN(
  (embedding): Embedding(30002, 100)
  (rnn): RNN(100, 256)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)
Esta modelo tem 3,092,105 parâmetros treináveis.


In [56]:
# No primeiro teste vamos usar um otimizador simples - gradiente descendente estocástico
# - learning_rate = 0.001
optimizer = optim.SGD(model.parameters(), lr=1e-3)

# Como loss usamos este do torch que traz tanto a sigmoide quanto a entropia cruzada binária.
criterion = nn.BCEWithLogitsLoss()

# Função para calcular acurácia binária. 
def binary_accuracy(preds, y):
    """
    Retorna a acurácia no lote. Por exemplo, 6/10 corretas, retorna 0.6.
    Ao mesmo tempo traduz a predição para uma das classes binárias disponíveis através de arredondamento. Por exemplo: 0.6 -> 1 (verdadeiro), 0.3 -> 0 (falso).
    """

    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

# Função de treino recebendo o modelo, os dados, o otimizador e a loss como argumentos
def train(model, iterator, optimizer, criterion):
    
    # loss e acurácia acumulada
    epoch_loss = 0
    epoch_acc = 0
    
    # Seta o modelo para o modo de treino
    model.train()
    
    # Para cada lote de dados
    for batch in iterator:
        
        optimizer.zero_grad()

        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Função de validação
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Função para ajudar no cálculo de cada época
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

# Quantidade de épocas usadas no treino
N_EPOCHS = 5

# Treino está como um processo pois é padrão
def process(model, train_fnc=train, evaluate_fnc=evaluate, save_name='tf-model1.pt'):
    # Ajuda a identificar a melhor época para salvar o modelo
    best_valid_loss = float('inf')

    for epoch in range(N_EPOCHS):

        start_time = time.time()
        
        train_loss, train_acc = train_fnc(model, train_iterator, optimizer, criterion)
        valid_loss, valid_acc = evaluate_fnc(model, valid_iterator, criterion)
        
        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), save_name)
        
        print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
        print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

# Executa o treino do modelo
process(model)

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not tuple

In [46]:
# Muito ruim com as N palavras mais frequentes

import torch.nn as nn

class MultiLayerRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pack sequence
        # lengths need to be on CPU!
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        #output = [sent len, batch size, hid dim * num directions]
        #output over padding tokens are zero tensors
        
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
        #hidden = [batch size, hid dim * num directions]
            
        return self.fc(hidden)

def train_with_sequences(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate_with_sequences(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Testando melhorando nossa rede com LSTM. Para isso precisamos incluir a quantidade de sentenças no vocabulário (include_length=true).
TEXT = data.Field(tokenize = tokenize,
                  include_lengths = True)

# Precisa recarregar dados
train_data, test_data, valid_data = train_test_valid()

TEXT.build_vocab(train_data, 
    max_size = MAX_VOCAB_SIZE, 
    # Teste com o glove fica pra depois
    #vectors = "glove.6B.100d", 
    #unk_init = torch.Tensor.normal_
    )
LABEL.build_vocab(train_data)

#INPUT_DIM = len(TEXT.vocab)
#EMBEDDING_DIM = 100
#HIDDEN_DIM = 256
#OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model2 = MultiLayerRNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

#pretrained_embeddings = TEXT.vocab.vectors

#print(pretrained_embeddings.shape)

#model2.embedding.weight.data.copy_(pretrained_embeddings)

#UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

#model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
#model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

#print(model.embedding.weight.data)

optimizer = optim.Adam(model2.parameters())

# Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting sort_within_batch = True.
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True)

process(model2, train_with_sequences, evaluate_with_sequences, 'tut2-model.pt')


Epoch: 01 | Epoch Time: 65m 56s
	Train Loss: 0.641 | Train Acc: 63.00%
	 Val. Loss: 0.572 |  Val. Acc: 72.01%
Epoch: 02 | Epoch Time: 58m 7s
	Train Loss: 0.537 | Train Acc: 72.92%
	 Val. Loss: 0.477 |  Val. Acc: 78.10%
Epoch: 03 | Epoch Time: 57m 58s
	Train Loss: 0.450 | Train Acc: 79.18%
	 Val. Loss: 0.412 |  Val. Acc: 82.54%
Epoch: 04 | Epoch Time: 57m 7s
	Train Loss: 0.414 | Train Acc: 81.33%
	 Val. Loss: 0.387 |  Val. Acc: 83.98%
Epoch: 05 | Epoch Time: 57m 41s
	Train Loss: 0.371 | Train Acc: 84.23%
	 Val. Loss: 0.403 |  Val. Acc: 83.85%


In [None]:
# Usando o word2vec do gensim para treinar um modelo
# Exemplo principal em: https://www.kaggle.com/guichristmann/lstm-classification-model-with-word2vec

embedding_dim = 100
window_size = 10
min_count = 1

w2v_model = gensim.models.Word2Vec(sentences=list(train_data.text), size=embedding_dim, window=window_size, min_count=min_count, workers=4)
w2v_weights = w2v_model.wv.vectors
vocab_size, embedding_size = w2v_weights.shape

print("Vocabulary Size: {} - Embedding Dim: {}".format(vocab_size, embedding_size))

print('good', w2v_model.wv.most_similar('good')) 
print('bad', w2v_model.wv.most_similar('bad'))
# Não entendi corretamente esse bad similar a good (deve ser quando for not bad) - mas no geral a similaridade ficou muito boa. Mas vamos ver como treinar.

import torchtext.vocab as vocab


In [None]:
  bad_words = w2v_model.wv.most_similar('bad', topn=100)

w2v_model.wv.sy

#from torchtext.vocab import Vectors
import numpy as np

vectors = []
dim = np.array([])
stoi = {}
for seq in bad_words:
  vectors.append(seq[0])
  dim = np.append(dim, seq[1])
  stoi[seq[0]] = len(stoi)


#TEXT.build_vocab(train_data, max_size = len(bad_words_vec), vectors=bad_words_vec)
TEXT.vocab.set_vectors(stoi, vectors, (100))

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')



In [None]:
# Transforma em dataframe para brincarmos com os dados
columns = ['sentiment', 'text']
train_df = pandas.DataFrame(train_iter, columns=columns)
test_df = pandas.DataFrame(test_iter, columns=columns)

# Dando uma olhada nos registros
train_df

In [None]:
# Só pra verificar quais opções temos: pos (assumo positiva) e neg (negativa).
train_df.sentiment.value_counts()

In [None]:
#train_model = gensim.models.Word2Vec(train_df[1])

# Os filtros padrão são o suficiente no momento
#gensim.parsing.preprocessing.DEFAULT_FILTERS = [
#    lambda x: x.lower(), strip_tags, strip_punctuation,
#    strip_multiple_whitespaces, strip_numeric,
#    remove_stopwords, strip_short, stem_text
#]

def preprocessor(text):
  return gensim.parsing.preprocessing.preprocess_string(text)

# Como fala no filtro acima, primeiro coloca em minúsculo, depois remove tags, pontuação, espaço duplo, números, palavras irrelevantes e faz o steeming
train_df['preprocessed_words'] = train_df.text.apply(lambda text : preprocessor(text))
test_df['preprocessed_words'] = test_df.text.apply(lambda text : preprocessor(text))

train_df

In [None]:
# Primeiro teste. TfIdf por cenário: pos/neg

pos_vectorizer = TfidfVectorizer() #tokenizer=gensim.parsing.preprocessing.preprocess_string)
X = pos_vectorizer.fit_transform(train_df[train_df.sentiment == 'pos'].text)
print(pos_vectorizer.get_feature_names())
print(X.shape)

neg_vectorizer = TfidfVectorizer() #tokenizer=gensim.parsing.preprocessing.preprocess_string)
X = neg_vectorizer.fit_transform(train_df[train_df.sentiment == 'neg'].text)



In [None]:
# Testando a pontuação das palavras em cada cenário
testing_words = ['good', 'bad', 'nice', 'obnoxious', 'preview', 'old', 'luiz', 'not', 'no']

print('word', 'positive', 'negative')
for w in testing_words : print(w, pos_vectorizer.vocabulary_.get(w,0), neg_vectorizer.vocabulary_.get(w, 0))

# A pontuação referente ao Tfidf não é boa por esse objetivo pois a ideia geral são as palavras que estão presentes frequentemente porém, em menos documentos. Normalmente são palavras definidoras de um documento.
# Assim a pontuação abaixo mostra que o Tfidf é sempre um pouco mais alto no pos. Independente do conceito associado.

In [None]:
# Vamos varrer o dataset de treinamento pra tentar analizar a frequência por palavra
group = train_df.explode('preprocessed_words').groupby(["preprocessed_words", "sentiment"]).count()
group = group.unstack().fillna(0)
group.sort_values([('text', 'neg')], ascending=False) 

# Curioso aqui é que palavras como like e good aparecem com frequências similares em ambos os sentimentos


In [None]:
# Usando o word2vec do gensim para treinar um modelo
# Exemplo principal em: https://www.kaggle.com/guichristmann/lstm-classification-model-with-word2vec

embedding_dim = 100
window_size = 10
min_count = 1

w2v_model = gensim.models.Word2Vec(sentences=train_df['preprocessed_words'], size=embedding_dim, window=window_size, min_count=min_count, workers=4)
w2v_weights = w2v_model.wv.vectors
vocab_size, embedding_size = w2v_weights.shape

print("Vocabulary Size: {} - Embedding Dim: {}".format(vocab_size, embedding_size))

In [None]:
print('good', w2v_model.wv.most_similar('good')) 
print('bad', w2v_model.wv.most_similar('bad'))

# Não entendi corretamente esse bad similar a good (deve ser quando for not bad) - mas no geral a similaridade ficou muito boa. Mas vamos ver como treinar.

In [None]:
import numpy as np 

def encode_word(word):
  return w2v_model.wv.distance(word, 'bad')

def encode_words(words):
  return np.array([encode_word(w) for w in words])

def vocab_size():
  return len(w2v_model.wv.vocab)

# Funções de apoio pra transformar de palavra para código e de código para palavra
def word2token(word):
    try:
        return w2v_model.wv.vocab[word].index
    except KeyError:
        return -1

def token2word(token):
    return w2v_model.wv.index2word[token]  

def to_x(preprocessed_words):
  data = np.zeros(vocab_size())
  for w in preprocessed_words:
    index = word2token(w)
    if index < 0:
      ow = w2v_model.wv.most_similar(w, topn=1)[0]
      index = word2token(ow)
    if index >= 0:
      data[index] = 1
  return torch.from_numpy(data)

def to_y(sentiment):
  return torch.from_numpy(np.array([1 if sentiment == 'pos' else 0]))

words = train_df.preprocessed_words[0]
encoded = encode_words(words)

# Testando o encoding
print(words[:20])
print(encoded[:20])


In [None]:
from torch import nn

class WordRNN(nn.Module):

    def __init__(self, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()

        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        input_size = vocab_size()
        
        #definir lstm -> input_size, hidden_size, num_layers batch_first
        self.lstm = nn.LSTM(input_size, n_hidden, n_layers, batch_first=True)
        
        #definir dropout -> drop_prob
        self.dropout = nn.Dropout(p=drop_prob)
        
        #definir camada fc -> num_hidden input_size
        self.fc = nn.Linear(n_hidden, input_size, '')
      
    
    def forward(self, x, hidden):
               
        # camada lstm
        r_output, hidden = self.lstm(x, hidden)
        
        # dropout
        out = self.dropout(r_output)
        
        out = out.contiguous().view(-1, self.n_hidden)
        
        # camada fc
        out = self.fc(out)
        
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        # Gera tensores de tamanho n_layers x betch_size x n_hidden
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        return hidden


In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        '''
        input_size - dimensao da entrada
        hidden_dim - numero de features
        n_layers - numero de camadas RNN (usualmente entre 1 e 3)
        '''
        super(RNN, self).__init__()
        self.hidden_dim = hidden_dim
        #RNN
        #batch_first - batch_size na primeira dimensão da entrada: (batch_size, seq_length, hidden_dim)
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
        #fully-connected
        self.fc = nn.Linear(hidden_dim, output_size)

        
    def forward(self, x, hidden):
        '''
        x - (batch_size, seq_length, input_size)
        hidden - (n_layers, batch_size, hidden_dim)
        r_out - (batch_size, time_step, hidden_size)
        '''
        
        batch_size = x.size(0)
        
        #Saida RNN
        r_out, hidden = self.rnn(x, hidden)
        
        #reshape: (batch_size*seq_length, hidden_dim)
        r_out = r_out.view(-1, self.hidden_dim)  
        
        #prediction
        output = self.fc(r_out)
        
        return output, hidden

In [None]:
def train(net, data, epochs=10, batch_size=10, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    n_words = vocab_size()
    for e in range(epochs):
        h = net.init_hidden(batch_size)
        
        for counter, row in data.iterrows():
            
            # One-hot encoding
            x = to_x(row.preprocessed_words)
            x = x.reshape(1, 1, len(x))
            #print(x)
            y = to_y(row.sentiment)
            #print(y)
            
            # Cria variáveis para hidden state 
            #h = tuple([each.data for each in h])
            #print(h)

            net.zero_grad()            
            
            # saida do modelo
            output, h = net(x, h)
            
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            #validacao
            if counter % print_every == 0:
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()

                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))
                
net = WordRNN()
print(net)

n_epochs = 10

train(net, train_df, epochs=n_epochs, batch_size=input_size, lr=0.001, print_every=10)

In [None]:
def train(rnn, data, n_steps, print_every):
    
    # initializa o hidden state
    hidden = None      

    for batch_i, step in enumerate(range(n_steps)):

      for counter, row in data.iterrows():

        x = to_x(row.preprocessed_words)
        y = to_y(row.sentiment)

        #saida do bloco RNN
        prediction, hidden = rnn(x, hidden)
        hidden = hidden.data

        # calcula loss
        loss = criterion(prediction, y)
        optimizer.zero_grad()

        #backpropagation e atualização dos pesos
        loss.backward()
        optimizer.step()

        #display
        if batch_i%print_every == 0:        
            print('Loss: ', loss.item())
            plt.plot(time_steps[1:], x, 'r.')
            plt.plot(time_steps[1:], prediction.data.numpy().flatten(), 'b.')
            plt.show()

    return rnn

#hiperparametros
n_steps = 10
input_size = len(w2v_model.wv.vocab)
output_size = 1
hidden_dim = 255
n_layers = 2
print_every = 30

net = RNN(input_size=len(w2v_model.wv.vocab) , output_size=1, hidden_dim=hidden_dim, n_layers = n_layers)
print(net)

trained_rnn = train(net, train_df, n_steps, print_every)

In [None]:
class RNN2(nn.Module):
    def __init__(self, input_size, hidden_size, vocab_size, output_size):
        super(RNN2, self).__init__()
        self.hidden_size = hidden_size
        self.word_embeddings = nn.Embedding(vocab_size, input_size)
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, word, hidden):
        embeds = self.word_embeddings(word)
        combined = torch.cat((embeds.view(1, -1), hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden
    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)


# Pense no embeeding como uma lookup table (https://www.tensorflow.org/text/guide/word_embeddings)
EMBEDDING_DIM = len(w2v_model.wv.vectors[0])
HIDDEN_DIM = 10
output_size = 2

# creating an instance of RNN
rnn = RNN2(EMBEDDING_DIM, HIDDEN_DIM, vocab_size(), output_size)

print(rnn)

In [None]:
def train(data, epochs, print_every):
  for i_epoch in range(epochs):
      if not i_epoch % print_every:
          print("Finnished epoch " + str(i_epoch / 30 * 100)  + "%")
      
      for i_count, row in data.iterrows():
          sentence = train_tweets[i]
          sent_class = tweet_sent_class[i]
          # Step 1. Remember that Pytorch accumulates gradients.
                  # We need to clear them out before each instance
          # Also, we need to clear out the hidden state of the LSTM,
                  # detaching it from its history on the last instance.
          hidden = rnn.init_hidden()
          rnn.zero_grad()
          
          # Step 2. Get our inputs ready for the network, that is, turn them into
          # Tensors of word indices.
          sentence_in = prepare_sequence(sentence)
          target_class = map_class(sent_class)

          # Step 3. Run our forward pass.
          for i in range(len(sentence_in)):
              class_scores, hidden = rnn(sentence_in[i], hidden)

          # Step 4. Compute the loss, gradients, and update the parameters by
          #  calling optimizer.step()
          loss = loss_function(class_scores, target_class)
          loss.backward()
          optimizer.step()

In [None]:
# Exemplo de https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
class SentenceClassifier(nn.Module):
    
    #define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, n_batch_size):
        
        #Constructor
        super().__init__()          
        
        #embedding layer
        #self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #lstm layer
        self.lstm = nn.LSTM(vocab_size, #embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout,
                           batch_first=True)
        
        #dense layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        #activation function
        #self.act = nn.ReLU()
        self.act = nn.Sigmoid()
        
    def forward(self, word_vector): #text, text_lengths):
        
        #text = [batch size,sent_length]
        #embedded = self.embedding(text)
        #embedded = [batch size, sent_len, emb dim]
      
        #packed sequence
        #packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths,batch_first=True)
        #packed_words = nn.utils.rnn.pack_padded_sequence(word_vector, 1, batch_first=True)
        
        packed_output, (hidden, cell) = self.lstm(word_vector) #packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
        
        #concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                
        #hidden = [batch size, hid dim * num directions]
        dense_outputs=self.fc(hidden)

        #Final activation function
        outputs=self.act(dense_outputs)
        
        return outputs



# Exemplo de https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
class SentenceClassifierEmb(nn.Module):
    
    #define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, n_batch_size):
        
        #Constructor
        super().__init__()          
        
        #embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #lstm layer
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout,
                           batch_first=True)
        
        #dense layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        #activation function
        #self.act = nn.ReLU()
        self.act = nn.Sigmoid()
        
    def forward(self, word_vector): #text, text_lengths):
        
        #text = [batch size,sent_length]
        embedded = self.embedding(word_vector)
        #embedded = [batch size, sent_len, emb dim]
      
        #packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, batch_first=True)
        #packed_words = nn.utils.rnn.pack_padded_sequence(word_vector, 1, batch_first=True)
        
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
        
        #concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                
        #hidden = [batch size, hid dim * num directions]
        dense_outputs=self.fc(hidden)

        #Final activation function
        outputs=self.act(dense_outputs)
        
        return outputs

In [None]:
#define hyperparameters
size_of_vocab = vocab_size()
embedding_dim = len(w2v_model.wv.vectors[0])
num_hidden_nodes = 255
num_output_nodes = 1
num_layers = 2
bidirection = True
dropout = 0.2
batch_size = 200

#instantiate the model
model = SentenceClassifierEmb(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, 
                   bidirectional = True, dropout = dropout, n_batch_size = batch_size)

#architecture
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

#define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCELoss()

#define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    print(preds, y, rounded_preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

def get_batches(iterator, batch_count):
    counter = 0
    total_count = 0
    x = np.array([])
    y = np.array([])
    for index, row in iterator:
      #print(row)
      counter = counter + 1
      total_count = total_count + 1
      #print((x, row.preprocessed_words))
      #x =  np.concatenate((x, to_x(row.preprocessed_words)))
      x =  np.concatenate((x, row.preprocessed_words))
      #print((y, row.sentiment))
      y = np.concatenate((y, to_y(row.sentiment)))
      #print(row.sentiment, y)
      
      while counter >= batch_count:
        yield x, torch.from_numpy(y).float(), counter, total_count
        #yield torch.from_numpy(x).float(), torch.from_numpy(y).float(), counter, total_count
        counter = 0
        x = np.array([])
        y = np.array([])

    if counter > 0:
      yield x, torch.from_numpy(y).float(), counter, total_count
      #yield torch.from_numpy(x).float(), torch.from_numpy(y).float(), counter, total_count
    
def train(model, dataset, optimizer, criterion, n_batch_count=20):
    
    #initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    #set the model in training phase
    model.train()  
    
    import torchtext

    for batch_x, batch_y, batch_count, total_count in get_batches(dataset.iterrows(), n_batch_count):
    #for batch in torchtext.data.BucketIterator(iterator, n_batch_count):
    #for batch in data.BucketIterator.splits(iterator, n_batch_count):
    

        #resets the gradients after every batch
        optimizer.zero_grad()   
        
        #retrieve text and no. of words

        #x = to_x(row.preprocessed_words).float()
        #print(x)
        #y = to_y(row.sentiment).float()
        
        # [samples,time steps,features]
        x = batch_x #.reshape(batch_count, 1, vocab_size())
        #x = batch_x.reshape(batch_count, 1, vocab_size())
        y = batch_y
        #print(x.shape, y.shape)
        #print(x)

        #convert to 1D tensor
        predictions = model(x).squeeze()
        #print(predictions.shape)
        #print(predictions, y)
        
        #compute the loss
        loss = criterion(predictions, y)
        
        #compute the binary accuracy
        acc = binary_accuracy(predictions, y)
        
        #backpropage the loss and compute the gradients
        loss.backward()       
        
        #update the weights
        optimizer.step()      
        
        #loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()

        print('batch', total_count, epoch_loss, epoch_acc)

    return epoch_loss / total_count, epoch_acc / total_count


N_EPOCHS = 5
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
    train_loss, train_acc = train(model, train_df, optimizer, criterion, batch_size)

    #evaluate the model
    #valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    #save the best model
    #if valid_loss < best_valid_loss:
    #    best_valid_loss = valid_loss
    #    torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    #print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
#Visualizing Word2Vec Embeddings with t-SNE
from sklearn.manifold import TSNE
import random
import matplotlib.pyplot as plt

rnn = WordRNN()

n_samples = 500
# Sample random words from model dictionary
random_i = random.sample(range(vocab_size), n_samples)
random_w = [rnn.token2word(i) for i in random_i]

# Generate Word2Vec embeddings of each word
word_vecs = np.array([w2v_model[w] for w in random_w])

# Apply t-SNE to Word2Vec embeddings, reducing to 2 dims
tsne = TSNE()
tsne_e = tsne.fit_transform(word_vecs)

# Plot t-SNE result
plt.figure(figsize=(32, 32))
plt.scatter(tsne_e[:, 0], tsne_e[:, 1], marker='o', c=range(len(random_w)), cmap=plt.get_cmap('Spectral'))

for label, x, y, in zip(random_w, tsne_e[:, 0], tsne_e[:, 1]):
    plt.annotate(label,
                 xy=(x, y), xytext=(0, 15),
                 textcoords='offset points', ha='right', va='bottom',
                 bbox=dict(boxstyle='round, pad=0.2', fc='yellow', alpha=0.1))