## **Introdução à Word Embeddings**

Hipótese distribucional: palavras tem significados parecidos quando são usadas em contextos parecidos.

Modelos de linguagem: 
- predizem a próxima palavra, dado um conjunto de palavras;
- usados em processamento de voz, autocorreção de ortografia, etc.

Mas o que é **Word Embeddings**? 

É a representação vetorial de uma palavra, atribuir valores aos atributos através de aprendizado de máquinas.

- texto ➜ número

Um dos algoritmos utilizados: *Word2Vec*. 

### Utilizando o *word2vec* no *spaCy*:

Encontrar modelos pré treinados e baixar.
- [Repositório de Word Embeddings do NILC](http://www.nilc.icmc.usp.br/embeddings)

In [1]:
import spacy

In [2]:
!pip install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.7/dist-packages (3.0.6)


Usando as de 100 dimensões como exemplo:

### Modelo **CBOW** como exemplo:

In [3]:
!python -m spacy init vectors pt '/content/drive/MyDrive/Cursos/PLN_USP/embeddings/cbow_s100.zip' vec_spacy_cbow_100

2021-05-28 11:06:54.102238: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Creating blank nlp object for language 'pt'[0m
[2021-05-28 11:06:55,917] [INFO] Reading vectors from /content/drive/MyDrive/Cursos/PLN_USP/embeddings/cbow_s100.zip
929606it [00:38, 24083.21it/s]
[2021-05-28 11:07:34,642] [INFO] Loaded vectors from /content/drive/MyDrive/Cursos/PLN_USP/embeddings/cbow_s100.zip
[38;5;2m✔ Successfully converted 929606 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' setting in [initialize].[0m
/content/vec_spacy_cbow_100


In [4]:
from spacy import util as spc_util

texto_simples = 'gato cachorro conversar falar'

pathw2v = 'vec_spacy_cbow_100'

nlpw2v = spc_util.load_model(pathw2v)

In [5]:
docw2v = nlpw2v(texto_simples)

tokens_w2v = [token.orth_ for token in docw2v]

tokens_w2v

['gato', 'cachorro', 'conversar', 'falar']

In [6]:
docw2v[0].vector

array([ 0.236389, -0.603464, -0.161562, -0.19812 ,  0.227765, -0.139105,
        0.166398, -0.296089,  0.128838, -0.240817,  0.195813, -0.047349,
       -0.42046 ,  0.21304 ,  0.117412,  0.088539,  0.148387,  0.052073,
        0.136064,  0.134528,  0.161985,  0.185564, -0.126994,  0.224053,
       -0.267075,  0.052961,  0.055635,  0.076302,  0.095579,  0.240283,
       -0.037003, -0.39441 , -0.098924,  0.247828,  0.124387,  0.260114,
        0.25561 , -0.27736 ,  0.107813,  0.048234, -0.057004,  0.408355,
       -0.093729, -0.122929,  0.072935, -0.08228 ,  0.073258, -0.020372,
       -0.392844, -0.227506, -0.018241, -0.322697,  0.234883, -0.259119,
       -0.403255,  0.010997,  0.352699,  0.090044, -0.122668, -0.146297,
       -0.087796,  0.149924,  0.280711, -0.294955,  0.068481,  0.13756 ,
        0.18189 , -0.31254 ,  0.16442 , -0.216146, -0.10583 ,  0.19913 ,
       -0.07746 , -0.204225, -0.144298, -0.282948, -0.410612, -0.011469,
       -0.475289, -0.290361,  0.116357,  0.198507, 

In [7]:
len(docw2v[0].vector)

100

gato X cachorro:

In [8]:
docw2v[0].similarity(docw2v[1])

0.76577175

conversar X falar:

In [9]:
docw2v[2].similarity(docw2v[3])

0.7374225

### Módulo Scikit-Learn

In [10]:
!pip install numpy



In [11]:
!pip install -U scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/a8/eb/a48f25c967526b66d5f1fa7a984594f0bf0a5afafa94a8c4dbc317744620/scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
[K     |████████████████████████████████| 22.3MB 1.8MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.1.0


In [12]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from spacy import util as spc_util

In [13]:
pathw2v = 'vec_spacy_cbow_100'

nlpw2v = spc_util.load_model(pathw2v)

In [14]:
madri = nlpw2v.vocab['madri']
espanha = nlpw2v.vocab['espanha']
franca = nlpw2v.vocab['franca']
paris = nlpw2v.vocab['paris']

In [15]:
vector_res = np.array(madri.vector) - np.array(espanha.vector) + np.array(franca.vector)

vector_res = vector_res.reshape(1 , -1)

vector_res.shape

(1, 100)

In [16]:
vector_paris = np.array(paris.vector).reshape(1, -1)

similarity = cosine_similarity(vector_res, vector_paris)

print(similarity)

[[0.21776156]]


### **BERT - Buscando o contexto**

Analisa o texto em nível semântico. Objetivo: tentar prever palavras.

Para a ferramenta rodar, é preciso de:
- transformadores - arquitetura e bibliotecas;
- modelos treinados; 
- módulo que permita a execução dos modelos.

Modelo em português: [BERTimbau](https://github.com/neuralmind-ai/portuguese-bert)

In [17]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |▏                               | 10kB 12.6MB/s eta 0:00:01[K     |▎                               | 20kB 14.0MB/s eta 0:00:01[K     |▍                               | 30kB 18.5MB/s eta 0:00:01[K     |▋                               | 40kB 14.4MB/s eta 0:00:01[K     |▊                               | 51kB 9.2MB/s eta 0:00:01[K     |▉                               | 61kB 10.3MB/s eta 0:00:01[K     |█                               | 71kB 11.6MB/s eta 0:00:01[K     |█▏                              | 81kB 11.1MB/s eta 0:00:01[K     |█▎                              | 92kB 11.9MB/s eta 0:00:01[K     |█▌                              | 102kB 12.9MB/s eta 0:00:01[K     |█▋                              | 112kB 12.9MB/s eta 0:00:01[K     |█▊                              | 1

Passo a passo: 
1. importação do módulo para a execução do modelo: PyToreh;
2. implementar os modelos pré treinados: BERTimbau;

Detalhes na variável **text**:
- CLS → início da sentença;
- SEP → final da sentença;
- MASK → parte da sentença que queremos predizer.

*obs: usando o corpus_text.txt como base:*

### Atividade 1: predizer o complemento de uma sentença

In [32]:
def calc_score_tarefa1(text, target_word, tokenizer, model, debug = False):

  # tokenizar o texto
  tokenized_text = tokenizer.tokenize(text)
  # pegar os índices do texto no vocabulário próprio
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

  # pegar os índices das palavras
  target_index = tokenizer.convert_tokens_to_ids([target_word])[0]
  masked_index = tokenized_text.index('[MASK]') 

  # gerar um tensor
  segments_ids = [0] * len(tokenized_text)

  # converter índices em tensores
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])
  
  # predizer tokens
  with torch.no_grad():
      predictions = model(tokens_tensor, segments_tensors)

  # pegar o token de novo só pra conferir a confiança
  expected_token = tokenizer.convert_ids_to_tokens([target_index])[0]

  # deixar os valores entre 0 e 1
  predictions_candidates = torch.sigmoid(predictions[0][0][masked_index])
  
  predicted_index = torch.argmax(predictions_candidates).item()

  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

  # se a palavra não existir no vocabulário
  if predicted_token == '[UNK]':
    return 0

  predictions_candidates = predictions_candidates.cpu().numpy()
  target_bert_confiance = predictions_candidates[target_index]
  predicted_bert_confience = predictions_candidates[predicted_index]

  # se a palavra existir no vocabulário
  score = 1 - (target_bert_confiance-predicted_bert_confience if  target_bert_confiance > predicted_bert_confience else predicted_bert_confience-target_bert_confiance)
  if debug:
    print("predicted token ---> ", predicted_token, predicted_bert_confience)
    print("expected token  ---> ",expected_token , target_bert_confiance)
    print("Score:", score)

  return score

In [40]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# carregando o modelo pré treinado
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

model = BertForMaskedLM.from_pretrained('neuralmind/bert-base-portuguese-cased')
model.eval()

text = '[CLS] O jogo continuou amarrado no terceiro quarto, com as [MASK] levando a melhor sobre os ataques [SEP]'
# [MASK] = defesas

target_word = 'amparo'

# calculando score
score = calc_score_tarefa1(text, target_word, tokenizer, model, debug = True)
print("Score: ", score)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


predicted token --->  defesas 0.9999999
expected token  --->  [UNK] 0.99746823
Score: 0.9974683523178101
Score:  0.9974683523178101


### Atividade 2: verificar a similaridade contextual entre duas palavras (predita e esperada) em uma sentença.

In [29]:
def calc_score_tarefa2(text, target_word, predicted_word, tokenizer, model, debug=False):

  # tokenizar o texto
  tokenized_text = tokenizer.tokenize(text)
  # pegar os índices do texto no vocabulário próprio
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

  # pegar os índices
  target_index = tokenizer.convert_tokens_to_ids([target_word])[0]
  predicted_index = tokenizer.convert_tokens_to_ids([predicted_word])[0]
  masked_index = tokenized_text.index('[MASK]') 

  # gerar um tensor
  segments_ids = [0] * len(tokenized_text)

  # converter índices em tensores
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])

  # pegar as palavras
  expected_token = tokenizer.convert_ids_to_tokens([target_index])[0]
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

  # predizer os tokens
  with torch.no_grad():
      predictions = model(tokens_tensor, segments_tensors)

  # definindo os valores entre 0 e 1
  predictions_candidates = torch.sigmoid(predictions[0][0][masked_index]).cpu().numpy()

  # pegar as palavras de novo
  expected_token = tokenizer.convert_ids_to_tokens([target_index])[0]
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

  # se a palavra não existir no vocabulário
  if predicted_token == '[UNK]':
    return 0

  target_bert_confiance = predictions_candidates[target_index]
  predicted_bert_confience = predictions_candidates[predicted_index]

  # se a palavra existir no vocabulário
  score = 1 - (target_bert_confiance-predicted_bert_confience if  target_bert_confiance > predicted_bert_confience else predicted_bert_confience-target_bert_confiance)
  if debug:
    print("predicted token ---> ", predicted_token, predicted_bert_confience)
    print("expected token  ---> ",expected_token , target_bert_confiance)
    print("Score:", score)

  return score

In [41]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

import logging
logging.basicConfig(level=logging.INFO)

# carregando modelo pré treinado
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

model = BertForMaskedLM.from_pretrained('neuralmind/bert-base-portuguese-cased')
model.eval()

text = '[CLS] O jogo continuou amarrado no terceiro quarto, com as [MASK] levando a melhor sobre os ataques [SEP]'
# [MASK] = defesas

# nesse código, nós vamos dar as duas palavras
predicted_word = 'defesas'
target_word = 'proteção'

# calculando score
score = calc_score_tarefa2(text, target_word, predicted_word, tokenizer, model, debug=True)
print("Score: ", score)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


predicted token --->  defesas 0.9999999
expected token  --->  proteção 0.9578208
Score: 0.9578208923339844
Score:  0.9578208923339844


*A aula ocorreu no dia 26 de maio de 2021, das 14h às 18h.*