# Word2Vec: Treinamento de Word Embedding

## Objetivos
* Aprenda como usar Spacy no pré-processamento de dados textuais, suas vantagens e desvantagens
* Aprenda a configurar os Hiperparâmetros do modelo Word2Vec
* Treine o seu modelo Word2Vec, usando o Gensim
* Crie um classificador de texto usando o seu próprio modelo Word2Vec
* Disponibilize o seu modelo em uma aplicação web

link: https://cursos.alura.com.br/course/word2vec-treinamento-word-embedding

## Importando as bibliotecas necessárias

In [14]:
! python -m spacy download pt_core_news_sm

Collecting pt_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-2.2.5/pt_core_news_sm-2.2.5.tar.gz (21.2 MB)
[K     |████████████████████████████████| 21.2 MB 1.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('pt_core_news_sm')


In [72]:
import numpy as np
import pandas as pd
import gensim
import spacy
import pt_core_news_sm
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from gensim.models.callbacks import CallbackAny2Vec

## Carregando a base de dados

In [16]:
dados_treino = pd.read_csv('/content/drive/MyDrive/Alura cursos/Word2Vec/treino.csv')
dados_teste = pd.read_csv('/content/drive/MyDrive/Alura cursos/Word2Vec/teste.csv')

In [17]:
dados_treino.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Após polêmica, Marine Le Pen diz que abomina n...",A candidata da direita nacionalista à Presidên...,2017-04-28,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
1,"Macron e Le Pen vão ao 2º turno na França, em ...",O centrista independente Emmanuel Macron e a d...,2017-04-23,mundo,,http://www1.folha.uol.com.br/mundo/2017/04/187...
2,"Apesar de larga vitória nas legislativas, Macr...",As eleições legislativas deste domingo (19) na...,2017-06-19,mundo,,http://www1.folha.uol.com.br/mundo/2017/06/189...
3,"Governo antecipa balanço, e Alckmin anuncia qu...",O número de ocorrências de homicídios dolosos ...,2015-07-24,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2015/07...
4,"Após queda em maio, a atividade econômica sobe...","A economia cresceu 0,25% no segundo trimestre,...",2017-08-17,mercado,,http://www1.folha.uol.com.br/mercado/2017/08/1...


In [18]:
dados_teste.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,Grandes irmãos,"RIO DE JANEIRO - O Brasil, cada vez menos famí...",2017-03-06,colunas,ruycastro,http://www1.folha.uol.com.br/colunas/ruycastro...
1,Haddad congela orçamento e suspende emendas de...,"O prefeito de São Paulo, Fernando Haddad (PT),...",2016-08-10,colunas,monicabergamo,http://www1.folha.uol.com.br/colunas/monicaber...
2,Proposta de reforma da Fifa tem a divulgação d...,"A Fifa divulgou, nesta quinta (10), um relatór...",2015-10-09,esporte,,http://www1.folha.uol.com.br/esporte/2015/09/1...
3,"Mercado incipiente, internet das coisas conect...","Bueiros, coleiras, aparelhos hospitalares, ele...",2016-11-09,mercado,,http://www1.folha.uol.com.br/mercado/2016/09/1...
4,"Mortes: Psicanalista, estudou o autismo em cri...",Toda vez que o grupo de amigos de Silvana Rabe...,2017-02-07,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2017/07...


In [19]:
print(f'Os dados de treino possuem {dados_treino.shape[0]} observações e {dados_treino.shape[1]} atributos')
print(f'Os dados de treino possuem {dados_teste.shape[0]} observações e {dados_teste.shape[1]} atributos')

Os dados de treino possuem 90000 observações e 6 atributos
Os dados de treino possuem 20513 observações e 6 atributos


In [20]:
nlp = pt_core_news_sm.load()

In [21]:
nlp

<spacy.lang.pt.Portuguese at 0x7ff257dc3ad0>

In [22]:
texto = 'Rio de Janeiro é uma cidade maravilhosa'
doc = nlp(texto)
doc

Rio de Janeiro é uma cidade maravilhosa

In [23]:
textos_para_tratamento = (titulos.lower() for titulos in dados_treino.title)
textos_para_tratamento

<generator object <genexpr> at 0x7ff257a6c4d0>

In [24]:
def trata_textos(doc):
  tokens_validos = []
  for token in doc:
    e_valido = not token.is_stop and token.is_alpha
    if e_valido:
      tokens_validos.append(token.text)
    
  if len(tokens_validos) > 2:
    return " ".join(tokens_validos)

In [25]:
texto = 'Rio de 5845116 Janeiro é uma cidade!!!! maravilhosa'
doc = nlp(texto)

trata_textos(doc)

'Rio Janeiro cidade maravilhosa'

In [28]:
import time
t0 = time.time()
textos_tratados = [trata_textos(doc) for doc in nlp.pipe(textos_para_tratamento, 
                                                        batch_size=1000, 
                                                        n_process=-1)]

t1= time.time()
print(t1-t0)

219.23638200759888


In [31]:
titulos_tratados = pd.DataFrame({'title': textos_tratados})
titulos_tratados.head()

Unnamed: 0,title
0,polêmica marine le pen abomina negacionistas h...
1,macron e le pen a o turno frança revés siglas ...
2,apesar larga vitória legislativas macron terá ...
3,governo antecipa balanço e alckmin anuncia que...
4,queda maio a atividade econômica sobe junho bc


## Inicando a fase de treinamento do modelo

Utilizamos a biblioteca gensim que traz uma arquitetura implementada do Word2vec, onde precisamos apenas setar os parâmetros de configuração da nossa rede.

In [33]:
w2v_modelo

<gensim.models.word2vec.Word2Vec at 0x7ff257a47350>

In [45]:
print(len(titulos_tratados))

titulos_tratados.dropna(inplace=True)
titulos_tratados.drop_duplicates(inplace=True)

print(len(titulos_tratados))

86113
86113


In [38]:
lista_lista_tokens = [titulo.split( ) for titulo in titulos_tratados.title]

### Treinando o modelo CBOW

In [79]:
import logging

#logging.basicConfig(format='%(asctime)s: -  %(message)s', level=logging.INFO)

w2v_modelo = Word2Vec(sg=0,
                      window=2,
                      size=300,
                      min_count=5,
                      alpha=0.03,
                      min_alpha=0.007)

w2v_modelo.build_vocab(lista_lista_tokens, progress_per=5000)

2021-08-10 01:46:27,923: -  collecting all words and their counts
2021-08-10 01:46:27,931: -  PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-08-10 01:46:27,953: -  PROGRESS: at sentence #10000, processed 69298 words, keeping 14909 word types
2021-08-10 01:46:27,982: -  PROGRESS: at sentence #20000, processed 138620 words, keeping 20969 word types
2021-08-10 01:46:28,012: -  PROGRESS: at sentence #30000, processed 207976 words, keeping 25453 word types
2021-08-10 01:46:28,040: -  PROGRESS: at sentence #40000, processed 277254 words, keeping 28992 word types
2021-08-10 01:46:28,068: -  PROGRESS: at sentence #50000, processed 346641 words, keeping 31924 word types
2021-08-10 01:46:28,095: -  PROGRESS: at sentence #60000, processed 416318 words, keeping 34458 word types
2021-08-10 01:46:28,122: -  PROGRESS: at sentence #70000, processed 485882 words, keeping 36651 word types
2021-08-10 01:46:28,150: -  PROGRESS: at sentence #80000, processed 555521 words, keeping 38

In [76]:
dir(w2v_modelo)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_check_input_data_sanity',
 '_check_training_sanity',
 '_clear_post_train',
 '_do_train_epoch',
 '_do_train_job',
 '_get_job_params',
 '_get_thread_working_mem',
 '_job_producer',
 '_load_specials',
 '_log_epoch_end',
 '_log_epoch_progress',
 '_log_progress',
 '_log_train_end',
 '_minimize_model',
 '_raw_word_count',
 '_save_specials',
 '_set_train_params',
 '_smart_save',
 '_train_epoch',
 '_train_epoch_corpusfile',
 '_update_job_params',
 '_worker_loop',
 '_worker_loop_corpusfile',
 'accuracy',
 'alpha',
 'batch_words',
 'build_vocab',
 'build_vocab_from_freq',
 'ca

In [77]:
w2v_modelo.corpus_count

86113

In [81]:
# iniciando a chamada callback
class callback(CallbackAny2Vec):
  def __init__(self):
    self.epoch = 0

def on_epoch_end(self, model):
  loss = model.get_latest_training_loss()
  if self.epoch == 0:
    print('Loss após a época {}: {}'.format(self.epoch, loss))
  else:
    print('Loss após a época {}: {}'.format(self.epoch, loss- self.loss_previous_step))
  self.epoch += 1
  self.loss_previous_step = loss

In [80]:
w2v_modelo.train(lista_lista_tokens, total_examples=w2v_modelo.corpus_count,
                 epochs=30,
                 compute_loss=True,
                 callbacks=[callback()])

2021-08-10 01:47:03,947: -  training model with 3 workers on 13006 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2
2021-08-10 01:47:04,975: -  EPOCH 1 - PROGRESS: at 66.94% examples, 330831 words/s, in_qsize 5, out_qsize 0
2021-08-10 01:47:05,419: -  worker thread finished; awaiting finish of 2 more threads
2021-08-10 01:47:05,423: -  worker thread finished; awaiting finish of 1 more threads
2021-08-10 01:47:05,430: -  worker thread finished; awaiting finish of 0 more threads
2021-08-10 01:47:05,431: -  EPOCH - 1 : training on 597929 raw words (502771 effective words) took 1.5s, 341470 effective words/s
2021-08-10 01:47:06,488: -  EPOCH 2 - PROGRESS: at 70.26% examples, 337814 words/s, in_qsize 3, out_qsize 2
2021-08-10 01:47:06,828: -  worker thread finished; awaiting finish of 2 more threads
2021-08-10 01:47:06,859: -  worker thread finished; awaiting finish of 1 more threads
2021-08-10 01:47:06,870: -  worker thread finished; awaiting finish of 0 more t

(15086746, 17937870)

In [87]:
w2v_modelo.wv.most_similar('google')

2021-08-10 01:52:02,705: -  precomputing L2-norms of word weight vectors


[('apple', 0.582836389541626),
 ('facebook', 0.5689347982406616),
 ('uber', 0.4988063871860504),
 ('amazon', 0.4805878698825836),
 ('waze', 0.4691309928894043),
 ('software', 0.46819931268692017),
 ('airbnb', 0.4672732353210449),
 ('fbi', 0.46390458941459656),
 ('walmart', 0.4373021721839905),
 ('apps', 0.4366353750228882)]

In [98]:
w2v_modelo.wv.most_similar('microsoft')

[('unilever', 0.5961178541183472),
 ('walmart', 0.5732885003089905),
 ('lego', 0.5488518476486206),
 ('tesla', 0.5473916530609131),
 ('amazon', 0.5462477207183838),
 ('inbev', 0.5434845089912415),
 ('spotify', 0.5265981554985046),
 ('sony', 0.5204979181289673),
 ('chrysler', 0.512848973274231),
 ('braskem', 0.5043861865997314)]

In [65]:
w2v_modelo.wv.most_similar('neymar')

[('messi', 0.5437285900115967),
 ('benzema', 0.48244208097457886),
 ('barça', 0.47397497296333313),
 ('valdivia', 0.45517486333847046),
 ('ibrahimovic', 0.45389801263809204),
 ('suárez', 0.44436556100845337),
 ('romário', 0.44328829646110535),
 ('fred', 0.4385612905025482),
 ('cristiano', 0.4366806149482727),
 ('stjd', 0.4346413314342499)]

In [71]:
w2v_modelo.wv.most_similar('gm')

[('chrysler', 0.6733736991882324),
 ('honda', 0.6651701927185059),
 ('volks', 0.6548870205879211),
 ('inbev', 0.6405227184295654),
 ('embraer', 0.6281988024711609),
 ('volkswagen', 0.6216104030609131),
 ('renault', 0.6114151477813721),
 ('toyota', 0.6049923896789551),
 ('braskem', 0.6047013998031616),
 ('csn', 0.5865892171859741)]

### Treinando o modelo Skip-Gram

In [91]:
w2v_modelo_sg = Word2Vec(sg=1,
                         window=5,
                         size=300,
                         min_count=5,
                         alpha=0.03,
                         min_alpha=0.007)

w2v_modelo_sg.build_vocab(lista_lista_tokens, progress_per=5000)

w2v_modelo_sg.train(lista_lista_tokens, total_examples=w2v_modelo_sg.corpus_count,
                 epochs=30,
                 compute_loss=True,
                 callbacks=[callback()])

2021-08-10 02:01:43,054: -  collecting all words and their counts
2021-08-10 02:01:43,059: -  PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-08-10 02:01:43,076: -  PROGRESS: at sentence #5000, processed 34716 words, keeping 10129 word types
2021-08-10 02:01:43,091: -  PROGRESS: at sentence #10000, processed 69298 words, keeping 14909 word types
2021-08-10 02:01:43,108: -  PROGRESS: at sentence #15000, processed 103841 words, keeping 18223 word types
2021-08-10 02:01:43,125: -  PROGRESS: at sentence #20000, processed 138620 words, keeping 20969 word types
2021-08-10 02:01:43,142: -  PROGRESS: at sentence #25000, processed 173257 words, keeping 23410 word types
2021-08-10 02:01:43,159: -  PROGRESS: at sentence #30000, processed 207976 words, keeping 25453 word types
2021-08-10 02:01:43,172: -  PROGRESS: at sentence #35000, processed 242567 words, keeping 27263 word types
2021-08-10 02:01:43,195: -  PROGRESS: at sentence #40000, processed 277254 words, keeping 2899

(15088743, 17937870)

In [92]:
w2v_modelo_sg.wv.most_similar('google')

2021-08-10 02:03:40,841: -  precomputing L2-norms of word weight vectors


[('reguladores', 0.4206386208534241),
 ('apple', 0.3930903673171997),
 ('waze', 0.3882356882095337),
 ('facebook', 0.3861532211303711),
 ('concorda', 0.3793504238128662),
 ('patentes', 0.36988765001296997),
 ('android', 0.3693094551563263),
 ('yahoo', 0.365840882062912),
 ('buffett', 0.3651365637779236),
 ('anúncios', 0.36011892557144165)]

In [93]:
w2v_modelo_sg.wv.most_similar('microsoft')

[('linkedin', 0.5354049801826477),
 ('chips', 0.48254701495170593),
 ('software', 0.47764453291893005),
 ('bitcoin', 0.45045530796051025),
 ('fertilizantes', 0.45014694333076477),
 ('verizon', 0.44941723346710205),
 ('investindo', 0.43623656034469604),
 ('fabricar', 0.4343263506889343),
 ('silício', 0.42991238832473755),
 ('syngenta', 0.42629486322402954)]

In [94]:
w2v_modelo_sg.wv.most_similar('neymar')

[('barça', 0.518221914768219),
 ('suárez', 0.4742594361305237),
 ('cavani', 0.4607198238372803),
 ('messi', 0.45851776003837585),
 ('benzema', 0.42924392223358154),
 ('villarreal', 0.4286305606365204),
 ('psg', 0.42676663398742676),
 ('dedada', 0.4090576171875),
 ('barcelona', 0.406423419713974),
 ('fernandinho', 0.3941882848739624)]

In [95]:
w2v_modelo_sg.wv.most_similar('gm')

[('metalúrgicos', 0.5739935636520386),
 ('motors', 0.5112147331237793),
 ('honda', 0.5084389448165894),
 ('airbags', 0.4985079765319824),
 ('coletivas', 0.487079918384552),
 ('cubatão', 0.48182615637779236),
 ('audi', 0.47662973403930664),
 ('bmw', 0.4720497727394104),
 ('compartilhamento', 0.4632718563079834),
 ('fiat', 0.4593982994556427)]

In [96]:
w2v_modelo_sg.wv.most_similar('galo')

[('lilly', 0.5294711589813232),
 ('frevo', 0.5282864570617676),
 ('beth', 0.5243561863899231),
 ('cores', 0.523779571056366),
 ('martinho', 0.5148050785064697),
 ('sarti', 0.5133844017982483),
 ('baralho', 0.4886307716369629),
 ('sambista', 0.4788205921649933),
 ('sereia', 0.4756861925125122),
 ('foliões', 0.4724130928516388)]

In [97]:
w2v_modelo.wv.save_word2vec_format('/content/drive/MyDrive/Alura cursos/Word2Vec/modelo_cbow.txt', binary=False)
w2v_modelo_sg.wv.save_word2vec_format('/content/drive/MyDrive/Alura cursos/Word2Vec/modelo_skip.txt', binary=False)

2021-08-10 02:03:40,970: -  storing 13006x300 projection weights into /content/drive/MyDrive/Alura cursos/Word2Vec/modelo_cbow.txt
2021-08-10 02:03:44,119: -  storing 13006x300 projection weights into /content/drive/MyDrive/Alura cursos/Word2Vec/modelo_skip.txt
