[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/jkanclerz/analiza-tekstu/blob/master/03-Eksploracyjna_analiza_dokumentow-Word-embedings.ipynb)

## Word2Vec

Word2Vec nie jest pojedynczym algorytmem, jest to raczej rodzina architektur modeli i optymalizacji, które mogą być używane do uczenia się embeddingów słów z dużych zbiorów danych. Embeddingi poznane dzięki Word2Vec okazały się skuteczne w wielu zadaniach przetwarzania języka naturalnego.

### OneHot vs WordVector

* Mniejszy wymiar 8, 50, 100, 300 vs (140000 polski, 350000 angielski) 
    * efektywność przechowywania w pamięci
* Semantyka / znaczenie
    * W słowniku Kot i Pies == Kot i Pieniądz
    * Intuicyjnie to nieprawda, ale maszyna nie ma jak wnioskować o podobieństwie

Zakładamy, że wyrazy o podobnym znaczeniu występują częściej w tym samym kontekście niż wyrazy zupełnie z tematem nie związane
    

### Continuous Bag-of-Words Model
który przewiduje środkowe słowo na podstawie otaczających je słów kontekstu. Kontekst składa się z kilku słów przed i po bieżącym (środkowym) słowie. Architektura ta jest nazywana modelem worka słów, ponieważ kolejność słów w kontekście nie jest istotna.


### Continuous Skip-gram Model
który przewiduje słowa w pewnym zakresie przed i po bieżącym słowie w tym samym zdaniu. Działający przykład tego jest podany poniżej.

In [4]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


**Now is better than never, although never is often better than right now.**

![](var/word_context.jpeg)

#### Parametry:
* Okno - Window -> 2

#### Transformacja
* 1 iteracja
    * target: Now
    * context: [is, better]
* 2 iteracja
    * target: Is
    * context: [Now, better, than]
* 3 iteracja
    * target: better
    * context: [Now, is, than, never]
    
#### Dane wejściowe
[(Now, is), (Now, better), (Is, Now), (Is, better), (Is, than)...]

### skip-gram
* 1 iteracja
    * target: [is, better]
    * word: Now
* 2 iteracja
    * target: [Now, better, than]
    * word: Is
* 3 iteracja
    * target: [Now, is, than, never]
    * word: better
    


### Przebieg

Is better _____ never although -> **than**

#### Cbow
HotOne (0,0,1,0,1,1,0....0,1,1) -> Sieć neuronowa (50 parametrów) -> HotOne (..., 1, ...)
#### Skip gram
HotOne (..., 1, ...) -> Sieć neuronowa (50 parametrów) -> HotOne (0,0,1,0,1,1,0....0,1,1)

![](var/models.jpeg)

### Znaczenie wyrazów

![](var/dimenssions.jpeg)

#### Gotowe modele

* [http://dsmodels.nlp.ipipan.waw.pl/](http://dsmodels.nlp.ipipan.waw.pl/)

In [13]:
!wget http://dsmodels.nlp.ipipan.waw.pl/dsmodels/wiki-forms-all-100-cbow-ns-30-it100.txt.gz

--2021-12-04 01:42:59--  http://dsmodels.nlp.ipipan.waw.pl/dsmodels/wiki-forms-all-100-cbow-ns-30-it100.txt.gz
Resolving dsmodels.nlp.ipipan.waw.pl (dsmodels.nlp.ipipan.waw.pl)... 213.135.36.94
Connecting to dsmodels.nlp.ipipan.waw.pl (dsmodels.nlp.ipipan.waw.pl)|213.135.36.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96291106 (92M) [application/octet-stream]
Saving to: ‘wiki-forms-all-100-cbow-ns-30-it100.txt.gz’


2021-12-04 01:43:02 (33,2 MB/s) - ‘wiki-forms-all-100-cbow-ns-30-it100.txt.gz’ saved [96291106/96291106]



In [14]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [16]:
from gensim.models import KeyedVectors

In [17]:
word2vec_model = KeyedVectors.load_word2vec_format('wiki-forms-all-100-cbow-ns-30-it100.txt.gz', binary=False)

In [18]:
word2vec_model

<gensim.models.keyedvectors.KeyedVectors at 0x10a6ded90>

In [22]:
word2vec_model.most_similar("komputer")

[('mikroprocesor', 0.8413556218147278),
 ('procesor', 0.8408359289169312),
 ('mikrokomputer', 0.7868935465812683),
 ('serwer', 0.7859148979187012),
 ('sterownik', 0.7709600925445557),
 ('odbiornik', 0.7664448618888855),
 ('modem', 0.76606285572052),
 ('automat', 0.7601091265678406),
 ('interfejs', 0.7560226321220398),
 ('moduł', 0.7477715015411377)]

In [25]:
## Kobieta + Król - Mężczyzna => 
word2vec_model.most_similar(positive=['kobieta', 'król'], negative=['mężczyzna'], topn=2)

[('królowa', 0.8121304512023926), ('cesarzowa', 0.7185278534889221)]

In [29]:
len(word2vec_model.index_to_key)

226396

In [31]:
word2vec_model.index_to_key[:5]

['w', 'i', 'na', 'z', 'do']

In [32]:
word2vec_model['studia']

array([  2.340531,   4.042009,   1.379964,  -2.570875,  -6.386054,
         2.563526,  -2.15919 ,  -5.862146,   0.900218,   4.766161,
         6.316826,  -3.465864,   4.536585,   5.025024,   2.033738,
         5.342262,  -9.382997,   2.485591,   9.384068,  -6.73937 ,
         6.684585,   4.521341,   0.503903,   5.57772 ,  -1.284685,
        -2.736128,   3.997064,  -1.389462,   2.380089,   1.208408,
        -2.977874,  -4.645787,   0.359989,   5.398995, -13.140451,
        -1.73424 ,  -0.928283,  -6.178953,  -5.871259,  -0.229391,
        -6.043469,  -1.246072,  -0.312358,  -0.607806,   5.411756,
       -11.643624,   7.283361,  -3.326613,   4.791219,   4.176166,
         2.003737,  -7.958755,   6.639453,   4.922795,   5.656033,
        -2.38712 ,  -7.016893,  -1.731023,   8.705477,   6.712057,
        -0.108324,   1.07114 ,   1.748102,   1.65885 ,   5.361063,
         1.075046,   1.301678,   2.086038,   0.717669,   0.452681,
         4.019646,   1.437227,   8.210261,   3.675681,   0.358

### Własny embedding

In [33]:
pip install keras tensorflow

Collecting keras
  Downloading keras-2.7.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 2.3 MB/s            
[?25hCollecting tensorflow
  Downloading tensorflow-2.7.0-cp39-cp39-macosx_10_11_x86_64.whl (207.1 MB)
     |████████████████████████████████| 207.1 MB 16 kB/s              
Collecting tensorboard~=2.6
  Downloading tensorboard-2.7.0-py3-none-any.whl (5.8 MB)
     |████████████████████████████████| 5.8 MB 14.4 MB/s            
[?25hCollecting tensorflow-estimator<2.8,~=2.7.0rc0
  Downloading tensorflow_estimator-2.7.0-py2.py3-none-any.whl (463 kB)
     |████████████████████████████████| 463 kB 15.6 MB/s            
[?25hCollecting tensorflow-io-gcs-filesystem>=0.21.0
  Downloading tensorflow_io_gcs_filesystem-0.22.0-cp39-cp39-macosx_10_14_x86_64.whl (1.6 MB)
     |████████████████████████████████| 1.6 MB 15.5 MB/s            
[?25hCollecting opt-einsum>=2.3.2
  Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting gast<0.5.0,>=0.2.

In [199]:
sentences = """
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""

#### Przygotowanie

In [200]:
sentences = sentences.split('.')

In [201]:
sentences

['\nBeautiful is better than ugly',
 '\nExplicit is better than implicit',
 '\nSimple is better than complex',
 '\nComplex is better than complicated',
 '\nFlat is better than nested',
 '\nSparse is better than dense',
 '\nReadability counts',
 "\nSpecial cases aren't special enough to break the rules",
 '\nAlthough practicality beats purity',
 '\nErrors should never pass silently',
 '\nUnless explicitly silenced',
 '\nIn the face of ambiguity, refuse the temptation to guess',
 '\nThere should be one-- and preferably only one --obvious way to do it',
 "\nAlthough that way may not be obvious at first unless you're Dutch",
 '\nNow is better than never',
 '\nAlthough never is often better than *right* now',
 "\nIf the implementation is hard to explain, it's a bad idea",
 '\nIf the implementation is easy to explain, it may be a good idea',
 "\nNamespaces are one honking great idea -- let's do more of those!"]

In [206]:
import re
sentences = list(map(lambda sent: re.sub('[^A-Za-z0-9]+', ' ', sent), sentences))

In [207]:
sentences

['Beautiful is better than ugly',
 'Explicit is better than implicit',
 'Simple is better than complex',
 'Complex is better than complicated',
 'Flat is better than nested',
 'Sparse is better than dense',
 'Readability counts',
 'Special cases aren special enough to break the rules',
 'Although practicality beats purity',
 'Errors should never pass silently',
 'Unless explicitly silenced',
 'In the face of ambiguity refuse the temptation to guess',
 'There should be one and preferably only one obvious way to do it',
 'Although that way may not be obvious at first unless you re Dutch',
 'Now is better than never',
 'Although never is often better than right now',
 'If the implementation is hard to explain it a bad idea',
 'If the implementation is easy to explain it may be good idea',
 'Namespaces are one honking great idea let do more of those']

In [208]:
sentences = list(map(lambda sent: re.sub(r'(?:^| )\w(?:$| )', ' ', sent).strip(), sentences))

In [209]:
sentences

['Beautiful is better than ugly',
 'Explicit is better than implicit',
 'Simple is better than complex',
 'Complex is better than complicated',
 'Flat is better than nested',
 'Sparse is better than dense',
 'Readability counts',
 'Special cases aren special enough to break the rules',
 'Although practicality beats purity',
 'Errors should never pass silently',
 'Unless explicitly silenced',
 'In the face of ambiguity refuse the temptation to guess',
 'There should be one and preferably only one obvious way to do it',
 'Although that way may not be obvious at first unless you re Dutch',
 'Now is better than never',
 'Although never is often better than right now',
 'If the implementation is hard to explain it bad idea',
 'If the implementation is easy to explain it may be good idea',
 'Namespaces are one honking great idea let do more of those']

In [210]:
sentences = list(map(lambda sent: sent.lower(), sentences))

In [211]:
sentences

['beautiful is better than ugly',
 'explicit is better than implicit',
 'simple is better than complex',
 'complex is better than complicated',
 'flat is better than nested',
 'sparse is better than dense',
 'readability counts',
 'special cases aren special enough to break the rules',
 'although practicality beats purity',
 'errors should never pass silently',
 'unless explicitly silenced',
 'in the face of ambiguity refuse the temptation to guess',
 'there should be one and preferably only one obvious way to do it',
 'although that way may not be obvious at first unless you re dutch',
 'now is better than never',
 'although never is often better than right now',
 'if the implementation is hard to explain it bad idea',
 'if the implementation is easy to explain it may be good idea',
 'namespaces are one honking great idea let do more of those']

In [215]:
from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence
import numpy as np
import pandas as pd


tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(sentences)
word_ids = tokenizer.word_index

In [216]:
word_ids

{'is': 1,
 'better': 2,
 'than': 3,
 'to': 4,
 'the': 5,
 'although': 6,
 'never': 7,
 'be': 8,
 'one': 9,
 'it': 10,
 'idea': 11,
 'complex': 12,
 'special': 13,
 'should': 14,
 'unless': 15,
 'of': 16,
 'obvious': 17,
 'way': 18,
 'do': 19,
 'may': 20,
 'now': 21,
 'if': 22,
 'implementation': 23,
 'explain': 24,
 'beautiful': 25,
 'ugly': 26,
 'explicit': 27,
 'implicit': 28,
 'simple': 29,
 'complicated': 30,
 'flat': 31,
 'nested': 32,
 'sparse': 33,
 'dense': 34,
 'readability': 35,
 'counts': 36,
 'cases': 37,
 'aren': 38,
 'enough': 39,
 'break': 40,
 'rules': 41,
 'practicality': 42,
 'beats': 43,
 'purity': 44,
 'errors': 45,
 'pass': 46,
 'silently': 47,
 'explicitly': 48,
 'silenced': 49,
 'in': 50,
 'face': 51,
 'ambiguity': 52,
 'refuse': 53,
 'temptation': 54,
 'guess': 55,
 'there': 56,
 'and': 57,
 'preferably': 58,
 'only': 59,
 'that': 60,
 'not': 61,
 'at': 62,
 'first': 63,
 'you': 64,
 're': 65,
 'dutch': 66,
 'often': 67,
 'right': 68,
 'hard': 69,
 'bad': 70,
 '

In [217]:
id_words = {v:k for k,v in word_ids.items()}

In [218]:
id_words

{1: 'is',
 2: 'better',
 3: 'than',
 4: 'to',
 5: 'the',
 6: 'although',
 7: 'never',
 8: 'be',
 9: 'one',
 10: 'it',
 11: 'idea',
 12: 'complex',
 13: 'special',
 14: 'should',
 15: 'unless',
 16: 'of',
 17: 'obvious',
 18: 'way',
 19: 'do',
 20: 'may',
 21: 'now',
 22: 'if',
 23: 'implementation',
 24: 'explain',
 25: 'beautiful',
 26: 'ugly',
 27: 'explicit',
 28: 'implicit',
 29: 'simple',
 30: 'complicated',
 31: 'flat',
 32: 'nested',
 33: 'sparse',
 34: 'dense',
 35: 'readability',
 36: 'counts',
 37: 'cases',
 38: 'aren',
 39: 'enough',
 40: 'break',
 41: 'rules',
 42: 'practicality',
 43: 'beats',
 44: 'purity',
 45: 'errors',
 46: 'pass',
 47: 'silently',
 48: 'explicitly',
 49: 'silenced',
 50: 'in',
 51: 'face',
 52: 'ambiguity',
 53: 'refuse',
 54: 'temptation',
 55: 'guess',
 56: 'there',
 57: 'and',
 58: 'preferably',
 59: 'only',
 60: 'that',
 61: 'not',
 62: 'at',
 63: 'first',
 64: 'you',
 65: 're',
 66: 'dutch',
 67: 'often',
 68: 'right',
 69: 'hard',
 70: 'bad',
 7

In [219]:
text.text_to_word_sequence('ala ma kota a kot ma mleko')

['ala', 'ma', 'kota', 'a', 'kot', 'ma', 'mleko']

In [220]:
[doc for doc in sentences]

['beautiful is better than ugly',
 'explicit is better than implicit',
 'simple is better than complex',
 'complex is better than complicated',
 'flat is better than nested',
 'sparse is better than dense',
 'readability counts',
 'special cases aren special enough to break the rules',
 'although practicality beats purity',
 'errors should never pass silently',
 'unless explicitly silenced',
 'in the face of ambiguity refuse the temptation to guess',
 'there should be one and preferably only one obvious way to do it',
 'although that way may not be obvious at first unless you re dutch',
 'now is better than never',
 'although never is often better than right now',
 'if the implementation is hard to explain it bad idea',
 'if the implementation is easy to explain it may be good idea',
 'namespaces are one honking great idea let do more of those']

In [221]:
sentence_encoded = [[word_ids[w] for w in text.text_to_word_sequence(doc)] for doc in sentences]

In [222]:
sentence_encoded[0][:10]

[25, 1, 2, 3, 26]

In [223]:
#Params
vocab_size = len(word_ids)
embed_size = 16
window_size = 2

In [224]:
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size*2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_length 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)

In [225]:
i = 0
for x, y in generate_context_word_pairs(corpus=sentence_encoded, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('Context (X):', [id_words[w] for w in x[0]], '-> Target (Y):', id_words[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

Context (X): ['beautiful', 'is', 'than', 'ugly'] -> Target (Y): better
Context (X): ['explicit', 'is', 'than', 'implicit'] -> Target (Y): better
Context (X): ['simple', 'is', 'than', 'complex'] -> Target (Y): better
Context (X): ['complex', 'is', 'than', 'complicated'] -> Target (Y): better
Context (X): ['flat', 'is', 'than', 'nested'] -> Target (Y): better
Context (X): ['sparse', 'is', 'than', 'dense'] -> Target (Y): better
Context (X): ['special', 'cases', 'special', 'enough'] -> Target (Y): aren
Context (X): ['cases', 'aren', 'enough', 'to'] -> Target (Y): special
Context (X): ['aren', 'special', 'to', 'break'] -> Target (Y): enough
Context (X): ['special', 'enough', 'break', 'the'] -> Target (Y): to
Context (X): ['enough', 'to', 'the', 'rules'] -> Target (Y): break


In [226]:
import multiprocessing

from gensim.models import Word2Vec

In [227]:
cores = multiprocessing.cpu_count()

In [228]:
cores

8

In [229]:
Word2Vec?

In [230]:
sentences = [
    'Ala ma kota a kot ma mleko',
    'Krzyś ma psa a pies zabawkę',
    "Ala ma psa i lubi spacery"
]

In [231]:
sentences_as_list = list(map(lambda x: x.split(), sentences))

In [232]:
sentences_as_list

[['Ala', 'ma', 'kota', 'a', 'kot', 'ma', 'mleko'],
 ['Krzyś', 'ma', 'psa', 'a', 'pies', 'zabawkę'],
 ['Ala', 'ma', 'psa', 'i', 'lubi', 'spacery']]

In [233]:
w2v_model = Word2Vec(min_count=1,
                     window=2,
                     vector_size=4,
                     workers=cores-1)

In [234]:
w2v_model.build_vocab(sentences_as_list, progress_per=1000)

In [235]:
w2v_model.wv.key_to_index, w2v_model.wv.index_to_key

({'ma': 0,
  'psa': 1,
  'a': 2,
  'Ala': 3,
  'spacery': 4,
  'lubi': 5,
  'i': 6,
  'zabawkę': 7,
  'pies': 8,
  'Krzyś': 9,
  'mleko': 10,
  'kot': 11,
  'kota': 12},
 ['ma',
  'psa',
  'a',
  'Ala',
  'spacery',
  'lubi',
  'i',
  'zabawkę',
  'pies',
  'Krzyś',
  'mleko',
  'kot',
  'kota'])

In [236]:
w2v_model.train(sentences_as_list, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(73, 570)

In [237]:
w2v_model.wv.most_similar('pies', topn=1)

[('Krzyś', 0.9422240257263184)]

#### na przykładzie

In [238]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

In [239]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [240]:
model = Word2Vec(vector_size=100, window=5, min_count=1, workers=4)

In [241]:
model.build_vocab(common_texts)

In [242]:
model.train(common_texts, total_examples=model.corpus_count, epochs=1)

(3, 29)

In [243]:
vector = model.wv['computer']

In [244]:
model.wv.most_similar('computer', topn=3)

[('system', 0.21617144346237183),
 ('survey', 0.04468921199440956),
 ('interface', 0.015025189146399498)]

In [246]:
word_vectors = model.wv

In [247]:
word_vectors.save("word2vec.wordvectors")

In [248]:
from gensim.models import KeyedVectors

In [249]:
wv_loaded = KeyedVectors.load("word2vec.wordvectors", mmap='r')

In [251]:
wv_loaded.most_similar('system', topn=3)

[('computer', 0.21617144346237183),
 ('response', 0.09291724115610123),
 ('human', 0.07963485270738602)]