[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/jkanclerz/analiza-dokumentow/blob/main/41--word2vec-sentiment.ipynb)

## Word2Vec

Word2Vec nie jest pojedynczym algorytmem, jest to raczej rodzina architektur modeli i optymalizacji, które mogą być używane do uczenia się embeddingów słów z dużych zbiorów danych. Embeddingi poznane dzięki Word2Vec okazały się skuteczne w wielu zadaniach przetwarzania języka naturalnego.

### OneHot vs WordVector

* Mniejszy wymiar 8, 50, 100, 300 vs (140000 polski, 350000 angielski) 
    * efektywność przechowywania w pamięci
* Semantyka / znaczenie
    * W słowniku Kot i Pies == Kot i Pieniądz
    * Intuicyjnie to nieprawda, ale maszyna nie ma jak wnioskować o podobieństwie
* Kontekst
    * Nie znalazłem żadnego argumentu żebym mógł sie z Tobą niezgodzic.

Zakładamy, że wyrazy o podobnym znaczeniu występują częściej w tym samym kontekście niż wyrazy zupełnie z tematem nie związane

Idea polega na stworzeniu macierzy, czyli *word embeddings* dla każdego słowa w dużym korpusie. Każdemu słowu przypisywany jest jego własny wektor w taki sposób, że słowa, które często pojawiają się razem w tym samym kontekście, otrzymują wektory, które są blisko siebie. W rezultacie powstaje model, który może nie wiedzieć, że "lew" jest zwierzęciem, ale wie, że "lew" jest w przestrzeni bliżej "kota" i "psa" niż "kotlet" ``zacny suchar milordzie``

### Continuous Bag-of-Words Model
który przewiduje środkowe słowo na podstawie otaczających je słów kontekstu. Kontekst składa się z kilku słów przed i po bieżącym (środkowym) słowie. Architektura ta jest nazywana modelem worka słów, ponieważ kolejność słów w kontekście nie jest istotna.


### Continuous Skip-gram Model
który przewiduje słowa w pewnym zakresie przed i po bieżącym słowie w tym samym zdaniu. Działający przykład tego jest podany poniżej.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


**Now is better than never, although never is often better than right now.**

![](https://githubtocolab.com/jkanclerz/analiza-dokumentow/blob/main/images/window.jpg)

#### Parametry:
* Okno - Window -> 2

#### Transformacja
* 1 iteracja
    * target: Now
    * context: [is, better]
* 2 iteracja
    * target: Is
    * context: [Now, better, than]
* 3 iteracja
    * target: better
    * context: [Now, is, than, never]
    
#### Dane wejściowe
[(Now, is), (Now, better), (Is, Now), (Is, better), (Is, than)...]

### skip-gram
* 1 iteracja
    * target: [is, better]
    * word: Now
* 2 iteracja
    * target: [Now, better, than]
    * word: Is
* 3 iteracja
    * target: [Now, is, than, never]
    * word: better
    


### Przebieg

Is better _____ never although -> **than**

#### Cbow
HotOne (0,0,1,0,1,1,0....0,1,1) -> Sieć neuronowa (50 parametrów) -> HotOne (..., 1, ...)
#### Skip gram
HotOne (..., 1, ...) -> Sieć neuronowa (50 parametrów) -> HotOne (0,0,1,0,1,1,0....0,1,1)

![](https://github.com/jkanclerz/analiza-tekstu/raw/master/var/models.jpeg)

### Znaczenie wyrazów

![](https://github.com/jkanclerz/analiza-tekstu/raw/master/var/dimenssions.jpeg)

#### Gotowe modele

* [http://dsmodels.nlp.ipipan.waw.pl/](http://dsmodels.nlp.ipipan.waw.pl/)

In [3]:
!wget http://dsmodels.nlp.ipipan.waw.pl/dsmodels/wiki-forms-all-100-cbow-ns-30-it100.txt.gz

--2023-02-11 21:36:00--  http://dsmodels.nlp.ipipan.waw.pl/dsmodels/wiki-forms-all-100-cbow-ns-30-it100.txt.gz
Resolving dsmodels.nlp.ipipan.waw.pl (dsmodels.nlp.ipipan.waw.pl)... 213.135.36.94
Connecting to dsmodels.nlp.ipipan.waw.pl (dsmodels.nlp.ipipan.waw.pl)|213.135.36.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96291106 (92M) [application/octet-stream]
Saving to: ‘wiki-forms-all-100-cbow-ns-30-it100.txt.gz’


2023-02-11 21:36:03 (30,6 MB/s) - ‘wiki-forms-all-100-cbow-ns-30-it100.txt.gz’ saved [96291106/96291106]



In [4]:
pip install gensim

Collecting gensim
  Downloading gensim-4.3.0-cp311-cp311-macosx_10_9_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting smart-open>=1.8.1
  Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting FuzzyTM>=0.4.0
  Downloading FuzzyTM-2.0.5-py3-none-any.whl (29 kB)
Collecting pyfume
  Downloading pyFUME-0.2.25-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting simpful
  Downloading simpful-2.9.0-py3-none-any.whl (30 kB)
Collecting fst-pso
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting miniful
  Downloading miniful-0.0.6.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25ldone

In [5]:
from gensim.models import KeyedVectors

In [6]:
word2vec_model = KeyedVectors.load_word2vec_format('wiki-forms-all-100-cbow-ns-30-it100.txt.gz', binary=False)

In [7]:
word2vec_model

<gensim.models.keyedvectors.KeyedVectors at 0x127c10b90>

In [11]:
text = "Ala ma kota"

encode = [word2vec_model[word] for word in text.split()]

In [12]:
encode

[array([ 2.60359 , -0.53379 , -0.418538, -0.567743, -1.828997,  3.992933,
         3.004368, -0.910962,  0.458971,  0.270718, -0.51762 ,  5.142121,
        -1.192841,  2.031395, -0.659251, -0.480254, -0.093248,  0.107133,
         1.641894, -1.490792,  0.423677,  2.674135, -4.119183,  2.790525,
         3.60652 ,  1.260155,  0.154874,  0.276685,  1.87403 , -1.119008,
        -2.647022,  1.990544, -1.601704, -0.728631,  1.684662, -1.950735,
         1.468508,  0.509837, -1.053435,  1.315314, -1.520988,  4.385165,
        -6.620793, -0.686365, -1.685816,  0.342647, -0.522114,  0.749273,
        -2.049263,  1.286566, -0.529605, -3.33568 , -0.834344, -1.072513,
         1.789571,  0.529662,  0.876424,  2.529619,  2.496111, -0.204708,
         0.358681,  0.591321, -0.815984,  2.830306, -0.189545,  1.692742,
         0.648947, -2.719239,  1.723135,  1.460818, -1.741183, -0.57327 ,
        -1.564336, -1.385513, -1.558214, -3.22159 , -2.859038,  2.411288,
         1.61544 , -0.036749, -1.58677

In [8]:
word2vec_model['komputer']

array([ 2.556872,  3.016202,  1.341352, -0.296617,  1.487371,  2.657528,
       -3.147548,  0.904789,  3.331265, -1.382941, -1.957119,  0.096157,
        1.998081,  6.033292, -1.769325, -2.343697, -4.029065,  9.055704,
        2.468241,  2.156334,  4.05997 , -3.280162, -7.082334, -1.680108,
       -4.101147, -2.248774, -2.508852,  4.602365,  2.489758,  3.700083,
       -2.901632,  1.26614 ,  0.398787, -2.246749,  3.065425, -2.838281,
        0.539788,  0.363134,  3.505548, -2.425144,  0.816681, -0.928684,
       -3.524406, -0.287948,  5.3212  , -3.676133, -3.609464,  3.185938,
       -1.579331,  1.425323,  1.221283, -0.397343,  1.213347, -1.154641,
        5.246403, -1.284907,  5.472346,  0.265039, -6.560622, -2.674489,
       -2.297376,  1.193259,  0.234391, -0.868071,  3.321518, -2.153233,
       -5.357116, -6.564101,  2.353112,  5.653057,  5.247916,  1.126638,
        1.715316, -0.273689,  1.2469  , -4.295434, -0.050897,  4.484003,
       -5.674363, -4.953731, -3.055965, -1.356919, 

In [14]:
word2vec_model['kot'].shape

(100,)

In [16]:
word2vec_model.similar_by_vector(word2vec_model['kot'])

[('kot', 1.0),
 ('Gołąb', 0.7597211599349976),
 ('piesek', 0.7462971806526184),
 ('Wilk', 0.7396236062049866),
 ('chudy', 0.7388879656791687),
 ('słoń', 0.7344905138015747),
 ('królik', 0.7295685410499573),
 ('pies', 0.7187919616699219),
 ('Kruk', 0.7150323390960693),
 ('ptaszek', 0.7124853134155273)]

In [17]:
word2vec_model.similar_by_key("pies")

[('koń', 0.8431771993637085),
 ('ptak', 0.7892438173294067),
 ('smok', 0.7819583415985107),
 ('myśliwy', 0.7620030045509338),
 ('królik', 0.7560844421386719),
 ('wąż', 0.741938591003418),
 ('pająk', 0.7307115197181702),
 ('kot', 0.7187919616699219),
 ('stwór', 0.7162348628044128),
 ('pirat', 0.7134245038032532)]

In [18]:
from scipy.spatial.distance import cosine

In [30]:
1 - cosine(word2vec_model['pies'], word2vec_model['kot'])

0.7187920212745667

In [31]:
1 - cosine(word2vec_model['klucz'], word2vec_model['czapka'])

0.003615056164562702

In [32]:
1 - word2vec_model.distances('pies', ['kot', 'pies'])

array([0.718792  , 0.99999994], dtype=float32)

In [15]:
word2vec_model.most_similar("komputer")

[('mikroprocesor', 0.8413556218147278),
 ('procesor', 0.8408359289169312),
 ('mikrokomputer', 0.7868935465812683),
 ('serwer', 0.7859148979187012),
 ('sterownik', 0.7709600925445557),
 ('odbiornik', 0.7664448618888855),
 ('modem', 0.76606285572052),
 ('automat', 0.7601091265678406),
 ('interfejs', 0.7560226321220398),
 ('moduł', 0.7477715015411377)]

In [33]:
## Kobieta + Król - Mężczyzna => 
word2vec_model.most_similar(positive=['kobieta', 'król'], negative=['mężczyzna'], topn=2)

[('królowa', 0.8121304512023926), ('cesarzowa', 0.7185278534889221)]

In [34]:
len(word2vec_model.index_to_key)

226396

In [35]:
word2vec_model.index_to_key[:5]

['w', 'i', 'na', 'z', 'do']

In [36]:
word2vec_model['studia']

array([  2.340531,   4.042009,   1.379964,  -2.570875,  -6.386054,
         2.563526,  -2.15919 ,  -5.862146,   0.900218,   4.766161,
         6.316826,  -3.465864,   4.536585,   5.025024,   2.033738,
         5.342262,  -9.382997,   2.485591,   9.384068,  -6.73937 ,
         6.684585,   4.521341,   0.503903,   5.57772 ,  -1.284685,
        -2.736128,   3.997064,  -1.389462,   2.380089,   1.208408,
        -2.977874,  -4.645787,   0.359989,   5.398995, -13.140451,
        -1.73424 ,  -0.928283,  -6.178953,  -5.871259,  -0.229391,
        -6.043469,  -1.246072,  -0.312358,  -0.607806,   5.411756,
       -11.643624,   7.283361,  -3.326613,   4.791219,   4.176166,
         2.003737,  -7.958755,   6.639453,   4.922795,   5.656033,
        -2.38712 ,  -7.016893,  -1.731023,   8.705477,   6.712057,
        -0.108324,   1.07114 ,   1.748102,   1.65885 ,   5.361063,
         1.075046,   1.301678,   2.086038,   0.717669,   0.452681,
         4.019646,   1.437227,   8.210261,   3.675681,   0.358

### Własny embedding

In [80]:
sentences = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""

#### Przygotowanie

In [81]:
sentences = sentences.split('\n')

In [82]:
import string
sentences = map(lambda chars: "".join([char for char in chars if char not in string.punctuation]), sentences)


In [83]:
sentences = list(sentences)

In [84]:
sentences

['Beautiful is better than ugly',
 'Explicit is better than implicit',
 'Simple is better than complex',
 'Complex is better than complicated',
 'Flat is better than nested',
 'Sparse is better than dense',
 'Readability counts',
 'Special cases arent special enough to break the rules',
 'Although practicality beats purity',
 'Errors should never pass silently',
 'Unless explicitly silenced',
 'In the face of ambiguity refuse the temptation to guess',
 'There should be one and preferably only one obvious way to do it',
 'Although that way may not be obvious at first unless youre Dutch',
 'Now is better than never',
 'Although never is often better than right now',
 'If the implementation is hard to explain its a bad idea',
 'If the implementation is easy to explain it may be a good idea',
 'Namespaces are one honking great idea  lets do more of those']

In [45]:
import re
sentences = list(map(lambda sent: re.sub('[^A-Za-z0-9]+', ' ', sent), sentences)) #remove all non A-Z characters

In [85]:
sentences

['Beautiful is better than ugly',
 'Explicit is better than implicit',
 'Simple is better than complex',
 'Complex is better than complicated',
 'Flat is better than nested',
 'Sparse is better than dense',
 'Readability counts',
 'Special cases arent special enough to break the rules',
 'Although practicality beats purity',
 'Errors should never pass silently',
 'Unless explicitly silenced',
 'In the face of ambiguity refuse the temptation to guess',
 'There should be one and preferably only one obvious way to do it',
 'Although that way may not be obvious at first unless youre Dutch',
 'Now is better than never',
 'Although never is often better than right now',
 'If the implementation is hard to explain its a bad idea',
 'If the implementation is easy to explain it may be a good idea',
 'Namespaces are one honking great idea  lets do more of those']

In [86]:
sentences = list(map(lambda sent: sent.lower(), sentences))

In [87]:
sentences

['beautiful is better than ugly',
 'explicit is better than implicit',
 'simple is better than complex',
 'complex is better than complicated',
 'flat is better than nested',
 'sparse is better than dense',
 'readability counts',
 'special cases arent special enough to break the rules',
 'although practicality beats purity',
 'errors should never pass silently',
 'unless explicitly silenced',
 'in the face of ambiguity refuse the temptation to guess',
 'there should be one and preferably only one obvious way to do it',
 'although that way may not be obvious at first unless youre dutch',
 'now is better than never',
 'although never is often better than right now',
 'if the implementation is hard to explain its a bad idea',
 'if the implementation is easy to explain it may be a good idea',
 'namespaces are one honking great idea  lets do more of those']

In [88]:
pip install keras


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [89]:
[doc for doc in sentences]

['beautiful is better than ugly',
 'explicit is better than implicit',
 'simple is better than complex',
 'complex is better than complicated',
 'flat is better than nested',
 'sparse is better than dense',
 'readability counts',
 'special cases arent special enough to break the rules',
 'although practicality beats purity',
 'errors should never pass silently',
 'unless explicitly silenced',
 'in the face of ambiguity refuse the temptation to guess',
 'there should be one and preferably only one obvious way to do it',
 'although that way may not be obvious at first unless youre dutch',
 'now is better than never',
 'although never is often better than right now',
 'if the implementation is hard to explain its a bad idea',
 'if the implementation is easy to explain it may be a good idea',
 'namespaces are one honking great idea  lets do more of those']

In [90]:
test_sentences = [
    'Ala ma kota a kot ma mleko',
    'Krzyś ma psa a pies zabawkę',
    "Ala ma psa i lubi spacery"
]

In [93]:
test_sentences_as_list = list(map(lambda x: x.split(), test_sentences))

In [94]:
test_sentences_as_list

[['Ala', 'ma', 'kota', 'a', 'kot', 'ma', 'mleko'],
 ['Krzyś', 'ma', 'psa', 'a', 'pies', 'zabawkę'],
 ['Ala', 'ma', 'psa', 'i', 'lubi', 'spacery']]

In [97]:
as_list = [sent.split() for sent in sentences]

In [98]:
from gensim.models import Word2Vec
w2v_model = Word2Vec(min_count=1,
                     window=2,
                     vector_size=4,
                     workers=1)

In [99]:
w2v_model.build_vocab(as_list, progress_per=1000)

In [100]:
w2v_model.wv.key_to_index, w2v_model.wv.index_to_key

({'is': 0,
  'better': 1,
  'than': 2,
  'the': 3,
  'to': 4,
  'idea': 5,
  'never': 6,
  'although': 7,
  'be': 8,
  'one': 9,
  'obvious': 10,
  'it': 11,
  'do': 12,
  'way': 13,
  'special': 14,
  'now': 15,
  'of': 16,
  'should': 17,
  'may': 18,
  'unless': 19,
  'a': 20,
  'if': 21,
  'implementation': 22,
  'explain': 23,
  'complex': 24,
  'dense': 25,
  'ugly': 26,
  'explicit': 27,
  'errors': 28,
  'purity': 29,
  'implicit': 30,
  'beats': 31,
  'practicality': 32,
  'simple': 33,
  'rules': 34,
  'break': 35,
  'complicated': 36,
  'readability': 37,
  'enough': 38,
  'flat': 39,
  'nested': 40,
  'arent': 41,
  'sparse': 42,
  'cases': 43,
  'pass': 44,
  'counts': 45,
  'those': 46,
  'ambiguity': 47,
  'silently': 48,
  'explicitly': 49,
  'lets': 50,
  'great': 51,
  'honking': 52,
  'are': 53,
  'namespaces': 54,
  'good': 55,
  'easy': 56,
  'bad': 57,
  'its': 58,
  'hard': 59,
  'right': 60,
  'often': 61,
  'dutch': 62,
  'youre': 63,
  'first': 64,
  'at': 65,

In [101]:
w2v_model.train(as_list, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(1436, 4080)

In [104]:
w2v_model.wv.most_similar('simple', topn=5)

[('may', 0.9384155869483948),
 ('in', 0.8307865858078003),
 ('better', 0.8039076924324036),
 ('at', 0.7229748964309692),
 ('silenced', 0.702760636806488)]

#### na przykładzie

In [106]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

In [107]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [108]:
model = Word2Vec(vector_size=100, window=5, min_count=1, workers=4)

In [109]:
model.build_vocab(common_texts)

In [110]:
model.train(common_texts, total_examples=model.corpus_count, epochs=1)

(3, 29)

In [111]:
vector = model.wv['computer']

In [112]:
model.wv.most_similar('computer', topn=3)

[('system', 0.21617141366004944),
 ('survey', 0.044689226895570755),
 ('interface', 0.015025208704173565)]

In [113]:
word_vectors = model.wv

In [114]:
word_vectors.save("word2vec.wordvectors")

In [115]:
from gensim.models import KeyedVectors

In [116]:
wv_loaded = KeyedVectors.load("word2vec.wordvectors", mmap='r')

In [117]:
wv_loaded.most_similar('system', topn=3)

[('computer', 0.21617142856121063),
 ('response', 0.09291722625494003),
 ('human', 0.07963486760854721)]