# Tutorial 4: Vector Semantics - Gensim, SpaCy, Word2Vec and GloVe

After you have dealt with synsemantics in the last tutorial, this tutorial is dedicated to vector semantics. Instead of manually created synsemantic networks,  in which words are linked in relationships to another, vector semantics automatically form word map semantic relations derived from co-occurrences in corpora.

Fortunately, there are Python modules that greatly simplify working with word embeddings and also make it straightforward to obtain pre-trained models.
In this tutorial, you will work with Gensim, a Python library specifically
for working with semantic vectors (https://radimrehurek.com/gensim/index.html).

Work through the tasks in a Jupyter notebook.


## 1. Importing the modules and data
### a) Import NLP modules
Import pandas, Numpy, NLTK and RE as in the first tutorial.
Install via pip or conda Gensim (in your console). Then import Gensim, KeyedVectors from gensim.models and the gensim.downloader as follows:

```python
import gensim
from gensim.models import KeyedVectors
import gensim.downloader
```

In [1]:
import pandas
import nltk
import re
import gensim
from gensim.models import KeyedVectors

### b) Downloading and setting up models
You have heard about different attributes and types of word vector models in the lecture, in this tutorial you will compare different models. Pre-trained models can be obtained in several ways:
They can be self-trained via custom corpora, obtained via vectors (for example, from https://github.com/stanfordnlp/GloVe ) and transformed into models in Gensim, or, most straightforwardly, retrieved directly via the pre-trained models natively supported by libraries like Gensim.

As a Glove model, we have already prepared for you the "Glove 6b 100" model from the source mentioned (https://th-koeln.sciebo.de/f/657059193, password is the name of this course, letters all capital, DI..5) . You only need to load it as a model via the "KeyedVectors" class in Gensim:

```python
glove_6b_100_model = KeyedVectors.load_word2vec_format("glove.6B.100d.w2vformat.txt6", binary=False)
```
Next, use the following command to get a list of all the models contained in Gensim output a list of all models contained in Gensim:

```python
print(list(gensim.downloader.info()["models"].keys()))
```

Load the "glove-twitter-100" model via "gensim.downloader.load()" as follows:
```python
glove_twitter100_vectors = gensim.downloader.load("glove-twitter-100")
```

**The process can take time and uses a lot of system memory, if you have problems, skip downloading the pre-trained model via Gensim!**

The last method to get vector models we would like to show you is to create a model over a corpus of texts. For example, proceed as follows:
```python
corpus = gensim.downloader.load('text8')
from gensim.models.word2vec import Word2Vec
word2vec = Word2Vec(corpus)
```

In [2]:
glove_6b_100_model = KeyedVectors.load_word2vec_format("datasets/glove.6B.100d.w2vformat.txt6", binary=False)
glove_6b_50_model = KeyedVectors.load_word2vec_format("datasets/glove.6B.50d.w2vformat.txt6", binary=False)

In [1]:
import gensim.downloader as dl

corpus = dl.load('text8')
from gensim.models.word2vec import Word2Vec
word2vec = Word2Vec(corpus)

collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
Creating a fresh vocabulary
Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 71290 unique words (28.083071371733357%% of original 253854, drops 182564)', 'datetime': '2022-04-29T10:07:41.376861', 'gensim': '4.1.2', 'python': '3.8.3 (default, Jul  2 2020, 16:21:59) \n[GCC 7.3.0]', 'platform': 'Linux-5.4.0-109-generic-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 16718844 word corpus (98.3160275555599%% of original 17005207, drops 286363)', 'datetime': '2022-04-29T10:07:41.468123', 'gensim': '4.1.2', 'python': '3.8.3 (default, Jul  2 2020, 16:21:59) \n[GCC 7.3.0]', 'platform': 'Linux-5.4.0-109-generic-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
deleting the raw counts dictionary of 253854 items
sample=0.001 downsample

we can check vector sizes via "get_vector()":

In [4]:
print('100: ',len(glove_6b_100_model.get_vector('king')))
print('50: ',len(glove_6b_50_model.get_vector('king')))
print('w2v: ',len(word2vec.wv['king']))

100:  100
50:  50
w2v:  100


## 2. Functions of Vector Semantics
You should now have between one and three different vector models loaded.
Many vector semantic operations can be applied via Gensim's Word2Vec module API (available HERE)). Don't be confused by the naming, the methods are generally applicable to vector models, not just Word2Vec.
### a) Find similar expressions.
By similar_by_word() method, determine the 10 most similar terms to the words "summer", "salad", and "python". Discuss in the group: what do you notice? Can you see any differences between the models based on larger corpora (glove.6b and glove twitter100) and the Word2Vec model you trained based on the 32MB corpus? How could the differences differences come about?

In [10]:
sim_summer_100 = glove_6b_100_model.similar_by_word("summer") 
sim_summer_50 = glove_6b_50_model.similar_by_word("summer") 
sim_summer_word2vec = word2vec.wv.most_similar('summer', topn=10)

In [11]:
len(sim_summer_100)

10

In [38]:
for i in range(0,10):
    print(i+1,":")
    print("100:",sim_summer_100[i],"50:", sim_summer_50[i],"w2v:", sim_summer_word2vec[i])

1 :
100: ('winter', 0.8896950483322144) 50: ('winter', 0.919983446598053) w2v: ('winter', 0.8647171854972839)
2 :
100: ('spring', 0.8580390214920044) 50: ('spring', 0.8946049809455872) w2v: ('spring', 0.814113438129425)
3 :
100: ('autumn', 0.7742397785186768) 50: ('autumn', 0.8393581509590149) w2v: ('autumn', 0.8006542325019836)
4 :
100: ('weekend', 0.7385303378105164) 50: ('beginning', 0.8159069418907166) w2v: ('olympics', 0.7069157361984253)
5 :
100: ('year', 0.7348463535308838) 50: ('starting', 0.7925456762313843) w2v: ('season', 0.6917258501052856)
6 :
100: ('days', 0.7250120043754578) 50: ('day', 0.7914853096008301) w2v: ('daytime', 0.6646353602409363)
7 :
100: ('beginning', 0.7218300104141235) 50: ('weekend', 0.7911542654037476) w2v: ('seasons', 0.6619733572006226)
8 :
100: ('during', 0.7205086946487427) 50: ('during', 0.7859177589416504) w2v: ('afternoon', 0.6513773798942566)
9 :
100: ('season', 0.7031364440917969) 50: ('days', 0.7764894366264343) w2v: ('weekend', 0.647225081920

In [39]:
sim_salad_100 = glove_6b_100_model.similar_by_word("salad") 
sim_salad_50 = glove_6b_50_model.similar_by_word("salad") 
sim_salad_word2vec = word2vec.wv.most_similar('salad', topn=10)

for i in range(0,10):
    print(i+1,":")
    print("100:",sim_salad_100[i],"50:", sim_salad_50[i],"w2v:", sim_salad_word2vec[i])

1 :
100: ('salads', 0.766255259513855) 50: ('pasta', 0.8427600860595703) w2v: ('soup', 0.7483817338943481)
2 :
100: ('pasta', 0.760292649269104) 50: ('soup', 0.832454264163971) w2v: ('sweet', 0.73320472240448)
3 :
100: ('tomato', 0.7298500537872314) 50: ('salads', 0.8248024582862854) w2v: ('spicy', 0.7315133213996887)
4 :
100: ('vinaigrette', 0.7297168970108032) 50: ('mashed', 0.8081055879592896) w2v: ('pork', 0.725369930267334)
5 :
100: ('lettuce', 0.7067264318466187) 50: ('potato', 0.793901264667511) w2v: ('chicken', 0.723736047744751)
6 :
100: ('dessert', 0.697432279586792) 50: ('fried', 0.7893426418304443) w2v: ('grilled', 0.7175354361534119)
7 :
100: ('sauce', 0.6960574984550476) 50: ('baked', 0.7884256839752197) w2v: ('miso', 0.7116280794143677)
8 :
100: ('spinach', 0.6923962831497192) 50: ('tomato', 0.7836177349090576) w2v: ('roast', 0.7074493169784546)
9 :
100: ('cheese', 0.6896312832832336) 50: ('potatoes', 0.7813866138458252) w2v: ('stew', 0.7054186463356018)
10 :
100: ('pest

In [40]:
sim_python_100 = glove_6b_100_model.similar_by_word("python") 
sim_python_50 = glove_6b_50_model.similar_by_word("python") 
sim_python_word2vec = word2vec.wv.most_similar('python', topn=10)

for i in range(0,10):
    print(i+1,":")
    print("100:",sim_python_100[i],"50:", sim_python_50[i],"w2v:", sim_python_word2vec[i])

1 :
100: ('monty', 0.6886237263679504) 50: ('reticulated', 0.6916365623474121) w2v: ('monty', 0.8617887496948242)
2 :
100: ('php', 0.586538553237915) 50: ('spamalot', 0.6635736227035522) w2v: ('animaniacs', 0.732209324836731)
3 :
100: ('perl', 0.5784407258033752) 50: ('php', 0.6414496898651123) w2v: ('moby', 0.6848551630973816)
4 :
100: ('cleese', 0.5446676015853882) 50: ('owl', 0.6301496028900146) w2v: ('gilliam', 0.6839032173156738)
5 :
100: ('flipper', 0.5112984776496887) 50: ('mouse', 0.6275478601455688) w2v: ('muppet', 0.6801720857620239)
6 :
100: ('ruby', 0.5066928267478943) 50: ('reticulatus', 0.6274471282958984) w2v: ('blaxploitation', 0.6611126065254211)
7 :
100: ('spamalot', 0.505638837814331) 50: ('perl', 0.6267575025558472) w2v: ('slayer', 0.6601446866989136)
8 :
100: ('javascript', 0.5030569434165955) 50: ('monkey', 0.6207211017608643) w2v: ('grail', 0.658697783946991)
9 :
100: ('reticulated', 0.4983375668525696) 50: ('monty', 0.60793536901474) w2v: ('webcomic', 0.65780591

### b) Analogical Inference/ Relational Similarity.
Vector semantic models can be used to infer analogies of the form "A relates to B as A* relates to ...?" as they exist in the information extracted from the corpus. Using the method most similar() (see Gensim Word2Vec module API linked in introduction to question 2), you can thus have "Paris-France+Italy" find "Rome", for example:

```python
print('glove: Paris - France + Italy: ', glove_6b_100_model.most_similar(positive=["paris", "italy"], negative=["france"], topn=3))
```

Using this method, bias contained in the corpus, more precisely the vector models, can be revealed. For dichotomous features like [man,woman] or [young,old], put one of each in "positive" and "negative". Find out what the models for "doctor+woman", "housewife+man" as well as "car+sea" contain as their three next associations, choosing meaningful negatives. Discuss as a group! Can you find any other relational symmetries?

In [47]:
print('glove 100: Paris - France + Italy: ', glove_6b_100_model.most_similar(positive=["paris", "italy"], negative=["france"], topn=3))

glove 100: Paris - France + Italy:  [('rome', 0.8189547061920166), ('milan', 0.7376196980476379), ('naples', 0.7117615342140198)]


In [48]:
print('glove 50: Paris - France + Italy: ', glove_6b_50_model.most_similar(positive=["paris", "italy"], negative=["france"], topn=3))

glove 50: Paris - France + Italy:  [('rome', 0.8465589284896851), ('milan', 0.7766007781028748), ('turin', 0.7666355967521667)]


for word2vec, add ".wv":

In [50]:
print('Word2Vec: Paris - France + Italy: ', word2vec.wv.most_similar(positive=["paris", "italy"], negative=["france"], topn=3))

W2V: Paris - France + Italy:  [('venice', 0.7544799447059631), ('vienna', 0.7467787861824036), ('florence', 0.7297069430351257)]


doctor  + woman :

In [58]:
print('glove 50: Doctor - Man + Woman: ', glove_6b_50_model.most_similar(positive=["doctor", "woman"], negative=["man"], topn=3))

glove 50: Doctor - Man + Woman:  [('nurse', 0.8404642939567566), ('child', 0.7663259506225586), ('pregnant', 0.7570130228996277)]


In [59]:
print('Word2Vec: Doctor - Man + Woman: ', word2vec.wv.most_similar(positive=["doctor", "woman"], negative=["man"], topn=3))

Word2Vec: Doctor - Man + Woman:  [('nurse', 0.6485609412193298), ('teacher', 0.5760821104049683), ('child', 0.5707877278327942)]


housewife + man :

In [63]:
print('glove 100: Housewife - Woman + Man: ', glove_6b_100_model.most_similar(positive=["housewife", "man"], negative=["woman"], topn=3))

glove 100: Housewife - Woman + Man:  [('homemaker', 0.6182607412338257), ('loner', 0.5810955166816711), ('schoolteacher', 0.5793447494506836)]


In [64]:
print('glove 50: Housewife - Woman + Man: ', glove_6b_50_model.most_similar(positive=["housewife", "man"], negative=["woman"], topn=3))

glove 50: Housewife - Woman + Man:  [('loner', 0.7505239844322205), ('schoolteacher', 0.7436637282371521), ('homemaker', 0.7377645373344421)]


In [65]:
print('Word2Vec: Housewife - Woman + Man: ', word2vec.wv.most_similar(positive=["housewife", "man"], negative=["woman"], topn=3))

Word2Vec: Housewife - Woman + Man:  [('classmate', 0.6363623142242432), ('youngster', 0.6295291185379028), ('joanie', 0.6116021275520325)]


car + sea :

In [72]:
print('glove 100: Car - Road + Sea: ', glove_6b_100_model.most_similar(positive=["car", "sea"], negative=["road"], topn=3))

glove 100: Car - Road + Sea:  [('tanker', 0.6246403455734253), ('ship', 0.6190789341926575), ('jet', 0.5916841626167297)]


In [73]:
print('glove 50: Car - Road + Sea: ', glove_6b_50_model.most_similar(positive=["car", "sea"], negative=["road"], topn=3))

glove 50: Car - Road + Sea:  [('jet', 0.7649568915367126), ('plane', 0.7592712640762329), ('cargo', 0.7386853098869324)]


In [74]:
print('Word2Vec: Car - Road + Sea: ', word2vec.wv.most_similar(positive=["car", "sea"], negative=["road"], topn=3))

Word2Vec: Car - Road + Sea:  [('carrier', 0.5291203260421753), ('carriers', 0.5250734090805054), ('aircraft', 0.5171101093292236)]


### c) Bonus: (Sentence) Similarity.
The method "n_similarity([], [])" can be used to judge the similarity of two lists of tokens to each other. Can you think of a useful application example? Discuss and try!


In [19]:
glove_6b_100_model.n_similarity(['game', "deer"], ["veil", "cow"])

0.35846308

In [92]:
sentence1 = "Separate the yolks from the whites and slowly add them slowly."
sentence2 = "Fry the pancakes until they become golden brown."
sentence3 = "If you push with your left foot, your stance is called goofy."

In [80]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\w+")

In [93]:
print("1 and 2:",glove_6b_100_model.n_similarity([a.lower() for a in tokenizer.tokenize(sentence1)], [a.lower() for a in tokenizer.tokenize(sentence2)]))

1 and 2: 0.8566468


In [94]:
print("1 and 3:",glove_6b_100_model.n_similarity([a.lower() for a in tokenizer.tokenize(sentence1)], [a.lower() for a in tokenizer.tokenize(sentence3)]))

1 and 3: 0.80041516


In [95]:
print("2 and 3:",glove_6b_100_model.n_similarity([a.lower() for a in tokenizer.tokenize(sentence2)], [a.lower() for a in tokenizer.tokenize(sentence3)]))

2 and 3: 0.802222


## 3. Pre-trained German vector models and vector-similarity with SpaCy
### a) Import modules and data
As you have learned in the second tutorial, Spacy is a powerful tool, that often has accessible, powerful functionalities built in, including word vector embeddings. Download a large German model, for example in your jupyter notebook by entering with the following command:
```python
!python -m spacy download de_core_news_lg
```

Attention: You will probably have to restart your Jupyter kernel for your system to find the can find the model. Load the model in your notebook with 

```python
nlp = spacy.load('de_core_news_lg')
```

In [96]:
import spacy

2022-04-28 17:17:26.696576: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-28 17:17:26.696645: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [97]:
!python -m spacy download de_core_news_lg

2022-04-28 17:17:36.359689: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-28 17:17:36.359722: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Defaulting to user installation because normal site-packages is not writeable
Collecting de-core-news-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.2.0/de_core_news_lg-3.2.0-py3-none-any.whl (572.3 MB)
[K     |████████████████████████████████| 572.3 MB 20 kB/s s eta 0:00:01    |████▋                           | 81.8 MB 2.4 MB/s eta 0:03:27        | 407.6 MB 24.3 MB/s eta 0:00:07��████████████████████████▌   | 509.7 MB 29.8 MB/s eta 0:00:03     |██████████████████████████████  | 536.5 MB 14.0 MB/s eta 0:00:03
[38;5;2m✔ Download and installation succ

In [98]:
nlp = spacy.load('de_core_news_lg')

### b) Similarity with SpaCy

Using pandas read_csv(), load the list of single term suggestions.txt from the previous tutorial and save it as a dataframe. Add another column "city" in which you determine the similarity of the term (suggestion_ger) to the document "nlp("city")". Do the same for "politics" and "recreation" and store the similarities in meaningful the similarities in meaningful named columns. 
Find the similarity between two texts using the ".similarity" method:
```python
simil = nlp("text number one").similarity(nlp("text number two"))
```
Discuss:
What could you use the particular similarities for?

In [99]:
simil = nlp("text number one").similarity(nlp("text number two"))

In [105]:
print("1 and 2:",nlp(sentence1).similarity(nlp(sentence2)))
print("1 and 3:",nlp(sentence1).similarity(nlp(sentence3)))
print("2 and 3:",nlp(sentence2).similarity(nlp(sentence3)))

1 and 2: 0.8281328035011029
1 and 3: 0.7779656503070352
2 and 3: 0.77883248259888


now for the task:

In [107]:
import pandas as pd

In [109]:
data_ger = pd.read_csv("single_term_suggestions.txt")

In [111]:
data_ger["city"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("city")), axis=1)

  data_ger["city"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("city")), axis=1)


In [114]:
data_ger["politics"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("politics")), axis=1)

  data_ger["politics"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("politics")), axis=1)


In [115]:
data_ger["recreation"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("recreation")), axis=1)

  data_ger["recreation"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("recreation")), axis=1)


In [118]:
data_ger["privat"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("privat")), axis=1)

  data_ger["privat"] = data_ger.apply(lambda row: nlp(row["suggestion_ger"]).similarity(nlp("privat")), axis=1)


In [120]:
data_ger

Unnamed: 0,suggestion_ger,city,politics,recreation,privat
0,aa,0.027505,0.043489,-0.007406,0.128290
1,aach,0.141100,0.002952,-0.003878,0.135336
2,aalten,0.000000,0.000000,0.000000,0.000000
3,aarburg,0.000000,0.000000,0.000000,0.000000
4,aaronn,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...
3757,zwangsdienst,0.000000,0.000000,0.000000,0.000000
3758,zwangshypothek,0.000000,0.000000,0.000000,0.000000
3759,zweibruecken,0.000000,0.000000,0.000000,0.000000
3760,zwickau,0.312800,0.079084,0.081898,0.322399
