# Computing with word embeddings: Exercises

| Author | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-24 |

This notebook illustrates how to use `gensim` to compute with word vectors (e.g., word2vec) to, for example

- compute two words similarity
- find the most similar words for a focal word
- solve analogy tasks

## Setup

Load required modules.

In [2]:
# file in- and export
import os

# for working with word embeddings
import gensim
import gensim.downloader as api

# for using arrays and data frames
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Load a pre-trained word embedding model with `gensim`'s model API.

In [3]:
# load the model and name it's instance in our notebook environment 'word2vec'
word2vec = api.load('word2vec-google-news-300')

In [8]:
# load it from disk
from gensim.models import KeyedVectors

model_name = 'word2vec-google-news-300'
model_dir = os.path.join(api.BASE_DIR, model_name)
print(os.listdir(model_dir))
model_path = os.path.join(model_dir, model_name + '.gz') 

model = KeyedVectors.load_word2vec_format(model_path, binary=True)


['word2vec-google-news-300.gz', '__init__.py', '__pycache__']


**_Note_:** You can also load another model if you want. It will still be a `KeyedVectors` object. So you can directly apply what you learned.

In [3]:
# list available models
print(list(api.info()['models'].keys()))

# get detailed info for a specific model
api.info(name='word2vec-google-news-300')

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


{'num_records': 3000000,
 'file_size': 1743563840,
 'base_dataset': 'Google News (about 100 billion words)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py',
 'license': 'not found',
 'parameters': {'dimension': 300},
 'description': "Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).",
 'read_more': ['https://code.google.com/archive/p/word2vec/',
  'https://arxiv.org/abs/1301.3781',
  'https://arxiv.org/abs/1310.4546',
  'https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvec

## 1. Word vector similarities

### Exercise 1

Use the `wordnet` python package to find synonyms and antonyms of your choice of focus word(s), and compute their similarities.
What do you observe?

**_Hint:_** You can also define your own lists of synonyms and antonyms.

**_Note:_** *WordNet* is a lexical database and semantic network of words and their relationships. 
It was developed to assist natural language processing and computational linguistics applications by providing a structured and comprehensive way to represent the English language vocabulary. 
WordNet was created at Princeton University and has been widely used in various text analysis tasks, including machine learning, information retrieval, and natural language understanding.
([source](https://chat.openai.com/share/1de49018-f487-4789-82cb-98ccf8d47ccb))

In [22]:
# import wordnet
from nltk.corpus import wordnet

# TODO: insert your focus word here
focus_word = 'nice' # <== change!

# find the sysnset for the focus word of your choice
synsets = wordnet.synsets(focus_word)
for synset in synsets:
    print(synset, synset.definition())

Synset('nice.n.01') a city in southeastern France on the Mediterranean; the leading resort on the French Riviera
Synset('nice.a.01') pleasant or pleasing or agreeable in nature or appearance; - George Meredith
Synset('decent.s.01') socially or conventionally correct; refined or virtuous
Synset('nice.s.03') done with delicacy and skill
Synset('dainty.s.04') excessively fastidious and easily disgusted
Synset('courteous.s.01') exhibiting courtesy and politeness


In [23]:
# find the synonyms and antonyms for <YOUR WORD>
synonyms = []
antonyms = []
for synset in wordnet.synsets(focus_word):
    for lemma in synset.lemmas():
        if focus_word not in lemma.name():
            synonyms.append(lemma.name())
        if lemma.antonyms():
            for a in lemma.antonyms():
                antonyms.append(a.name())

# print the results
print('Synonyms for: ' + focus_word)
print(set(synonyms))
print('Antonyms for: ' + focus_word)
print(set(antonyms))

Synonyms for: nice
{'prissy', 'squeamish', 'Nice', 'gracious', 'courteous', 'decent', 'skillful', 'dainty'}
Antonyms for: nice
{'nasty'}


In [24]:
# TODO: iterate over synonyms and antonyms and print the similarity between focus word and synonym/antonym pairs
for synonym in synonyms:
    if synonym in word2vec.index_to_key:
        print(f'"{synonym}": {word2vec.similarity(focus_word, synonym)}')

"Nice": 0.46746230125427246
"decent": 0.5993331670761108
"skillful": 0.20048925280570984
"dainty": 0.2785818874835968
"prissy": 0.25895628333091736
"squeamish": 0.19743596017360687
"courteous": 0.28391408920288086
"gracious": 0.4229739308357239


In [7]:
for antonym in antonyms:
    print(f'"{antonym}": {word2vec.similarity(focus_word, antonym)}')

"nasty": 0.3650338053703308


### EXERCISE 2

Let's implement a classic approach to evaluate how word embeddings capture cultural biases in their training copora.
Here, we'll focus on **_gender bias_** &mdash; the differential association of traits and attributes with women and men (my lose definition).

Compile a list `comparison_words` with occupations, character traits, and other words that might exhibit gender bias.
Then compute how similar each word is with terms like 'man' and 'women', that indicate the male and female genders.

Which words exhibit gender bias?
And in which direction? 
Do you spot a pattern?

In [27]:
comparison_words = [
    'programmer',
    'scientist',
    'smart',
    'emotional',
    'caring',
    'CEO',
    'annoying',
    'football',
    'soccer',
    'politician',
    'populist',
    'power',
    'nurse',
    'doctor'
    # add more interesting words here
]

In [31]:
# compute similarities to male and female terms
male_terms = ['man', 'men', 'male', 'he', 'him', 'his']
female_terms = ['woman', 'women', 'female', 'she', 'her', 'hers']

s_male = dict()
s_female = dict()

for word in comparison_words:
    s_male[word] = []
    s_female[word] = []
    for w in male_terms:
        s = word2vec.similarity(word, w)
        s_male[word].append(s)
    for w in female_terms:
        s = word2vec.similarity(word, w)
        s_female[word].append(s)

In [33]:
{w: np.mean(s) for w, s in s_female.items()}

{'programmer': 0.029597761,
 'scientist': 0.06876839,
 'smart': 0.041580766,
 'emotional': 0.18292125,
 'caring': 0.20474696,
 'CEO': -0.106728755,
 'annoying': 0.097587876,
 'football': 0.063607715,
 'soccer': 0.13928266,
 'politician': 0.22477663,
 'populist': 0.08099014,
 'power': 0.042475563,
 'nurse': 0.3203691,
 'doctor': 0.23384072}

In [30]:
{w: np.mean(s) for w, s in s_male.items()}

{'programmer': 0.054541085,
 'scientist': 0.10331989,
 'smart': 0.06168256,
 'emotional': 0.15299931,
 'caring': 0.17566599,
 'CEO': -0.0061667506,
 'annoying': 0.07397268,
 'football': 0.17734624,
 'soccer': 0.11919821,
 'politician': 0.22752604,
 'populist': 0.09036921,
 'power': 0.07486963,
 'nurse': 0.19163398,
 'doctor': 0.27489495}

In [None]:
# TODO: summarize the results in a table or figure
# hint: if you have more than one term per gender, you might want to compute the average of comparison term--gender word similarities

### EXERCISE 3

Implement the same logic but now use the gender bisa-related word and attribute lists used in Caliskan *et al.*'s paper ["Semantics derived automatically from language corpora contain human-like biases"](https://www.science.org/doi/10.1126/science.aal4230).

You find the word and attribute lists in the folder `./../data/replications/caliskan_semantics_2017/wordlists/` (e.g., the file 'science_arts_male_female.txt')

## 2. Nearest neihbors

### Exercise 1

Let's use nearest neighbors search to find conceptually equivalent terms for a "seed" word.

**_Note:_** This is a typical task in expanding keyword lists for dictionaries.

You can choose which seed word you want to start with (see example below for a suggestion).
But while going through nearest neighbors, keep track of how many of the candidate terms in the top-20 or top-50 terms (or so) you would inlcude in your dictionary, and how many you would discard!

**_Example_**: 
Say you want to compile a dictionary that contains typical words used to express *positive emotions*.
In this case, you could start with the seed word 'happy.'

In [17]:
[w for w, s in word2vec.most_similar('happy', topn=20)]
# lots of good suggestions: 'glad', 'pleased', 'ecstatic', 'overjoyed', etc.
# but also conceptual opposites 'disappointed' (should be discarded)
#  and weird artifacts like 'Said_Hirschbeck'

['glad',
 'pleased',
 'ecstatic',
 'overjoyed',
 'thrilled',
 'satisfied',
 'proud',
 'delighted',
 'disappointed',
 'excited',
 'happier',
 'Said_Hirschbeck',
 'elated',
 'thankful',
 'unhappy',
 'enthused',
 'chuffed',
 'grateful',
 'confident',
 'hapy']

In [16]:
[w for w, s in word2vec.most_similar('hate', topn=20)]
# lots of good suggestions. 
# But some like 'whatever_funhouse_mirror' are weird
# also terms like 'racist' make sense in cultural context but are not indicate of the emotion per se
# at the same time, synonym words like 'contempt' are missing

['despise',
 'Hate',
 'detest',
 'hatred',
 'hating',
 'hates',
 'HATE',
 'dislike',
 'love',
 'hated',
 'loathe',
 'hateful',
 'haters',
 'whatever_funhouse_mirror',
 'rascist',
 'embrace_brotherly_coexistence',
 'predjudice',
 'Ignorance_breeds',
 'hatered',
 'hater']

In [15]:
[w for w, s in word2vec.most_similar('autocracy', topn=20)]
# here the questions is how you can find words that are  c o n c e p t u a l l y  similar to autocracy
# terms like 'despotism', 'despotic_rule', 'dictatorship', 'authoritarian_rule', 'authoritarian_regime', etc. make sense
# but conceptual opposities like 'democracy' and 'socialism' should be discarded unless
#  you just want to find documents that talk about political regimes 
# also, including adjective wods (e.g., 'despotic') might lead to false positives

['despotism',
 'authoritarianism',
 'dictatorship',
 'democracy',
 'dictatorial_regime',
 'dictatorial_rule',
 'tyranny',
 'feudal_monarchy',
 'autocratic_rule',
 'autocracies',
 'despotic_rule',
 'authoritarian',
 'authoritarian_rule',
 'totalitarian_dictatorship',
 'feudalism',
 'dynastic_rule',
 'authoritarian_regime',
 'totalitarianism',
 'socialism',
 'despotic']

In [19]:
[w for w, s in word2vec.most_similar('competent', topn=20)]

['competent', 'qualified', 'trustworthy']

['incompetent',
 'competant',
 'www.CoastalEnergy.com',
 'technically_competent',
 'quartz_pyrite_tourmaline_veins',
 'uncorrupt',
 'Competent',
 'knowledgeable',
 'tactically_proficient',
 'trustworthy',
 'competently',
 'competence',
 'sane',
 'technically_proficient',
 'conscientious',
 'qualified',
 'commercially_astute',
 'skilled',
 'hirable',
 'competency']

### Exercise 2

Discuss with your neighbor how one could improve this nearest neighbor search-based dictionary expansion strategy?
Do you ideas for automated quality checks?

**Ideas**


- start with a seed term: 'autocracy' ("overlapping")
    - for each candidate term (e.g., 'socialism')
        - get the 10 most simiolar terms
        - chek how large the overlap is with top-20 most similar terms for 'autocracy'
        - discard if below some threshold *t* (but arbitray!?!?)

-  start with a seed term: 'competent' ("snow balling")
    - get best fitting terms (manually)
    - 

-  start with a seed term: 'competent' ("snow balling 2.0")

- maybe also use distancec (to centroid) to weigh terms (??)

## 3. Analogies

### Exercise 1

Can you come up with analogy problems involving terms from your discipline or research area?
Can the word embedding model solve these specialized problems?

**_Example:_** In politics "Democrat is to progressive what Republican is to ___?"

In [20]:
v_x1 = word2vec['Democrat']
v_y1 = word2vec['progressive']
v_x2 = word2vec['Republican']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=5)

[('progressive', 0.7648183107376099),
 ('conservative', 0.525808572769165),
 ('progressive_netroots', 0.5039817094802856),
 ('radical', 0.47010430693626404),
 ('n_CHRIST_EPISCOPAL', 0.46954479813575745)]

In [24]:
v_x1 = word2vec['head']
v_y1 = word2vec['helmet']
v_x2 = word2vec['hand']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=20)

[('helmet', 0.7886247038841248),
 ('helmets', 0.5802411437034607),
 ('helment', 0.5743348598480225),
 ('hand', 0.5650533437728882),
 ('visor', 0.5115036368370056),
 ('protective_splint', 0.5070538520812988),
 ('gloves', 0.5015875697135925),
 ('Helmet', 0.4950107932090759),
 ('machine_guns_bandolier', 0.4937583804130554),
 ('goggles', 0.4788079559803009),
 ('eighth_Verlander_beaned', 0.4763328433036804),
 ('facemask', 0.4681493043899536),
 ('headgear', 0.46554625034332275),
 ('Rawlings_S###', 0.4648207724094391),
 ('mouthguard', 0.46436232328414917),
 ('faceshield', 0.4626709818840027),
 ('protective_padding', 0.4590736925601959),
 ('glove', 0.45742732286453247),
 ('Flak_jackets', 0.45673221349716187),
 ('mitt', 0.4559189975261688)]

In [26]:
v_x1 = word2vec['football']
v_y1 = word2vec['United_States']
v_x2 = word2vec['cricket']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=20)

[('India', 0.6109928488731384),
 ('subcontinent', 0.5827982425689697),
 ('sub_continent', 0.5815519094467163),
 ('cricket', 0.5808954834938049),
 ('Australia', 0.5685694217681885),
 ('West_Indies', 0.5684044361114502),
 ('Pakistan', 0.566472053527832),
 ('Test_cricket', 0.5551456212997437),
 ('Tendulkar', 0.552420437335968),
 ('United_States', 0.5515602231025696),
 ('Sri_Lanka', 0.548048734664917),
 ('ODI', 0.5434243679046631),
 ('Baggy_Greens', 0.5332558751106262),
 ('Bangladesh', 0.5251384377479553),
 ('ODIs', 0.5249315500259399),
 ('Sachin_Tendulkar', 0.5248597264289856),
 ('dayers', 0.5230942368507385),
 ('ICC_Champions_Trophy', 0.5220580697059631),
 ('New_Zealand', 0.5215911269187927),
 ('Ricky_Ponting', 0.5189464092254639)]

In [33]:
v_x1 = word2vec['German']
v_y1 = word2vec['Germany']
v_x2 = word2vec['Peruvian']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=20)

[('Peruvian', 0.8159431219100952),
 ('Peru', 0.809283971786499),
 ('Ecuador', 0.7329590320587158),
 ('Chile', 0.7297125458717346),
 ('Costa_Rica', 0.70567387342453),
 ('Guatemala', 0.6708042621612549),
 ('Bolivia', 0.670447587966919),
 ('Brazil', 0.6703616380691528),
 ('Argentina', 0.6612628698348999),
 ('El_Salvador', 0.6595791578292847),
 ('Colombia', 0.6483430862426758),
 ('Honduras', 0.6456732153892517),
 ('Chilean', 0.6242369413375854),
 ('Costa_Rican', 0.6241909265518188),
 ('Bolivian', 0.6210365891456604),
 ('Nicaragua', 0.600653350353241),
 ('Ecuadorian', 0.5938271284103394),
 ('Uruguay', 0.5932496786117554),
 ('Paraguayan', 0.5902215838432312),
 ('Paraguay', 0.5832325220108032)]

In [38]:
v_y1 = word2vec['Germany']
v_x1 = word2vec['german']
v_x2 = word2vec['peruvian']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=20)

[('United_States', 0.37230148911476135),
 ('Germany', 0.35291966795921326),
 ('United_Kingdom', 0.3246475160121918),
 ('####-####_##Figure', 0.32130444049835205),
 ('WORLD_BRIEFING_EUROPE', 0.3183465301990509),
 ('Brazil', 0.3166411817073822),
 ('depth_PESTLE_Insights', 0.3155994117259979),
 ('Paper_Tableware', 0.3155561089515686),
 ('Markets_#Q.####', 0.3141346871852875),
 ('Meal_Replacement_Products', 0.3130999803543091),
 ('represents_INVOS_System', 0.31308215856552124),
 ('Italy', 0.31219157576560974),
 ('Colour_Cosmetics', 0.31042715907096863),
 ('RTD_Tea', 0.309854656457901),
 ('APGROUP_LatinAmerica_;)_COUNTRY', 0.30961477756500244),
 ('STOCKHOLM_Sweden_Gerhard_Ertl', 0.30582064390182495),
 ('DORTMUND_Reuters', 0.30522432923316956),
 ('Argentina', 0.3050248622894287),
 ('Costa_Rica', 0.3038727939128876),
 ('Rodent_Care', 0.3027800917625427)]

In [44]:
v_y1 = word2vec['fathers']
v_x1 = word2vec['men']
v_x2 = word2vec['mothers']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=20)

[('mothers', 0.8295677900314331),
 ('fathers', 0.8070769309997559),
 ('dads', 0.7006524801254272),
 ('moms', 0.6297956705093384),
 ('parents', 0.6068122386932373),
 ('fathers_stepfathers', 0.5964891910552979),
 ('parenting', 0.5838446617126465),
 ('mums', 0.5777431130409241),
 ('Mothers', 0.5749648213386536),
 ('latchkey_kids', 0.5629170536994934),
 ('unwed_fathers', 0.5626464486122131),
 ('Stepparents', 0.5505766868591309),
 ('babies', 0.5435649752616882),
 ('children', 0.5428438186645508),
 ('stepdads', 0.5405153632164001),
 ('expectant_dads', 0.5344789028167725),
 ('mothering', 0.5309228897094727),
 ('grandfathers_uncles', 0.5304272174835205),
 ('FATHERS_SUPPORT_GROUP', 0.5285888314247131),
 ('grandparental', 0.5271429419517517)]

### Exercise 2

Take examples from one of the word lists in the folder `./../data/benchmarks/bats/3_encyclopedic_semantics/` to construct analgoy tests.
How well does the word2vec model perform on average?

**_Hint:_** Think about possible ways of defining performance.

In [None]:
# example
fp = './../data/benchmarks/bats/3_encyclopedic_semantics/E01 [country - capital].txt'

with open(fp, 'r') as f:
    wordlist = [tuple(line.strip().split('\t')) for line in f]

wordlist[:10]

In [None]:
# note: depending on your evaluation strategy, you might need to change this function
def analogy(x1='man', y1='king', x2='woman', verbose=True):
    """Computes return to query 'y1 is to x1 what WORD is to x2?'"""
    result = word2vec.most_similar(positive=[y1, x2], negative=[x1])
    if verbose:
        print(f"'{x1}' : '{y1}' :: '{x2}' : ?? ==> '{result[0][0]}'")
    return result[0][0]

In [None]:
# analogy(x1='man', y1='king', x2='women') # :)
analogy(x1='abuja', y1='nigeria', x2='ankara') # :/
analogy(x1='nigeria', y1='abuja', x2='turkey') # x/
analogy(x1='athens', y1='greece', x2='baghdad') # :/
analogy(x1='berlin', y1='germany', x2='paris') # x/
analogy(x1='germany', y1='berlin', x2='france') # x/
analogy(x1='germany', y1='berlin', x2='france') # x/