# Computing with word embeddings: Exercises

| Author | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-24 |

This notebook illustrates how to use `gensim` to compute with word vectors (e.g., word2vec) to, for example

- compute two words similarity
- find the most similar words for a focal word
- solve analogy tasks

## Setup

Load required modules.

In [None]:
# file in- and export
import os

# for working with word embeddings
import gensim
import gensim.downloader as api

# for using arrays and data frames
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Load a pre-trained word embedding model with `gensim`'s model API.

In [None]:
# load the model and name it's instance in our notebook environment 'word2vec'
word2vec = api.load('word2vec-google-news-300')

**_Note_:** You can also load another model if you want. It will still be a `KeyedVectors` object. So you can directly apply what you learned.

In [None]:
# list available models
print(list(api.info()['models'].keys()))

# get detailed info for a specific model
api.info(name='word2vec-google-news-300')

## 1. Word vector similarities

### Exercise 1

Use the `wordnet` python package to find synonyms and antonyms of your choice of focus word(s), and compute their similarities.
What do you observe?

**_Hint:_** You can also define your own lists of synonyms and antonyms.

**_Note:_** *WordNet* is a lexical database and semantic network of words and their relationships. 
It was developed to assist natural language processing and computational linguistics applications by providing a structured and comprehensive way to represent the English language vocabulary. 
WordNet was created at Princeton University and has been widely used in various text analysis tasks, including machine learning, information retrieval, and natural language understanding.
([source](https://chat.openai.com/share/1de49018-f487-4789-82cb-98ccf8d47ccb))

In [None]:
# import wordnet
from nltk.corpus import wordnet

# TODO: insert your focus word here
focus_word = 'fun' # <== change!

# find the sysnset for the focus word of your choice
synsets = wordnet.synsets(focus_word)
for synset in synsets:
    print(synset, synset.definition())

In [None]:
# find the synonyms and antonyms for <YOUR WORD>
synonyms = []
antonyms = []
for synset in wordnet.synsets(focus_word):
    for lemma in synset.lemmas():
        if focus_word not in lemma.name():
            synonyms.append(lemma.name())
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())


# print the results
print('Synonyms for: ' + focus_word)
print(set(synonyms))
print('Antonyms for: ' + focus_word)
print(set(antonyms))

In [None]:
# TODO: iterate over synonyms and antonyms and print the similarity between focus word and synonym/antonym pairs

### EXERCISE 2

Let's implement a classic approach to evaluate how word embeddings capture cultural biases in their training copora.
Here, we'll focus on **_gender bias_** &mdash; the differential association of traits and attributes with women and men (my lose definition).

Compile a list `comparison_words` with occupations, character traits, and other words that might exhibit gender bias.
Then compute how similar each word is with terms like 'man' and 'women', that indicate the male and female genders.

Which words exhibit gender bias?
And in which direction? 
Do you spot a pattern?

In [None]:
comparison_words = [
    'programmer',
    'scientist',
    'smart',
    'emotional',
    'caring',
    # add more interesting words here
]

In [None]:
# compute similarities to male and female terms
male_terms = ['man']
female_terms = ['woman']

for word in comparison_words:
    # TODO: implement logic

In [None]:
# TODO: summarize the results in a table or figure
# hint: if you have more than one term per gender, you might want to compute the average of comparison term--gender word similarities

### EXERCISE 3

Implement the same logic but now use the gender bisa-related word and attribute lists used in Caliskan *et al.*'s paper ["Semantics derived automatically from language corpora contain human-like biases"](https://www.science.org/doi/10.1126/science.aal4230).

You find the word and attribute lists in the folder `./../data/replications/caliskan_semantics_2017/wordlists/` (e.g., the file 'science_arts_male_female.txt')

## 2. Nearest neihbors

### Exercise 1

Let's use nearest neighbors search to find conceptually equivalent terms for a "seed" word.

**_Note:_** This is a typical task in expanding keyword lists for dictionaries.

You can choose which seed word you want to start with (see example below for a suggestion).
But while going through nearest neighbors, keep track of how many of the candidate terms in the top-20 or top-50 terms (or so) you would inlcude in your dictionary, and how many you would discard!

**_Example_**: 
Say you want to compile a dictionary that contains typical words used to express *positive emotions*.
In this case, you could start with the seed word 'happy.'

In [None]:
[w for w, s in word2vec.most_similar('happy', topn=20)]

In [None]:
# EXAMPLE: I made decisions
'glad'  # approved
'pleased', # approved
'ecstatic', # approved
'overjoyed', # approved
'thrilled', # approved
'satisfied', # approved
'proud', # approved
'delighted', # approved
'disappointed', # not approved <== !!!
'excited', # approved
'happier', # approved
'Said_Hirschbeck', # not approved <== !!!
# ... and so on

### Exercise 2

Discuss with your neighbor how one could improve this nearest neighbor search-based dictionary expansion strategy?
Do you ideas for automated quality checks?

## 3. Analogies

### Exercise 1

Can you come up with analogy problems involving terms from your discipline or research area?
Can the word embedding model solve these specialized problems?

**_Example:_** In politics "Democrat is to progressive what Republican is to ___?"

In [None]:
v_x1 = word2vec['Democrat']
v_y1 = word2vec['progressive']
v_x2 = word2vec['Republican']

v_q = v_y1 - v_x1 + v_x2

word2vec.similar_by_vector(v_q, topn=5)

### Exercise 2

Take examples from one of the word lists in the folder `./../data/benchmarks/bats/3_encyclopedic_semantics/` to construct analgoy tests.
How well does the word2vec model perform on average?

**_Hint:_** Think about possible ways of defining performance.

In [None]:
# example
fp = './../data/benchmarks/bats/3_encyclopedic_semantics/E01 [country - capital].txt'

with open(fp, 'r') as f:
    wordlist = [tuple(line.strip().split('\t')) for line in f]

wordlist[:10]

In [None]:
# note: depending on your evaluation strategy, you might need to change this function
def analogy(x1='man', y1='king', x2='woman', verbose=True):
    """Computes return to query 'y1 is to x1 what WORD is to x2?'"""
    result = word2vec.most_similar(positive=[y1, x2], negative=[x1])
    if verbose:
        print(f"'{x1}' : '{y1}' :: '{x2}' : ?? ==> '{result[0][0]}'")
    return result[0][0]

In [None]:
# analogy(x1='man', y1='king', x2='women') # :)
analogy(x1='abuja', y1='nigeria', x2='ankara') # :/
analogy(x1='nigeria', y1='abuja', x2='turkey') # x/
analogy(x1='athens', y1='greece', x2='baghdad') # :/
analogy(x1='berlin', y1='germany', x2='paris') # x/
analogy(x1='germany', y1='berlin', x2='france') # x/
analogy(x1='germany', y1='berlin', x2='france') # x/