# Class 8: Word Embeddings - Solution


In this exercise we will explore whether word embeddings can be used to:
    
1. ideological scaling of party positions in the Danish parliament 

2. solve analogoes

3. semantic scaling of latent dimensions 

## Setup

In [1]:
# Import basic Python modules
import os
import platform

# Regular expressions
import re

# Data management
import numpy as np
import pandas as pd
from collections import namedtuple

# Progress bars
from tqdm import tqdm

# Gensim
import gensim
import gensim.downloader
from gensim.models.doc2vec import Doc2Vec

# SpaCy
import spacy

# DANLP
from danlp.models.embeddings import load_wv_with_gensim

# Scikit-learn
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

# Plotting
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

In [2]:
# # # # Working Directory # # # #

if platform.system() == 'Linux':
    wd = '/home/rask/'
else:
    wd = 'C:/Users/au535365/'

wd = os.path.join(wd, 'Dropbox/teaching/css_fall2023')
    
# Change directory
os.chdir(wd)

# Confirm that the working directory is as intended 
os.getcwd()

'C:\\Users\\au535365\\Dropbox\\teaching\\css_fall2023'

## 1 Ideological Scaling

In this exercise, we will investigate the extent to which word embeddings can be leveraged to generate scaling estimates of parties' ideological positions in the Danish Parliament. The exercise is based on Rheault and Cochrane (2020), which you read for today's class. 

While pretrained embeddings generally perform well (see Rodriguez and Spirling 2022), we will train our own local word embeddings using a model called `Doc2Vec` in `gensim`. Unlike the standard `Word2Vec`, the `Doc2Vec` model enables to include metadata such as the party of the speaker, age, gender, and so on. You can think of it as including control variables in a regression model. Like Rheault and Cochrane (2020), we utilize this feature to scale positions of four Danish parties in a two-dimensional space using PCA-reduced embeddings. To ease the burden of fitting our model, we only scale positions for speeches given by legislators from either
*EL*, *S*, *V*, or *DF*. We could use legislator-level "fixed effects", but we investigate the positions at the party level. Our selection of parties includes the two historical mainstream parties and the farest wing parties in recent decades (at least on average).

We work with the same data (parliamentary speeches in the Danish parliament from 2000-2021) as we have done in the previous exercises. 

1. Read in the data directly from GitHub. Discard each dataset if it has less than $10,000$ speeches.
2. Keep only speeches from *EL*, *S*, *V*, and *DF*
3. Clean, tokenize, and preproces the data. Explain the steps you take. It is on purpose that I don't specify which steps you should take. To practice for the exam, you should be able to account for your choices. The arguments can be short and does not need to hit a homerun. What matters is that you demonstrate that you have thought carefully about what you do. 
4. Prepare your cleaned, preprocessed, and tokenized data for the `Doc2Vec` using the two functions I have provided to you

        generate_indicators(*metatags)
        generate_iterator(tokens, metatags)
    where `generate_indicators` take the metadata we want to include in our model and `generate_iterator` takes the 
    tokens from step 3 (for the tokens argument) and the metatags created from `generate_indicators` (for the 
    metatags argument). You should provide a party indicator as metatags. 

5. Fit the `Doc2Vec` using the specifications to an object called `d2v`:
    - `vector_size=300`
    - `window=20`
    - `min_count=5`
    - `workers=8`
    - `epochs=10`
    - `sample=1e-3`
    
    and save the model after training using `d2v.save(model_fdir)` where you replace *model_fdir* with your own directory


6. Use PCA to reduce the resulting party embeddings to two dimensions. Adapt your code for *class05-exercise/solution* to your new setting.

7. Plot the two-dimensional party embeddings and interpret the results (again: see and adapt code for *class05*)

8. Inspect the top words associated with each of ends of the two components (left-right, north-south). You can copy-paste the PCAInterpret Python class from *class5* to do so. Once again, adapt the code if necessary.



In [None]:
def generate_tags(*metatags):
    """
    Generate indicators by combining multiple input tags.

    This function takes multiple metatags and combines them by joining each element
    with a hyphen ('-'). The result is a list of combined tags, which can be
    used to label samples.

    Parameters:
    *metatags (tuple): A tuple of input tags.

    Returns:
    list: A list of combined indicators.

    Example:
    >>> generate_tags(('A', 'B'), ('1', '2'))
    ['A-1', 'B-2']
    """
    tags = ['-'.join(map(str, t)) for t in zip(*metatags)]
    return tags

def generate_iterator(words, metatags):
    """
    Generate an iterator of namedtuples containing words and tags.

    This function creates an iterator that combines a list of words and a list
    of metatags into namedtuples. Each namedtuple has two fields: 'words' and
    'metatags', where 'words' is a list of tokens and 'metatags' is a list of associated tags.

    Parameters:
    words (nested list): A list of tokenized words.
    metatags (list): A list of tags or indicators used to fit a Doc2Vec.

    Returns:
    list: A list of namedtuples containing 'words' and 'tags'.

    Example:
    >>> generate_iterator(['apple', 'banana'], ['fruit', 'yellow'])
    [docs(words=['apple'], tags=['fruit']), docs(words=['banana'], tags=['yellow'])]
    """
    speech_iterator = namedtuple('docs', 'words tags')
    iterator = [speech_iterator(x, [y]) for x, y in zip(words, metatags)]
    return iterator

#### 1)

In [None]:
# Generate file ids
files = ['20001', 
         '20011',
         '20012',
         '20021',
         '20031',
         '20041',
         '20042',
         '20051',
         '20061',
         '20071',
         '20072',
         '20081',
         '20091',
         '20101',
         '20102',
         '20111',
         '20121',
         '20131',
         '20141',
         '20142',
         '20151',
         '20161',
         '20171',
         '20181',
         '20182',
         '20191',
         '20201',
         '20211']

# Specify base url
base_url = 'https://raw.githubusercontent.com/mraskj/css_fall2023/master/data/ft-speeches/'

# Read in data. Solution here:
df = pd.DataFrame()
for file in tqdm(files):
    df_term = pd.read_csv(base_url + file + '.csv')
    if len(df_term) > 10000:
        df = pd.concat([df, df_term])
df.reset_index(drop=True, inplace=True)

#### 2)

In [None]:
# Define list of parties to keep
parties = ['S', 'DF', 'EL', 'V']
df_filtered = df.loc[df['party'].isin(parties)].reset_index(drop=True)

#### 3)

In [None]:
# Compute length of each speech on the original dataframe
# and discard speeches outside the span of the 25th-75th percentile.
# Note that it is perfectly fine to compute the number of words on the filtered dataframe as well.
# This makes no difference at all. 
df['n_words'] = df.text.apply(lambda x: len(x))
df_filtered['n_words'] = df_filtered.text.apply(lambda x: len(x))
p25, p75 = np.quantile(df['n_words'], q=.25), np.quantile(df['n_words'], q=.75)
df_filtered = df_filtered.loc[(df_filtered['n_words'] <= p75) & (df_filtered['n_words'] >= p25)]

In [None]:
# Define names of legislators, parties, and context-specific stopwords that we want to remove
names_to_remove = [x[:-1] + '[a-z]+' for x in list(df.speaker.unique())]

procedural_to_remove = ['[Ll]ovforsla[a-z]+', 'ordfør[a-z]+', 'spørgsmå[a-z]+',
                        'forsla[a-z]+', 'L', 'B', '[Hh]r', '[Ff]ru', '[Aa]fstemnin[a-z]+',
                        '[Ff]orhandlin[a-z]+', '[Hh]r', '[Ff]ru']
                   
parties_to_remove = ['[Ll]iberal [Aa]llianc[a-z]+', 'LA', '[Dd]et [Kk]onservative [Ff]olkepar[a-z]+', 'KF',
                   '[Dd]e [Kk]onservati[a-z]+', 'Venst[a-z]+', '[Dd]ansk [Ff]olkepart[a-z]+', 
                   '[Nn]ye [Bb]orgerli[a-z]+', '[Dd]e [Rr]adikal[a-z]+', '[Ss]ocialdemokratie[a-z+]',
                   '[Ss]ocialdemokra[a-z]+', '[Ss]ocialistis[a-z]+ [Ff]olkepart[a-z]+', 'SF',
                   '[Aa]lternative[a-z]+', '[Ee]nhedslist[a-z]+', '[Rr]adika[a-z]+']        

removal_words = parties_to_remove + procedural_to_remove + names_to_remove
        
removal_pattern = r'\b(?:' + '|'.join(removal_words) + r')\b'

In [None]:
# Define list to contain our corpus 
corpus = list(df_filtered['text'])

In [None]:
# Conduct text cleaning, preprocessing, and tokenization.
# Note that I use a lazy tokenization version here. This is way more imprecise
# than using SpaCy, but also much faster. For our purpose here, it is sufficient, 
# but ideally you use SpaCy.
# Note also that I do not remove stopwords. It is fine if you remove them.
# but rule-of-thumb is to preprocess text *less* when using word embeddings.
# The reason is that we want to learn representations of language *as it is*.
# Removing text might disturb this goal.

# Remove \xa0 from text
corpus = [re.sub('\xa0', '', doc) for doc in tqdm(corpus)]

# Remove double or more consecutive whitespaces
corpus = [re.sub(' +', ' ', doc) for doc in tqdm(corpus)]

# Remove names defined above
corpus = [re.sub(removal_pattern, '', doc) for doc in tqdm(corpus)]

# Convert to lowercase
corpus = [doc.lower() for doc in tqdm(corpus)]

# Lazy tokenization where I split on whitespace (.split() uses ' ' as the default value)
corpus = [doc.split() for doc in corpus]

# Keep only tokens that contain three or more characters
corpus = [[x for x in doc if len(x) >= 3] for doc in corpus]

#### 4)

In [None]:
# Prepare data for training
covariates = generate_tags(list(df_filtered.party.values))
iterator = generate_iterator(words=corpus, metatags=covariates)

#### 5) 

In [None]:
# Fit the Doc2Vec model with:
#    - `vector_size=300`
#    - `window=20`
#    - `min_count=5`
#    - `workers=8`
#    - `epochs=10`
#    - `sample=1e-3`

d2v = Doc2Vec(iterator,
              vector_size=300, 
              window=20, 
              min_count=5,
              workers=8,
              epochs=10, 
              sample = 1e-3)

In [11]:
wordlist = []
for word, vocab_obj in dd.wv.vocab.items():
    wordlist.append((word, vocab_obj.count))

In [13]:
wordlist = sorted(wordlist, key=lambda tup: tup[1], reverse=True)

In [14]:
wordlist

[('det', 759390),
 ('jeg', 412421),
 ('når', 251909),
 ('men', 211231),
 ('godt', 207787),
 ('altså', 206145),
 ('bare', 173651),
 ('gerne', 150949),
 ('danmark', 150280),
 ('får', 149548),
 ('der', 145175),
 ('regeringen', 140716),
 ('dag', 136713),
 ('helt', 135051),
 ('faktisk', 129595),
 ('forslag', 126099),
 ('mener', 119803),
 ('forhold', 118191),
 ('set', 114745),
 ('siger', 113504),
 ('tror', 110924),
 ('selvfølgelig', 103121),
 ('tak', 102647),
 ('spørgsmål', 96552),
 ('går', 95196),
 ('rigtig', 94350),
 ('står', 89847),
 ('danske', 89332),
 ('tage', 89291),
 ('ting', 87172),
 ('komme', 85628),
 ('måde', 80342),
 ('derfor', 79281),
 ('arbejde', 78843),
 ('hele', 76677),
 ('gang', 74094),
 ('mennesker', 73724),
 ('forslaget', 71530),
 ('for', 69436),
 ('sagt', 68015),
 ('penge', 67883),
 ('ser', 67386),
 ('sikre', 67105),
 ('side', 66970),
 ('del', 66585),
 ('mulighed', 65794),
 ('hvert', 65531),
 ('fald', 65179),
 ('ønsker', 63467),
 ('blevet', 63057),
 ('netop', 62704),
 ('lo

#### 6)

In [None]:
def dimensionality_reduction(embed_model, indicators, n_components=2, return_dataframe=True):
    """
    Perform dimensionality reduction on high-dimensional embeddings using PCA.

    This function takes an embedding model (embed_model) and a list of indicators, and it performs
    dimensionality reduction on the embeddings associated with those indicators
    using PCA.

    Parameters:
    embed_model: An embedding model such as Doc2Vec or Word2Vec from gensim
    indicators (list): A list of indicators or tags.
    n_components (int): The number of principal components to retain (default is 2).
    return_pandas (bool): If True, the result is returned as a pandas DataFrame with
        columns 'PC1' and 'PC2' and an 'indicator' column (default is False).

    Returns:
    Z (array or DataFrame): An array or DataFrame containing the reduced-dimensional
        embeddings. If return_pandas is True, Z is a DataFrame.
    dr (PCA): The PCA model used for dimensionality reduction.

    Example:
    >>> Z, dr = dimensionality_reduction(embed_model, ['tag1', 'tag2'], n_components=2)
    >>> Z, dr = dimensionality_reduction(embed_model, ['tag1', 'tag2'], return_pandas=True)
    """
    z = np.zeros((len(indicators), embed_model.vector_size))

    for i in range(len(indicators)):
        z[i,:] = embed_model.docvecs[indicators[i]]

    dr = PCA(n_components=n_components)

    for i in range(len(indicators)):
        z[i,:] = embed_model.docvecs[indicators[i]]
    Z = dr.fit_transform(z)    
    
    if return_dataframe:
        Z = pd.DataFrame(Z)
        Z.columns = ['PC1', 'PC2']
        Z['indicator'] = indicators
    
    return Z, dr

In [None]:
# Use PCA to reduce the 300-dimensional party embeddings to two dimensions
Z, pca_model = dimensionality_reduction(embed_model=d2v, indicators=parties, n_components=2)

#### 7)

In [None]:
def plot_pca(dataframe, metatags:list, color:list, show=True):
    
    """
    Plot a PCA visualization of data points in a DataFrame.

    This function takes a DataFrame with PCA-reduced data, a list of metatags,
    and a colormap (cmap) to visualize data points in a two-dimensional PCA space.

    Parameters:
    dataframe (DataFrame): A DataFrame containing PCA-reduced data with 'PC1' and 'PC2' columns.
    metatags (list): A list of tags corresponding to data points.
    color (list): A colormap to assign colors to data points based on a category.
    show (bool): If True, the plot is displayed (default is True).

    Returns:
    None

    Example:
    >>> plot_pca(dataframe=Z, metatags=['V', 'S'], color=['red', 'blue'], show=True)
    """
    
    ##if len(dataframe) != metatags:
     #   raise ValueError(f"dataframe and metatags must have the same length")
    
    mpl.rcParams['axes.titlesize'] = 20
    mpl.rcParams['axes.labelsize'] = 20
    mpl.rcParams['font.size'] = 14

    plt.figure(figsize=(22,15))
    plt.scatter(dataframe.PC1, dataframe.PC2, color=color)
    texts=[]
    for label, x, y, c in zip(metatags, Z.PC1, Z.PC2, color):
        plt.annotate(
            label,
            xy=(x, y), xytext=(-20, 20),
            textcoords='offset points', ha='right', va='bottom',
            bbox=dict(boxstyle='round,pad=0.5', fc=c, alpha=0.3),
            arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

    plt.xlabel("PC1")
    plt.ylabel("PC2")

    if show:
        plt.show()

In [None]:
# Plot the two PCAs with labels for each party position
n_tags = int(len(np.unique(parties)) / len(parties))
cmap = {'V': 'blue', 'DF': 'gold', 'S': 'red', 'EL': 'sandybrown'}
cols = [cmap['EL']]*n_tags + [cmap['S']]*n_tags + [cmap['V']]*n_tags + [cmap['DF']]*n_tags
plot_pca(dataframe=Z, metatags=parties, color=cols, show=True)

#### 8)

In [None]:
class PCA_INTERPRET(object):
    
    """
    Perform interpretation of PCA results for word embeddings.

    This class is designed to interpret PCA results of word embeddings, particularly for
    understanding the semantics of the principal components. It provides methods for
    sorting words based on their association with different components and directions.

    Parameters:
    model (Doc2Vec): A Doc2Vec model.
    parties (list): A list of party labels.
    dr (PCA): A PCA model for dimensionality reduction.
    Z (array): PCA-reduced data.
    labels (list): Labels for data points.
    rev1 (bool): Reverse the first component direction (default is False).
    rev2 (bool): Reverse the second component direction (default is False).
    min_count (int): Minimum word count threshold (default is 100).
    max_count (int): Maximum word count threshold (default is 1000000).
    max_features (int): Maximum number of features (words) to consider (default is 10000).

    Attributes:
    model (Doc2Vec): A Doc2Vec model.
    parties (list): A list of party labels.
    labels (list): Labels for data points.
    P (int): Number of parties.
    M (int): Size of the word vectors.
    voc (list): Sorted vocabulary based on word counts.
    V (int): Number of words in the vocabulary.
    pca (PCA): A PCA model for dimensionality reduction.
    max (array): Maximum values in the reduced data.
    min (array): Minimum values in the reduced data.
    sims (DataFrame): Similarities of words in reduced space.

    Example:
    >>> pca_interpreter = PCA_INTERPRET(d2v_model, parties, pca_model, reduced_data, labels, rev1=False, rev2=True)
    """
    
    def __init__(self, model, parties, dr, Z, labels, rev1=False, rev2=False, min_count=100, max_count = 1000000, max_features=10000):

        self.model = model
        self.parties = parties
        self.labels = labels
        self.P = len(self.parties)
        self.M = self.model.vector_size   
        self.voc = self.sorted_vocab(min_count, max_count, max_features)
        self.V = len(self.voc)   
        self.pca = dr
        self.max = Z.max(axis=0)
        self.min = Z.min(axis=0)
        self.sims = self.compute_sims()
        self.dim1 = rev1
        self.dim2 = rev2
        
    def sorted_vocab(self, min_count=100, max_count=10000, max_features=10000):
        wordlist=[]
        for word, vocab_obj in self.model.wv.vocab.items():
            wordlist.append((word, vocab_obj.count))
        wordlist = sorted(wordlist, key=lambda tup: tup[1], reverse=True)
        return [w for w,c in wordlist if c>min_count and c<max_count and w.count('_')<3][0:max_features]
    
    def compute_sims(self):

        Z = np.zeros((self.V, 2))
        for idx, w in enumerate(self.voc):
            Z[idx, :] = self.pca.transform(self.model.wv[w].reshape(1,-1))
        sims_right = euclidean_distances(Z, np.array([self.max[0],0]).reshape(1, -1))
        sims_left = euclidean_distances(Z, np.array([self.min[0],0]).reshape(1, -1))
        sims_up = euclidean_distances(Z, np.array([0,self.max[1]]).reshape(1, -1))
        sims_down = euclidean_distances(Z, np.array([0,self.min[1]]).reshape(1, -1))
        temp = pd.DataFrame({'word': self.voc, 'right': sims_right[:,0], 'left': sims_left[:,0], 'up': sims_up[:,0], 'down': sims_down[:,0]})
        return temp

    def top_words_list(self, topn=20):

        if self.dim1:
            ordering = ['left','right']
        else:
            ordering = ['right', 'left']
        temp = self.sims.sort_values(by=ordering[0])
        print(80*"-")
        print("Words Associated with Positive Values (Right) on First Component:")
        print(80*"-")
        self.top_positive_dim1 = temp.word.tolist()[0:topn]
        self.top_positive_dim1 = ', '.join([w.replace('_',' ') for w in self.top_positive_dim1])
        print(self.top_positive_dim1)
        temp = self.sims.sort_values(by=ordering[1])
        print(80*"-")
        print("Words Associated with Negative Values (Left) on First Component:")
        print(80*"-")
        self.top_negative_dim1 = temp.word.tolist()[0:topn]
        self.top_negative_dim1 = ', '.join([w.replace('_',' ') for w in self.top_negative_dim1])
        print(self.top_negative_dim1)

        if self.dim2:
            ordering = ['down','up']
        else:
            ordering = ['up', 'down']
        temp = self.sims.sort_values(by=ordering[0])
        print(80*"-")
        print("Words Associated with Positive Values (North) on Second Component:")
        print(80*"-")
        self.top_positive_dim2 = temp.word.tolist()[0:topn]
        self.top_positive_dim2 = ', '.join([w.replace('_',' ') for w in self.top_positive_dim2])
        print(self.top_positive_dim2)
        temp = self.sims.sort_values(by=ordering[1])
        print(80*"-")
        print("Words Associated with Negative Values (South) on Second Component:")
        print(80*"-")
        self.top_negative_dim2 = temp.word.tolist()[0:topn]
        self.top_negative_dim2 = ', '.join([w.replace('_',' ') for w in self.top_negative_dim2])
        print(self.top_negative_dim2)
        print(80*"-")

In [None]:
# Use PCA_INTERPRET class to get words associated with each of the two dimensions and their two directons
PCA_INTERPRET(d2v, parties, pca_model, Z, parties, rev1=False, rev2=False, min_count=100, max_count = 1000000, max_features = 50000).top_words_list(20)

## Exercise 2: Analogies

We start by exploring whether the iconic word analogy can be solved usign pretrained models for English and Danish language.

1) Load in a pretrained model for English of your choice from `gensim` (https://radimrehurek.com/gensim/models/word2vec.html)
    - Available models: ['fasttext-wiki-news-subwords-300',
     'conceptnet-numberbatch-17-06-300',
     'word2vec-ruscorpora-300',
     'word2vec-google-news-300',
     'glove-wiki-gigaword-50',
     'glove-wiki-gigaword-100',
     'glove-wiki-gigaword-200',
     'glove-wiki-gigaword-300',
     'glove-twitter-25',
     'glove-twitter-50',
     'glove-twitter-100',
     'glove-twitter-200']

2) Solve the analogy *"man is to woman as king is to _____"*. Interpret the results.

3) Load in the pretrained model for Danish ['conll17.da.wv'] using the `danlp` module.

4) Repeat step 2). Can the Danish embeddings also solve the analogy? Interpret the result and differences if there are any.

5) Explore whether the changing the wording (e.g. from singular to plural) matters for the Danish model

6) Can you come up with one or two other analogies that word embeddings might solve? Repeat step 2 and 4 for your new analogy.

In [None]:
# 1) Loading pretrained model 'word2vec-google-news-300' from gensim
w2v_en = gensim.downloader.load('word2vec-google-news-300')

In [None]:
# 2) Solve the analogy "man is to woman as king is to ____"
top_similarity_en = w2v_en.most_similar(positive=['king', 'woman'], negative=['man'])
print(top_similarity_en)

In [None]:
# 3) Loading Danish model 'conll17.da.wv'
w2v_da = load_wv_with_gensim('conll17.da.wv')

In [None]:
# 4) Solve the analogy "man is to woman as king is to ____"
top_similarity_da = w2v_da.most_similar(positive=['konge', 'kvinde'], negative=['mand'])
print(top_similarity_da)

In [None]:
# 5) Construct gender pairs to explore sensitivity to wording
gender_pairs = [
    ("mand", "kvinde"),
    ("mænd", "kvinder"),
    ("han", "hun"),
    ("ham", "hende"),
    ("hans", "hendes")]

# Compute similarity for each gender pair
sim_list = []
for gp in gender_pairs:
    sim_list += w2v_da.most_similar(positive=[gp[1], 'konge'], negative=[gp[0]], topn=20)
    
# a) Convert similarity list into dataframe 
# b) Groupby identified words and compute the average cosine similarity
# c) Keep only identified words that appear in 3/5 or more pairs
# d) Sort values by mean
word_df = pd.DataFrame(sim_list, columns=['word', 'score'])
word_df = word_df.groupby('word')['score'].describe().reset_index()
word_df = word_df.loc[word_df['count'] >= 3]
word_df = word_df.sort_values(['mean'], ascending=False)
word_df

In [None]:
# 2.6) Solving: "Copenhagen is to Denmark what London is to ____"
w2v_da.most_similar(positive=['danmark', 'london'], negative=['københavn'])

In [None]:
# 2.6) Solving: "Messi is to football what Karabatic is to ____"
w2v_da.most_similar(positive=['fodbold', 'karabatic'], negative=['messi'])

## 3 Semantic Scaling

With fairly little effort, we were able to obtain a descent representation of the ideological positions of four Danish parties. For the ideological scaling exercise, we *learned* so-called party embeddings, which we then reduced to two dimensions using PCA. That we reduces the space to two-dimensions is a pure modeling choice, which is often used because it is convenient for visualization, and it also makes quite a lot of sense (an economic dimension and a cultural dimension).

Word embeddings can also learn *semantic scales* directly from the learned word representations. In a much-cited paper by Kozlowski et al. (2019) "*The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings*" from 2019, it is shown that word embeddings capture semantic relations between words that map onto a cultural dimension of class. The paper exploits the "analogy-solving" feature of word embeddings (e.g. *"man is to woman as king is to _____"*) to generate group-based dimensions in the vector space such as class, gender, and race. 

In this exercise, we investigate whether we can use pretrained embeddings on Danish language ('conll17.da.wv') to analyze the cultural dimensions of social class. To do so, we construct an *affluence* dimension which basically spans from "rich" to "poor". This is simply done using vector subtraction $\mathbf{a} - \mathbf{b}$ where subtracting vector $\mathbf{b}$ from $\mathbf{a}$ yields a semantic meaningful dimension. Hence, subtracting $affluence - poverty$, we construct an affluence dimension. 

The most iconic example of this logic is the *"man is to woman as king is to _____"* where the correct answer is *queen*. Using word embeddings, this can be approximated by first constructing the gender dimension $gender_{dim} = woman - man$ and then adding the vector for $king$, i.e. $woman - man + king$. This has the effect of starting at $king$ and then taking one step on the gender dimension scaled from $woman$ to $man$.

Below, I have given you a set of word pairs from Kozlowski et al. (2019), which I have translated using ChatGPT. Each word pair corresponds to the *affluence dimension*. This list is called `aff_pairs` and consists of $32$ word pairs. I also provide you three other lists called `sports`, `foods`, and `drinks`, respectively. These consists of words that we *project* onto the *affluence dimension* using cosine similarity. Since the vectors 'conll17.da.wv' are normalized, this corresponds 1:1 to matrix multiplication. 


1) Why do we use more than one word pair to construct the *affluence dimension*? 
2) Write a function that computes the cosine similarity. Use NumPy's `np.dot()` and `np.linalg.norm()` for the individual parts.
3) Check whether each pair of words in `aff_pairs` are present in the vocabulary used to train 'conll17.da.wv'. Filter away pairs where one or both words are missing (if any at all).
4) Construct the semantic dimension for *affluence* (*hint*: All you need to do is to iterate over each word pair subtract the two word vectors, and finally average over the pairs. I suggest a list comprehension.)
5. Project each of the words in `sports`, `foods`, and `drinks` onto the dimension from step 4 using the cosine similarity you wrote in step 2. Save the result in a list and convert to a pandas dataframe with two columns: one with the projected word (e.g. 'håndbold') and one for the projection score (i.e. the cosine similarity).
6. Visualize the results using a barchart sorted from highest to lowest. Interpret and discuss the results.
7. In the exercise, we have projected words onto an *affluence dimension*, but this idea generalizes to any semantic scale. For instance, if you construct a gender dimension for kids ranging from $girl - boy$ and project toys (e.g. dukker, biler, lego, etc.), you should also get meaningful results. Now it's your turn to be creative. Create a semantic scale and project a set of words onto your scale. 

In [None]:
# Affluence word pairs
aff_pairs = [
    ("rig", "fattig"),
    ("rigere", "fattigere"),
    ("rigeste", "fattigste"),
    ("velstand", "fattigdom"),
    ("fordelagtige", "ulemper"),
    ("fordelagtigt", "ulempe"),
    ("fordelagtig", "ulempen"),
    ("velhavende", "hjælpeløs"),
    ("elegant", "uelegant"),
    ("dyr", "billig"),
    ("dyrt", "billigt"),
    ("overklasse", "underklasse"),
    ("eksklusiv", "normalt"),
    ("luksuriøs", "elendig"),
    ("luksus", "billig"),
    ("velhavende", "fattig"),
    ("velstående", "lavindkomst"),
    ("dyr", "enkel"),
    ("værdifuld", "værdiløs"),
    ("privilegeret", "underprivilegeret"),
    ("privilegeret", "uprivilegeret"),
    ("ejendom", "almenbolig"),
    ("udviklet", "underudviklet"),
    ("succesfuld", "usuccesfuld"),
    ("prangende", "simpel"),
    ("velhavende", "trængende"),
    ("rent", "beskidt"),
    ("velhavende", "forarmet"),
    ("luksuriøs", "faldefærdig"),
    ("velholdt", "faldefærdig"),
    ("velholdt", "faldefærdigt"),
    ("overflod", "nødlidende")
]

In [None]:
# Projection words
sports = ['fodbold', 'håndbold', 'ridning', 'hestesport','golf', 'tennis', 
          'boksning', 'badminton', 'svømning', 'ishockey', 'hockey', 'spejder']

foods = ['pizza', 'burger', 'steak', 'slik', 'chips', 'saftevand', 'grøntsager', 'kartofler', 
         'sovs', 'sauce','frikadeller', 'flæskesteg', 'østers', 'muslinger',
         'skaldyr', 'fisk']

drinks = [ 'cola', 'danskvand', 'rødvin', 'rosé', 'fadøl', 'dåseøl', 'cocktails', 'drinks', 'juice']

In [None]:
class EmbeddingProjections:
    
    def __init__(self, model, dimension_words, projection_words):
        
        self.model = model
        self.vocab = list(self.model.vocab.keys())
        self.projection_words = projection_words
        self.dimension_words = [p for p in dimension_words if p[0] in self.vocab and p[1] in self.vocab]
    
        
    @staticmethod
    def cos_similarity(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
        
    @staticmethod
    def plot_projections(dataframe, color='skyblue'):
        dataframe = dataframe.sort_values(['score'], ascending=True)
        plt.figure(figsize=(12, 8))
        plt.barh(dataframe['word'], dataframe['score'], color=color)
        plt.xlabel('Cosine similarity')
        plt.title('Word Embedding Scores')
        plt.tight_layout()
    
    
    def semantic_dimension(self, return_vectors=True, return_mean=False):
        dim = np.array([self.model[x[0]] - self.model[x[1]] for x in self.dimension_words])
        
        if return_vectors and not return_mean:
            return dim
        elif return_mean and not return_vectors:
            return np.mean(dim, axis=0)
        else:
            return dim, np.mean(dim, axis=0)
            
    
    def dim_projection(self, return_dataframe=True, verbose=False):
        
        dimension_vectors, dimension_mean = self.semantic_dimension(return_mean=True)
        
        score_list = []
        for word in self.projection_words:
            if word in self.vocab:
                score = self.cos_similarity(self.model[word], dimension_mean)
                score_list += [score]
                if verbose:
                    print(f"{word}: {score}")
            else:
                if verbose:
                    print(f"{word} not in vocab")

        if return_dataframe:
            self.score_df = pd.DataFrame({'word': self.projection_words, 'score': score_list})
            return self.score_df
        else:
            return score_list


In [None]:
# Initiate instance of the EmbeddingProjections class
projector = EmbeddingProjections(model=w2v_da, dimension_words=aff_pairs, projection_words=foods)

# Compute projections
sport_scores = projector.dim_projection()

# Plot projections 
projector.plot_projections(dataframe=sport_scores)

In [None]:
# Identify gendered language for children's toy
gender_pairs = [('drenge', 'piger', 'dreng', 'pige', 'drengene', 'pigerne')]
toys = ['dukke', 'dukker', 'lego', 'biler', 'krig', 'tegne']
projector = EmbeddingProjections(model=w2v_da, dimension_words=gender_pairs, projection_words=toys)
toy_scores = projector.dim_projection()
projector.plot_projections(dataframe=toy_scores)