[SemAxis](https://arxiv.org/pdf/1806.05521.pdf) is a method for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), which can be used for a range of empirical questions (for one example, see [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)). In this activity, you'll implement SemAxis using word representations from Glove, and use it to explore corpus-specific conceptual associations.

Gensim should be installed before running this notebook; if not, install with:

`conda install gensim`


In [1]:
import re
from gensim.models import KeyedVectors
import numpy as np
import numpy.linalg as LA

In this activity, we'll be working with pre-trained word embeddings using the `gensim` library, which provides a number of functions for accessing representations for individual words and comparing them.  The representations we'll use come from [Glove](https://nlp.stanford.edu/projects/glove/), which are trained on web data from the [Common Crawl](https://en.wikipedia.org/wiki/Common_Crawl) corpus.

In [2]:
glove = KeyedVectors.load_word2vec_format("../data/glove.6B.100d.100K.w2v.txt", binary=False)

In [3]:
good_vector=glove["good"]

In [4]:
print(good_vector)

[-0.030769   0.11993    0.53909   -0.43696   -0.73937   -0.15345
  0.081126  -0.38559   -0.68797   -0.41632   -0.13183   -0.24922
  0.441      0.085919   0.20871   -0.063582   0.062228  -0.051234
 -0.13398    1.1418     0.036526   0.49029   -0.24567   -0.412
  0.12349    0.41336   -0.48397   -0.54243   -0.27787   -0.26015
 -0.38485    0.78656    0.1023    -0.20712    0.40751    0.32026
 -0.51052    0.48362   -0.0099498 -0.38685    0.034975  -0.167
  0.4237    -0.54164   -0.30323   -0.36983    0.082836  -0.52538
 -0.064531  -1.398     -0.14873   -0.35327   -0.1118     1.0912
  0.095864  -2.8129     0.45238    0.46213    1.6012    -0.20837
 -0.27377    0.71197   -1.0754    -0.046974   0.67479   -0.065839
  0.75824    0.39405    0.15507   -0.64719    0.32796   -0.031748
  0.52899   -0.43886    0.67405    0.42136   -0.11981   -0.21777
 -0.29756   -0.1351     0.59898    0.46529   -0.58258   -0.02323
 -1.5442     0.01901   -0.015877   0.024499  -0.58017   -0.67659
 -0.040379  -0.44043    0.0

Functions useful for this activity include the following:

In [5]:
# access the representation for a single word
great_vector=glove["great"]

# use numpy to average multiple vector representations together
vecs_to_average=[good_vector, great_vector]
average=np.mean(vecs_to_average, axis=0)

# calculate the cosine similariy between two vectors
cosine_similarity=glove.cosine_similarities(good_vector, [great_vector])

print(good_vector.shape, great_vector.shape, average.shape, cosine_similarity)

(100,) (100,) (100,) [0.7592798]


Implement the [SemAxis](https://arxiv.org/pdf/1806.05521.pdf) method as described in class. Given a set of word embeddings for positive terms $S^+ = \{v_1^+, \ldots v_n^+\}$ and embeddings for negative terms $S^- = \{v_1^-, \ldots v_n^-\}$ that define the endpoints of the axis, your output should be a single real-value score for an input word $w$ with word representation $v_w$:

$$
score(w)_{\mathbf{V_\textrm{axis}}} = \textrm{cos}(v_w, \mathbf{V}_\textrm{axis})
$$

Where: 
$$
\mathbf{V}^+ = {1 \over n} \sum_1^n v_i^+
$$

$$
\mathbf{V}^- = {1 \over m} \sum_1^m v_i^-
$$

$$
\mathbf{V}_{\textrm{axis}} = \mathbf{V}^+ - \mathbf{V}^-
$$



In [21]:
positive_terms = ["good", "great"]


In [11]:
def get_semaxis_score(vectors, positive_terms=None, negative_terms=None, target_word=None):
    
    # access the representation of target word
    target_vector= glove[target_word]
    
    pos = []
    for pos in positive_terms:
        pos.append(glove[pos])
    
    neg = []
    for neg in negative_terms:
        neg.append(glove[neg])

    # use numpy to average multiple vector representations together
    vecs_to_average=[pos, neg]
    average =np.mean(vecs_to_average, axis=0)

    # calculate the cosine similariy between two vectors
    cosine_similarity=glove.cosine_similarities(good_vector, [great_vector])

    score = cosine_similarity

    return neg

In [10]:
# should be 0.342
get_semaxis_score(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_word="actress")

AttributeError: 'str' object has no attribute 'append'

Now let's score a set of target terms along that axis

In [None]:
def score_list_of_targets(vectors, positive_terms=None, negative_terms=None, target_words=None):
    scores=[]
    for target in target_words:
        scores.append((get_semaxis_score(vectors, positive_terms, negative_terms, target), target))

    for k,v in reversed(sorted(scores)):
        print("%.3f\t%s" % (k,v))

In [None]:
targets=["doctor", "nurse", "actor", "actress", "mechanic", "librarian", "architect", "magician", "cook", "chef"]

In [None]:
score_list_of_targets(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_words=targets)

Define **your own concept axis** by selecting a set of positive and negative terms and illustrate its utility by scoring a set of 10 target terms (as we did above).  If you've implemented  `get_semaxis_score` above, you only need to add terms to the `positive_terms` and `negative_terms` lists below and execute this cell.

In [None]:
positive_terms=[]
negative_terms=[]
targets=[]

score_list_of_targets(glove, positive_terms=positive_terms, negative_terms=negative_terms, target_words=targets)

Let's assume now that you're able to score all words in a vocabulary along several conceptual dimensions (like the one you've defined) for a given set of word embeddings trained on a dataset.  What could you do with that score? Brainstorm possible applications.