# 'Mikolov Negatives'
If we take the vector of king, subtract the vector of man, and add the vector of woman, how close are we to the vector of queen?
This notebook will implement a function to allow smooth testing of this experiment.

Changes in XXL: No OPT. GPT-3 Davinci and google/flan-t5-xxl (11b)

In [3]:
import os
import torch
import openai
import numpy as np

import sentencepiece    # necessary for proper t5 init.
from transformers import T5Tokenizer, T5EncoderModel

from sklearn.metrics.pairwise import cosine_similarity

# api key set in conda env.
openai.api_key = os.getenv('OPENAI_API_KEY')

## Load Models

### Vocab
Now expanded to the Oxford 5000, plus relevant test words.

In [4]:
vocab = []
with open('./vocab/expanded_vocab.txt', 'r') as f:
    for line in f:
        vocab.append(line.strip())

len(vocab)

5124

### GPT-3
Vectors are normalized. Produced by https://arxiv.org/abs/2201.10005. Note: "Our models achieved new state-of-the-art results in linear-probe classification, text search and code search. We find that our models **underperformed on sentence similarity tasks and observed unexpected training behavior with respect to these tasks**." (emphasis mine) 

In [5]:
# Loading saved embeddings from GPT-3 Davinci
# Davinci embeds: 12,228 dims
dav_embeds = []
with open(u'./gpt/gpt_davinci.txt', 'r') as f:
    for line in f:
        dav_embeds.append([float(x) for x in line.strip().split()])

model_gpt = dict(zip(vocab, dav_embeds))

In [6]:
# for getting new embeddings from OpenAI
def gpt_embed(text, engine='text-similarity-davinci-001'):
    text = text.replace('\n', ' ')
    return openai.Embedding.create(input=[text], engine=engine)['data'][0]['embedding']

### T5
Here is our encoder-decoder model. I will extract embeddings from the encoder only, as per results from https://arxiv.org/pdf/2108.08877.pdf. Note: "When mean pooling is applied to the T5’s encoder outputs, it greatly outperforms the average embeddings of BERT. Notably, even without fine-tuning, the average embeddings of the T5’s encoder-only outputs outperforms SimCSE-RoBERTa, which is fine-tuned on NLI datase."

Seeing as I amd doing no fine tuning, and mean pool OPT decoder outputs, I will do the same with T5's **encoder** outputs.

In [9]:
t5_tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-xxl', cache_dir='/scratch/mbarlow6/.cache')
model_t5_raw = T5EncoderModel.from_pretrained('google/flan-t5-xxl', cache_dir='/scratch/mbarlow6/.cache')

Some weights of the model checkpoint at google/flan-t5-xxl were not used when initializing T5EncoderModel: ['decoder.block.19.layer.1.EncDecAttention.o.weight', 'decoder.block.22.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.8.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.q.weight', 'decoder.block.20.layer.0.SelfAttention.v.weight', 'decoder.block.17.layer.0.layer_norm.weight', 'decoder.block.19.layer.1.EncDecAttention.q.weight', 'decoder.block.14.layer.1.EncDecAttention.q.weight', 'decoder.block.23.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.o.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.2.DenseReluDense.wo.weight', 'decoder.block.9.layer.2.layer_norm.weight', 'decoder.block.13.layer.1.EncDecAttention.k.weight', 'decoder.block.2.l

In [10]:
t5_embeds = []
with open('./t5/flan_t5_11b.txt', 'r') as f:
    for line in f:
        t5_embeds.append([float(x) for x in line.strip().split()])
model_t5 = dict(zip(vocab, t5_embeds))

In [11]:
def t5_embed(text, tokenizer=t5_tokenizer, model=model_t5_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings, dim=0))

## Helpers
Here I will define some functions for the purpose of the investigation.

In [12]:
def positive(words, model='gpt'):
    """
    Args:
        words: iterable
        model: 'gpt' or 't5'
    Returns:
        Positive (summed vectors) of word embeddings of a given list of words from the specified model. Defaults to GPT-3.
    """
    if isinstance(words, str):
        print(f"You requested the positive of the string \"{words}\". Did you mean [\"{words}\"]?")

    out = 0
    for token in words:
        # convert token to string
        word = str(token)
        # do model check - least intensive operation to repeat
        if model.lower() == 'gpt':
            # look for token in cached GPT embeds
            if word in model_gpt:
                ex = model_gpt[word]  # ex for "extracted"
            # if not found, query API
            else:
                ex = gpt_embed(word)
                model_gpt[word] = ex
        elif model.lower() == 't5':
            if word in model_t5:
                ex = model_t5[word]
            else:
                ex = t5_embed(word)
                model_t5[word] = ex
        else:
            raise ValueError('Please provide either gpt, opt, or t5 as a model choice.')

        # construct positive
        if isinstance(out, int):
            out = np.array(ex).reshape(1, -1)
        else:
            out += np.array(ex).reshape(1, -1)
            
    return out if not isinstance(out, int) else np.array([])

In [13]:
def calculate_similarity(words, target, vec=False, model='gpt'):
    """
    Args:
        words: iterable or vector   -> Items to be combined into a positive and compared,
                                       or already constructed vector of matching dimensions.
        target: str     -> single term to calulate similarity to.
        vec: bool       -> true if 'words' is a vector
        model: str      -> 'gpt' or 't5'
    Returns:
        The cosine similarity of the words vector and target term in specified model.
    """
    # get phrase
    phrase = words
    if not vec:
        if isinstance(words, str):
            phrase = positive([words], model)
        else:
            phrase = positive(words, model)
    
    # get target
    target = positive([target], model)

    return cosine_similarity(phrase, target)[0][0]

In [14]:
def negative(start, det, add, end, model='gpt'):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
        model: str  -> gpt or t5.
    Returns:
        The cosine similarity between <end> and <start> - <det> + <add>.
    """
    A = positive([start], model)
    B = positive([det], model)
    C = positive([add], model)
    D = positive([end], model)
    neg = (A - B) + C
    return cosine_similarity(neg, D)[0][0]


In [15]:
def mikolov(start, less, more, target):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
    Returns:
        None. Wraps negative for each model we have to test.
    """
    print(*[start, less, more], sep=', ', end='')
    print(f" -> {target}")
    for model in ['gpt', 't5']:
        print(f"model: {model}")
        print(f"score: {negative(start, less, more, model)}")
        print(f"benchmark: {start} -> {target} = {calculate_similarity([start], target, model=model)}")
        print('-------------------------------------------------------')

## Tests
Main Categories identified: *opposite-gender*, *capitol-of*, *pluralization*, *adjective-scale*, *possesion*, and *tense*.

Please refer to https://arxiv.org/pdf/1509.01692.pdf and https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rvecs.pdf for more info.

* opposite-gender

In [16]:
mikolov('king', 'man', 'woman', 'queen')

king, man, woman -> queen
model: gpt
score: 0.7703084983562917
benchmark: king -> queen = 0.8985080894387735
-------------------------------------------------------
model: t5
score: 0.7258212692616726
benchmark: king -> queen = 0.6996615597317424
-------------------------------------------------------


In [17]:
mikolov('chairman', 'man', 'woman', 'chairwoman')

chairman, man, woman -> chairwoman
model: gpt
score: 0.6888063543967273
benchmark: chairman -> chairwoman = 0.9315533026820068
-------------------------------------------------------
model: t5
score: 0.6575280485534069
benchmark: chairman -> chairwoman = 0.6923004904833964
-------------------------------------------------------


In [18]:
mikolov('brother', 'male', 'female', 'sister')

brother, male, female -> sister
model: gpt
score: 0.7901466172516423
benchmark: brother -> sister = 0.9186488343882371
-------------------------------------------------------
model: t5
score: 0.7433035811329919
benchmark: brother -> sister = 0.8771536292487245
-------------------------------------------------------


In [19]:
mikolov('stallion', 'male', 'female', 'mare')

stallion, male, female -> mare
model: gpt
score: 0.7406233952109762
benchmark: stallion -> mare = 0.8411060014518819
-------------------------------------------------------
model: t5
score: 0.6982178873116891
benchmark: stallion -> mare = 0.37250359848041403
-------------------------------------------------------


* capitol-of

In [20]:
mikolov('Madrid', 'Spain', 'France', 'Paris')

Madrid, Spain, France -> Paris
model: gpt
score: 0.7580520368039784
benchmark: Madrid -> Paris = 0.8597558370002809
-------------------------------------------------------
model: t5
score: 0.7321430761748646
benchmark: Madrid -> Paris = 0.68666672706604
-------------------------------------------------------


* pluralization

In [21]:
mikolov('cars', 'car', 'apple', 'apples')

cars, car, apple -> apples
model: gpt
score: 0.7595482455293469
benchmark: cars -> apples = 0.814181320219243
-------------------------------------------------------
model: t5
score: 0.7286188171448085
benchmark: cars -> apples = 0.5622069835662842
-------------------------------------------------------


In [22]:
mikolov('years', 'year', 'law', 'laws')

years, year, law -> laws
model: gpt
score: 0.8080466383177447
benchmark: years -> laws = 0.8577423710934432
-------------------------------------------------------
model: t5
score: 0.7627040460783938
benchmark: years -> laws = 0.542034387588501
-------------------------------------------------------


* adjective-scale

In [23]:
mikolov('hot', 'warm', 'cool', 'cold')

hot, warm, cool -> cold
model: gpt
score: 0.7600496905453323
benchmark: hot -> cold = 0.8196469902160869
-------------------------------------------------------
model: t5
score: 0.7095308319680024
benchmark: hot -> cold = 0.7367349160417564
-------------------------------------------------------


In [24]:
mikolov('best', 'good', 'bad', 'worst')

best, good, bad -> worst
model: gpt
score: 0.7348750679534736
benchmark: best -> worst = 0.8107603269753152
-------------------------------------------------------
model: t5
score: 0.7027772168983535
benchmark: best -> worst = 0.762467757632748
-------------------------------------------------------


* possession

In [25]:
mikolov('city\'s', 'city', 'bank', 'bank\'s')

city's, city, bank -> bank's
model: gpt
score: 0.818953208882454
benchmark: city's -> bank's = 0.8540676399020821
-------------------------------------------------------
model: t5
score: 0.772950553007265
benchmark: city's -> bank's = 0.7040677070617676
-------------------------------------------------------


In [26]:
mikolov('mine', 'me', 'you', 'yours')

mine, me, you -> yours
model: gpt
score: 0.8208659247052847
benchmark: mine -> yours = 0.9076210077153537
-------------------------------------------------------
model: t5
score: 0.7673237665263305
benchmark: mine -> yours = 0.6406882706185066
-------------------------------------------------------


* verb-source

In [27]:
mikolov('walking', 'legs', 'teeth', 'chewing')

walking, legs, teeth -> chewing
model: gpt
score: 0.7054636865837388
benchmark: walking -> chewing = 0.8131267858240695
-------------------------------------------------------
model: t5
score: 0.6845487541675309
benchmark: walking -> chewing = 0.45184335868969294
-------------------------------------------------------


In [28]:
mikolov('walk', 'legs', 'teeth', 'chew')

walk, legs, teeth -> chew
model: gpt
score: 0.7205202625033159
benchmark: walk -> chew = 0.8141899916189553
-------------------------------------------------------
model: t5
score: 0.6997145450187889
benchmark: walk -> chew = 0.6378753246466464
-------------------------------------------------------


In [29]:
mikolov('sailing', 'boat', 'car', 'driving')

sailing, boat, car -> driving
model: gpt
score: 0.7904261042444086
benchmark: sailing -> driving = 0.8523761563932656
-------------------------------------------------------
model: t5
score: 0.7329003237907269
benchmark: sailing -> driving = 0.5727941778241692
-------------------------------------------------------


In [30]:
mikolov('sail', 'boat', 'car', 'drive')

sail, boat, car -> drive
model: gpt
score: 0.7942699441735251
benchmark: sail -> drive = 0.8471531312715113
-------------------------------------------------------
model: t5
score: 0.7359975380845648
benchmark: sail -> drive = 0.5849099740684152
-------------------------------------------------------


# New Observations

This file uses flan-t5-xxl as it's embedding source. 

While the new vector may score higher in similarity on a case like 'years' - 'year' + 'law' = 'laws' than the benchmark of 'years' -> 'laws', each composite vector is scoring about 75% (give or take) similarity. This makes me feel like we are not truly getting closer but moving towards something decisively more average, but not the target. Why do we never see a score about 85, 90%?

This paper has been on my mind - https://doi.org/10.18653/v1%2F2021.emnlp-main.372. Per the abstract, 
>We find that a small number of rogue dimensions, often just 1-3, dominate these measures. Moreover, we find a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model. We show that simple postprocessing techniques such as standardization are able to correct for rogue dimensions and reveal underlying representational quality. We argue that accounting for rogue dimensions is essential for any similarity-based analysis of contextual language models.

which leads me to believe embeddings from t5 and opt will not be performant on this task (no matter the size) without post-processing.


I believe GPT-3 embeddings undergo some kind of post-processing like this paper suggests, which could explain our significant results with Curie but not with OPT or t5 equivalents.