# 'Mikolov Negatives'
If we take the vector of king, subtract the vector of man, and add the vector of woman, how close are we to the vector of queen?
This notebook will implement a function to allow smooth testing of this experiment.

In [26]:
import os
import torch
import openai
import numpy as np

import sentencepiece    # necessary for proper t5 init.
from transformers import T5Tokenizer, T5EncoderModel, GPT2Tokenizer, OPTModel

from sklearn.metrics.pairwise import cosine_similarity

# api key set in conda env.
openai.api_key = os.getenv('OPENAI_API_KEY')

## Load Models

### Vocab
Now expanded to the Oxford 5000, plus relevant test words.

In [27]:
vocab = []
with open('./expanded_vocab.txt', 'r') as f:
    for line in f:
        vocab.append(line.strip())

len(vocab)

5124

### GPT-3
Vectors are normalized. Produced by https://arxiv.org/abs/2201.10005. Note: "Our models achieved new state-of-the-art results in linear-probe classification, text search and code search. We find that our models **underperformed on sentence similarity tasks and observed unexpected training behavior with respect to these tasks**." (emphasis mine) 

In [28]:
# Loading saved embeddings from GPT-3 Curie
# Curie embeds: 4098 dims
ada_embeds = []
with open(u'./gpt/gpt_curie.txt', 'r') as f:
    for line in f:
        ada_embeds.append([float(x) for x in line.strip().split()])

model_gpt = dict(zip(vocab, ada_embeds))

In [29]:
# for getting new embeddings from OpenAI
def gpt_embed(text, engine='text-similarity-curie-001'):
    text = text.replace('\n', ' ')
    return openai.Embedding.create(input=[text], engine=engine)['data'][0]['embedding']

### OPT
OPT, like GPT-3, is a Decoder-only Transformer. BERT is encoder only, and T5 is encoder-decoder, making use of both. Likewise, getting embeddings from OPT is a *little* less complicated, as there is only one part of the architecture we look to for embeddings.

In [30]:
# WARNING: OPT-13B uses ~49GB of RAM when loaded into memory.
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('facebook/opt-13b', cache_dir='/scratch/mbarlow6/.cache')
model_opt_raw = OPTModel.from_pretrained('facebook/opt-13b', cache_dir='/scratch/mbarlow6/.cache')

Some weights of the model checkpoint at facebook/opt-13b were not used when initializing OPTModel: ['decoder.final_layer_norm.weight', 'decoder.final_layer_norm.bias']
- This IS expected if you are initializing OPTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing OPTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [31]:
# 5120D vectors here :)
opt_embeds = []
with open(u'./opt/13B.txt', 'r') as f:
    for line in f:
        opt_embeds.append([float(x) for x in line.strip().split()])
model_opt = dict(zip(vocab, opt_embeds))

In [32]:
def opt_embed(text, tokenizer=gpt2_tokenizer, model=model_opt_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings[1:], dim=0))

### T5
Here is our encoder-decoder model. I will extract embeddings from the encoder only, as per results from https://arxiv.org/pdf/2108.08877.pdf. Note: "When mean pooling is applied to the T5’s encoder outputs, it greatly outperforms the average embeddings of BERT. Notably, even without fine-tuning, the average embeddings of the T5’s encoder-only outputs outperforms SimCSE-RoBERTa, which is fine-tuned on NLI datase."

Seeing as I amd doing no fine tuning, and mean pool OPT decoder outputs, I will do the same with T5's **encoder** outputs.

In [33]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-3b', cache_dir='/scratch/mbarlow6/.cache')
model_t5_raw = T5EncoderModel.from_pretrained('t5-3b', cache_dir='/scratch/mbarlow6/.cache')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-3b automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Some weights of the model checkpoint at t5-3b were not used when initializing T5EncoderModel: ['decoder.block.16.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.0.SelfAttention.o.weight', 'decoder.block.16.layer.0.SelfAttention.o.weight', 'decoder.block.23.layer.2.DenseReluDense.wi.weight', 'decoder.block.14.layer.1.EncDecAttention.k.weight', 'decoder.block.22.layer.0.SelfAttention.k.weight', 'decoder.block.9.layer.1.EncDecAttention.o.weight', 'decoder.block.1.layer.0.SelfAttention.v.weight', 'decoder.block.6.layer.2.DenseReluDense

In [34]:
t5_embeds = []
with open('./t5/t53b.txt', 'r') as f:
    for line in f:
        t5_embeds.append([float(x) for x in line.strip().split()])
model_t5 = dict(zip(vocab, t5_embeds))

In [35]:
def t5_embed(text, tokenizer=t5_tokenizer, model=model_t5_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings, dim=0))

## Helpers
Here I will define some functions for the purpose of the investigation.

In [36]:
def positive(words, model='gpt'):
    """
    Args:
        words: iterable
        model: 'gpt', 'opt', or 't5'
    Returns:
        Positive (summed vectors) of word embeddings of a given list of words from the specified model. Defaults to GPT-3.
    """
    if isinstance(words, str):
        print(f"You requested the positive of the string \"{words}\". Did you mean [\"{words}\"]?")

    out = 0
    for token in words:
        # convert token to string
        word = str(token)
        # do model check - least intensive operation to repeat
        if model.lower() == 'gpt':
            # look for token in cached GPT embeds
            if word in model_gpt:
                ex = model_gpt[word]  # ex for "extracted"
            # if not found, query API
            else:
                ex = gpt_embed(word)
                model_gpt[word] = ex
        elif model.lower() == 'opt':
            if word in model_opt:
                ex = model_opt[word]
            else:
                # squeeze!
                ex = opt_embed(word)
                model_opt[word] = ex
        elif model.lower() == 't5':
            if word in model_t5:
                ex = model_t5[word]
            else:
                ex = t5_embed(word)
                model_t5[word] = ex
        else:
            raise ValueError('Please provide either gpt, opt, or t5 as a model choice.')

        # construct positive
        if isinstance(out, int):
            out = np.array(ex).reshape(1, -1)
        else:
            out += np.array(ex).reshape(1, -1)
            
    return out if not isinstance(out, int) else np.array([])

In [37]:
def calculate_similarity(words, target, vec=False, model='gpt'):
    """
    Args:
        words: iterable or vector   -> Items to be combined into a positive and compared,
                                       or already constructed vector of matching dimensions.
        target: str     -> single term to calulate similarity to.
        vec: bool       -> true if 'words' is a vector
        model: str      -> 'gpt', 'opt', or 't5'
    Returns:
        The cosine similarity of the words vector and target term in specified model.
    """
    # get phrase
    phrase = words
    if not vec:
        if isinstance(words, str):
            phrase = positive([words], model)
        else:
            phrase = positive(words, model)
    
    # get target
    target = positive([target], model)

    return cosine_similarity(phrase, target)[0][0]

In [38]:
def negative(start, det, add, end, model='gpt'):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
        model: str  -> gpt, opt, or t5.
    Returns:
        The cosine similarity between <end> and <start> - <det> + <add>.
    """
    A = positive([start], model)
    B = positive([det], model)
    C = positive([add], model)
    D = positive([end], model)
    neg = (A - B) + C
    return cosine_similarity(neg, D)[0][0]


In [39]:
def mikolov(start, less, more, target):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
    Returns:
        None. Wraps negative for each model we have to test.
    """
    print(*[start, less, more], sep=', ', end='')
    print(f" -> {target}")
    for model in ['gpt', 'opt', 't5']:
        print(f"model: {model}")
        print(f"score: {negative(start, less, more, model)}")
        print(f"benchmark: {start} -> {target} = {calculate_similarity([start], target, model=model)}")
        print('-------------------------------------------------------')

## Tests
Main Categories identified: *opposite-gender*, *capitol-of*, *pluralization*, *adjective-scale*, *possesion*, and *tense*.

Please refer to https://arxiv.org/pdf/1509.01692.pdf and https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rvecs.pdf for more info.

* opposite-gender

In [40]:
mikolov('king', 'man', 'woman', 'queen')

king, man, woman -> queen
model: gpt
score: 0.6729880798038093
benchmark: king -> queen = 0.8851292569413515
-------------------------------------------------------
model: opt
score: 0.6613467628491987
benchmark: king -> queen = 0.5037397711321339
-------------------------------------------------------
model: t5
score: 0.6648142296896553
benchmark: king -> queen = 0.3860530420267463
-------------------------------------------------------


In [41]:
mikolov('chairman', 'man', 'woman', 'chairwoman')

chairman, man, woman -> chairwoman
model: gpt
score: 0.6193385506374722
benchmark: chairman -> chairwoman = 0.928414866800918
-------------------------------------------------------
model: opt
score: 0.5979181296760168
benchmark: chairman -> chairwoman = 0.9589032532969219
-------------------------------------------------------
model: t5
score: 0.6224928570465704
benchmark: chairman -> chairwoman = 0.48436462087132637
-------------------------------------------------------


In [42]:
mikolov('brother', 'male', 'female', 'sister')

brother, male, female -> sister
model: gpt
score: 0.6799844491903719
benchmark: brother -> sister = 0.8785020289126475
-------------------------------------------------------
model: opt
score: 0.6690131338612086
benchmark: brother -> sister = 0.4470249819252777
-------------------------------------------------------
model: t5
score: 0.6853435881126732
benchmark: brother -> sister = 0.8213804516568233
-------------------------------------------------------


In [43]:
mikolov('stallion', 'male', 'female', 'mare')

stallion, male, female -> mare
model: gpt
score: 0.6556616643566257
benchmark: stallion -> mare = 0.7583042843432317
-------------------------------------------------------
model: opt
score: 0.6225629314446364
benchmark: stallion -> mare = 0.6087521518061494
-------------------------------------------------------
model: t5
score: 0.6508969838508619
benchmark: stallion -> mare = 0.15555480238970334
-------------------------------------------------------


* capitol-of

In [44]:
mikolov('Madrid', 'Spain', 'France', 'Paris')

Madrid, Spain, France -> Paris
model: gpt
score: 0.690129356628566
benchmark: Madrid -> Paris = 0.8690059206865344
-------------------------------------------------------
model: opt
score: 0.6507127559965387
benchmark: Madrid -> Paris = 0.4303607940673828
-------------------------------------------------------
model: t5
score: 0.6979207299067045
benchmark: Madrid -> Paris = 0.6243758797645569
-------------------------------------------------------


* pluralization

In [45]:
mikolov('cars', 'car', 'apple', 'apples')

cars, car, apple -> apples
model: gpt
score: 0.7396348199029129
benchmark: cars -> apples = 0.8229437328841653
-------------------------------------------------------
model: opt
score: 0.6840334182254659
benchmark: cars -> apples = 0.139387309551239
-------------------------------------------------------
model: t5
score: 0.712470370452123
benchmark: cars -> apples = 0.5858688354492188
-------------------------------------------------------


In [46]:
mikolov('years', 'year', 'law', 'laws')

years, year, law -> laws
model: gpt
score: 0.672309158607031
benchmark: years -> laws = 0.8118192521108692
-------------------------------------------------------
model: opt
score: 0.6892114445466884
benchmark: years -> laws = 0.9065991640090942
-------------------------------------------------------
model: t5
score: 0.6708920462923322
benchmark: years -> laws = 0.44145113229751587
-------------------------------------------------------


* adjective-scale

In [47]:
mikolov('hot', 'warm', 'cool', 'cold')

hot, warm, cool -> cold
model: gpt
score: 0.6799035471122096
benchmark: hot -> cold = 0.8094623865808173
-------------------------------------------------------
model: opt
score: 0.684984178459188
benchmark: hot -> cold = 0.927157890849394
-------------------------------------------------------
model: t5
score: 0.6959157535214271
benchmark: hot -> cold = 0.5191256564414523
-------------------------------------------------------


In [48]:
mikolov('best', 'good', 'bad', 'worst')

best, good, bad -> worst
model: gpt
score: 0.6840402511513165
benchmark: best -> worst = 0.8453496084889025
-------------------------------------------------------
model: opt
score: 0.7167266803741168
benchmark: best -> worst = 0.9662132245608877
-------------------------------------------------------
model: t5
score: 0.6998340678906982
benchmark: best -> worst = 0.5163130997210805
-------------------------------------------------------


* possession

In [49]:
mikolov('city\'s', 'city', 'bank', 'bank\'s')

city's, city, bank -> bank's
model: gpt
score: 0.7539958149021666
benchmark: city's -> bank's = 0.8545013438681033
-------------------------------------------------------
model: opt
score: 0.7310323207299744
benchmark: city's -> bank's = 0.7009552717208862
-------------------------------------------------------
model: t5
score: 0.7341099190526892
benchmark: city's -> bank's = 0.7618241310119629
-------------------------------------------------------


In [50]:
mikolov('mine', 'me', 'you', 'yours')

mine, me, you -> yours
model: gpt
score: 0.7058381934793133
benchmark: mine -> yours = 0.9051090576935131
-------------------------------------------------------
model: opt
score: 0.6899145928208112
benchmark: mine -> yours = 0.47416482359328005
-------------------------------------------------------
model: t5
score: 0.6911763131105917
benchmark: mine -> yours = 0.3430048608756724
-------------------------------------------------------


* verb-source

In [51]:
mikolov('walking', 'legs', 'teeth', 'chewing')

walking, legs, teeth -> chewing
model: gpt
score: 0.6499564519125616
benchmark: walking -> chewing = 0.7993996582106503
-------------------------------------------------------
model: opt
score: 0.6117839805130452
benchmark: walking -> chewing = 0.4678374320452787
-------------------------------------------------------
model: t5
score: 0.647341491590882
benchmark: walking -> chewing = 0.2289078643794683
-------------------------------------------------------


In [52]:
mikolov('walk', 'legs', 'teeth', 'chew')

walk, legs, teeth -> chew
model: gpt
score: 0.6553762415589696
benchmark: walk -> chew = 0.784419188617846
-------------------------------------------------------
model: opt
score: 0.6450453502361564
benchmark: walk -> chew = 0.47490602590411346
-------------------------------------------------------
model: t5
score: 0.6540647831609678
benchmark: walk -> chew = 0.3189265948477693
-------------------------------------------------------


In [53]:
mikolov('sailing', 'boat', 'car', 'driving')

sailing, boat, car -> driving
model: gpt
score: 0.6952303539774101
benchmark: sailing -> driving = 0.8182398689225385
-------------------------------------------------------
model: opt
score: 0.6743034696965616
benchmark: sailing -> driving = 0.40121005002385757
-------------------------------------------------------
model: t5
score: 0.6944079497787948
benchmark: sailing -> driving = 0.5557397781760884
-------------------------------------------------------


In [54]:
mikolov('sail', 'boat', 'car', 'drive')

sail, boat, car -> drive
model: gpt
score: 0.6947248628853393
benchmark: sail -> drive = 0.8023252530187783
-------------------------------------------------------
model: opt
score: 0.7073158600982759
benchmark: sail -> drive = 0.4122551928759718
-------------------------------------------------------
model: t5
score: 0.7067052875915694
benchmark: sail -> drive = 0.3872744165407898
-------------------------------------------------------
