# 'Mikolov Negatives'
If we take the vector of king, subtract the vector of man, and add the vector of woman, how close are we to the vector of queen?
This notebook will implement a function to allow smooth testing of this experiment.

In [1]:
import os
import torch
import openai
import numpy as np

import sentencepiece    # necessary for proper t5 init.
from transformers import T5Tokenizer, T5EncoderModel, GPT2Tokenizer, OPTModel

from sklearn.metrics.pairwise import cosine_similarity

# api key set in conda env.
openai.api_key = os.getenv('OPENAI_API_KEY')

## Load Models

### Vocab
Now expanded to the Oxford 5000, plus relevant test words.

In [2]:
vocab = []
with open('./expanded_vocab.txt', 'r') as f:
    for line in f:
        vocab.append(line.strip())

len(vocab)

5124

### GPT-3
Vectors are normalized. Produced by https://arxiv.org/abs/2201.10005. Note: "Our models achieved new state-of-the-art results in linear-probe classification, text search and code search. We find that our models **underperformed on sentence similarity tasks and observed unexpected training behavior with respect to these tasks**." (emphasis mine) 

In [3]:
# Loading saved embeddings from GPT-3 Ada
# Ada embeds: 1024 dims
# Babbage load included below as comment.
"""
# Babbage embeds: 2048 dims
bab_embeds = []
with open(u'/gpfs/fs1/home/mbarlow6/Desktop/Conceptual-Analysis/barlow/gpt/gpt_babbage.txt', 'r') as f:
    for line in f:
        bab_embeds.append([float(x) for x in line.strip().split()])

model_bab = dict(zip(vocab, bab_embeds))
"""
ada_embeds = []
with open(u'./gpt/gpt_ada.txt', 'r') as f:
    for line in f:
        ada_embeds.append([float(x) for x in line.strip().split()])

model_gpt = dict(zip(vocab, ada_embeds))

In [4]:
# for getting new embeddings from OpenAI
def gpt_embed(text, engine='text-similarity-ada-001'):
    text = text.replace('\n', ' ')
    return openai.Embedding.create(input=[text], engine=engine)['data'][0]['embedding']

### OPT
OPT, like GPT-3, is a Decoder-only Transformer. BERT is encoder only, and T5 is encoder-decoder, making use of both. Likewise, getting embeddings from OPT is a *little* less complicated, as there is only one part of the architecture we look to for embeddings.

In [5]:
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/.cache')
model_opt_raw = OPTModel.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/.cache')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.45G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/opt-1.3b were not used when initializing OPTModel: ['lm_head.weight', 'model.decoder.final_layer_norm.weight', 'model.decoder.final_layer_norm.bias']
- This IS expected if you are initializing OPTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing OPTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
opt_embeds = []
with open(u'./opt/1_3B.txt', 'r') as f:
    for line in f:
        opt_embeds.append([float(x) for x in line.strip().split()])
model_opt = dict(zip(vocab, opt_embeds))

In [7]:
def opt_embed(text, tokenizer=gpt2_tokenizer, model=model_opt_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings[1:], dim=0))

### T5
Here is our encoder-decoder model. I will extract embeddings from the encoder only, as per results from https://arxiv.org/pdf/2108.08877.pdf. Note: "When mean pooling is applied to the T5’s encoder outputs, it greatly outperforms the average embeddings of BERT. Notably, even without fine-tuning, the average embeddings of the T5’s encoder-only outputs outperforms SimCSE-RoBERTa, which is fine-tuned on NLI datase."

Seeing as I amd doing no fine tuning, and mean pool OPT decoder outputs, I will do the same with T5's **encoder** outputs.

In [8]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/.cache')
model_t5_raw = T5EncoderModel.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/.cache')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

Some weights of the model checkpoint at t5-large were not used when initializing T5EncoderModel: ['decoder.block.13.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight', 'decoder.block.14.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.1.EncDecAttention.k.weight', 'decoder.block.17.layer.1.EncDecAttention.v.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'decoder.block.16.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.v.weight', 'decoder.block.13.layer.2.DenseReluDense.wo.weight', 'decoder.block.7.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.2.layer_norm.weight', 'decoder.block.11.layer.0.layer_norm.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.2

In [9]:
t5_embeds = []
with open('./t5/t5large.txt', 'r') as f:
    for line in f:
        t5_embeds.append([float(x) for x in line.strip().split()])
model_t5 = dict(zip(vocab, t5_embeds))

In [10]:
def t5_embed(text, tokenizer=t5_tokenizer, model=model_t5_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings, dim=0))

## Helpers
Here I will define some functions for the purpose of the investigation.

In [11]:
def positive(words, model='gpt'):
    """
    Args:
        words: iterable
        model: 'gpt', 'opt', or 't5'
    Returns:
        Positive (summed vectors) of word embeddings of a given list of words from the specified model. Defaults to GPT-3.
    """
    if isinstance(words, str):
        print(f"You requested the positive of the string \"{words}\". Did you mean [\"{words}\"]?")

    out = 0
    for token in words:
        # convert token to string
        word = str(token)
        # do model check - least intensive operation to repeat
        if model.lower() == 'gpt':
            # look for token in cached GPT embeds
            if word in model_gpt:
                ex = model_gpt[word]  # ex for "extracted"
            # if not found, query API
            else:
                ex = gpt_embed(word)
                model_gpt[word] = ex
        elif model.lower() == 'opt':
            if word in model_opt:
                ex = model_opt[word]
            else:
                # squeeze!
                ex = opt_embed(word)
                model_opt[word] = ex
        elif model.lower() == 't5':
            if word in model_t5:
                ex = model_t5[word]
            else:
                ex = t5_embed(word)
                model_t5[word] = ex
        else:
            raise ValueError('Please provide either gpt, opt, or t5 as a model choice.')

        # construct positive
        if isinstance(out, int):
            out = np.array(ex).reshape(1, -1)
        else:
            out += np.array(ex).reshape(1, -1)
            
    return out if not isinstance(out, int) else np.array([])

In [12]:
def calculate_similarity(words, target, vec=False, model='gpt'):
    """
    Args:
        words: iterable or vector   -> Items to be combined into a positive and compared,
                                       or already constructed vector of matching dimensions.
        target: str     -> single term to calulate similarity to.
        vec: bool       -> true if 'words' is a vector
        model: str      -> 'gpt', 'opt', or 't5'
    Returns:
        The cosine similarity of the words vector and target term in specified model.
    """
    # get phrase
    phrase = words
    if not vec:
        if isinstance(words, str):
            phrase = positive([words], model)
        else:
            phrase = positive(words, model)
    
    # get target
    target = positive([target], model)

    return cosine_similarity(phrase, target)[0][0]

In [13]:
def negative(start, det, add, end, model='gpt'):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
        model: str  -> gpt, opt, or t5.
    Returns:
        The cosine similarity between <end> and <start> - <det> + <add>.
    """
    A = positive([start], model)
    B = positive([det], model)
    C = positive([add], model)
    D = positive([end], model)
    neg = (A - B) + C
    return cosine_similarity(neg, D)[0][0]


In [14]:
def mikolov(start, less, more, target):
    """
    Args:
        start: str  -> starting word
        det: str    -> word to subtract from start
        add: str    -> word to add to get new direction
        end: str    -> target word
    Returns:
        None. Wraps negative for each model we have to test.
    """
    print(*[start, less, more], sep=', ', end='')
    print(f" -> {target}")
    for model in ['gpt', 'opt', 't5']:
        print(f"model: {model}")
        print(f"score: {negative(start, less, more, model)}")
        print(f"benchmark: {start} -> {target} = {calculate_similarity([start], target, model=model)}")
        print('-------------------------------------------------------')

## Tests
Main Categories identified: *opposite-gender*, *capitol-of*, *pluralization*, *adjective-scale*, *possesion*, and *tense*.

Please refer to https://arxiv.org/pdf/1509.01692.pdf and https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rvecs.pdf for more info.

* opposite-gender

In [15]:
mikolov('king', 'man', 'woman', 'queen')

king, man, woman -> queen
model: gpt
score: 0.7378115578440716
benchmark: king -> queen = 0.9117405424813445
-------------------------------------------------------
model: opt
score: 0.7957717588520916
benchmark: king -> queen = 0.6519601615672634
-------------------------------------------------------
model: t5
score: 0.7511311209215747
benchmark: king -> queen = 0.5005843823222568
-------------------------------------------------------


In [16]:
mikolov('chairman', 'man', 'woman', 'chairwoman')

chairman, man, woman -> chairwoman
model: gpt
score: 0.6522159143876429
benchmark: chairman -> chairwoman = 0.9445721548901769
-------------------------------------------------------
model: opt
score: 0.7059410113916196
benchmark: chairman -> chairwoman = 0.9417800696634807
-------------------------------------------------------
model: t5
score: 0.6497875212009107
benchmark: chairman -> chairwoman = 0.6183062413573323
-------------------------------------------------------


In [17]:
mikolov('brother', 'male', 'female', 'sister')

brother, male, female -> sister
model: gpt
score: 0.7596375085364417
benchmark: brother -> sister = 0.9134060809522063
-------------------------------------------------------
model: opt
score: 0.8316455619512485
benchmark: brother -> sister = 0.557895862264566
-------------------------------------------------------
model: t5
score: 0.7632233058668813
benchmark: brother -> sister = 0.7528198508095869
-------------------------------------------------------


In [18]:
mikolov('stallion', 'male', 'female', 'mare')

stallion, male, female -> mare
model: gpt
score: 0.6962794443000064
benchmark: stallion -> mare = 0.8475974889192053
-------------------------------------------------------
model: opt
score: 0.7506368283946538
benchmark: stallion -> mare = 0.5576398415985386
-------------------------------------------------------
model: t5
score: 0.7054738074979054
benchmark: stallion -> mare = 0.31265633795208364
-------------------------------------------------------


* capitol-of

In [19]:
mikolov('Madrid', 'Spain', 'France', 'Paris')

Madrid, Spain, France -> Paris
model: gpt
score: 0.7393247629269616
benchmark: Madrid -> Paris = 0.9007781858064563
-------------------------------------------------------
model: opt
score: 0.8171097543563783
benchmark: Madrid -> Paris = 0.5303939580917358
-------------------------------------------------------
model: t5
score: 0.7522346421892213
benchmark: Madrid -> Paris = 0.7242726683616638
-------------------------------------------------------


* pluralization

In [20]:
mikolov('cars', 'car', 'apple', 'apples')

cars, car, apple -> apples
model: gpt
score: 0.8042591015676499
benchmark: cars -> apples = 0.8524972988346033
-------------------------------------------------------
model: opt
score: 0.8708407098948878
benchmark: cars -> apples = 0.5699345469474792
-------------------------------------------------------
model: t5
score: 0.7876204873860704
benchmark: cars -> apples = 0.6409047842025757
-------------------------------------------------------


In [21]:
mikolov('years', 'year', 'law', 'laws')

years, year, law -> laws
model: gpt
score: 0.7852276945566117
benchmark: years -> laws = 0.8483212083512726
-------------------------------------------------------
model: opt
score: 0.8493390486675807
benchmark: years -> laws = 0.5232529640197754
-------------------------------------------------------
model: t5
score: 0.7946566917630429
benchmark: years -> laws = 0.7154074907302856
-------------------------------------------------------


* adjective-scale

In [22]:
mikolov('hot', 'warm', 'cool', 'cold')

hot, warm, cool -> cold
model: gpt
score: 0.7819205476154103
benchmark: hot -> cold = 0.8576111115834552
-------------------------------------------------------
model: opt
score: 0.8466620250197867
benchmark: hot -> cold = 0.5924510794233862
-------------------------------------------------------
model: t5
score: 0.7903440844728845
benchmark: hot -> cold = 0.6201469350799498
-------------------------------------------------------


In [23]:
mikolov('best', 'good', 'bad', 'worst')

best, good, bad -> worst
model: gpt
score: 0.7502848923165245
benchmark: best -> worst = 0.8677750042777197
-------------------------------------------------------
model: opt
score: 0.8133568359441687
benchmark: best -> worst = 0.8731156467209791
-------------------------------------------------------
model: t5
score: 0.7641626398489444
benchmark: best -> worst = 0.46554383381387376
-------------------------------------------------------


* possession

In [24]:
mikolov('city\'s', 'city', 'bank', 'bank\'s')

city's, city, bank -> bank's
model: gpt
score: 0.8100800352663959
benchmark: city's -> bank's = 0.8612663210375753
-------------------------------------------------------
model: opt
score: 0.8631198519596566
benchmark: city's -> bank's = 0.6846699714660645
-------------------------------------------------------
model: t5
score: 0.8080202576243289
benchmark: city's -> bank's = 0.7735621929168701
-------------------------------------------------------


In [25]:
mikolov('mine', 'me', 'you', 'yours')

mine, me, you -> yours
model: gpt
score: 0.8114813212545702
benchmark: mine -> yours = 0.9033465234916255
-------------------------------------------------------
model: opt
score: 0.880973468945707
benchmark: mine -> yours = 0.5635010564792599
-------------------------------------------------------
model: t5
score: 0.8036020348017502
benchmark: mine -> yours = 0.46227390929504036
-------------------------------------------------------


* verb-source

In [26]:
mikolov('walking', 'legs', 'teeth', 'chewing')

walking, legs, teeth -> chewing
model: gpt
score: 0.7040879899569987
benchmark: walking -> chewing = 0.8454043336269507
-------------------------------------------------------
model: opt
score: 0.7603399214774956
benchmark: walking -> chewing = 0.3978073897494764
-------------------------------------------------------
model: t5
score: 0.7020406968955835
benchmark: walking -> chewing = 0.4715649425101859
-------------------------------------------------------


In [27]:
mikolov('walk', 'legs', 'teeth', 'chew')

walk, legs, teeth -> chew
model: gpt
score: 0.7232827264293642
benchmark: walk -> chew = 0.845130940848593
-------------------------------------------------------
model: opt
score: 0.7858758351719928
benchmark: walk -> chew = 0.403867014311146
-------------------------------------------------------
model: t5
score: 0.7280116643353497
benchmark: walk -> chew = 0.5056241362847337
-------------------------------------------------------


In [28]:
mikolov('sailing', 'boat', 'car', 'driving')

sailing, boat, car -> driving
model: gpt
score: 0.7649509877119995
benchmark: sailing -> driving = 0.8619133991236356
-------------------------------------------------------
model: opt
score: 0.8230902797244916
benchmark: sailing -> driving = 0.35207847641778717
-------------------------------------------------------
model: t5
score: 0.7930588916368582
benchmark: sailing -> driving = 0.6732577969081711
-------------------------------------------------------


In [29]:
mikolov('sail', 'boat', 'car', 'drive')

sail, boat, car -> drive
model: gpt
score: 0.7697334127739649
benchmark: sail -> drive = 0.843451216162298
-------------------------------------------------------
model: opt
score: 0.8280892167138598
benchmark: sail -> drive = 0.30502673815257786
-------------------------------------------------------
model: t5
score: 0.8083683579829373
benchmark: sail -> drive = 0.6057842568936658
-------------------------------------------------------
