# 'Mikolov Negatives'
If we take the vector of king, subtract the vector of man, and add the vector of woman, how close are we to the vector of queen?
This notebook will implement a function to allow smooth testing of this experiment.

In [None]:
import os
import torch
import openai
import numpy as np

import sentencepiece    # necessary for proper t5 init.
from transformers import T5Tokenizer, T5EncoderModel, GPT2Tokenizer, OPTModel

from sklearn.metrics.pairwise import cosine_similarity

# api key set in conda env.
openai.api_key = os.getenv('OPENAI_API_KEY')

## Load Models

### Vocab
Now expanded to the Oxford 5000, plus relevant test words.

In [1]:
vocab = []
with open('./expanded_vocab.txt', 'r') as f:
    for line in f:
        vocab.append(line.strip())

len(vocab)

5124

### GPT-3
Vectors are normalized. Produced by https://arxiv.org/abs/2201.10005. Note: "Our models achieved new state-of-the-art results in linear-probe classification, text search and code search. We find that our models **underperformed on sentence similarity tasks and observed unexpected training behavior with respect to these tasks**." (emphasis mine) 

In [None]:
# Loading saved embeddings from GPT-3 Ada
# Ada embeds: 1024 dims
# Babbage load included below as comment.
"""
# Babbage embeds: 2048 dims
bab_embeds = []
with open(u'/gpfs/fs1/home/mbarlow6/Desktop/Conceptual-Analysis/barlow/gpt/gpt_babbage.txt', 'r') as f:
    for line in f:
        bab_embeds.append([float(x) for x in line.strip().split()])

model_bab = dict(zip(vocab, bab_embeds))
"""
ada_embeds = []
with open(u'./gpt/gpt_ada.txt', 'r') as f:
    for line in f:
        ada_embeds.append([float(x) for x in line.strip().split()])

model_gpt = dict(zip(vocab, ada_embeds))

In [None]:
# for getting new embeddings from OpenAI
def gpt_embed(text, engine='text-similarity-ada-001'):
    text = text.replace('\n', ' ')
    return openai.Embedding.create(input=[text], engine=engine)['data'][0]['embedding']

### OPT
OPT, like GPT-3, is a Decoder-only Transformer. BERT is encoder only, and T5 is encoder-decoder, making use of both. Likewise, getting embeddings from OPT is a *little* less complicated, as there is only one part of the architecture we look to for embeddings.

In [None]:
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/cache')
model_opt_raw = OPTModel.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/cache')

In [None]:
opt_embeds = []
with open(u'./opt/1_3B.txt', 'r') as f:
    for line in f:
        opt_embeds.append([float(x) for x in line.strip().split()])
model_opt = dict(zip(vocab, opt_embeds))

In [None]:
def opt_embed(text, tokenizer=gpt2_tokenizer, model=model_opt_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    # compare with:
    # final_embed = np.array(embeddings[1])
    # if embeddings.shape[0] > 2:
    #     for next_emb in embeddings[2:]:
    #         final_embed += np.array(next_emb)
    #     final_embed /= embeddings.shape[0] - 1
    # return final_embed
    return np.array(torch.mean(embeddings, dim=1))

### T5
Here is our encoder-decoder model. I will extract embeddings from the encoder only, as per results from https://arxiv.org/pdf/2108.08877.pdf. Note: "When mean pooling is applied to the T5’s encoder outputs, it greatly outperforms the average embeddings of BERT. Notably, even without fine-tuning, the average embeddings of the T5’s encoder-only outputs outperforms SimCSE-RoBERTa, which is fine-tuned on NLI datase."

Seeing as I amd doing no fine tuning, and mean pool OPT decoder outputs, I will do the same with T5's **encoder** outputs.

In [None]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/cache')
model_t5_raw = T5EncoderModel.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/cache')

In [None]:
def t5_embed(text, tokenizer=t5_tokenizer, model=model_t5_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings, dim=1))