# Finding defintions with **S**
If we want to define term X, how can we do so based on information extracted from a word embedding? I am proposing and will explore the use of a simple relation of the two common vector descriptors - angle and magnitude. When defining a word, I will assign each word in our vocabulary some kind of score, **S**, that will help us define X with some pruning.
```
S(X, y) = cosine_similarity(X, y) * |y|
```
This is based on the idea that words with a larger magnitude appear more frequently in the test set, and therefore are likely to be more simple (or at least common). By using each to scale the other, we project both important pieces of information into the same space, and can begin to construct a definition.

How do we avoid orthographic influences? After generating S(X, Y), I will prune out any remaining 'descriptors' that share the same 3-letter sequence to start the word. Hopefully this will reduce "man" being defined using "manner" or "manuscript", or something similar. Once the list is pruned, I will chose the two highest scoring terms in S(X, Y) that do not contain orthographic similarity, and output them as our "definition". This will complete the statement "X is a(n) y1 y2" (ex. "bachelor is an unmarried man").

In [1]:
import os
import torch
import openai
import numpy as np

import sentencepiece    # necessary for proper t5 init.
from transformers import T5Tokenizer, T5EncoderModel, GPT2Tokenizer, OPTModel

from sklearn.metrics.pairwise import cosine_similarity

# api key set in conda env.
openai.api_key = os.getenv('OPENAI_API_KEY')

## Loading Models
You know the drill.

### Vocab

In [2]:
vocab = []
with open('expanded_vocab.txt', 'r') as f:
    for line in f:
        vocab.append(line.strip())

len(vocab)

5124

### OPT-1.3B

In [4]:
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/cache')
model_opt_raw = OPTModel.from_pretrained('facebook/opt-1.3b', cache_dir='/scratch/mbarlow6/cache')

Downloading:   0%|          | 0.00/2.45G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/opt-1.3b were not used when initializing OPTModel: ['lm_head.weight']
- This IS expected if you are initializing OPTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing OPTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
opt_embeds = []
with open(u'./opt/1_3B.txt', 'r') as f:
    for line in f:
        opt_embeds.append([float(x) for x in line.strip().split()])
model_opt = dict(zip(vocab, opt_embeds))

In [6]:
def opt_embed(text, tokenizer=gpt2_tokenizer, model=model_opt_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings[1:], dim=0))

### T5

In [7]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/cache')
model_t5_raw = T5EncoderModel.from_pretrained('t5-large', cache_dir='/scratch/mbarlow6/cache')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

Some weights of the model checkpoint at t5-large were not used when initializing T5EncoderModel: ['decoder.block.4.layer.1.EncDecAttention.k.weight', 'decoder.block.23.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.1.layer_norm.weight', 'decoder.block.23.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.0.SelfAttention.o.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.11.layer.0.SelfAttention.o.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.11.layer.2.DenseReluDense.wi.weight', 'decoder.block.16.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.q.weight', 'decoder.block.4.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.1.layer_norm.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.22.layer.1.EncDecAttention.o.weight', 'decoder.block.5.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.rela

In [8]:
t5_embeds = []
with open('./t5/t5large.txt', 'r') as f:
    for line in f:
        t5_embeds.append([float(x) for x in line.strip().split()])
model_t5 = dict(zip(vocab, t5_embeds))

In [9]:
def t5_embed(text, tokenizer=t5_tokenizer, model=model_t5_raw, debug=False):
    inputs = tokenizer(text, return_tensors='pt')
    if debug:
        print('Tokens Requested:')
        print(tokenizer.batch_decode(inputs.input_ids[0]))
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.squeeze(outputs.last_hidden_state, dim=0)
    return np.array(torch.mean(embeddings, dim=0))

## Helpers
Positive for now.

In [10]:
def positive(words, model='opt'):
    """
    Args:
        words: iterable
        model: 'opt', or 't5'
    Returns:
        Positive (summed vectors) of word embeddings of a given list of words from the specified model. Defaults to GPT-3.
    """
    if isinstance(words, str):
        print(f"You requested the positive of the string \"{words}\". Did you mean [\"{words}\"]?")

    out = 0
    for token in words:
        # convert token to string
        word = str(token)
        # do model check - least intensive operation to repeat
        if model.lower() == 'opt':
            if word in model_opt:
                ex = model_opt[word]
            else:
                # squeeze!
                ex = opt_embed(word)
                model_opt[word] = ex
        elif model.lower() == 't5':
            if word in model_t5:
                ex = model_t5[word]
            else:
                ex = t5_embed(word)
                model_t5[word] = ex
        else:
            raise ValueError('Please provide either opt or t5 as a model choice.')

        # construct positive
        if isinstance(out, int):
            out = np.array(ex).reshape(1, -1)
        else:
            out += np.array(ex).reshape(1, -1)
            
    return out if not isinstance(out, int) else np.array([])

In [33]:
def bigs_define(phrase, n=2, model='t5'):
    """
    Args:
        phrase: iterable or string  -> Word(s) to be defined using big S.
        n: int                      -> Number of terms in definition. Defaults to 2.
        model: 'opt' or 't5'        -> Model to use. Defaults to OPT.
    Returns:
        Defintion of phrase using n terms derived from chosen model's weights.
    """
    # normalize phrase to list
    if isinstance(phrase, str):
        phrase = [phrase]
    pos_phrase = positive(phrase, model=model)

    # initialize S
    S = []

    # score vocab
    for term in vocab:
        b = positive([term], model=model)
        
        mag = np.sqrt(b[0].dot(b[0]))
        s = cosine_similarity(pos_phrase, b)[0][0] * mag

        S.append((term, s))

    # sort S
    S.sort(key=lambda x: x[1])

    # get answer
    ans = []
    while len(ans) < n and len(S) > 0:
        # check S[0][0] for orthographic similarity
        flag = False
        for word in phrase:
            if word[:3] == S[-1][0][:3]:
                flag = True
                break
        
        if flag:
            S.pop(-1)
            continue

        ans.append(S.pop(-1))

    return ans

## Tests
Now we check if this is viable!

In [34]:
bigs_define(['puppy'])

[('kitten', 2.68356396922264), ('swim', 2.2774726306368316)]

In [35]:
bigs_define(['bachelor'])

[('husband', 3.6575986762274), ('son', 3.5372512505590303)]

In [36]:
bigs_define(['wife'])

[('husband', 3.8932849167995665), ('wives', 3.816325770533681)]

In [37]:
bigs_define(['husband'])

[('father', 4.049301448253738), ('son', 4.0092946583691)]

In [38]:
bigs_define(['king'])

[('atom', 3.165172609289131), ('wing', 3.1279966620710447)]

In [39]:
bigs_define(['duckling'])

[('bandage', 2.7501300118338428), ('applicant', 2.6543564796928423)]

In [40]:
bigs_define(['kitten'])

[('puppy', 2.7245124571632), ('nursery', 2.6223343976805884)]

In [41]:
bigs_define(['knowledge'])

[('understanding', 3.8409909710067476), ('conception', 3.7131432659347747)]

This might have been a terrible idea :)