This notebook build matrix representation of transitive verbs using the vector embeddings of its arguments, that is, the subject and the object.

Preparation before running this notebook:
1. Extract a 
2. Run the notebook `encode_texts.ipynb` to precompute the embeddings of the texts for efficiency. You can skip this step if you want to run the notebook with on the fly embeddings.

In [None]:
import pandas as pd
import numpy as np
import torch

from pathlib import Path
from tqdm import tqdm

import clip

In [None]:
import pickle

# set the path to the cache file, which should be a pickle file
# of a dictionary where the keys are the strings and the values
# are embeddings
# if you have not generated the cache file, leave it as None
emb_cache_path = None

# load the clip model
# comment out the following 2 lines if you are confident that
# everything string is in the cache file
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

model_name = "32" # 32 for ViT-B/32 and 14 for ViT-L/14

# load the cache file
if emb_cache_path is not None:
    with open(emb_cache_path, "rb") as f:
        text_emb_cache = pickle.load(f)
else:
    text_emb_cache = dict()

# make sure that the embeddings are float32 numpy arrays 
text_emb_cache = {k: v.astype(np.float32) for k, v in text_emb_cache.items()}

def get_text_emb(text):
    """
    Args:
        text (str): text to be embedded
    
    Returns:
        emb (np.array): text embedding of shape (d, ) where d is the dimension of the embedding
    """
    if text in text_emb_cache:
        return text_emb_cache[text]
    inputs = clip.tokenize(text).to(device)
    with torch.no_grad():
        text_features = model.encode_text(inputs).cpu().numpy().flatten()
    text_emb_cache[text] = text_features
    return text_features

def normalize(emb):
    return emb / np.linalg.norm(emb)

### Relational matrix representation

In [None]:
def build_rel_matrices(subj_list, obj_list, norm_vec=False, norm_num=False):
    """
    Build the matrix presentation of the verb given the subject-verb-object triples
    found in a corpus. Note that the verb itself is not used in this construction.
    
    The relational matrix is defined as the sum of the outer product of the embeddings
    of the subject and object of each triple:
    M = \sum_{k=1}^{n} outer_product(emb(subj_list[k]), emb(obj_list[k])).
    The outer product takes two vectors of size d and returns a matrix of size d x d.

    Args:
        subj_list (list): list of subject nouns in the triples, which has the same length as obj_list
        obj_list (list): list of object nouns in the triples, which has the same length as subj_list
        norm_vec (bool): whether to normalize the vectors of the subjects and objects
        norm_num (bool): whether to normalize the matrix by the number of triples
    
    Returns:
        matrix (np.array): the relational matrix representation of the verb
        
    Example:
        >>> # verb is 'chase' and the triples are
        >>> # 'dog chase ball', 'cat chase mouse', 'police chase car'
        >>> subj_list = ['dog', 'cat', 'police']
        >>> obj_list = ['ball', 'mouse', 'car']
        >>> matrix = build_rel_matrices(subj_list, obj_list)
    """
    if len(subj_list) != len(obj_list):
        raise ValueError('subj_list and obj_list must have the same length')
    num_triples = len(subj_list)
   
    if norm_vec: 
        post_process = normalize
    else:
        post_process = lambda x: x

    subj_arr = np.array([post_process(get_text_emb(subj)) for subj in subj_list])
    obj_arr = np.array([post_process(get_text_emb(obj)) for obj in obj_list])
        
    matrix = subj_arr.T @ obj_arr
    if norm_num:
        matrix /= num_triples

    return matrix

### Regression-based methods for building verb matrices
The idea is to find a verb matrix `V` such that, when multiplied with the vector of a noun, it gives the vector of the verb-noun pair.
There are two cases:

- subject-verb, e.g. cat eats
- verb-object, e.g. eats fish
    
For a given verb, we gather subject-verb-object triples from a corpus.
We would like the following to hold (approximately):
```
emb(s) @ V  == emb(sv)
V @ emb(o) == emb(vo)
```
    
where `V` is the verb matrix, `emb(s)` is the vector of the subject, `emb(o)` is the vector of the object, `emb(sv)` is the vector of the subject-verb pair, and `emb(vo)` is the vector of the verb-object pair.

Consider a vectorized form of the subject-verb case with `n` subject-verb-object triples and `d` the dimension of the vectors:
``` 
S @ V == SV
```
where S is a matrix of shape (n, d), V is a matrix of shape (d, d), and SV is a matrix of shape (n, d).
Here `S[k,:]` gives the vector of the k-th subject, and `SV[k,:]` gives the vector of the k-th subject-verb pair.

An exact solution for V is most likely not possible, but we can find the best approximation using the least squares method.
The idea is to minimize the following loss function:
```
L = ||S @ V - SV||^2
```
where `||.||` is the Frobenius norm, that is, the sum of the squared elements of the matrix.

The solution is given by the Penrose-Moore pseudo-inverse of S:
```
V = S^+ @ SV
```
where `S^+` is the pseudo-inverse of S, which can be obtained in numpy with `np.linalg.pinv`.

<!-- For the verb-object case, we can do the same thing, but the vetorized form is slightly different: -->
We can do the same for the verb-object case, but the verb matrix has to be transposed to make the tensor contraction work:
```
O @ V^T = VO
```
The solution is given by an extra transpose: `V = (O^+ @ VO)^T`. 

The transpose is necessary because the object vector is supposed to be on the right side of the verb matrix. Since we are working with a batched version of the problem, `O` is of shape (n, d). For the `d` part to interact with the verb matrix, we need to put it on the right side of `V` and transpose `V` so that the interaction is correct. This transpose is only an artifact of the batched version of the problem and carries no special meaning.

In [None]:
def solve_least_square(A, B):
    """
    Given two matrices A and B, solve for X such that l2_norm(A @ X - B) is minimized.
    """
    return np.linalg.pinv(A) @ B
    
def build_reg_matrices_subj(subj_list, subj_verb_list, norm_vec=False):
    """
    Given a list of subjects and their corresponding subject-verb pairs,
    returns a regression matrix V such that l2_norm(subj_arr @ V - subj_verb_arr) is minimized.
    where subj_arr and subj_verb_arr have shape (num_triples, emb_dim).

    Note:
    The solution can be found by taking the pseudo-inverse of subj_arr and multiplying it with subj_verb_arr.
    subj_arr @ V = subj_verb_arr => V = pinv(subj_arr) @ subj_verb_arr

    Args:
        subj_list (list): list of subject nouns
        subj_verb_list (list): list of subject-verb pairs
        norm_vec (bool): whether to normalize the embeddings of the subjects and subject-verb pairs
    
    Returns:
        matrix (np.array): the regression matrix representation of the verb
    """
    if norm_vec:
        post_process = normalize
    else:
        post_process = lambda x: x
        
    subj_arr = np.array([post_process(get_text_emb(subj)) for subj in subj_list])
    subj_verb_arr = np.array([post_process(get_text_emb(subj_verb)) for subj_verb in subj_verb_list])

    # Approximate solution to S @ V = SV
    # V = pinv(S) @ SV
    matrix = solve_least_square(subj_arr, subj_verb_arr) 
    return matrix
    
def build_reg_matrices_obj(verb_obj_list, obj_list, norm_vec=False):
    """
    Given a list of objects and their corresponding verb-object pairs,
    returns a regression matrix V such that l2_norm(obj_arr @ V^T - verb_obj_arr) is minimized.
    
    Note:
    The solution can be found by taking the pseudo-inverse of obj_arr and multiplying it with verb_obj_arr.
    obj_arr @ V^T = verb_obj_arr => V = pinv(obj_arr) @ verb_obj_arr

    Args:
        subj_verb_list (list): list of subject-verb pairs
        obj_list (list): list of object nouns
        norm_vec: whether to normalize the embeddings of the subject-verb pairs and objects
    
    Returns:
        matrix (np.array): the regression matrix representation of the verb
    """
    if norm_vec:
        post_process = normalize
    else:
        post_process = lambda x: x    
        
    verb_obj_arr = np.array([post_process(get_text_emb(verb_obj)) for verb_obj in verb_obj_list])
    obj_arr = np.array([post_process(get_text_emb(obj)) for obj in obj_list])
        
    # Approximate solution to O @ V^T = VO
    matrix = np.transpose(solve_least_square(obj_arr, verb_obj_arr))
    
    return matrix

In [None]:
cleaned_svo_path = Path('experiments/svo_probes/cleaned_svo')

rel_matrices = dict()
rel_matrices_norm = dict()
reg_subj_matrices = dict()
reg_obj_matrices = dict()

files = list(cleaned_svo_path.glob('*.csv'))
for f in tqdm(files):
    df = pd.read_csv(f)
    if df.empty:
        print(f'{f.stem} is empty')
        continue
    subj_list = df['subject_text']
    obj_list = df['object_text']
    verb_list = df['verb_text']
    
    subj_verb_list = subj_list + ' ' + verb_list
    verb_obj_list = verb_list + ' ' + obj_list
    
    rel_matrix = build_rel_matrices(subj_list, obj_list, norm_vec=True, norm_num=False)
    rel_matrix_norm = build_rel_matrices(subj_list, obj_list, norm_vec=True, norm_num=True)
    reg_subj_matrix = build_reg_matrices_subj(subj_list, subj_verb_list, norm_vec=True)
    reg_obj_matrix = build_reg_matrices_obj(verb_obj_list, obj_list, norm_vec=True)
    
    rel_matrices[f.stem] = rel_matrix
    rel_matrices_norm[f.stem] = rel_matrix_norm
    reg_subj_matrices[f.stem] = reg_subj_matrix
    reg_obj_matrices[f.stem] = reg_obj_matrix
    

In [None]:
# save the matrices
rel_matrices_path = cleaned_svo_path.parent / f'embeddings/rel_matrices_{model_name}.pkl'
rel_matrices_norm_path = cleaned_svo_path.parent / f'embeddings/rel_matrices_norm_{model_name}.pkl'
reg_subj_matrices_path = cleaned_svo_path.parent / f'embeddings/reg_subj_matrices_{model_name}.pkl'
reg_obj_matrices_path = cleaned_svo_path.parent / f'embeddings/reg_obj_matrices_{model_name}.pkl'

rel_matrices_path.parent.mkdir(parents=True, exist_ok=True)
rel_matrices_norm_path.parent.mkdir(parents=True, exist_ok=True)
reg_subj_matrices_path.parent.mkdir(parents=True, exist_ok=True)
reg_obj_matrices_path.parent.mkdir(parents=True, exist_ok=True)

pickle.dump(rel_matrices, open(rel_matrices_path, 'wb'))
pickle.dump(rel_matrices_norm, open(rel_matrices_norm_path, 'wb'))
pickle.dump(reg_subj_matrices, open(reg_subj_matrices_path, 'wb'))
pickle.dump(reg_obj_matrices, open(reg_obj_matrices_path, 'wb'))