# Setup

In [1]:
import os
import pandas as pd
import numpy as np
import torch
from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer

import utils
import vsm

In [2]:
VSM_HOME = os.path.join('data', 'vsmdata')
DATA_HOME = os.path.join('data', 'wordrelatedness')

In [3]:
utils.fix_random_seeds()

In [4]:
dev_df = pd.read_csv(
    os.path.join(DATA_HOME, "cs224u-wordrelatedness-dev.csv"))

# BERT

When evaluating subword pooling methods, in this case BERT, our first big consideration was which approach to take: decontextualized or aggregated. We decided to focus on the decontextualized approach because it does not require a corpus and, as stated in lecture, produced comparable results to the aggreagated approach.

Following this, we evaluated our model using various pooling functions, distance functions, and numbers of layers. In lecture and based on the papers discussed, the conclusions drawn were that fewer layers and a mean pooling function typically produced the best results. Nevertheless, we decided to test a variety of combinations of the previously stated factors. 

# Decontextualized Approach Model

There are many different options for pre-trained weights. We chose to use 'bert-base-uncased' for our experiments.

In [6]:
bert_weights_name = 'bert-base-uncased'

bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
bert_model = BertModel.from_pretrained(bert_weights_name)

The following is our implementation of some of the distance functions we tried. This includes kNearestNeighbors, as well as a function that returns the negative values of the jaccard score. We are utilizing the negative value because `vsm.create_subword_pooling_vsm` returns `-d` where `d` is the value computed by `distfunc`, since it assumes that `distfunc` is a distance value of some kind rather than a relatedness/similarity value.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor()

def knn_distance(u, v):
    return neigh.predict(np.concatenate((u,v), axis=1))

def create_knn_model(vsm_df, dev_df, test_size=0.20):
    X = knn_feature_matrix(vsm_df, dev_df)
    y = dev_df['score']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    neigh.fit(X_train, y_train)
    print(X_train.shape)
    
def knn_feature_matrix(vsm_df, rel_df):
    matrix = np.zeros((len(rel_df), len(vsm_df.columns)*2))
    for ind in rel_df.index:
        matrix[ind] = knn_represent(rel_df['word1'][ind], rel_df['word2'][ind], vsm_df)
    return matrix

def knn_represent(word1, word2, vsm_df):
    return np.concatenate((vsm_df.loc[word1], vsm_df.loc[word2]), axis=None)

def neg_jaccard(u,v):
    return -vsm.jaccard(u,v)

The following is our implementation of the decontextualized appraoch to BERT, in which we were able to alter the pooling function, number of layers, and distance function. 

In [13]:
def apply_bert(rel_df, layer, pool_func): 
    vocab = set(rel_df.word1.values) | set(rel_df.word2.values)
    pooled_df = vsm.create_subword_pooling_vsm(vocab, bert_tokenizer, bert_model, layer, pool_func)
    return pooled_df

def evaluate_pooled_bert(rel_df, layer, pool_func):
    pooled_df = apply_bert(rel_df, layer, pool_func)
    return vsm.word_relatedness_evaluation(rel_df, pooled_df, distfunc=neg_jaccard)

In [14]:
pool_func = vsm.mean_pooling
for val in range(1,4):
    layer = val
    pred_df, rho = evaluate_pooled_bert(dev_df, layer, pool_func)
    print(layer, rho)

1 0.43622599933657347
2 0.3473368495338
3 0.3279460629681185


Record of pooling function, distance function, number of layers, and resulting rho

| pooling func| distfunc | layer | rho |
| --- | ---| --- | --- |
| max | cosine | 1 | 0.2707496460162731 |
| max | cosine | 2 | 0.20702414483988724 |
| max | cosine | 3 | 0.17744729074571614 |
| mean | cosine | 1 | 0.2757425333620801 |
| mean | cosine | 2 | 0.217700456830832 |
| mean | cosine | 3 | 0.18617500500667575 |
| mean | euclidean | 1 | 0.28318140326817176 |
| mean | euclidean | 2 | 0.19286314385117495 | 
| mean | euclidean | 3 | 0.1681594482646394 |
| mean | jaccard | 1 | 0.43622599933657347 | 
| mean | jaccard | 2 | 0.3473368495338 |
| mean | jaccard | 3 | 0.3279460629681185 |
| min | cosine | 1 | 0.28747309266119614 |
| min | cosine | 2 | 0.2211592952130484 | 
| min | cosine | 3 | 0.19272403506986122 |
| min | euclidean | 1 | 0.23831264619930617 | 
| min | euclidean | 2 | 0.16104191516635505 |
| min | euclidean | 3 | 0.1326673152179109 |
| last | cosine | 1 | 0.26255946375943245 |
| last | cosine | 2 | 0.20210332109799414 | 
| last | cosine | 3 | 0.1720367373470963 |