# Experiments on the data of Grand et al

In [1]:
import os
from scipy import stats
import numpy as np 
import pandas as pd
import zipfile
import math
import sklearn
import torch
import torch.optim as optim
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
import matplotlib.pyplot as plt

glove_path = "glove/glove.42B.300d.zip"
glove_file = "glove.42B.300d.txt"

feature_dim = 300

word_vectors = { }

with zipfile.ZipFile(glove_path) as azip:
    with azip.open(glove_file) as f:
        for line in f:
            values = line.split()
            word = values[0].decode()
            vector = np.array(values[1:], dtype=np.float32)
            word_vectors[word] = vector

grandratings_dir = "Grand_etal_csv/"
grandfeatures_path = "/Users/kee252/Data/grand_directions_in_space/features.xlsx"

grandfeatures_df = pd.read_excel(grandfeatures_path)

# reading in Grand data
def read_grand_data(filename, grandratings_dir, grandfeatures_df):
    # extract category and feature
    grandcategory, grandfeature = filename[:-4].split("_")
        
    # read human ratings, make gold column
    df = pd.read_csv(grandratings_dir + filename)
    nspeakers = len(df.columns) -1
    df["Average"] = [row.iloc[1:26].sum() / nspeakers for _, row in df.iterrows()]
    # z-scores of average ratings
    df["Gold"] = (df["Average"] - df["Average"].mean()) / df["Average"].std()
        
    # obtain seed words from excel file
    relevant_row = grandfeatures_df[grandfeatures_df.Dimension == grandfeature]
    seedwords = relevant_row.iloc[:, 1:].values.flatten().tolist()
    pos_seedwords = seedwords[:3]
    neg_seedwords = seedwords[3:]
    
    return (grandcategory, grandfeature, pos_seedwords, neg_seedwords, df)

  warn(msg)


## Reproducing their results

We reproduce their results from the Nature paper almost perfectly, on both Pearson's r correlation and pairwise order evaluation OC_P. X percent of category/feature pairs show a significant correlation, and average OC_P is Y.

In [2]:
# script for running on data
def each_grandcondition(grandratings_dir, grandfeatures_df):
    for filename in os.listdir(grandratings_dir): 
        if not filename.endswith("csv"):
            continue

        grandcategory, grandfeature, pos_seedwords, neg_seedwords, df = read_grand_data(filename, 
                                                                    grandratings_dir, 
                                                                    grandfeatures_df)

        # storage for word vectors and gold values for this dataset
        data_vectors = []

        # collect word vectors and gold ratings
        for row in df.itertuples():
            # row.Row is the word. look it up in word_vectors
            data_vectors.append( word_vectors[ row.Row ])
            
        yield (grandcategory, grandfeature, pos_seedwords, neg_seedwords, df, data_vectors)


In [3]:
import compute_dim
import eval_dim
from scipy import stats

results = [ ]

for grandcategory, grandfeature, pos_seedwords, neg_seedwords, df, data_vectors in each_grandcondition(grandratings_dir, grandfeatures_df):
    
    dimension = compute_dim.dimension_seedbased(pos_seedwords, neg_seedwords, word_vectors)
    
    df["Pred"]  = compute_dim.predict_scalarproj(data_vectors, dimension)
    
    ocp = eval_dim.pairwise_order_consistency(df["Gold"], df["Pred"])
    result_obj = stats.pearsonr(df["Gold"], df["Pred"])
    
    results.append({"category": grandcategory,
                    "feature" : grandfeature,
                    "ocp" : ocp,
                    "pearsonr" : result_obj.statistic,
                    "pvalue" : result_obj.pvalue } )
    

In [4]:
# arranging the data in the same order as in the Grand et al paper so we can compare numbers

for r in sorted(results, key = lambda r:(r["feature"], r["category"])):
    starred = "*" if r["pearsonr"] > 0 and r["pvalue"] < 0.05 else ""
    print(r["category"], r["feature"], 
          "r", round(r["pearsonr"], 3), "p=", round(r["pvalue"],3), 
          "ocp", round(r["ocp"], 3), " ", starred)

clothing age r 0.561 p= 0.0 ocp 0.709   *
names age r 0.616 p= 0.0 ocp 0.723   *
professions age r 0.238 p= 0.1 ocp 0.568   
cities arousal r 0.001 p= 0.996 ocp 0.522   
clothing arousal r 0.185 p= 0.199 ocp 0.562   
professions arousal r 0.523 p= 0.0 ocp 0.658   *
sports arousal r 0.279 p= 0.05 ocp 0.597   *
cities cost r -0.136 p= 0.346 ocp 0.47   
clothing cost r -0.087 p= 0.549 ocp 0.471   
states cost r -0.006 p= 0.968 ocp 0.54   
animals danger r 0.599 p= 0.0 ocp 0.695   *
cities danger r 0.715 p= 0.0 ocp 0.774   *
myth danger r 0.723 p= 0.0 ocp 0.779   *
professions danger r 0.446 p= 0.001 ocp 0.641   *
sports danger r 0.379 p= 0.007 ocp 0.633   *
weather danger r 0.79 p= 0.0 ocp 0.797   *
animals gender r 0.7 p= 0.0 ocp 0.731   *
clothing gender r 0.818 p= 0.0 ocp 0.811   *
myth gender r 0.817 p= 0.0 ocp 0.744   *
names gender r 0.94 p= 0.0 ocp 0.873   *
professions gender r 0.916 p= 0.0 ocp 0.839   *
sports gender r 0.854 p= 0.0 ocp 0.817   *
animals intelligence r 0.08 p= 0.6

## Fitted dimensions

Schockaert and colleagues have a series of papers in which they consider interpretable dimensions in space from a knowledge base point of view. Their methods are completely different from the seed-based approach popular in NLP. We adapt an idea of Jameel and Schockaert to fit a dimension in space to best match human ratings.

For every category/feature pair from Grand et al, we obtain a dimension that is a perfect fit to the data:

In [5]:
results = [ ]

for grandcategory, grandfeature, pos_seedwords, neg_seedwords, df, data_vectors in each_grandcondition(grandratings_dir, grandfeatures_df):
    
    dimension = compute_dim.dimension_seedbased(pos_seedwords, neg_seedwords, word_vectors)
    
    # seed-based dimension
    df["SPred"]  = compute_dim.predict_scalarproj(data_vectors, dimension)
    sresult_obj = stats.pearsonr(df["Gold"], df["SPred"])
    
    # fitted dimension
    dimension, weight, bias = compute_dim.dimension_fitted_fromratings(data_vectors, df["Gold"], feature_dim)
    df["FPred"] = compute_dim.predict_coord_fromline(data_vectors, dimension, weight, bias)
    fresult_obj = stats.pearsonr(df["Gold"], df["FPred"])
    
    results.append({"category": grandcategory,
                    "feature" : grandfeature,
                    "s_pearsonr" : sresult_obj.statistic,
                    "s_pvalue" : sresult_obj.pvalue,
                    "f_pearsonr" : fresult_obj.statistic,
                    "f_pvalue" : fresult_obj.pvalue } )

In [6]:
print("Percentage of conditions with significant correlations:")
print("Seed-based:", round(len([ r for r in results if r["s_pearsonr"] > 0 and r["s_pvalue"] < 0.05]) / len(results), 3))
print("Fitted:", round(len([ r for r in results if r["f_pearsonr"] > 0 and r["f_pvalue"] < 0.05]) / len(results), 3))


Percentage of conditions with significant correlations:
Seed-based: 0.679
Fitted: 1.0


### Underdetermined dimensions

However, the embeddings give us too much leeway to fit dimensions in space. Even when we scramble the ratings, the model mostly still manages to fit a dimension perfectly. 

In [14]:
import random

results = [ ]

for grandcategory, grandfeature, pos_seedwords, neg_seedwords, df, data_vectors in each_grandcondition(grandratings_dir, grandfeatures_df):
    
    data_gold = [row.Gold for row in df.itertuples()]
    random.shuffle(data_gold)
    
    # fitted dimension
    dimension, weight, bias = compute_dim.dimension_fitted_fromratings(data_vectors, data_gold, feature_dim)
    df["Pred"] = compute_dim.predict_coord_fromline(data_vectors, dimension, weight, bias)
    result_obj = stats.pearsonr(data_gold, df["Pred"])
    
    results.append({"category": grandcategory,
                    "feature" : grandfeature,
                    "pearsonr" : result_obj.statistic,
                    "pvalue" : result_obj.pvalue } )

In [15]:
print("Scrambled ratings: percentage of dimensions that could be fitted successfully")
print(round(len([ r for r in results if r["pearsonr"] > 0 and r["pvalue"] < 0.05]) / len(results), 3))

Scrambled ratings: percentage of dimensions that could be fitted successfully
1.0


That also means that the model overfits to the given data, and doesn't generalize well to new datapoints, as we will show below.

To mitigate this problem, we combine the fitted model with seed property words, and we will be able to show that this leads to improved dimensions in space.

# Variants of the fitted model

Seeds as words, match to seed-based dimensions as part of the loss, and both of the above

# Evaluating on unseen data

We introduce a train/test split, or rather crossvalidation, to test how well different models do on unseen data.  But when we do that, we cannot use Pearson's r anymore: When there are few datapoints in the dataset, significance computation becomes unreliable.

Instead, we will focus on (a variant of) OC_P, and we add mean square error to the picture. OC_P is highly correlated with Pearson's r; MSE less so:

The variant of OC_P that we use doesn't just compare pairwise orderings of the datapoints in the test set, but also pairwise orderings of test datapoints compared to training datapoints: Do the test datapoints get inserted at the right point in the overall ordering?

## Hyperparameter optimization

We make a development set, and use it to set the hyperparameters: offset and jitter for the seeds-as-words, alpha and averaging for seeds-as-dimensions, all of the above for the joint model, alpha for the seed-dimension-attention model. 

See other notebook. We use:

* Fitted model with seed words: offset of 1.3, with jitter.  (which got OC_P of 0.61 with MSE = 13.6)
* Fitted model with seed dimensions: alpha = 0.01, with averaging. (which got OC_P of 0.69 with MSE = 2.0)
* Fitted model with seed words and seed dimensions: alpha = 0.055, with averaging, jitter, offset of 1.3. (Only alpha was re-fitted.) (this got an OC_P of 0.83 with MSE = 0.3)


## Crossvalidation on all data except the development data

See other notebook. We obtain:

```Seed-based method: OC_P mean 0.636 (0.12) MSE median 220.202 MSE mean 1177228.385 (12986411.98)```

```Fitted method: OC_P mean 0.543 (0.11) MSE median 88.377 MSE mean 20095.844 (255193.91)```

```Fitted method with seed words: OC_P mean 0.534 (0.11) MSE median 147.199 MSE mean 16850.911 (227645.29)```

```Fitted method with seed dim.s: OC_P mean 0.655 (0.13) MSE median 6.174 MSE mean 1998.254 (27162.71)```

```Fitted method, seed words and dim.s: OC_P mean 0.79 (0.08) MSE median 0.624 MSE mean 0.74 (0.5)```