# Comparing Free Association Norms to SNLI-derived association

This notebook uses Peter's DT-RNN model trained on the SNLI data set to predict one-word entailments for words in the Free Association Norms dataset. Only those words that are contained in both the SNLI dataset and the Free Association Norms are analyzed. Over 80% of words in norms are also present in the SNLI dataset.

Premature conclusion: One word entailings produce a word that is associate of a given word in less than 10% of tested cases. Reasons as to why this is the case are outlined at the end of this notebook.

In [1]:
import pysem
import pickle

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator, EncoderDecoder

Loading the DT-RNN model pretrained on the SNLI dataset:

In [2]:
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])  # needed for pickle

with open('../data/train_data.pickle', 'rb') as pfile:
    train_data = pickle.load(pfile)
    
with open('../data/vocab.pickle', 'rb') as pfile:
    vocab_snli = pickle.load(pfile)
    
model = EncoderDecoder(encoder=None, decoder=None, data=train_data)
model.load('enc_model_0006_alt.pickle','dec_model_0006_alt.pickle')

Loading free-associations:

In [3]:
import sparat
path = '/home/ivana/phd/workspace/sparat/data/associationmatrices/'
name = 'freeassoc_asymmetric'

Load one vocabulary containig all words in Free Norms, and another one with animal words:

In [4]:
mat, vocab_free_full, w2i = sparat.load_assoc_mat(path, name)
vocab_free_animals = [w.lower().strip() for w in open('../data/animal_words.txt','r').readlines()]

vocab_snli = [w.lower() for w in vocab_snli]
vocab_free_full = [w.lower() for w in vocab_free_full]

In [5]:
print(vocab_free_full[:6])
print(vocab_free_animals[:4])

['stairway', 'life', 'unsure', 'loft', 'tease', 'flow']
['aardvark', 'alligator', 'ant', 'anteater']


### Comparisons of vocabulary sizes

Show the number of words in all vocabularies (SNLI, norms all, norms animals):

In [6]:
print('Total number of words in a vocabulary\n---')
print('SNLI vocab:', len(np.unique(vocab_snli)))
print('Free associations vocab:', len(vocab_free_full))
print('Animals in free associations vocab:', len(vocab_free_animals))

Total number of words in a vocabulary
---
SNLI vocab: 19306
Free associations vocab: 5018
Animals in free associations vocab: 158


The SNLI vocabulary has almost four times more words, but many of those are non-words like different numbers (1.9, 100th), expressions (100-meter, 12-hour), random words (ab, abc. ) and various grammatical forms of the same word. In other words, it contains a lot of noise.

Now let's compare the number of words contained in both vocabularies:

In [7]:
words_overlap = list(set(vocab_free_full) & set(vocab_snli))
print('Number of words found in free norms vocab and in SNLI vocab:', len(words_overlap))

animals_overlap = list(set(vocab_free_animals) & set(vocab_snli))
print('Number of animals in both datasets:', len(animals_overlap))

Number of words found in free norms vocab and in SNLI vocab: 4208
Number of animals in both datasets: 146


In [8]:
print('Words in free norms but not in SNLI:', len(set(vocab_free_full) - set(vocab_snli)))

Words in free norms but not in SNLI: 810


About 15% of words in Norms are not present in the SNLI vocabulary. Eyeballing the data, looks like a lot of words related to food and animals are not included, any words with potentially promiment affective connotations as well as more technical/domain-specific terminology (eg. in economics, medicine, law).

In [9]:
list(set(vocab_free_full)-set(vocab_snli))[:5]

['obsession', 'jock', 'possibility', 'rhyme', 'poise']

## Generating one-word entailments

We use Pete's trained model and explore what comes out at the side of decoder when a single word is presented. First testing frequent words:

In [10]:
words = ['cat','dog', 'fish', 'octopus', 'eat', 'mother', 'son', 'book', 'happiness', 
         'luck','when', 'why', 'how', 'eat', 'banana']

for word in words:
    assert word in words_overlap, word + ' not in one of the vocabularies'

In [11]:
print('Encoded -> Decoded:\n---')
for word in words:
    model.encode(word)
    print(word, '->', model.decode('dog is similar to cat'))

Encoded -> Decoded:
---
cat -> cat cat furry with cat
dog -> dog dog leashed with dog
fish -> fish fish fishing near fish
octopus -> octopus whale whale underwater octopus
eat -> eating eat eating of eat
mother -> mother mother distraught with mother
son -> father son newborn than son
book -> book book paperback about book
happiness -> marriage affection happy in happiness
luck -> fortune gamble lucky of luck
when -> it later about about other
why -> what answer sad about reasons
how -> what react about about importance
eat -> eating eat eating of eat
banana -> banana banana nutritious of banana


It appears that the very frequent words are just repetitions of themelves (cat->cat, dog->dog). The situation gets a bit more interesting when inspecting less frequent words and interrogative words. 

We now analyse in how many cases the entailed word appears in the list of associates for a given word

In [15]:
def get_assoc(word,top=1):
    word = word.upper()
    row = mat[w2i[word]]
    row[w2i[word]] = 0 # auto-assoc
    if top == 1:
        idx = np.argmax(row)
        return vocab_free_full[idx].lower()
    else:
        non_zeros = len(np.nonzero(row)[0])
        idx = np.argsort(row)[-non_zeros:]
        return [vocab_free_full[i].lower() for i in idx[::-1]]

counter = 0
for word in words_overlap:
    model.encode(word)
    decoded=model.decode('word')
    if word!=decoded:
        list_assoc = get_assoc(word, 0)
        if decoded in list_assoc:
            counter += 1
            if counter < 5:
                print('Encoded word:', word)
                print('Decoded:', decoded)
                print('Associates:', list_assoc, '\n')

Encoded word: ease
Decoded: soothe
Associates: ['easy', 'relax', 'hard', 'difficult', 'simple', 'slow', 'comfort', 'tense', 'pain', 'soothe', 'help', 'smooth', 'slide', 'ability', 'quick', 'nervous', 'grace', 'complex', 'mind', 'rest'] 

Encoded word: obnoxious
Decoded: annoying
Associates: ['rude', 'loud', 'annoying', 'jerk', 'snotty', 'irritating', 'crazy', 'stupid', 'pain', 'me', 'snob', 'drunk', 'mean', 'bother', 'attitude', 'people', 'arrogant', 'silly', 'nose'] 

Encoded word: poor
Decoded: poverty
Associates: ['rich', 'bad', 'people', 'old', 'money', 'dirty', 'poverty', 'house', 'lonely', 'welfare', 'sad'] 

Encoded word: daily
Decoded: everyday
Associates: ['news', 'weekly', 'routine', 'newspaper', 'everyday', 'always', 'paper', 'today', 'monthly', 'day', 'bread', 'regular', 'often', 'log', 'ritual', 'schedule'] 



Entailed word is an associate in less than 10% of cases:

In [16]:
counter/len(words_overlap)

0.08745247148288973

One possible reason why there is such a small overlap in entailment predictions and free norms might be the dataset that is used for training of the model. SNLI dataset contains sentences which usually involve people doing certain actions, generated by captioning imges. The vocabulary gathered in that way does not contain many finesses arising in a discourse or a written piece of text. To alleviate this one could try training the model on a different data set.

Another reason for this poor match might be the model architecture which is designed as to process syntax, something less relevant in free association norms. So, the second idea is to modify the model to train it on more co-occurrence data (although some people have strongly argued against the interpretation of associations as co-occurences in text).