# Comparing Free Association Norms to SNLI-derived association

This notebook uses Peter's DT-RNN model trained on the SNLI data set to predict one-word entailments for words in the Free Association Norms dataset. Only those words that are contained in both the SNLI dataset and the Free Association Norms are analyzed. Over 80% of words in norms are also present in the SNLI dataset.

Premature conclusion: One word entailings produce a word that is associate of a given word in less than 10% of tested cases. Reasons as to why this is the case are outlined at the end of this notebook.

In [176]:
import pysem
import pickle

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator, EncoderDecoder

Loading the DT-RNN model pretrained on the SNLI dataset:

In [178]:
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])  # needed for pickle

with open('../data/train_data.pickle', 'rb') as pfile:
    train_data = pickle.load(pfile)
    
with open('../data/vocab.pickle', 'rb') as pfile:
    vocab_snli = pickle.load(pfile)
    
model = EncoderDecoder(encoder=None, decoder=None, data=train_data)
model.load('enc_model_0006_alt.pickle','dec_model_0006_alt.pickle')

Loading free-associations:

In [179]:
import sparat
path = '/home/ivana/phd/workspace/sparat/data/associationmatrices/'
name = 'freeassoc_asymmetric'

Load one vocabulary containig all words in Free Norms, and another one with animal words:

In [182]:
mat, vocab_free_full, w2i = sparat.load_assoc_mat(path, name)
vocab_free_animals = [w.lower().strip() for w in open('../data/animal_words.txt','r').readlines()]

vocab_snli = [w.lower() for w in vocab_snli]
vocab_free_full = [w.lower() for w in vocab_free_full]

In [183]:
print(vocab_free_full[:6])
print(vocab_free_animals[:4])

['stairway', 'life', 'unsure', 'loft', 'tease', 'flow']
['aardvark', 'alligator', 'ant', 'anteater']


### Comparisons of vocabulary sizes

Show the number of words in all vocabularies (SNLI, norms all, norms animals):

In [247]:
print('Total number of words in a vocabulary\n---')
print('SNLI vocab:', len(np.unique(vocab_snli)))
print('Free associations vocab:', len(vocab_free_full))
print('Animals in free associations vocab:', len(vocab_free_animals))

Total number of words in a vocabulary
---
SNLI vocab: 19306
Free associations vocab: 5018
Animals in free associations vocab: 158


The SNLI vocabulary has almost four times more words, but many of those are non-words like different numbers (1.9, 100th), expressions (100-meter, 12-hour), random words (ab, abc. ) and various grammatical forms of the same word. In other words, it contains a lot of noise.

Now let's compare the number of words contained in both vocabularies:

In [197]:
words_overlap = list(set(vocab_free_full) & set(vocab_snli))
print('Number of words found in free norms vocab and in SNLI vocab:', len(words_overlap))

animals_overlap = list(set(vocab_free_animals) & set(vocab_snli))
print('Number of animals in both datasets:', len(animals_overlap))

Number of words found in free norms vocab and in SNLI vocab: 4208
Number of animals in both datasets: 146


In [42]:
print('Words in free norms but not in SNLI:', len(set(vocab_free_full) - set(vocab_snli)))

Words in free norms but not in SNLI: 810


About 15% of words in Norms are not present in the SNLI vocabulary. Eyeballing the data, looks like a lot of words related to food and animals are not included, any words with potentially promiment affective connotations as well as more technical/domain-specific terminology (eg. in economics, medicine, law).

In [186]:
set(vocab_free_full)-set(vocab_snli)

{'aardvark',
 'abnormal',
 'absurd',
 'abundance',
 'accelerate',
 'accept',
 'accuse',
 'ache',
 'acre',
 'adjective',
 'adjourn',
 'admit',
 'adore',
 'adultery',
 'adverb',
 'advil',
 'affect',
 'agenda',
 'aggravate',
 'ahoy',
 'air force',
 'almanac',
 'almond',
 'amaze',
 'ambition',
 'ancestor',
 'anisette',
 'annihilate',
 'announce',
 'anteater',
 'antidote',
 'antler',
 'antlers',
 'anxiety',
 'apology',
 'appeal',
 'appraise',
 'aright',
 'aroma',
 'arrogant',
 'artery',
 'ashes',
 'atlas',
 'atomic',
 'aura',
 'axon',
 'backbone',
 'bandaid',
 'bargain',
 'barley',
 'barracuda',
 'bashful',
 'beet',
 'betray',
 'beware',
 'bias',
 'bizarre',
 'blackmail',
 'blame',
 'blockade',
 'blonde',
 'blot',
 'blubber',
 'bluejay',
 'bold',
 'bologna',
 'bookbag',
 'bother',
 'bouillon',
 'bounty',
 'bran',
 'bravado',
 'breakable',
 'bribe',
 'bristle',
 'britannica',
 'bullock',
 'bureau',
 'cafe',
 'caffeine',
 'calcium',
 'canary',
 'capability',
 'cardinal',
 'cashew',
 'cauliflo

## Generating one-word entailments

We use Pete's trained model and explore what comes out at the side of decoder when a single word is presented. First testing frequent words:

In [225]:
words = ['cat','dog', 'fish', 'octopus', 'eat', 'mother', 'son', 'book', 'happiness', 
         'luck','when', 'why', 'how', 'eat', 'banana']

for word in words:
    assert word in words_overlap, word + ' not in one of the vocabularies'

In [243]:
print('Encoded -> Decoded:\n---')
for word in words:
    model.encode(word)
    print(word, '->', model.decode('dog is similar to cat'))

Encoded -> Decoded:
---
cat -> cat cat furry with cat
dog -> dog dog leashed with dog
fish -> fish fish fishing near fish
octopus -> octopus whale whale underwater octopus
eat -> eating eat eating of eat
mother -> mother mother distraught with mother
son -> father son newborn than son
book -> book book paperback about book
happiness -> marriage affection happy in happiness
luck -> fortune gamble lucky of luck
when -> it later about about other
why -> what answer sad about reasons
how -> what react about about importance
eat -> eating eat eating of eat
banana -> banana banana nutritious of banana


It appears that the very frequent words are just repetitions of themelves (cat->cat, dog->dog). The situation gets a bit more interesting when inspecting less frequent words and interrogative words. 

We now analyse in how many cases the entailed word appears in the list of associates for a given word

In [233]:
def get_assoc(word,top=1):
    word = word.upper()
    row = mat[w2i[word]]
    row[w2i[word]] = 0 # auto-assoc
    if top == 1:
        idx = np.argmax(row)
        return vocab_free_full[idx].lower()
    else:
        non_zeros = len(np.nonzero(row)[0])
        idx = np.argsort(row)[-non_zeros:]
        return [vocab_free_full[i].lower() for i in idx[::-1]]

counter = 0
for word in words_overlap:
    model.encode(word)
    decoded=model.decode('word')
    #if word!=decoded and word+'ing'!=decoded:
    #    print(word, ',',  decoded, ',', get_strongest_assoc(word))
    if word!=decoded:
        #print(word, '->', decoded)
        list_assoc = get_assoc(word, 0)
        if decoded in list_assoc:
            print('Encoded word:', word)
            print('Decoded:', decoded)
            print('Associates:', list_assoc, '\n')
            counter += 1

Encoded word: daisy
Decoded: flower
Associates: ['flower', 'yellow', 'rose', 'petals', 'gun', 'dog', 'tulip', 'girl'] 

Encoded word: luck
Decoded: gamble
Associates: ['good', 'charm', 'bad', 'chance', 'fortune', 'never', 'skill', 'happiness', 'gamble'] 

Encoded word: topic
Decoded: discussion
Associates: ['subject', 'idea', 'title', 'paper', 'thesis', 'story', 'sentence', 'discussion', 'theme', 'book', 'study', 'meaning', 'issue', 'main', 'conversation'] 

Encoded word: expense
Decoded: cost
Associates: ['money', 'account', 'cost', 'bill', 'pay', 'cheap', 'price', 'clothes', 'car', 'book', 'food', 'school', 'broke', 'buy'] 

Encoded word: beans
Decoded: peas
Associates: ['rice', 'food', 'gas', 'vegetables', 'pork', 'green', 'baked', 'hot dogs', 'black', 'chili', 'peas', 'cool', 'soup', 'potatoes'] 

Encoded word: basement
Decoded: cellar
Associates: ['attic', 'cellar', 'bottom', 'house', 'dark', 'cold', 'stairs', 'floor', 'underground', 'downstairs', 'below', 'low', 'under', 'cement'

Encoded word: spade
Decoded: shovel
Associates: ['cards', 'shovel', 'ace', 'dig', 'garden', 'black', 'dog', 'dirt', 'hoe', 'diamond', 'tool', 'heart'] 

Encoded word: reptile
Decoded: alligator
Associates: ['snake', 'lizard', 'alligator', 'scales', 'mammal', 'animal', 'gross', 'frog'] 

Encoded word: creek
Decoded: river
Associates: ['river', 'water', 'brook', 'stream', 'noise', 'pond', 'lake', 'meadow', 'woods', 'turtle', 'squeak', 'fish', 'swim'] 

Encoded word: hazard
Decoded: risk

Encoded word: haircut
Decoded: hair
Associates: ['short', 'scissors', 'trim', 'hair', 'style', 'no', 'bad', 'shave', 'buzz', 'chop', 'long', 'messy', 'cheap', 'shampoo'] 

Encoded word: division
Decoded: unit
Associates: ['math', 'separate', 'multiply', 'divide', 'addition', 'half', 'unit', 'group', 'split', 'section', 'department', 'bell', 'class', 'border', 'apart', 'subtraction'] 

Encoded word: motor
Decoded: engine
Associates: ['car', 'engine', 'boat', 'motorcycle', 'run', 'vehicle', 'bike', 'oil', 

Encoded word: acorn
Decoded: squirrel
Associates: ['tree', 'squirrel', 'nut', 'squash', 'seed', 'chipmunk'] 

Encoded word: burglar
Decoded: robber
Associates: ['thief', 'robber', 'steal', 'alarm', 'rob', 'crook', 'cat', 'mask', 'robbery', 'bar', 'criminal', 'black', 'house', 'bad', 'cop', 'money', 'crime', 'movie'] 

Encoded word: headache
Decoded: pain
Associates: ['pain', 'aspirin', 'migraine', 'tylenol', 'hurt', 'sinus', 'sick', 'pill', 'medicine', 'brain', 'fever'] 

Encoded word: macaroni
Decoded: pasta
Associates: ['cheese', 'noodles', 'pasta', 'food', 'spaghetti', 'sauce', 'salad'] 

Encoded word: ease
Decoded: soothe
Associates: ['easy', 'relax', 'hard', 'difficult', 'simple', 'slow', 'comfort', 'tense', 'pain', 'soothe', 'help', 'smooth', 'slide', 'ability', 'quick', 'nervous', 'grace', 'complex', 'mind', 'rest'] 

Encoded word: chapter
Decoded: book
Associates: ['book', 'one', 'section', 'novel', 'verse', 'page', 'sorority'] 

Encoded word: granite
Decoded: stone
Associates:

Encoded word: danger
Decoded: risk

Encoded word: arctic
Decoded: snow
Associates: ['cold', 'ocean', 'ice', 'circle', 'snow', 'north', 'bear', 'frigid'] 

Encoded word: verdict
Decoded: jury
Associates: ['guilty', 'decision', 'judge', 'court', 'jury', 'judgment', 'sentence', 'answer', 'law', 'conclusion', 'truth', 'trial', 'end', 'final', 'result', 'outcome'] 

Encoded word: upstream
Decoded: river
Associates: ['downstream', 'river', 'salmon', 'fish', 'water', 'swim', 'down', 'paddle', 'trout', 'current', 'hard', 'boat', 'high', 'fight'] 

Encoded word: ruler
Decoded: king
Associates: ['measure', 'king', 'inch', 'pencil', 'dictator', 'straight', 'stick', 'wood', 'number', 'math', 'tool', 'pen', 'edge', 'line', 'geometry', 'meter'] 

Encoded word: east
Decoded: south
Associates: ['west', 'coast', 'south', 'slow'] 

Encoded word: physician
Decoded: doctor
Associates: ['doctor', 'heal', 'needle', 'smart'] 

Encoded word: chart
Decoded: diagram
Associates: ['graph', 'map', 'table', 'doctor

Encoded word: comprehend
Decoded: understand
Associates: ['understand', 'learn', 'read', 'know'] 

Encoded word: employer
Decoded: employee
Associates: ['boss', 'employee', 'work', 'job', 'worker', 'money', 'check'] 

Encoded word: paddy
Decoded: rice
Associates: ['wagon', 'cake', 'hamburger', 'rice', 'soft', 'cushion', 'cow', 'meat', 'boat', 'pad', 'comfortable', 'bear', 'person'] 

Encoded word: organization
Decoded: group
Associates: ['group', 'club', 'neat', 'together', 'messy', 'tidy', 'order', 'people', 'mafia', 'business', 'chaos', 'skill', 'gang', 'mob', 'team', 'sorority', 'company'] 

Encoded word: slogan
Decoded: banner
Associates: ['ad', 'saying', 'word', 'campaign', 'commercial', 'banner', 'sign', 'product', 'phrase', 'theme', 'coke', 'candidate', 'sell'] 

Encoded word: robin
Decoded: bird
Associates: ['bird', 'hood', 'red', 'eggs', 'nest'] 

Encoded word: alarm
Decoded: alert
Associates: ['clock', 'bell', 'awake', 'sound', 'scare', 'ring', 'fire', 'morning', 'alert', 'fr

Encoded word: slippers
Decoded: shoes
Associates: ['shoes', 'feet', 'robe', 'bedroom', 'fuzzy', 'furry', 'socks', 'bed'] 

Encoded word: cousin
Decoded: brother
Associates: ['relative', 'aunt', 'family', 'uncle', 'friend', 'nephew', 'brother', 'kin', 'kid'] 

Encoded word: autumn
Decoded: summer
Associates: ['fall', 'leaves', 'spring', 'season', 'summer', 'brown', 'color', 'rain', 'tree'] 

Encoded word: butler
Decoded: maid
Associates: ['maid', 'servant', 'rich', 'waiter', 'wealth', 'meat', 'door', 'mansion', 'tuxedo', 'money', 'man'] 

Encoded word: four
Decoded: five
Associates: ['five', 'number', 'three', 'golf', 'fire', 'square', 'score', 'eight', 'fingers', 'two'] 

Encoded word: spice
Decoded: seasoning
Associates: ['sugar', 'salt', 'pepper', 'cinnamon', 'hot', 'food', 'flavor', 'taste', 'seasoning', 'nice', 'garlic', 'herb', 'oregano', 'smell', 'cake', 'sweet', 'rack'] 

Encoded word: fox
Decoded: wolf
Associates: ['hound', 'animal', 'dog', 'sly', 'trot', 'red', 'hunt', 'wolf',

Encoded word: shame
Decoded: regret
Associates: ['guilt', 'bad', 'disgrace', 'doubt', 'embarrass', 'sorrow', 'hide', 'pity', 'blame', 'sorry', 'humiliate', 'wrong', 'regret', 'abuse', 'hurt'] 

Encoded word: bison
Decoded: moose
Associates: ['buffalo', 'animal', 'bull', 'ox', 'big', 'cow', 'snake', 'moose', 'meat', 'indian'] 

Encoded word: filthy
Decoded: dirty
Associates: ['dirty', 'rich', 'dirt', 'mess', 'bum', 'poor', 'clean'] 

Encoded word: science
Decoded: biology
Associates: ['biology', 'fiction', 'art', 'math', 'psychology', 'chemistry', 'experiment', 'technology', 'knowledge', 'study', 'discipline', 'lab', 'discovery', 'scientist', 'fair', 'method', 'astronomy', 'hard'] 

Encoded word: character
Decoded: actor
Associates: ['personality', 'person', 'play', 'actor', 'cartoon', 'act', 'player', 'witness', 'letter', 'figure', 'funny', 'fine', 'movie', 'book', 'story'] 

Encoded word: joker
Decoded: clown
Associates: ['cards', 'clown', 'funny', 'laugh', 'wild', 'ace'] 

Encoded wo

Entailed word is an associate in less than 10% of cases:

In [236]:
counter/len(words_overlap)

0.08745247148288973

One possible reason why there is such a small overlap in entailment predictions and free norms might be the dataset that is used for training of the model. SNLI dataset contains sentences which usually involve people doing certain actions, generated by captioning imges. The vocabulary gathered in that way does not contain many finesses arising in a discourse or a written piece of text. To alleviate this one could try training the model on a different data set.

Another reason for this poor match might be the model architecture which is designed as to process syntax, something less relevant in free association norms. So, the second idea is to modify the model to train it on more co-occurrence data (although some people have strongly argued against the interpretation of associations as co-occurences in text).