# Summary

This notebook explores distribitional simliarity in a dataset of 10,000 Wikipedia articles (4.4M words), building high-dimensional, sparse representations for words from the distinct contexts they appear in.  These representations allow for analysis of the most similar words to a given query, and are interpretable with respect to the specific contexts that are most important for determining that two words are similar.

In [1]:
from collections import defaultdict, Counter
import math
import operator
import gzip

In [2]:
window=2
vocabSize=10000

In [30]:
filename="../data/wiki.10K.txt"

wiki_data=open(filename, encoding="utf-8").read().lower().split(" ")


## Clean text

In [15]:
import string
from nltk.corpus import stopwords

In [18]:
puncts = string.punctuation
en_stop_words = stopwords.words('english')

In [31]:
cleaned_wiki_data = [word for word in wiki_data if (word not in puncts) and 
             (word not in en_stop_words)]

## Define helpers

In [4]:
# We'll only create word representation for the most frequent K words

def create_vocab(data):
    word_representations={}
    vocab=Counter()
    for i, word in enumerate(data):
        vocab[word]+=1

    topK=[k for k,v in vocab.most_common(vocabSize)]
    for k in topK:
        word_representations[k]=defaultdict(float)
    return word_representations

In [5]:
# word representation for a word = its unigram distributional context (the unigrams that show
# up in a window before and after its occurence)

def count_unigram_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        for j in range(start, end):
            if i != j:
                word_representations[word][data[j]]+=1

In [6]:
def count_directional_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        left="L: %s" % ' '.join(data[start:i])
        right="R: %s" % ' '.join(data[i+1:end])
        
        word_representations[word][left]+=1
        word_representations[word][right]+=1

In [38]:
import numpy as np

In [40]:
np.power(list({'a': 1, 'b': 2}.values()), 2)

array([1, 4], dtype=int32)

In [41]:
# normalize a word represenatation vector that its L2 norm is 1.
# we do this so that the cosine similarity reduces to a simple dot product

def normalize(word_representations):
    for word in word_representations:
        total = sum(np.power(list(word_representations[word].values()), 2))
#         total=0
#         for key in word_representations[word]:
#             total+=word_representations[word][key]*word_representations[word][key]
            
        total=math.sqrt(total)
        for key in word_representations[word]:
            word_representations[word][key]/=total
    
    return word_representations

In [8]:
def dictionary_dot_product(dict1, dict2):
    dot=0
    for key in dict1:
        if key in dict2:
            dot+=dict1[key]*dict2[key]
    return dot

In [9]:
def find_sim(word_representations, query):
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None
    
    scores={}
    for word in word_representations:
        cosine=dictionary_dot_product(word_representations[query], word_representations[word])
        scores[word]=cosine
    return scores

In [10]:
# Find the K words with highest cosine similarity to a query in a set of word_representations
def find_nearest_neighbors(word_representations, query, K):
    scores=find_sim(word_representations, query)
    if scores != None:
        sorted_x = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
        for idx, (k, v) in enumerate(sorted_x[:K]):
            print("%s\t%s\t%.5f" % (idx,k,v))

# Unigram vs. directional contexts

Explore the difference between `count_unigram_context` and `count_directional_context` for determining what counts as "context".  `count_unigram_context` counts an individual unigram in the bag of words around a target as a "context" variable, while `count_directional_context` counts the sequence of words before and after the word as a single "context"--and specifies the direction it occurs (to the left or right of the word).

## Unigram context

In [32]:
word_representations=create_vocab(wiki_data)
count_unigram_context(wiki_data, word_representations)
word_representations = normalize(word_representations)

In [33]:
find_nearest_neighbors(word_representations, "actor", 10)

0	actor	1.00000
1	actress	0.94363
2	artist	0.88560
3	writer	0.85602
4	politician	0.84846
5	musician	0.84787
6	entrepreneur	0.84640
7	engineer	0.83586
8	singer	0.82831
9	activist	0.82257


Interestingly, `politician`, `entrepreneur`, `activist` and even `engineer` are near neighbors of `actor`.

In [13]:
# Let's find the contexts shared between two words that have the most contribution
# to the cosine similarity

def find_shared_contexts(word_representations, query1, query2, K):
    if query1 not in word_representations:
        print("'%s' is not in vocabulary" % query1)
        return None
    
    if query2 not in word_representations:
        print("'%s' is not in vocabulary" % query2)
        return None
    
    context_scores={}
    dict1=word_representations[query1]
    dict2=word_representations[query2]
    
    for key in dict1:
        if key in dict2:
            score=dict1[key]*dict2[key]
            context_scores[key]=score

    sorted_x = sorted(context_scores.items(), key=operator.itemgetter(1), reverse=True)
    for idx, (k, v) in enumerate(sorted_x[:K]):
        print("%s\t%s\t%.5f" % (idx,k,v))

In [35]:
find_shared_contexts(word_representations, "actor", "politician", 10)

0	and	0.22555
1	.	0.18242
2	a	0.10453
3	,	0.09911
4	an	0.07981
5	the	0.05109
6	he	0.01720
7	in	0.01608
8	american	0.01511
9	(	0.01307


In [34]:
find_shared_contexts(word_representations, "actor", "entrepreneur", 10)

0	and	0.23554
1	,	0.18507
2	.	0.13991
3	an	0.09476
4	a	0.05330
5	the	0.04516
6	in	0.01579
7	he	0.01555
8	by	0.00987
9	(	0.00962


It turns out that these word pairs share mostly stop words and punctuation. Let us strip them to see how things change.

In [36]:
# define a helper to avoid code repetition
def cal_normalized_repr(text_data):
    word_representations=create_vocab(text_data)
    count_unigram_context(text_data, word_representations)
    return normalize(word_representations)

In [42]:
clean_word_repr = cal_normalized_repr(cleaned_wiki_data)

In [43]:
find_shared_contexts(clean_word_repr, "actor", "politician", 10)

0	american	0.09663
1	born	0.01678
2	indian	0.00814
3	british	0.00678
4	canadian	0.00559
5	served	0.00542
6	english	0.00415
7	''	0.00373
8	director	0.00322
9	's	0.00280


In [44]:
find_shared_contexts(clean_word_repr, "actor", "entrepreneur", 10)

0	best	0.03721
1	''	0.02212
2	film	0.01609
3	american	0.01509
4	actor	0.01207
5	known	0.01106
6	director	0.00955
7	canadian	0.00905
8	born	0.00905
9	``	0.00654


We start seeing some logic here:
+ actor and entrepreneur co-occur with *best* and *known*, make sense because wiki articles talk about best/well-known persons

+ actor and entrepreneur also co-occur with *director*, maybe some famous actors also become director/movie entrepreneur.

## Directional context

In [24]:
word_representations=create_vocab(wiki_data)
count_directional_context(wiki_data, word_representations)
normalize(word_representations)

In [25]:
find_nearest_neighbors(word_representations, 'actor', 10)

0	actor	1.00000
1	actress	0.30555
2	perhaps	0.18018
3	filmmaker	0.15327
4	screenwriter	0.13743
5	writer	0.10975
6	producer	0.09931
7	probably	0.07869
8	musician	0.07860
9	consultant	0.07101


In [26]:
find_shared_contexts(word_representations, "actor", "politician", 10)

0	R: new york	0.00383
1	L: 1973 american	0.00191
2	L: 1942 american	0.00096
3	R: appeared tamil	0.00096


In [27]:
find_shared_contexts(word_representations, "actor", "filmmaker", 10)

0	R: best known	0.13411
1	R: film director	0.01916


In [29]:
find_shared_contexts(word_representations, "actor", "consultant", 10)

0	R: best known	0.07101
