# Creating 2D embeddings using TSNE to visualize the data
It is very useful in visually seeing any patterns in the data.

In [74]:
import numpy as np
import random
import pandas as pd 
import altair as alt
from sklearn.manifold import TSNE

Read words from words.txt and find unique characters in this list of words.

In [12]:
words = []
with open('words.txt','r') as f:
    words = f.read().split('\n')
words = [w.lower() for w in words]
unique_ch = set()
for w in words:
    for c in w:
        unique_ch.add(c)
unique_ch = list(unique_ch)
unique_ch.sort()
unique_ch
c_idx_map = dict([(unique_ch[i],i) for i in range(len(unique_ch))])


Method to vectorize words. It is a naive method that encodes words based on the frequency of characters appearing in the word. Thus the similarity between words is going to be based on the characters appearing in the words rather than their semantics.

In [59]:
def vectorize(word,c_idx_map):
    v = np.zeros(len(c_idx_map),dtype=float)
    for c in word:
        v[c_idx_map[c]]+=1
    return v

def get_k_closest(seed_word,all_word_vecs,k=100):
    u = vectorize(seed_word,c_idx_map)
    dists = np.array([np.linalg.norm(u-v) for v in all_word_vecs])
    return dists.argsort()[:k]
    

To understand clustering we are going to create a subset of words with three explicit clusters and see whether we can recover these clusters. Here we choose 'apple', 'banana' and 'kiwi' as seed words and find closest words to each of them. Since there isn't much overlap between these seed words so we should expect to see three distinct clusters.

In [60]:
word_vectors = [vectorize(w,c_idx_map) for w in words]

apple_cluster = get_k_closest('apple',word_vectors,k=100)
banana_cluster = get_k_closest('banana',word_vectors,k=100)
kiwi_cluster = get_k_closest('kiwi',word_vectors,k=100)

h = list(apple_cluster) + list(banana_cluster) + list(kiwi_cluster)
print(len(h))
X = np.array([word_vectors[i] for i in h])
lst_w = [words[i] for i in h]
clusters = ['apple']*len(apple_cluster) + ['banana']*len(banana_cluster) + ['kiwi']*len(kiwi_cluster)

Here we use TSNE to get 2D representation of the data points(words) so that we can visualize them. TSNE internally computes distances between the given data points and finds 2D representation such that the distances are preserved.

In [66]:
X_embedded = TSNE(n_components=2, 
                   init='random').fit_transform(X)
X_embedded.shape

(300, 2)

Visualize the embedded data. We can clearly see three distinct clusters as expected.

In [73]:

source = {'X':X_embedded[:,0],'Y':X_embedded[:,1],'Word':lst_w,'Center':clusters}
source = pd.DataFrame(source)
alt.Chart(source).mark_circle(size=60).encode(
    x='X',
    y='Y',
    color='Center',
    tooltip=['Word']
).interactive()