In this lab, we're going to work with word embeddings again. We've used embeddings to look at word similarity. But today we are going to use k-means clustering to turn word embeddings into word classes, groups of words that the model thinks have a similar meaning.

Let's start by getting our environment ready.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


We'll start by loading our pre-trained embeddings. I'm using the embeddings learned from tweets, but you can use these others by changing the file name:

     tweets: "sociolinguistics.english_all.gz"
     hotel reviews: "economic.hotel_reviews.gz"
     news articles: "economic.nyt.1931-2016.gz"
     speeches: "economic.congress.1931-2016.gz"
     19th century books: "stylistics.gutenberg_all.gz"

In [2]:
file = "economic.nyt.1931-2016.gz"
file = os.path.join(ai.data_dir, file)

#df = pd.read_csv(file)
#print(df)
#ai.train_word2vec(df)

ai.word_vectors = ai.deserialize("w2v_embedding", file + ".w2v_embedding.json")
ai.word_vectors_vocab = ai.deserialize("w2v_vocab", file + ".w2v_vocab.json")

#Build an index of each word
y = list(ai.word_vectors_vocab.keys())
    
print(ai.word_vectors)
print(list(ai.word_vectors_vocab.keys())[0:20])

[[-0.1235968   0.04748945  0.16285081 ...  0.03241363 -0.03300684
   0.26601192]
 [-0.0030584   0.04572405  0.10358463 ... -0.00173236 -0.10827369
   0.20938973]
 [-0.19914334  0.03242941  0.0766376  ...  0.12763536  0.02289162
   0.0425519 ]
 ...
 [ 0.05791736  0.24080583  0.11852352 ...  0.13066971  0.03170295
   0.20842037]
 [-0.05558461  0.00972249  0.02807915 ...  0.00696762  0.19038115
   0.08795328]
 [ 0.04808212  0.17134945 -0.06976075 ...  0.02019391  0.03982212
   0.0915439 ]]
['the_DET', 'of_ADP', 'a_DET', 'and_CCONJ', 'in_ADP', 'number_NOUN', 'to_PART', 'to_ADP', 'for_ADP', 'on_ADP', 'at_ADP', 'is_AUX', 'by_ADP', 'was_AUX', 'with_ADP', 'that_SCONJ', 'as_SCONJ', 'his_DET', 'from_ADP', 'it_PRON']


Let's take a look at some of our vocabulary words. Remember that we've joined phrases using PMI and we added part-of-speech tags as an easy way to tell different words apart, even if they have the same string.

In [3]:
import random

sample = random.sample(ai.word_vectors_vocab.keys(), 10)
print(sample)

['banca_PROPN', 'bac_PROPN', 'earthshaking_ADJ', 'kennebunkport_PROPN', 'whangpoo_NOUN', 'readycut_PROPN', 'bayeux_PROPN', 'wilsey_PROPN', 'hulahoop_NOUN', 'tryons_NOUN']


Let's use our k-means clustering that we used before. The only difference is that we're doing it with word embeddings this time!

    Our *x* variable is the word embeddings themselves, a table of vectors for each word.

    Our *y* variable is the word-form that corresponds with each embedding.

    Our *k* variable is the number of clusters we want; this is 1,000 because we have a lot of words.

    And we aren't comparing this with a ground-truth label, so we set ari to False.

In [4]:
cluster_df = ai.cluster(x = ai.word_vectors, y = y, k = 1000)
print(cluster_df)

                   Label Cluster
0                the_DET     706
1                 of_ADP     659
2                  a_DET     797
3              and_CCONJ     370
4                 in_ADP     399
...                  ...     ...
281042   hierarchs_PROPN     709
281043     whiteonly_ADJ     622
281044  churubusco_PROPN     799
281045        amwest_ADJ     217
281046     citroens_NOUN      21

[281047 rows x 2 columns]


And there you go! Now we have a cluster of words. This cluster means that the machine thinks the words inside the cluster go together, based on their distribution. It thinks these words have similar meanings. But do they? Let's take a look. We'll print out some random clusters.

In [5]:
cluster = random.randint(0,999)
pd.set_option('display.max_rows', None)
select_df = cluster_df.loc[cluster_df["Cluster"] == cluster]
print(select_df)

                         Label Cluster
3590             behavior_NOUN      26
4342                 acts_NOUN      26
9804               racism_NOUN      26
9900                 bias_NOUN      26
11105           brutality_NOUN      26
12150         persecution_NOUN      26
12455             slavery_NOUN      26
12464           prejudice_NOUN      26
13196                 sin_NOUN      26
13513        exploitation_NOUN      26
13639                hate_NOUN      26
14118              hatred_NOUN      26
14364           injustice_NOUN      26
14959       homosexuality_NOUN      26
16097            betrayal_NOUN      26
17425                sins_NOUN      26
18446             cruelty_NOUN      26
18771               evils_NOUN      26
19688            disgrace_NOUN      26
20134              morals_NOUN      26
21143             bigotry_NOUN      26
22626         intolerance_NOUN      26
23899            adultery_NOUN      26
26594         retribution_NOUN      26
27743        mistreatment

Every time you run this, we'll get a different cluster, a different group by words. So I don't know what exactly you're seeing. But, for me, the cluster contains names of social places: fair, church, stadium, gallery, studio, parade, resort, casino, cinema. These are physical locations or events that have a specific social function, for entertainment.

And this tells us that our word embeddings, our distance measure, and our clustering algorithm can work together to tell us about language and about the world!