<a href="https://www.kaggle.com/code/mikedelong/visualize-moby-dick-with-word2vec?scriptVersionId=147131068" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
with open(file='/kaggle/input/moby-dick-herman-melville/melville-moby_dick.txt', mode='r', encoding='utf-8', ) as input_fp:
    data = input_fp.read()

Our text is oddly formatted; we need to parse it into sentences to use gensim. So we use a sentence parser from spacy.

In [2]:
from arrow import now
from spacy import load
from os.path import exists
time_start = now()
outfile = '/kaggle/working/moby-dick-formatted.txt'
if exists(path=outfile):
    with open(file=outfile, mode='r', encoding='utf-8') as input_fp:
        documents = input_fp.readlines()
else:
    spacy_model = load('en_core_web_sm')
    spacy_model.max_length = 1200000
    spacy_result = spacy_model(data.replace('\n', ' '))
    documents = [item.text for item in spacy_result.sents]
    with open(file=outfile, mode='w', encoding='utf-8') as output_fp:
        for document in documents:
            print(document, file=output_fp)
print(now() - time_start)

0:00:48.471343


In [3]:
from gensim.corpora import Dictionary
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import strip_numeric
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import strip_tags
CUSTOM_FILTERS = [lambda x: x.lower(), 
                  remove_stopwords, 
                  strip_multiple_whitespaces, 
                  strip_numeric,
                  strip_punctuation,
                  strip_short,
                  strip_tags, 
                 ]
texts = [preprocess_string(s=document, filters=CUSTOM_FILTERS) for document in documents]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
print(dictionary)

Dictionary<16573 unique tokens: ['chapter', 'loomings', 'ishmael', 'ago', 'having']...>


In [4]:
# this determines our runtime and also helps determine how many low-frequency tokens we keep
MAX_VOCAB_SIZE = 5000

In [5]:
from gensim.models import Word2Vec
time_start = now()
word2vec_model = Word2Vec(sentences=texts, vector_size=100, window=5, workers=4, seed=2023, max_vocab_size=MAX_VOCAB_SIZE)
print('vocabulary size: {}'.format(len(word2vec_model.wv)))
print(now() - time_start)

vocabulary size: 1459
0:00:00.379682


In [6]:
word2vec_model.wv.most_similar(topn=10, positive=['whale'])

[('sperm', 0.9987097978591919),
 ('white', 0.9982337951660156),
 ('case', 0.9981936812400818),
 ('ships', 0.9981038570404053),
 ('great', 0.9980811476707458),
 ('whalemen', 0.9980697631835938),
 ('right', 0.9980660676956177),
 ('seen', 0.9980177879333496),
 ('teeth', 0.9979925751686096),
 ('back', 0.9979572296142578)]

In [7]:
word2vec_model.wv.most_similar(topn=10, positive=['air'])

[('beneath', 0.9995913505554199),
 ('hold', 0.9995598793029785),
 ('going', 0.9995550513267517),
 ('harpooneer', 0.9995514154434204),
 ('tail', 0.999510645866394),
 ('yet', 0.9995103478431702),
 ('side', 0.9995073080062866),
 ('water', 0.9995055794715881),
 ('them', 0.9994984865188599),
 ('room', 0.9994925260543823)]

In [8]:
from math import log10
from pandas import DataFrame
from sklearn.manifold import TSNE
time_start = now()
init = ['pca', 'random'][1] # choose this to see different shapes
N_COMPONENTS = 3 # we get more diffusion if we get 3 t-sne components
tsne = TSNE(random_state=2023, n_iter=1000, verbose=1, init='random', n_components=N_COMPONENTS,)
tsne_result = tsne.fit_transform(X=word2vec_model.wv.vectors)
plot_df = DataFrame(data=tsne_result, columns=['x', 'y', 'z'])
plot_df['word'] = list(word2vec_model.wv.key_to_index.keys())
plot_df['weight'] = plot_df['word'].apply(func=lambda x: word2vec_model.wv[x].sum())
plot_df['count'] = plot_df['word'].apply(func=lambda x: log10(word2vec_model.wv.get_vecattr(key=x, attr='count')))
print(now() - time_start)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1459 samples in 0.001s...
[t-SNE] Computed neighbors for 1459 samples in 0.121s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1459
[t-SNE] Computed conditional probabilities for sample 1459 / 1459
[t-SNE] Mean sigma: 0.026209
[t-SNE] KL divergence after 250 iterations with early exaggeration: 52.385521
[t-SNE] KL divergence after 1000 iterations: 0.499449
0:00:11.624161


In [9]:
from plotly.express import scatter
scatter(data_frame=plot_df, x='y', y='z', hover_name='word', color='weight')

Not surprisingly our TSNE model tracks our vector weights.

In [10]:
scatter(data_frame=plot_df, x='y', y='z', hover_name='word', color='count')

The log of the count and the weight are related but not the same.

In [11]:
from sklearn.cluster import KMeans
N_CLUSTERS = 50
kmeans_model = KMeans(n_clusters=N_CLUSTERS, verbose=0, max_iter=1000, random_state=2023, n_init='auto')
kmeans_result = kmeans_model.fit_transform(X=word2vec_model.wv.vectors)
plot_df['cluster'] = kmeans_model.labels_
scatter(data_frame=plot_df, x='y', y='z', hover_name='word', color='cluster')