This tries as a starting point this post here on unsupervised sentiment analysis via clusters of word vectors
https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483

In [1]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from glove import Glove
from kmeans_pytorch import kmeans
from sklearn.cluster import KMeans
from sentiment_anomaly.models import train_torch_vocab
import torch

Load the model created by `train_word2vec.py`

In [2]:
word_vectors  = KeyedVectors.load_word2vec_format("reddit_w2v_model.bin", binary=True)

We can have vectors for emojis which is quite nice

In [3]:
word_vectors.similar_by_vector(word_vectors.vectors[word_vectors.index2word.index('🙌')], topn=5, restrict_vocab=None)

[('🙌', 1.0),
 ('🙌🏻', 0.6535108685493469),
 ('✋', 0.652143120765686),
 ('💎', 0.6348857879638672),
 ('🍆', 0.630488395690918)]

Trying to ID two positive / negative clusters from the variety of reddit comment data is not going to happen. We can increase the number and look at what seem like themes - meta talk, meme talk, apparently-serious talk, stats-heavy. This changes over time quite a lot as the dataset grows. You could create sets of embeddings over time slices and look at their relative distance from reference clusters or from one another. Could be linked to upvote numbers and comment volume over time, needs sleeping on.

In [4]:
model = KMeans(n_clusters=4, max_iter=1000, random_state=True, n_init=50).fit(X=word_vectors.vectors.astype('double'))

In [5]:
word_vectors.similar_by_vector(model.cluster_centers_[2], topn=20, restrict_vocab=None)

[('money', 0.4843536615371704),
 ('think', 0.43711191415786743),
 ('sell', 0.434577077627182),
 ('lose', 0.4340251088142395),
 ('people', 0.4328373074531555),
 ('loose', 0.42213934659957886),
 ('know', 0.4151310920715332),
 ('happen', 0.4144483804702759),
 ('really', 0.40265196561813354),
 ('want', 0.394616961479187),
 ('gamble', 0.3945499062538147),
 ('make', 0.3851986825466156),
 ('understand', 0.3792017698287964),
 ('possible', 0.37648141384124756),
 ('squeeze', 0.3753488063812256),
 ('blame', 0.37441205978393555),
 ('still', 0.36940059065818787),
 ('stock', 0.36777323484420776),
 ('betting', 0.36734893918037415),
 ('hedges', 0.3668730854988098)]

In [6]:
words = pd.DataFrame(word_vectors.vocab.keys())
words.columns = ['words']
words['vectors'] = words.words.apply(lambda x: word_vectors[f'{x}'])
words['cluster'] = words.vectors.apply(lambda x: model.predict([np.array(x)]))
words.cluster = words.cluster.apply(lambda x: x[0])

In [7]:
# Closeness to the cluster centre, but as above, it's arbitrary
words['closeness_score'] = words.apply(lambda x: 1/(model.transform([x.vectors]).min()), axis=1)

In [8]:
words.head(10)

Unnamed: 0,words,vectors,cluster,closeness_score
0,like,"[-0.10896601, -0.4989108, -0.816069, 0.6278063...",2,0.109177
1,people,"[0.05315982, -0.9963219, -1.1672673, -0.550229...",2,0.094861
2,gme,"[0.09548321, -0.55390525, -1.3246208, 0.112597...",2,0.097914
3,money,"[1.0587951, -1.2309575, -1.3329929, 0.24446851...",2,0.084312
4,get,"[-0.59698033, -1.3145387, -1.1092944, -0.36285...",2,0.091951
5,stock,"[1.2761588, -0.9259336, -1.6981211, 0.03803802...",2,0.082326
6,think,"[0.28200936, 0.41803923, 0.26620647, 0.8017928...",2,0.096865
7,would,"[-0.08824884, 0.83257955, -0.52572507, 0.68127...",2,0.080834
8,shares,"[0.23181276, 0.5602848, -1.1762869, -0.5062161...",3,0.072099
9,still,"[-0.6590462, -0.11133627, -0.4280134, 0.384516...",2,0.0864


In [9]:
words[['words', 'cluster']].to_csv('sentiment_dictionary.csv', index=False)

Try the glove model created with `train_glove.py` - less intuitive than word2vec

In [10]:
glove = Glove.load('reddit.glove.model')

glove.most_similar('🙌', number=10)
glove.most_similar('apes', number=10)

[('together', 0.8503645499106047),
 ('stonker', 0.833585989646089),
 ('strong', 0.7843598709353422),
 ('ape', 0.7447651565484497),
 ('planet', 0.7352073987340947),
 ('purpose', 0.7232556207133627),
 ('nanners', 0.6463758844633478),
 ('band', 0.6295576290415956),
 ('idiots', 0.6164448270558343)]

In [11]:
# It did say this was experimental
# glove.most_similar_paragraph(['apes', 'together', 'strong'])

Another approach to GloVe, this involves a set of word vectors generated separately with the Stanford CoreNLP implementation and passed to a torchtext.vocab - see the docstrings in `sentiment_anomaly.models` for info. A short-term copout because the vocabulary gets rebuilt on the fly but the vectors don't 

In [12]:
voc = train_torch_vocab(vectors='vectors.txt')

75703lines [00:08, 9003.90lines/s] 


In [13]:
voc.freqs.most_common(10)

[('like', 8250),
 ('people', 7929),
 ('gme', 6691),
 ('money', 6173),
 ('get', 5680),
 ('would', 4762),
 ('stock', 4593),
 ('think', 4521),
 ('one', 4389),
 ("i'm", 4316)]

In [14]:
voc.vectors

tensor([[ 0.1767, -0.0937,  0.0285,  ..., -0.1350, -0.2213,  0.1102],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.6638,  0.0854,  0.9071,  ...,  0.0020,  0.4532, -0.9754],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [15]:
torch.norm(voc.vectors[voc['sell']] - voc.vectors[voc['shares']])

tensor(5.2047)

In [16]:
torch.norm(voc.vectors[voc['🙌']] - voc.vectors[voc['💎']])

tensor(1.5992)

In [17]:
cluster_ids_x, cluster_centers = kmeans(
    X=voc.vectors, num_clusters=4, distance='cosine', device=torch.device('cuda:0'), iter_limit=1000
)

running k-means on cuda:0..


[running kmeans]: 1000it [00:39, 40.03it/s, center_shift=nan, iteration=1000, tol=0.000100]

???