This tries as a starting point this post here on unsupervised sentiment analysis via clusters of word vectors
https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483

In [1]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.cluster import KMeans

Load the model created by `train_word2vec.py`

In [2]:
word_vectors  = KeyedVectors.load_word2vec_format("reddit_w2v_model.bin", binary=True)

We can have vectors for emojis which is quite nice

In [3]:
word_vectors.similar_by_vector(word_vectors.vectors[word_vectors.index2word.index('🙌')], topn=10, restrict_vocab=None)

[('🙌', 1.0),
 ('dum', 0.8143207430839539),
 ('💎', 0.8083906173706055),
 ('son', 0.7342033386230469),
 ('carrying', 0.7328704595565796),
 ('jk', 0.7314605712890625),
 ('dry', 0.726060152053833),
 ('drain', 0.7095533013343811),
 ('sept', 0.7064850330352783),
 ('b0t', 0.7048172950744629)]

Trying to ID two positive / negative clusters from the variety of reddit comment data is not going to happen. We can increase the number and look at what seem like themes - meta talk, meme talk, apparently-serious talk, stats-heavy. This changes over time quite a lot as the dataset grows. Could be linked to upvote numbers and comment volume over time, needs sleeping on.

In [4]:
model = KMeans(n_clusters=4, max_iter=1000, random_state=True, n_init=50).fit(X=word_vectors.vectors.astype('double'))

In [5]:
word_vectors.similar_by_vector(model.cluster_centers_[2], topn=20, restrict_vocab=None)

[('lose)', 0.675972044467926),
 ('valid', 0.6677347421646118),
 ('walking', 0.6415858268737793),
 ('dems', 0.6213172674179077),
 ('nefarious', 0.6207584142684937),
 ('walked', 0.6129131317138672),
 ('loosing', 0.6119007468223572),
 ('infinite', 0.6081537008285522),
 ('hurt', 0.608083963394165),
 ('dare', 0.6052772998809814),
 ('failures', 0.6029701232910156),
 ('lurk', 0.602564811706543),
 ('gaining', 0.6014528274536133),
 ('stack', 0.6014478206634521),
 ('inexperienced', 0.6009200811386108),
 ('>and', 0.5993832945823669),
 ('punishment', 0.5966596603393555),
 ('gamble', 0.5935034155845642),
 ('ruin', 0.5932648777961731),
 ('swinging', 0.5930019617080688)]

In [6]:
words = pd.DataFrame(word_vectors.vocab.keys())
words.columns = ['words']
words['vectors'] = words.words.apply(lambda x: word_vectors[f'{x}'])
words['cluster'] = words.vectors.apply(lambda x: model.predict([np.array(x)]))
words.cluster = words.cluster.apply(lambda x: x[0])

In [7]:
# Closeness to the cluster centre, but as above, it's arbitrary
words['closeness_score'] = words.apply(lambda x: 1/(model.transform([x.vectors]).min()), axis=1)

In [8]:
words.head(10)

Unnamed: 0,words,vectors,cluster,closeness_score
0,like,"[-0.23872057, 0.7668084, -0.06523065, 0.502036...",2,0.121072
1,people,"[0.6937657, -0.08785572, -0.29196194, -0.39732...",2,0.126791
2,money,"[0.0074918834, 0.2853351, 0.0026786586, 0.3595...",2,0.115928
3,gme,"[-0.00830241, -0.24240829, -1.1237249, -0.7349...",0,0.125775
4,get,"[0.9307432, -0.7535146, -0.07598917, 0.1320738...",2,0.121322
5,stock,"[0.22871694, 0.07093796, -0.7372991, -0.411352...",0,0.110766
6,i'm,"[0.43491623, -0.39347756, -0.28166318, 0.50103...",2,0.100814
7,think,"[0.13865502, 0.6003073, 0.11263267, 0.3632528,...",2,0.137126
8,still,"[0.06291361, 0.48377606, -0.43849382, 0.498615...",0,0.107593
9,would,"[0.86194956, 1.0472965, 0.28914222, 0.32642567...",2,0.107005


In [9]:
words[['words', 'cluster']].to_csv('sentiment_dictionary.csv', index=False)