<a href="https://www.kaggle.com/code/mikedelong/eda-with-umap-sentence-transformers?scriptVersionId=158732731" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
%env TOKENIZERS_PARALLELISM=false
!pip install --quiet keybert
print('pip install keybert complete.')

env: TOKENIZERS_PARALLELISM=false
pip install keybert complete.


In [2]:
import pandas as pd
from glob import glob

PATHNAME = '/kaggle/input/sentiment-analysis-datasets/*.csv'

df = pd.concat(axis=1, objs=[pd.read_csv(filepath_or_buffer=input_file) for input_file in glob(pathname=PATHNAME)])
df['date'] = pd.to_datetime(df['Timestamp'])
df['token count'] = df['text'].str.split().apply(func=len)
df = df.drop(columns=['Timestamp', 'Hour', 'Month', 'Year', 'Day', 'worlds', 'value_worlds'])
df.head()

Unnamed: 0,User,Hashtags,Platform,text,Likes,Country,Sentiment,Retweets,date,token count
0,CathedralVisitor,#Awe #ArchitecturalGrandeur,Facebook,Awe-inspired by the grandeur of an ancient cat...,35.0,Czech Republic,Awe,18.0,2018-08-18 14:45:00+00:00,10
1,CathedralVisitor,#Awe #ArchitecturalGrandeur,Facebook,Awe-struck by the grandeur of an ancient cathe...,35.0,Czech Republic,Awe,18.0,2018-08-18 14:45:00+00:00,10
2,GreatWallWalker,#Awe #EngineeringMarvels,Twitter,"Walking the Great Wall of China, each step a t...",35.0,China,Awe,18.0,2017-08-18 19:30:00+00:00,14
3,PuzzleSolver,#Euphoria #PuzzleCompletion,Instagram,Euphoria floods in as the final puzzle piece c...,40.0,Denmark,Euphoria,20.0,2022-06-08 15:30:00+00:00,11
4,PuzzleSolver,#Euphoria #PerfectPuzzle,Instagram,Euphoria floods in as the final puzzle piece f...,40.0,Denmark,Euphoria,20.0,2022-06-08 15:30:00+00:00,10


In [3]:
from plotly.express import histogram
for x in ['Platform', 'Likes', 'token count', 'Country', 'Retweets']:
    histogram(data_frame=df, x=x).show()

In [4]:
df['Sentiment'].nunique(), len(df)

(191, 732)

In [5]:
df['Sentiment'].value_counts().head(n=20)

Sentiment
Positive         45
Joy              44
Excitement       37
Contentment      19
Neutral          18
Gratitude        18
Curiosity        16
Serenity         15
Happy            14
Despair          11
Nostalgia        11
Loneliness        9
Sad               9
Awe               9
Hopeful           9
Grief             9
Embarrassed       8
Confusion         8
Acceptance        8
Determination     7
Name: count, dtype: int64

We have a lot of different sentiment classes, so we may need to pick some or combine some to get reasonable results.

In [6]:
keep = {'Positive', 'Neutral', 'Despair', 'Sad', 'Loneliness', 'Joy', 'Nostalgia'}
df = df[df['Sentiment'].isin(keep)].copy()

In [7]:
from arrow import now
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

MAX_DF = 1.0
MIN_DF = 2
MODEL = 'all-MiniLM-L12-v2'
STOP_WORDS = 'english'
DOCS = df['text'].values.tolist()

model_start = now()
model = KeyBERT(model=MODEL,)
model.max_seq_length = 64
vectorizer = TfidfVectorizer(ngram_range=(1, 1), stop_words=STOP_WORDS, min_df=MIN_DF, max_df=MAX_DF, )
document_embeddings, word_embeddings = model.extract_embeddings(docs=DOCS, vectorizer=vectorizer, )
print('embedding time: {}'.format(now() - model_start))
print('we have {} documents and {} words.'.format(len(document_embeddings), len(word_embeddings)))
keywords = model.extract_keywords(docs=DOCS, top_n=1, stop_words=STOP_WORDS, vectorizer=vectorizer,
                                  doc_embeddings=document_embeddings, word_embeddings=word_embeddings, min_df=MIN_DF, )
print('model time: {}'.format(now() - model_start))
df['keyword'] = [keyword[0][0] if len(keyword) else '-none-' for keyword in keywords]

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

embedding time: 0:00:20.020701
we have 147 documents and 193 words.
model time: 0:00:20.148673


In [8]:
import pandas as pd
from plotly.express import scatter
from umap import UMAP

IGNORE = {'-none-', }

umap_start = now()
df['short text'] = df['text'].apply(func=lambda x: ' '.join(x.split()[:20]) + '...' if len(x.split()) > 20 else x)
umap_model = UMAP(n_components=2, random_state=2024, verbose=False, n_jobs=1)
df[['u0', 'u1']] = umap_model.fit_transform(X=document_embeddings)
scatter(data_frame=df[~df['keyword'].isin(IGNORE)], x='u0', y='u1', hover_name='short text',
        hover_data=['keyword', ], color = 'Sentiment', size='token count').show()
print('UMAP time: {}'.format(now() - umap_start))

UMAP time: 0:00:09.980040


Working with this little model and only a few sentiments we can see occasionally interesting results; for example sports are associated with sadness and despair.

In [9]:
# let's look at words our little model thinks are related
from sklearn.metrics.pairwise import linear_kernel
words_df = pd.DataFrame(data=linear_kernel(X=word_embeddings), columns=vectorizer.get_feature_names_out())
# we want to ignore self-similarity and focus on medium-strong to strong similarity
words_df = words_df[(words_df < 0.9999) & (words_df > 0.66)]
words = words_df.columns.tolist()

for index, row in words_df.iterrows():
    related = words_df.index[row.notnull()].tolist()
    related = [item for item in related if abs(item - index) > 2]
    related_words = [words[item] for item in related]
    if len(related_words):
        print(index, words[index], related_words)

4 age ['old']
7 art ['creativity', 'painting']
25 class ['lecture']
35 creates ['making']
36 creating ['making']
37 creativity ['art']
44 despair ['hopelessness']
47 drowning ['sinking']
53 evening ['night']
66 fitness ['workout']
72 gaming ['sports']
82 hopelessness ['despair']
86 journey ['trip']
92 lecture ['class']
98 loneliness ['solitude']
100 making ['creates', 'creating']
107 music ['tunes']
111 night ['evening']
114 old ['age']
119 painting ['art']
156 sinking ['drowning']
161 solitude ['loneliness']
165 sports ['gaming']
169 sunday ['tonight', 'weekend']
177 tonight ['sunday']
178 trip ['journey']
180 tunes ['music']
186 weekend ['sunday']
188 workout ['fitness']


Our little model does not pick out a lot of of true synonyms with cosine similarity, which suggests that most of the semantic work is being done by groups of words more than individual words.