<a href="https://www.kaggle.com/code/mikedelong/explore-with-keybert-umap?scriptVersionId=157840134" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
%env TOKENIZERS_PARALLELISM=false
!pip install --quiet keybert
print('pip installed keybert')

env: TOKENIZERS_PARALLELISM=false
pip installed keybert


In [2]:
import re
import pandas as pd
from datetime import datetime
from time import mktime

def clean(arg:str) -> str:
    return re.sub('[^a-zA-Z0-9. /,]', ' ', arg)

def short_review(arg: str) -> str:
    pieces = arg.split()
    return ' '.join(pieces[:20]) + '...'

# https://stackoverflow.com/a/6451892
def since(date):
    return mktime(date.timetuple())

def year_fraction(date) -> float:
    this_year = datetime(year=date.year, month=1, day=1)
    next_year = datetime(year=1 + date.year, month=1, day=1)
    return date.year + (since(date) - since(this_year))/(since(next_year) - since(this_year))

filename = '/kaggle/input/cyberpunk-2077-steam-reviews/cyberpunk_2077_filtered.csv'
usecols = ['language', 'review', 'updated']
df = pd.read_csv(filepath_or_buffer=filename, parse_dates=['updated'], usecols=usecols)
# let's only look at reviews in English 
df = df[df['language'] == 'english']
# not all non-English reviews are labeled correctly
# and we want to remove emoji
# so we need to work a little harder
df['clean'] = df['review'].apply(func=clean)
# we need to be mindful of review length
df['token count'] = df['clean'].str.split().str.len()
df = df[df['token count'] > 3]
# we want to be able to get time slices
df['year'] = df['updated'].apply(func=year_fraction) 
df = df.drop(columns=['language', 'updated'])
# we need to truncate the review we show in the plot
df['short review'] = df['review'].apply(func=short_review)
df.head()

Unnamed: 0,review,clean,token count,year,short review
0,It's very fun. I don't usually like open world...,It s very fun. I don t usually like open world...,42,2023.947945,It's very fun. I don't usually like open world...
6,Coming back to try the game after 2.0 came out...,Coming back to try the game after 2.0 came out...,80,2023.947945,Coming back to try the game after 2.0 came out...
10,i dont even own this fucking game why can i wr...,i dont even own this fucking game why can i wr...,13,2023.947945,i dont even own this fucking game why can i wr...
11,Todo valio la pena al final con el mejor endin...,Todo valio la pena al final con el mejor endin...,50,2023.947945,Todo valio la pena al final con el mejor endin...
12,I am a sneaky boi and I stab people with arm s...,I am a sneaky boi and I stab people with arm s...,13,2023.947945,I am a sneaky boi and I stab people with arm s...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 227200 entries, 0 to 612379
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   review        227200 non-null  object 
 1   clean         227200 non-null  object 
 2   token count   227200 non-null  int64  
 3   year          227200 non-null  float64
 4   short review  227200 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 10.4+ MB


In [4]:
from plotly.express import histogram
histogram(data_frame=df, x='year', log_y=True).show()

Those spikes are really something; we might expect them to correlate with events like updates or minor releases, maybe?

In [5]:
histogram(data_frame=df, x='token count', log_y=True).show()

Our BERT models expect sequences of 128 subwords by default; review length measured in tokens may be an issue we need to deal with.

In [6]:
# let's take a fixed number of rows starting with the most recent going back in time
sample_df = df.sort_values(ascending=False, by='year').head(n=20000)
sample_df.shape

(20000, 5)

In [7]:
histogram(data_frame=sample_df, x='token count', log_y=True).show()

In [8]:
from arrow import now
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer

COLUMN = 'clean'
MIN_DF = 2
MODEL = 'all-MiniLM-L12-v2'
STOP_WORDS = 'english'

model_start = now()
model = KeyBERT(model=MODEL)
# if we set this to 512 we get about 99% of the input intact
model.max_seq_length = 512
raw_documents = sample_df[COLUMN].values
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=STOP_WORDS, min_df=MIN_DF,)
document_embeddings, word_embeddings = model.extract_embeddings(docs=raw_documents, vectorizer=vectorizer, )
print('embedding time: {}'.format(now() - model_start))
print('we have {} documents and {} words.'.format(len(document_embeddings), len(word_embeddings)))
keywords = model.extract_keywords(docs=raw_documents, top_n=1, stop_words=STOP_WORDS, vectorizer=vectorizer,
                                  doc_embeddings=document_embeddings, word_embeddings=word_embeddings,
                                 min_df=MIN_DF, )
print('model time: {}'.format(now() - model_start))

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

embedding time: 0:08:35.954924
we have 20000 documents and 13338 words.
model time: 0:08:59.593556


In [9]:
TOP_N = 26 # we have 26 colors in our biggest discrete color map so let's use 26 tags
# we are folding together some similar tags somewhat arbitrarily
# this is mostly plural->singular but also 
# accounting for the fact that the game has a two-word name
RESOLVE = {'buggy' : 'bug', 'bugs': 'bug', 'crashes': 'crash', 'crashing': 'crash',
           'games': 'game', 'gameplay': 'game', '2077': 'cyberpunk', 'updates': 'update',}

sample_df['top keyword'] = [item[0][0] if len(item) else 'unknown' for item in keywords]
sample_df['tag'] = sample_df['top keyword'].apply(func=lambda x: x if x not in RESOLVE.keys() else RESOLVE[x])
# all of the keywords out of the top N get retagged as unknown
sample_df['tag'] = sample_df['tag'].apply(func=lambda x: 'unknown' if x not in sample_df['tag'].value_counts()[:TOP_N].index.tolist() else x)
# how much of the corpus did we tag?
print(sample_df['tag'].value_counts(normalize=True))
top_keywords = sorted(sample_df['tag'].unique().tolist())
print(top_keywords)

tag
unknown      0.50580
game         0.11270
cyberpunk    0.09050
dlc          0.03415
phantom      0.03275
cdpr         0.02785
bug          0.02700
crash        0.01750
rpg          0.01445
launch       0.01420
witcher      0.01350
update       0.01165
gta          0.00915
patch        0.00855
starfield    0.00755
glitches     0.00730
night        0.00705
release      0.00685
fixed        0.00665
review       0.00655
samurai      0.00590
fun          0.00580
immersive    0.00545
story        0.00530
panam        0.00530
keanu        0.00530
play         0.00525
Name: proportion, dtype: float64
['bug', 'cdpr', 'crash', 'cyberpunk', 'dlc', 'fixed', 'fun', 'game', 'glitches', 'gta', 'immersive', 'keanu', 'launch', 'night', 'panam', 'patch', 'phantom', 'play', 'release', 'review', 'rpg', 'samurai', 'starfield', 'story', 'unknown', 'update', 'witcher']


In [10]:
from plotly.colors import qualitative
from plotly.express import scatter
from umap import UMAP

umap_start = now()
umap_model = UMAP(n_components=2, random_state=2024, verbose=False, n_jobs=1)
sample_df[['u0', 'u1']] = umap_model.fit_transform(X=document_embeddings)
scatter(data_frame=sample_df, x='u0', y='u1', hover_name='short review', hover_data = ['top keyword'],
        height=900, color='tag', color_discrete_sequence=qualitative.Alphabet).show()
print('UMAP time: {}'.format(now() - umap_start))

UMAP time: 0:00:46.053013


This map gives us a sense of what roughly half the corpus is talking about; by laying out the documents this way we can see occasional similarity islands for further investigation.

We can also see that there are lots of documents that don't say much and/or that are essentially unique for various reasons; we measure that near-uniqueness by finding keywords that have low cardinality.

In [11]:
from plotly.graph_objects import Figure
import numpy as np
words = vectorizer.get_feature_names_out()
# let's plot our top indices plus the quasi-synonyms from our resolver above
top_words = [word for word in top_keywords + list(RESOLVE.keys()) if word != 'unknown']

def plot_words(arg_words: list, arg_keywords: list, arg_model: UMAP, arg_embeddings: np.ndarray) -> Figure:
    top_indices = [arg_words.tolist().index(keyword) for keyword in arg_keywords]
    result_df = pd.DataFrame(data=arg_model.transform(X=[arg_embeddings[index] for index in top_indices]), 
                             columns=['u0', 'u1'])
    result_df['word'] = arg_keywords
    return scatter(data_frame=result_df, x='u0', y='u1', text='word', height=900).update_traces(marker={'size': 1})
    
plot_words(arg_words=words, arg_keywords=top_words, arg_model=umap_model,
          arg_embeddings=word_embeddings).show()    


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



We can use the same dimension reduction model to put our keywords and their synonyms into the same fictional two-dimensional space. Some of our keywords are near their documents because the documents are tightly clustered; for others the clusters may be spread out, diffuse, or disjoint, so the keywords are not located very close to their documents.

In [12]:
from math import log
sample_df['top 100'] = sample_df['top keyword'].apply(func=lambda x: 'unknown' if x not in sample_df['top keyword'].value_counts()[:100].index.tolist() else x)
top100_df = sample_df[sample_df['top 100'] != 'unknown'].copy()
density = top100_df['top 100'].value_counts(normalize=True).to_dict()
top100_df['density'] = top100_df['top 100'].map(density)
top100_df['density'] = top100_df['density'].apply(func=lambda x: round(log(x), ndigits=4))
top100_df.head()

Unnamed: 0,review,clean,token count,year,short review,top keyword,tag,u0,u1,top 100,density
0,It's very fun. I don't usually like open world...,It s very fun. I don t usually like open world...,42,2023.947945,It's very fun. I don't usually like open world...,gameplay,game,10.856953,9.648424,gameplay,-3.4217
59,THE BEST GAME EVER\n,THE BEST GAME EVER,4,2023.947945,THE BEST GAME EVER...,game,game,9.127726,12.04633,game,-2.0743
62,"Still buggy, \nI loaded my last save from year...","Still buggy, I loaded my last save from years...",92,2023.947945,"Still buggy, I loaded my last save from years ...",buggy,bug,14.776852,8.467245,buggy,-4.5203
66,PLEASE FIX THE ANTIVIRUS ISSUE :-). For these...,PLEASE FIX THE ANTIVIRUS ISSUE . For these...,13,2023.947945,PLEASE FIX THE ANTIVIRUS ISSUE :-). For these ...,crashes,crash,15.300282,9.195382,crashes,-4.1386
126922,we all know it had a rough start but after 3 y...,we all know it had a rough start but after 3 y...,24,2023.947945,we all know it had a rough start but after 3 y...,patches,unknown,14.320988,8.080556,patches,-6.2181


If we are willing to give up on being able to make fine distinctions between colors we can use a continuous color map; to do that we need a color that is a float. Using the log of the normalized value count (which we are inaccurately calling the density) is adequate to the task; here we can drop everything outside the top 100 keywords for clarity/simplicity.

In [13]:
scatter(data_frame=top100_df, x='u0', y='u1', color='density', hover_name='top 100', hover_data=['short review'], height=900)

We're looking at more than half the data now and we're seeing how we have a tradeoff between being able to see a particular keyword-related collection of documents by color and being able to visualize enough documents to see little clusters for medium-obscure keywords.

In [14]:
plot_words(arg_words=words, arg_keywords=top100_df['top 100'].unique().tolist(), arg_model=umap_model,
          arg_embeddings=word_embeddings).show()

Because our documents and our words live in the same space we can correlate what we see here with the document map above: commenters talk about characters and locations, they talk about platforms and other comparable games, they talk about gameplay, and they complain about issues.

In [15]:
top100_df.shape, sample_df.shape

((13045, 11), (20000, 10))