<a href="https://www.kaggle.com/code/mikedelong/explore-with-keybert-umap?scriptVersionId=157717436" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
%env TOKENIZERS_PARALLELISM=false
!pip install --quiet keybert
print('pip installed keybert')

env: TOKENIZERS_PARALLELISM=false
pip installed keybert


In [2]:
import re
import pandas as pd
from datetime import datetime
from time import mktime

def clean(arg:str) -> str:
    return re.sub('[^a-zA-Z0-9. /,]', ' ', arg)

def short_review(arg: str) -> str:
    pieces = arg.split()
    return ' '.join(pieces[:20]) + '...'

# https://stackoverflow.com/a/6451892
def since(date):
    return mktime(date.timetuple())

def year_fraction(date) -> float:
    this_year = datetime(year=date.year, month=1, day=1)
    next_year = datetime(year=1 + date.year, month=1, day=1)
    return date.year + (since(date) - since(this_year))/(since(next_year) - since(this_year))

filename = '/kaggle/input/cyberpunk-2077-steam-reviews/cyberpunk_2077_filtered.csv'
usecols = ['language', 'review', 'updated']
df = pd.read_csv(filepath_or_buffer=filename, parse_dates=['updated'], usecols=usecols)
# let's only look at reviews in English 
df = df[df['language'] == 'english']
# not all non-English reviews are labeled correctly
# and we want to remove emoji
# so we need to work a little harder
df['clean'] = df['review'].apply(func=clean)
# we need to be mindful of review length
df['token count'] = df['clean'].str.split().str.len()
df = df[df['token count'] > 3]
# we want to be able to get time slices
df['year'] = df['updated'].apply(func=year_fraction) 
df = df.drop(columns=['language', 'updated'])
# we need to truncate the review we show in the plot
df['short review'] = df['review'].apply(func=short_review)
df.head()

Unnamed: 0,review,clean,token count,year,short review
0,It's very fun. I don't usually like open world...,It s very fun. I don t usually like open world...,42,2023.947945,It's very fun. I don't usually like open world...
6,Coming back to try the game after 2.0 came out...,Coming back to try the game after 2.0 came out...,80,2023.947945,Coming back to try the game after 2.0 came out...
10,i dont even own this fucking game why can i wr...,i dont even own this fucking game why can i wr...,13,2023.947945,i dont even own this fucking game why can i wr...
11,Todo valio la pena al final con el mejor endin...,Todo valio la pena al final con el mejor endin...,50,2023.947945,Todo valio la pena al final con el mejor endin...
12,I am a sneaky boi and I stab people with arm s...,I am a sneaky boi and I stab people with arm s...,13,2023.947945,I am a sneaky boi and I stab people with arm s...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 227200 entries, 0 to 612379
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   review        227200 non-null  object 
 1   clean         227200 non-null  object 
 2   token count   227200 non-null  int64  
 3   year          227200 non-null  float64
 4   short review  227200 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 10.4+ MB


In [4]:
from plotly.express import histogram
histogram(data_frame=df, x='year', log_y=True).show()

Those spikes are really something; we might expect them to correlate with events like updates or minor releases, maybe?

In [5]:
histogram(data_frame=df, x='token count', log_y=True).show()

Our BERT models expect sequences of 128 subwords by default; review length measured in tokens may be an issue we need to deal with.

In [6]:
# let's take a fixed number of rows starting with the most recent going back in time
sample_df = df.sort_values(ascending=False, by='year').head(n=20000)
sample_df.shape

(20000, 5)

In [7]:
histogram(data_frame=sample_df, x='token count', log_y=True).show()

In [8]:
from arrow import now
from keybert import KeyBERT

COLUMN = 'clean'
MODEL = 'all-MiniLM-L12-v2'
model = KeyBERT(model=MODEL)
# if we set this to 512 we get about 99% of the input intact
model.max_seq_length = 512

model_start = now()
document_embeddings, word_embeddings = model.extract_embeddings(docs=sample_df[COLUMN].values,)
print('embedding time: {}'.format(now() - model_start))
keywords = model.extract_keywords(docs=sample_df[COLUMN].values, top_n=1)
print('model time: {}'.format(now() - model_start))

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

embedding time: 0:10:03.824444
model time: 0:20:51.105481


In [9]:
TOP_N = 26 # we have 26 colors so let's use 26 tags
# we are folding together some similar tags somewhat arbitrarily
# this is mostly plural->singular but also 
# accounting for the fact that the game has a two-word name
RESOLVE = {'buggy' : 'bug', 'bugs': 'bug', 
           'crashes': 'crash', 'crashing': 'crash',
           'games': 'game', 'gameplay': 'game',
           '2077': 'cyberpunk', 'updates': 'update',}

sample_df['top keyword'] = [item[0][0] if len(item) else 'unknown' for item in keywords]
sample_df['tag'] = sample_df['top keyword'].apply(func=lambda x: x if x not in RESOLVE.keys() else RESOLVE[x])
# all of the keywords out of the top N get retagged as unknown
sample_df['tag'] = sample_df['tag'].apply(func=lambda x: 'unknown' if x not in sample_df['tag'].value_counts()[:TOP_N].index.tolist() else x)
# how much of the corpus did we tag?
print(sample_df['tag'].value_counts(normalize=True))
print(sorted(sample_df['tag'].unique().tolist()))

tag
unknown      0.51505
game         0.10770
cyberpunk    0.08990
dlc          0.03370
phantom      0.03245
cdpr         0.02760
bug          0.02670
crash        0.01730
rpg          0.01415
launch       0.01415
witcher      0.01350
update       0.01140
gta          0.00905
patch        0.00835
starfield    0.00755
glitches     0.00720
night        0.00685
release      0.00680
fixed        0.00650
review       0.00635
samurai      0.00590
fun          0.00565
immersive    0.00545
play         0.00525
story        0.00520
keanu        0.00520
panam        0.00510
Name: proportion, dtype: float64
['bug', 'cdpr', 'crash', 'cyberpunk', 'dlc', 'fixed', 'fun', 'game', 'glitches', 'gta', 'immersive', 'keanu', 'launch', 'night', 'panam', 'patch', 'phantom', 'play', 'release', 'review', 'rpg', 'samurai', 'starfield', 'story', 'unknown', 'update', 'witcher']


In [10]:
from plotly.colors import qualitative
from plotly.express import scatter
from umap import UMAP

umap_model = UMAP(n_components=2, random_state=2023, verbose=True, n_jobs=1)
sample_df[['u0', 'u1']] = umap_model.fit_transform(X=document_embeddings)
scatter(data_frame=sample_df, x='u0', y='u1', hover_name='short review', hover_data = ['top keyword'],
        height=900, color='tag', color_discrete_sequence=qualitative.Alphabet).show()

UMAP(n_jobs=1, random_state=2023, verbose=True)
Thu Jan  4 19:54:43 2024 Construct fuzzy simplicial set
Thu Jan  4 19:54:43 2024 Finding Nearest Neighbors
Thu Jan  4 19:54:43 2024 Building RP forest with 12 trees
Thu Jan  4 19:54:49 2024 NN descent for 14 iterations
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	Stopping threshold met -- exiting after 7 iterations
Thu Jan  4 19:55:09 2024 Finished Nearest Neighbor Search
Thu Jan  4 19:55:13 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Thu Jan  4 19:55:30 2024 Finished embedding
