### Steps
- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id**, **review**, **recommended**, **time** into table **reviews**.
- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to create **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents**t to link table **clusters** and **sents**.

### Step 1:

- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id** (fk), **review**, **recommended**, **time** into table **reviews**.

In [1]:
import requests 
import pandas as pd
import numpy as np

import download_steam_reviews

import sqlite3

import re
from bs4 import BeautifulSoup

import spacy
# nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_trf")

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

from sklearn.cluster import AgglomerativeClustering

from fuzzysearch import find_near_matches

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import flair

sentiment_model = flair.models.TextClassifier.load('sentiment')
sentiment_model_fast = flair.models.TextClassifier.load('sentiment-fast')
senti_analyzer = SentimentIntensityAnalyzer()

import pickle

import plotly.express as px
import plotly.graph_objects as go

from sklearn.cluster import KMeans
import hdbscan
from summa.summarizer import summarize

from scipy.cluster.vq import vq

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap.umap_ as umap

from tqdm import tqdm

from collections import Counter

from fuzzywuzzy import fuzz, process

from spacy import displacy

from pprint import pprint



2021-07-06 09:22:43,919 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-distillbert_4.pt
2021-07-06 09:22:48,888 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-ft-rnn.pt


In [2]:
# from contractions import contractions
contractions = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [3]:
conn = sqlite3.connect('./data/steam_reviews.db') 
cursor = conn.cursor()

In [None]:
def print_cursor(cursor): 
    print(cursor.execute("""
        select * from games;
    """).fetchall())

In [None]:
print_cursor(cursor)

In [None]:
print(cursor.execute("""
        select game_id from games 
        where app_id=367520;
    """).fetchall()[0][0])

In [None]:
# 428550 - Momodora: Reverie under the Moonlight
# 367520 - Hollow Knight
app_ids = [428550, 367520]

In [None]:
### get game_name ###

def get_app_list(): 
    app_list_url = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'
    resp_data = requests.get(app_list_url)
    return resp_data.json()

def get_name(app_id, app_list): 
    for app in app_list['applist']['apps']: 
        if app['appid'] == app_id: 
            return app['name']

In [None]:
### get header_img_url  

def get_header_img_url(app_id): 
    return f'https://cdn.cloudflare.steamstatic.com/steam/apps/{app_id}/header.jpg'

In [None]:
### get total_positive, total_negative, total_reviews in Crawler for table games 
### get game_id, review, recommended, time in Crawler for table games 

app_list = get_app_list()

request_params = {
    'language': 'english'
}

# load or download new (maximum ~5000 newest reviews)
load_mode = True

for app_id in app_ids: 
    game_tuples = [] 

    game_name = get_name(app_id, app_list)
    
    header_img_url = get_header_img_url(app_id)
    
    total_positive, total_negative, total_reviews = 0, 0, 0
    
    if load_mode: 
        review_dict = download_steam_reviews.load_review_dict(app_id)['reviews'].values()
    else: 
        review_dict = download_steam_reviews.download_reviews_for_app_id(app_id, 
                                                                     chosen_request_params=request_params, 
                                                                     reviews_limit=5000)[0]['reviews'].values()
    
    review_tuples = []
    
    with conn:
        cursor.execute("""INSERT INTO games (app_id, game_name, header_img_url) VALUES (?, ?, ?);""", 
                       (app_id, game_name, header_img_url))
    
    # get game_id (fk) for table reviews
    game_id = cursor.execute("""SELECT game_id FROM games WHERE app_id=?;""", (app_id,)).fetchone()[0] 
    
    for review_dict_value in review_dict: 
        total_reviews += 1
    
        voted_up = 1 if review_dict_value['voted_up'] else 0
    
        if voted_up: 
            total_positive += 1
        else: 
            total_negative += 1
     
        review = review_dict_value['review']
        recommended = voted_up
        time = review_dict_value['timestamp_updated']
        
        review_tuples.append((review, recommended, time, game_id))
    
    with conn:
        cursor.execute("""UPDATE games SET (total_positive, total_negative, total_reviews) = (?, ?, ?)
                            WHERE game_id=?;""",
                       (total_positive, total_negative, total_reviews, game_id))
    
    with conn:
        cursor.executemany("""INSERT INTO reviews (review, recommended, time, game_id) VALUES 
                                (?, ?, ?, ?);""", review_tuples)    

### Step 2:

- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to insert **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
        

    

In [None]:
df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
                    FROM reviews JOIN games USING(game_id);""", conn)

In [None]:
df_reviews

In [None]:
def add_missing_punct(text): 
    return re.sub('([A-Za-z0-9])\s*$', '\g<1>. ', text)


def replace_bullets(text): 
    text = re.sub('([A-Za-z0-9])\s*\n+\s*[+-]?\s*', '\g<1>. ', text)
    text = re.sub('\s*([:+-]+)\s*\n+\s*[+-]?\s*', '. ', text) 
    return text
    
    
# remove url from text
def remove_url(text):
    return re.sub(r"http\S+", ' ', text)


# remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    # remove square brackets and characters inside
    text = re.sub('\[(.*?)\]', ' ', text)
    return text


# replace ’ with ' 
def normalize_single_quote(text):
    return re.sub('[’‘]', '\'', text)


# remove non english characters effectively
def remove_non_ascii(text): 
    return text.encode("ascii", errors="ignore").decode()
    
    
# remove ANSI escape sequences
def remove_ansi_escape_sequences(text):
    ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', text)
    
    
# remove multiple whitespaces with single whitespace
def remove_multi_whitespaces(text): 
    return re.sub('\s+', ' ', text.strip())

In [None]:
def remove_bullet_nums(sent): 
    return re.sub('^\s*\d+\n*[.\)]+\s*([^\d])', '\g<1>', sent)

def remove_leading_symbols(sent):
    return re.sub('^[^A-Za-z\"\'\d]+', '', sent)

def uppercase_first(sent): 
    return sent[0].upper() + sent[1:] if len(sent) != 0 else sent

In [None]:
def tokenize_sent(text):
    doc = nlp(text, disable=['ner', 'attribute_ruler', 'lemmatizer', 'sentencizer'])
    sents = [str(sent).strip() for sent in doc.sents]
    return sents

In [None]:
def lowercase(text):
    return text.lower()

def expand_contractions(text):
    for key in contractions:
        value = contractions[key]
        text = text.replace(key, value)
    return text

# remove digits 
def remove_digits(text): 
    return re.sub('\d+', ' ', text)

# remove symbols 
def remove_symbols(text):
    return re.sub('[^A-Za-z,.\s\d]+', ' ', text)

# lemmatization with spacy 
def lemmatize_text(text): 
    doc = nlp(text, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc if token.pos_ != 'PUNCT']
    return ' '.join(lemma)

# remove stop words 
def remove_stopwords(text, word_list=[]):
    stop_words = stopwords.words("english")
    stop_words.extend(word_list)
    stop_words = set(stop_words)
    return ' '.join(e.lower() for e in text.split() if e.lower() not in stop_words)

def get_extra_stopwords(game_name): 
    stopwords = set(['game'])
    doc = nlp(game_name.lower(), disable=['parser', 'ner'])
    for token in doc: 
        if token.pos_ not in {'PUNCT', 'NUM'}:
            stopwords.add(token.text)
    return stopwords

In [None]:
sent_tuples = []

for game_id in df_reviews['game_id'].unique():
    with conn: 
        game_name = cursor.execute("""SELECT game_name FROM games WHERE game_id=?;""", (int(game_id),)).fetchone()[0]
    
    extra_stopwords = get_extra_stopwords(game_name)
    
    df_reviews_game = df_reviews[df_reviews['game_id'] == game_id]
    
    reviews_game_cleaned = df_reviews_game['review'].map(add_missing_punct)\
                    .map(replace_bullets)\
                    .map(remove_url)\
                    .map(remove_html_tags)\
                    .map(normalize_single_quote)\
                    .map(remove_non_ascii)\
                    .map(remove_ansi_escape_sequences)\
                    .map(remove_multi_whitespaces)
    
    for review_id, review in zip(df_reviews_game['review_id'], reviews_game_cleaned):
        sents = pd.Series(tokenize_sent(review)).map(remove_bullet_nums)\
                                                .map(remove_leading_symbols)\
                                                .map(uppercase_first)\
                                                .map(add_missing_punct)\
                                                .map(remove_multi_whitespaces)

        sents_prep = sents.map(lowercase)\
                        .map(expand_contractions)\
                        .map(remove_digits)\
                        .map(remove_symbols)\
                        .map(remove_multi_whitespaces)\
                        .map(lemmatize_text)\
                        .map(lambda x: remove_stopwords(x, word_list=extra_stopwords))

        for sent, sent_prep in zip(sents, sents_prep):
            sent_tuples.append((review_id, sent, sent_prep))

In [None]:
with conn:
    cursor.executemany("""INSERT INTO sents (review_id, sent, sent_prep) VALUES 
                        (?, ?, ?);""", sent_tuples)    

### Step 3: 
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.

In [14]:
df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
                    FROM reviews JOIN games USING(game_id);""", conn)

In [795]:
def get_ngram(x, ngram, min_df=1):
    vec = CountVectorizer(ngram_range=[ngram, ngram], min_df=min_df).fit(x)
    bow = vec.transform(x)
    sum_words = bow.sum(axis = 0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
    return words_freq

def bigram_rules(bigram): 
    first_pos = set(['ADJ', 'NOUN'])
    second_pos = set(['NOUN'])
    
    tags = [token.pos_ for token in nlp(bigram, disable=['parser','ner'])]
    
    return tags[0] in first_pos and tags[1] in second_pos

In [796]:
def unigram_rules(unigram): 
    first_pos = set(['NOUN', 'PROPN'])
    tags = [token.pos_ for token in nlp(unigram, disable=['parser', 'ner'])]
    return tags[0] in first_pos

In [807]:
unigram_freq = get_ngram(sents_prep, 1, 3)
unigram_df = pd.DataFrame(unigram_freq, columns=['unigram', 'freq'])

In [None]:
bigram_vocab = ' '.join(bigram_df['bigram'])

In [810]:
unigram_df_50 = []
count = 0
for unigram, freq in zip(unigram_df['unigram'], unigram_df['freq']): 
    if unigram_rules(unigram) and unigram not in bigram_vocab:
        unigram_df_50.append((unigram, freq))

        count += 1

        if count == 50: 
            break

In [4]:
df_sent_prep = pd.read_sql_query("""SELECT sent, sent_prep, game_id
                    FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

In [5]:
sents_prep = df_sent_prep[df_sent_prep['game_id'] == 1]['sent_prep']

In [6]:
sents = df_sent_prep[df_sent_prep['game_id'] == 1]['sent']

In [75]:
sents_10 = sents[:100]

In [76]:
for sent in sents_10: 
    print(f'> {sent}')

> Metroidvania with some influences from Dark Souls.
> I had to check some guide to see the true ending.
> A short but sweet metroidvania that is worth your time, especially if you like getting difficult achievements!
> It's a 2-d action platformer with a simple yet rewarding combat system and difficulty options aplenty.
> Other than that, there isn't a ton of replay value unless you want to find all the hidden collectibles.
> If you like games like Rabi-Ribi or Metroidvania games in general you'd def like this game!!
> This game is awesome, it has great art design and an insane diffculty, when played on hard mode, it become a personal fave for its la dark souls but in 16 bit.
> A charming and difficult pixel action platformer.
> The boss fights felt so smooth, were rewarding, and had great music to them.
> Worth the pain of repeated deaths for a perfect boss fight.
> Some sound effects were mistimed.
> Momodora doesn't even take that long to beat with my playthrough clocking at just a

In [77]:
def get_tokens_index_dict(string):
    index_dict = {}
    index = 0 
    for i, char in enumerate(string): 
        if char == ' ':
            index += 1
        else: 
            index_dict[i] = index
    
    return index_dict

In [258]:
kws = []

noun_types = set(['NOUN', 'PROPN'])
adj_types = set(['ADJ'])
pattern = re.compile('(ADJ (PUNCT ADJ )*)*((NOUN|PROPN) (PUNCT (NOUN|PROPN) )*(PUNCT )?)*(NOUN|PROPN)')

for sent in sents: 
    doc = nlp(sent, disable=['parser', 'ner'])
    doc_len = len(doc)

    i = 0 
    
    tokens = [token.text for token in doc]
    
    tags_str = ' '.join([token.pos_ for token in doc])
    
    tokens_index_dict = get_index_dict(tags_str)
    
    matches = pattern.finditer(tags_str)
    
    for match in matches: 
        tags = [elem for elem in match.group(0).split()]
        
        kw = [] 
        index = match.start()
        for i, tag in enumerate(tags): 
            kw_token = tokens[tokens_index_dict[index]]
            
#             print(f'{tags[i - 1]} {kw_token}')
            
            if kw_token in (',', ';') and tags[i - 1] in ('PROPN', 'NOUN'):
                kws.append(' '.join(kw))
                kw = []
#             elif kw_token not in ('-', ',', ';'):
#                 kw.append(kw_token)
            else:
                kw.append(kw_token)
            index += len(tag) + 1
    
            
        kws.append(' '.join(kw))

In [221]:
# kws = set(kws)

In [277]:
# remove messy keywords 
def remove_messy_kws(kws): 
    return [kw for kw in kws if not re.search('[^a-zA-Z0-9\s\-,:]', kw)]

def clean_kw(kw):
    res = re.sub(' - ', '-', kw)
    res = re.sub(' ([,:] )', '\g<1>', res)
    return res

In [278]:
cleaned_kws = remove_messy_kws(kws)

In [279]:
cleaned_kws = pd.Series(cleaned_kws)

In [281]:
cleaned_kws = cleaned_kws.map(clean_kw)

In [298]:
kws_freq = pd.DataFrame(cleaned_kws.value_counts()).reset_index()

In [299]:
kws_freq.columns = ['kw', 'freq']

In [None]:
# stop words: game, games, lot, bit, way, one, fun, way, name of the game

In [367]:
cleaned_kws.value_counts()['game']

889

In [302]:
for kw, freq in zip(kws_freq['kw'], kws_freq['freq']): 
    print(f'{kw}: {freq}')

game: 889
Dark Souls: 343
lot: 145
boss fights: 131
bosses: 128
music: 124
time: 98
gameplay: 97
combat: 96
games: 95
enemies: 87
story: 86
bit: 83
pixel art: 79
hours: 78
level design: 75
final boss: 75
art style: 72
difficulty: 72
dark souls: 70
Hollow Knight: 70
Momodora: 66
attacks: 63
price: 59
first time: 58
money: 57
Cave Story: 57
art: 57
hard mode: 56
fun: 54
way: 54
Moonlight: 52
true ending: 51
metroidvania: 50
items: 47
maple leaf: 47
boss: 47
challenge: 46
fan: 46
little bit: 45
secrets: 44
Castlevania: 44
Night: 43
controls: 43
atmosphere: 42
people: 42
world: 41
genre: 41
tight controls: 41
characters: 40
Metroidvania: 40
Momodora: Reverie: 40
exploration: 38
full price: 38
animations: 38
bow: 38
attack: 37
replay value: 35
damage: 34
artstyle: 33
things: 33
long time: 33
normal difficulty: 31
animation: 30
boss battles: 30
boss fight: 29
sound design: 29
Symphony: 29
areas: 28
main character: 28
great soundtrack: 27
one: 27
dodge roll: 27
Bloodborne: 27
easy mode: 26
me

Great animation work: 1
impressive pixel artistry: 1
devastating ability: 1
dark souls clone: 1
dangerous beastiary: 1
slapping boobs: 1
Demon: 1
Great emphasis: 1
few puzzles: 1
metroidvania perspective: 1
long story: 1
simple combat mechanics: 1
Map: 1
main bone: 1
great little metroidvania: 1
end game equipment: 1
bloodborne aesthetic: 1
dark souls thirst: 1
few new enemies: 1
solid little game: 1
single hit death: 1
challenging metroidvania: 1
famous games: 1
pixel art 2D: 1
best element: 1
RB: 1
SOUND: 1
Jumping: 1
Great level: 1
Speedrunners: 1
open world level design: 1
old Megaman series: 1
compressed experience: 1
hyper light drifter: 1
few others: 1
good games: 1
gorgeous animation: 1
getting new abilities: 1
great OST: 1
Criticisms: 1
easy boss fights: 1
flawless gameplay: 1
interesting bosses: 1
pen: 1
retro style graphics: 1
other consumable item: 1
Fun Metroidvania: 1
Beautiful Pixel Art: 1
achievements content: 1
action-platformers: 1
stellar pixel art: 1
writer: 1
open 

slight influence: 1
tar: 1
major positive: 1
equipable items: 1
second run: 1
deep messages: 1
interesting dynamic: 1
great responsive gameplay: 1
little damage: 1
Excellent combat system: 1
passive item system: 1
passive item support: 1
Gameplay-Combat system: 1
metacritic: 1
first hour: 1
ambientacin: 1
customization: 1
Undead Queen: 1
metroidvania upgrades: 1
tiddies: 1
subtle story: 1
leaves: 1
bad time: 1
3d games: 1
tingle: 1
distinct benefits: 1
Siekro: 1
Overall Momodora: RUtM: 1
surprising dark setting: 1
second opinion: 1
many boss fights: 1
lush backgrounds: 1
fresh start: 1
evident nods: 1
good anime: 1
split second: 1
great blend: 1
cave story mod: 1
city walls: 1
nice sprite art: 1
enemy designs: 1
solid animation: 1
excitement: 1
perfect incarnation: 1
low top: 1
time playthroughs: 1
healing options: 1
expansive 2D labyrinth: 1
overhaul: 1
sacred maple leaf: 1
fantastic design: 1
all-nighter game squad: 1
story development: 1
false hope: 1
environmental storytelling: 1
B

temporary attack boosts: 1
good metrovania: 1
Metroidvanyia elements: 1
Bloodborn: 1
gameplay mechanics: 1
favourite part: 1
old Metroid series: 1
walls beam: 1
fun metroidvania style game: 1
little secrets: 1
right moment: 1
cancer: 1
decent exploration: 1
biggest gripe: 1
2D platforming environment: 1
heavy emphasis: 1
hour sitting: 1
world-building: 1
air dodges: 1
ambiguity: 1
great interconnected level design: 1
bittersweet world: 1
leaf-smacking: 1
pattern-reaction dodging action: 1
defense: 1
RPG elements: 1
different art style: 1
second boss: 1
GOOD Beautiful art: 1
lovely art style: 1
soulsborne-style: 1
Excellent gameplay: 1
big boob witch: 1
Metriodvania: 1
pause menu: 1
Best feeling platformer: 1
lovely pixel graphics: 1
random attack orders: 1
onyl run: 1
witch boobs: 1
cutesy art style: 1
Solid action-platformer metroidvania: 1
such inspiration: 1
dev team: 1
recognizable patterns: 1
Circle: 1
loose recommendation: 1
whole year: 1
more ways: 1
Amazing game: 1
passive effe

In [337]:
kws_ = kws_freq['kw']

In [338]:
len(kws_)

5670

In [339]:
kw_embeddings = embedder.encode(kws_)
    
# Normalize the embeddings to unit length
kw_embeddings = kw_embeddings /  np.linalg.norm(kw_embeddings, axis=1, keepdims=True)

In [None]:
# pca = PCA(n_components=50)
# kw_embeddings_pca = pca.fit_transform(kw_embeddings)

# total_var = pca.explained_variance_ratio_.sum() * 100

In [353]:
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
clustering_model.fit(kw_embeddings)
cluster_assignment = clustering_model.labels_

kw_clusters = {}
for kw_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in kw_clusters:
        kw_clusters[cluster_id] = []

    kw_clusters[cluster_id].append(kws_[kw_id])

In [356]:
kw_clusters

{16: ['game',
  'gameplay',
  'games',
  'first playthrough',
  'great game',
  'Great game',
  'fun game',
  'good game',
  'Gameplay',
  'short game',
  'other games',
  'Game',
  'new game',
  'play',
  'solid gameplay',
  'favorite games',
  'playthrough',
  'great gameplay',
  'good gameplay',
  'playstyle',
  'fun gameplay',
  'fourth game',
  'Good game',
  'previous games',
  'amazing game',
  'best game',
  'whole game',
  'game play',
  'adventure',
  'satisfying gameplay',
  'Fantastic game',
  'most games',
  'game mechanics',
  'best games',
  'New Game',
  'Great gameplay',
  'Beautiful game',
  'atmospheric games',
  'like game',
  'second game',
  'playtime',
  'challenging gameplay',
  'difficult games',
  'excellent game',
  'must play',
  'previous game',
  'game design',
  'adventure game',
  'Wonderful game',
  'many games',
  'action game',
  'entire game',
  'challenging game',
  'Pretty fun game',
  'sidescrolling adventure',
  'normal playthrough',
  'first gam

In [645]:

    sents_prep = df_sent_prep[df_sent_prep['game_id'] == 1]['sent_prep']
    
    unigram_freq = get_ngram(sents_prep, 2, 3)
    bigram_freq = get_ngram(sents_prep, 2, 3)
    bigram_df = pd.DataFrame(bigram_freq, columns=['bigram', 'freq'])
    
    bigram_df_50 = []
    count = 0
    
    for bigram, freq in zip(bigram_df['bigram'], bigram_df['freq']): 
        if bigram_rules(bigram):
            bigram_df_50.append((bigram, freq))

            count += 1

            if count == 50: 
                break
                
    bigram_df = pd.DataFrame(bigram_df_50, columns=['bigram', 'freq'])
    
    kws = bigram_df['bigram']
#     kw_embeddings = embedder.encode(kws)
    
#     # Normalize the embeddings to unit length
#     kw_embeddings = kw_embeddings /  np.linalg.norm(kw_embeddings, axis=1, keepdims=True)

In [646]:
pca = PCA(n_components=50)
kw_embeddings_pca = pca.fit_transform(kw_embeddings)

total_var = pca.explained_variance_ratio_.sum() * 100

In [649]:
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
clustering_model.fit(kw_embeddings)
cluster_assignment = clustering_model.labels_

kw_clusters = {}
for kw_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in kw_clusters:
        kw_clusters[cluster_id] = []

    kw_clusters[cluster_id].append(kws[kw_id])

In [650]:
kw_clusters

{10: ['dark soul', 'soul series'],
 9: ['boss fight', 'final boss', 'boss battle', 'enemy boss'],
 0: ['pixel art',
  'art style',
  'art animation',
  'art music',
  'beautiful pixel'],
 15: ['level design'],
 19: ['metroidvania style',
  'good metroidvania',
  'great metroidvania',
  'little metroidvania',
  'metroidvania genre',
  'fun metroidvania'],
 3: ['hollow knight', 'cave story'],
 6: ['hard mode', 'easy mode'],
 7: ['hard difficulty',
  'insane difficulty',
  'normal difficulty',
  'high difficulty'],
 17: ['action platformer'],
 20: ['replay value'],
 23: ['true ending'],
 11: ['tight control'],
 12: ['maple leaf'],
 18: ['first time'],
 13: ['fun play', 'lot fun'],
 1: ['worth price', 'full price', 'worth time', 'worth money'],
 21: ['dodge roll'],
 2: ['great soundtrack', 'symphony night'],
 14: ['little bit'],
 4: ['range attack', 'melee attack', 'combat system'],
 16: ['passive item'],
 8: ['sound effect', 'sound design'],
 5: ['main character'],
 22: ['long time']}

In [250]:
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
clustering_model.fit(kw_embeddings)
cluster_assignment = clustering_model.labels_

kw_clusters = {}
for kw_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in kw_clusters:
        kw_clusters[cluster_id] = []

    kw_clusters[cluster_id].append(kws[kw_id])

In [651]:
# clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
# clustering_model.fit(kw_embeddings)
# cluster_assignment = clustering_model.labels_

# kw_clusters = {}
# for kw_id, cluster_id in enumerate(cluster_assignment):
#     if cluster_id not in kw_clusters:
#         kw_clusters[cluster_id] = []

#     kw_clusters[cluster_id].append(kws[kw_id])

In [18]:
cluster_tuples = []
kw_tuples = []

for game_id in df_sent_prep['game_id'].unique():
    sents_prep = df_sent_prep[df_sent_prep['game_id'] == game_id]['sent_prep']
    
    bigram_freq = get_ngram(sents_prep, 2, 3)
    bigram_df = pd.DataFrame(bigram_freq, columns=['bigram', 'freq'])
    
    bigram_df_50 = []
    count = 0
    
    for bigram, freq in zip(bigram_df['bigram'], bigram_df['freq']): 
        if bigram_rules(bigram):
            bigram_df_50.append((bigram, freq))

            count += 1

            if count == 50: 
                break
                
    bigram_df = pd.DataFrame(bigram_df_50, columns=['bigram', 'freq'])
    
    kws = bigram_df['bigram']
    kw_embeddings = embedder.encode(kws)
    
    # Normalize the embeddings to unit length
    kw_embeddings = kw_embeddings /  np.linalg.norm(kw_embeddings, axis=1, keepdims=True)
    
    # perform agglomerative clustering
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
    clustering_model.fit(kw_embeddings)
    cluster_assignment = clustering_model.labels_
    
    kw_clusters = {}
    for kw_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in kw_clusters:
            kw_clusters[cluster_id] = []

        kw_clusters[cluster_id].append(kws[kw_id])
        
    bigram_df['cluster_num'] = cluster_assignment
    
    for cluster_num, cluster_val in kw_clusters.items():
        cluster_tuples.append((int(game_id), int(cluster_num), cluster_val[0]))
    
    for kw, freq, cluster_num in zip(bigram_df['bigram'], bigram_df['freq'], bigram_df['cluster_num']):
        kw_tuples.append((kw, freq, int(cluster_num), int(game_id)))

In [None]:
with conn:
    cursor.executemany("""INSERT INTO clusters (game_id, cluster_num, cluster_name) VALUES 
                        (?, ?, ?);""", cluster_tuples)    

In [None]:
with conn:
    cursor.executemany("""INSERT INTO kws (kw, freq, cluster_id) VALUES 
                        (?, ?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""",
                       kw_tuples)

### Step 4: 
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents** to link table **clusters** and **sents**.

In [None]:
df_sent_prep = pd.read_sql_query("""SELECT game_id, sent_prep, sent_id
                    FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

df_kw = pd.read_sql_query("""SELECT game_id, cluster_id, kw 
                    FROM kws LEFT JOIN clusters USING(cluster_id);""", conn)

In [None]:
cluster_sent_tuples = []

for game_id in df_sent_prep['game_id'].unique():
    sents_prep = df_sent_prep[df_sent_prep['game_id'] == game_id]
    kws = df_kw[df_kw['game_id'] == game_id]
    
    kw_clusters = {}
    for kw, cluster_id in zip(kws['kw'], kws['cluster_id']):
        if cluster_id not in kw_clusters:
            kw_clusters[cluster_id] = []

        kw_clusters[cluster_id].append(kw)
        
    for sent_id, sent_prep in zip(sents_prep['sent_id'], sents_prep['sent_prep']):
        for cluster_id, kws in kw_clusters.items():
            for kw in kws:
                matches = find_near_matches(kw, sent_prep, max_l_dist=1)

                if len(matches) != 0: 
                    cluster_sent_tuples.append((cluster_id, sent_id))
                    break

In [None]:
with conn:
    cursor.executemany("""INSERT INTO clusters_sents  (cluster_id, sent_id) VALUES 
                        (?, ?);""", cluster_sent_tuples)

### Step 5:
- Remove all **sent_id** in table **sents** if they don't exist in table **clusters_sents**.
    - Insert **score_flair**, **score_vader**, **recommended**, **score_total**, **sent_embedding** in table **sents**.

In [None]:
# remove all sentences that contains no keyword from table sents
with conn: 
    cursor.execute("""
        DELETE FROM sents 
        WHERE sent_id NOT IN (
            SELECT DISTINCT sent_id
            FROM clusters_sents);
    """)

In [None]:
df_sent = pd.read_sql_query("""SELECT sent_id, sent, recommended
                                FROM sents JOIN reviews USING(review_id);""", conn)

In [None]:
df_sent

In [None]:
def get_score_flair(sent, threshold=0.9): 
    sent_flair = flair.data.Sentence(sent)
    sentiment_model_fast.predict(sent_flair)
    
    value_flair = sent_flair.labels[0].value
    score_flair = sent_flair.labels[0].score
    
    value_flair = 1 if value_flair == 'POSITIVE' else -1
    
    score_flair = value_flair*score_flair
    
    return 1 if score_flair > threshold else -1 if score_flair < -threshold else 0

In [None]:
sentis = df_sent['sent'].map(get_score_flair) 

In [None]:
sent_embeddings = embedder.encode(df_sent['sent'])

sent_embeddings = sent_embeddings /  np.linalg.norm(sent_embeddings, axis=1, keepdims=True)

sent_embeddings_str = [pickle.dumps(sent_embedding) for sent_embedding in sent_embeddings]

In [None]:
sent_tuples = []

for senti, sent_embedding_str, sent_id in zip(sentis, sent_embeddings_str, df_sent['sent_id']): 
    sent_tuples.append((senti, sent_embedding_str, sent_id))     

In [None]:
with conn:
    cursor.executemany("""UPDATE sents SET (senti, sent_embedding) = (?, ?)
                        WHERE sent_id=?;""", sent_tuples)

In [None]:
df_sent_embedding = pd.read_sql_query("""SELECT sent_embedding FROM sents;""", conn)
sent_embeddings = np.array([pickle.loads(sent_embedding) for sent_embedding in df_sent_embedding['sent_embedding']])

In [256]:
df_sent = pd.read_sql_query("""SELECT cluster_id, sent_id, sent, sent_embedding, senti, cluster_name, game_id
                                FROM clusters_sents LEFT JOIN(sents) USING(sent_id)
                                                    LEFT JOIN(clusters) USING(cluster_id);""", conn)

In [257]:
def decode_sent_embeddings(sent_embeddings):
    return np.array([pickle.loads(sent_embedding) for sent_embedding in sent_embeddings])

In [258]:
game_id = 1

In [259]:
df_sent = df_sent[df_sent['game_id'] == game_id]

In [260]:
df_sent

Unnamed: 0,cluster_id,sent_id,sent,sent_embedding,senti,cluster_name,game_id
0,1,1,Metroidvania with some influences from Dark So...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,dark soul,1
1,11,3,I had to check some guide to see the true ending.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,true ending,1
2,16,21,A short but sweet metroidvania that is worth y...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,worth price,1
3,9,25,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,action platformer,1
4,20,25,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,range attack,1
...,...,...,...,...,...,...,...
4188,2,22073,From a metroidvania perspective the gameplay h...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1,boss fight,1
4189,3,22080,"Con un diseo ""pixel-art"", escenarios en 2D, y ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,pixel art,1
4190,1,22085,One of the first wave of dark souls inspired m...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,dark soul,1
4191,6,22086,I rate it much higher than Hollow Knight.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,hollow knight,1


In [30]:
df_sent_1 = df_sent[['senti', 'cluster_name']].groupby(['cluster_name', 'senti']).size().reset_index(name='count')

df_sent_1['sum_count'] = df_sent_1['count'].groupby(df_sent_1['cluster_name']).transform('sum')

df_sent_1 = df_sent_1.sort_values(by=['sum_count', 'senti'], ascending=[False, True])

df_sent_1

Unnamed: 0,cluster_name,senti,count,sum_count
6,dark soul,-1,77,675
7,dark soul,0,299,675
8,dark soul,1,299,675
51,pixel art,-1,22,606
52,pixel art,0,64,606
...,...,...,...,...
38,long time,1,24,40
41,main character,1,19,40
48,passive item,-1,10,39
49,passive item,0,19,39


In [641]:
colors_dict  = {
    1:'#26de81',
    0:'#fed330',
    -1:'#fc5c65'
}

colors = [colors_dict[k] for k in df_sent_1['senti'].values]

# fig = go.Figure(data=[
#     go.Bar(x=df_sent_1['cluster_name'], 
#            y=df_sent_1['count'],
#           marker_color=colors)
# ])
fig = go.Figure()
fig.add_trace(go.Bar(name="positive", x=df_sent_1[df_sent_1['senti'] == 1]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == 1]['count'], marker_color='#26de81'))
fig.add_trace(go.Bar(name="neutral", x=df_sent_1[df_sent_1['senti'] == 0]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == 0]['count'], marker_color='#fed330'))
fig.add_trace(go.Bar(name="negative", x=df_sent_1[df_sent_1['senti'] == -1]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == -1]['count'], marker_color='#fc5c65'))

# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()

In [90]:
df_sent_cluster = df_sent[(df_sent['cluster_name'] == 'dark soul') & (df_sent['senti'] == -1)]

In [91]:
for sent in df_sent_cluster['sent']: 
    print(f'>>> {sent}')

>>> The Bad: low monster variety, combat is simplistic , the minimalist story that would make Dark Souls seem overly detailed, characters were forgettable.
>>> The difficulty is a bit extreme for me, the combat almost feels a bit Dark Souls-ish which I am personally not a fan of due to its frustrating nature.
>>> It's not Dark Souls hard.
>>> Difficulty wise, I'd put it about the same as "first time though Dark Souls": The game WILL give you a beating, but it does use patterns that you can pick up on and figure routes around or though encounters with minimal effort.
>>> The music and the setting try to be Dark Souls, but the parade of anime dolls and the bouncy pink blocks feel like... something else.
>>> The game is difficult to the point where you might smash your controller at a point, but it gets better with powerups, it reminded me of Dark Souls in a way, and I love it for that reason, but after completing it I don't think I'd play it again.
>>> Game is trying to be Dark Souls but

In [93]:
sents = df_sent_cluster['sent'].reset_index(drop = True)
sent_embeddings = decode_sent_embeddings(df_sent_cluster['sent_embedding'])

In [95]:
sents

0     The Bad: low monster variety, combat is simpli...
1     The difficulty is a bit extreme for me, the co...
2                             It's not Dark Souls hard.
3     Difficulty wise, I'd put it about the same as ...
4     The music and the setting try to be Dark Souls...
                            ...                        
72    All of these things exist outside of Dark Soul...
73    This game feels like the lovechild of Dark Sou...
74        Not like Dark Souls (someone else said that).
75    1 and 2 were Cave Story-esque without the focu...
76    I never really got into Dark Souls, and I thin...
Name: sent, Length: 77, dtype: object

In [97]:
summary = []

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters, random_state=0)
clustering_model.fit(sent_embeddings)

cluster_assignment = clustering_model.labels_
centroids = clustering_model.cluster_centers_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(sents[sentence_id])

closest, distances = vq(centroids, sent_embeddings)

for index in closest: 
    summary.append(sents[index])

In [99]:
summary

["With some distinct pages taken out of the Dark Souls series' book (illusory walls, to name just one), I feel that taking another hint from those games and implementing some more in-depth item descriptions would've helped enhance the story and perhaps give further incentive for completion of optional tasks and areas.",
 'Game is trying to be Dark Souls but feels more like Blasphemous (which was trying to be Dark Souls).',
 "Some people compared this game is almost difficult as Dark Souls (which if you don't know it's the standard of very difficult games), and I do think so too.",
 'Feels sort of like half Megaman and half Castlevania, with a dash of Dark Souls thrown in with the combat evading.',
 'Dark Souls but not Dark Souls.']

In [50]:
for cluster in clustered_sentences: 
    print(len(cluster))

65
64
46
50
74


In [51]:
# PCA with 3 dimensions
pca = PCA(n_components=3)
components = pca.fit_transform(sent_embeddings)

total_var = pca.explained_variance_ratio_.sum() * 100

In [640]:

fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=clustering_model.labels_.astype(np.str),
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}, 
    hover_data=[sents],
)
fig.show()

ValueError: All arguments should have the same length. The length of argument `hover_data_0` is 3774, whereas the length of  previously-processed arguments ['0', '1', '2'] is 299

In [638]:
df_sent = pd.read_sql_query("""SELECT sent, sent_embedding, game_id
                                FROM sents JOIN reviews USING(review_id)
                                WHERE game_id=1;""", conn)

In [639]:
df_sent

Unnamed: 0,sent,sent_embedding,game_id
0,Metroidvania with some influences from Dark So...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
1,I had to check some guide to see the true ending.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
2,A short but sweet metroidvania that is worth y...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
3,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
4,"Other than that, there isn't a ton of replay v...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
...,...,...,...
3769,From a metroidvania perspective the gameplay h...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
3770,"Con un diseo ""pixel-art"", escenarios en 2D, y ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
3771,One of the first wave of dark souls inspired m...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
3772,I rate it much higher than Hollow Knight.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1


In [291]:
sents = df_sent['sent'].reset_index(drop = True)
sent_embeddings = decode_sent_embeddings(df_sent['sent_embedding'])
n_sents = 5

In [288]:
df_sent_cluster = df_sent

In [597]:
def get_centroid(arr):
    length, dim = arr.shape
    return np.array([np.sum(arr[:, i])/length for i in range(dim)])

# return index of the vectors in corpus_embeddings nearest to the centroid
def get_nearest_indexes(centroids, corpus_embeddings):
    return vq(centroids, corpus_embeddings)[0]

In [598]:
def hdbscan_sum(sents, sent_embeddings, n_sents):
    # Perform hdbscan clustering
    clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])

    unique, counts = np.unique(cluster_assignment, return_counts=True)

    unique_counts = list(zip(unique, counts))

    unique_counts = sorted(unique_counts, key=lambda x: x[1], reverse=True)

    sents_count = 0 

    sents_sum = []

    for unique_count in unique_counts:
        cluster_num = unique_count[0]

        if cluster_num != -1:
#             print(clustered_sents[cluster_num])
            
            centroid = get_centroid(np.array(clustered_sent_embeddings[cluster_num]))
            centroid = np.array([centroid])
            
            index = get_nearest_indexes(centroid, clustered_sent_embeddings[cluster_num])[0]
            sents_sum.append(clustered_sents[cluster_num][index])

            sents_count += 1
            if sents_count == n_sents: 
                break
    
    return sents_sum

In [600]:
hdbscan_sum(sents,  sent_embeddings, 10)

["It's worth the price.",
 'Great soundtrack.',
 'Also, you can kill a Witch by slapping her in the boobs with a giant maple leaf.',
 'Metroidvania with some influences from Dark Souls.',
 'Great metroidvania.',
 "One of the best metroidvania games I've played.",
 'A fantastic, very well made action platformer with great tone and engaging gameplay.',
 'Good replay value.',
 'Fun boss fights.',
 'Hollow Knight.']

In [606]:
def kmeans_sum(sents,  sent_embeddings, n_sents=5):
    sents_sum = []

    clustering_model = KMeans(n_clusters=n_sents, random_state=0)
    clustering_model.fit(sent_embeddings)

    cluster_assignment = clustering_model.labels_
    centroids = clustering_model.cluster_centers_

    clustered_sentences = [[] for i in range(n_sents)]
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        clustered_sentences[cluster_id].append(sents[sentence_id])

    indexes = get_nearest_indexes(centroids, sent_embeddings)
    
#     closest, distances = vq(centroids, sent_embeddings_pca)

    for index in indexes: 
        sents_sum.append(sents[index])
        
    return sents_sum

In [607]:
kmeans_sum(sents, sent_embeddings, 10)

['Its a little short (3-5 hours on your first playthrough), but it has different endings and a decent replay value, overall its really fun.',
 'Has some challenging boss fights.',
 'The art style is amazing, and along with the soundtrack it creates a unique atmosphere.',
 'Speaking of that the game is also fairly inspired by Dark Souls.',
 'Very good metroidvania styled game.',
 'Solid game, worth the price.',
 'So you have a melee attack and a ranged attack.',
 'Awesome game, this game has great controls and the artstyle is quite pretty.',
 'Fantastic pixel art.',
 "Hard mode is actually fairly challenging as well and I've been having fun playing though it."]

In [636]:
def agglo_sum(sents, sent_embeddings, n_sents=5, threshold=0.3): 
    # Perform agglomerative clustering
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=threshold)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])

    unique, counts = np.unique(cluster_assignment, return_counts=True)

    unique_counts = list(zip(unique, counts))

    unique_counts = sorted(unique_counts, key=lambda x: x[1], reverse=True)

    sents_count = 0 

    sents_sum = []

    for unique_count in unique_counts:
        cluster_num = unique_count[0]

        if cluster_num != -1:
            print(len(clustered_sents[cluster_num]))
            
            centroid = get_centroid(np.array(clustered_sent_embeddings[cluster_num]))
            centroid = np.array([centroid])
            
            index = get_nearest_indexes(centroid, clustered_sent_embeddings[cluster_num])[0]
            sents_sum.append(clustered_sents[cluster_num][index])

            sents_count += 1
            if sents_count == n_sents: 
                break
    
    return sents_sum