### Steps
- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id**, **review**, **recommended**, **time** into table **reviews**.
- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to create **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents**t to link table **clusters** and **sents**.

### Step 1:

- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id** (fk), **review**, **recommended**, **time** into table **reviews**.

In [14]:
import requests 
import pandas as pd
import numpy as np

import download_steam_reviews

import sqlite3

import re
from bs4 import BeautifulSoup

import spacy

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

from sklearn.cluster import AgglomerativeClustering

from fuzzysearch import find_near_matches

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import flair

sentiment_model = flair.models.TextClassifier.load('sentiment')
sentiment_model_fast = flair.models.TextClassifier.load('sentiment-fast')
senti_analyzer = SentimentIntensityAnalyzer()

import pickle

import plotly.express as px
import plotly.graph_objects as go

from sklearn.cluster import KMeans
import hdbscan
from summa.summarizer import summarize

from scipy.cluster.vq import vq

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# import umap.umap_ as umap

from tqdm import tqdm

from collections import Counter

from fuzzywuzzy import fuzz, process

from spacy import displacy

from pprint import pprint

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.sum_basic import SumBasicSummarizer

import plotly.graph_objects as go
from math import ceil

2021-08-19 22:00:11,585 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-distillbert_4.pt
2021-08-19 22:00:14,959 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-ft-rnn.pt


In [15]:
conn = sqlite3.connect('./data/steam_reviews_new.db') 
cursor = conn.cursor()

In [84]:
# 428550 - Momodora: Reverie under the Moonlight
# 367520 - Hollow Knight 
# 736260 - Baba Is You 
# 501300 - What Remains of Edith Finch
# 504230 - Celeste
# 22000 - World of Goo
# 40700 - Machinarium
# 26800 - Braid
# 1222700 - A Way Out
# 1225570 - Unravel Two
app_ids = [428550, 367520, 736260, 501300, 504230, 22000, 40700, 26800, 1222700, 1225570]

In [22]:
### get game_name ###

def get_app_list(): 
    app_list_url = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'
    resp_data = requests.get(app_list_url)
    return resp_data.json()

def get_name(app_id, app_list): 
    for app in app_list['applist']['apps']: 
        if app['appid'] == app_id: 
            return app['name']

In [23]:
### get header_img_url  

def get_header_img_url(app_id): 
    return f'https://cdn.cloudflare.steamstatic.com/steam/apps/{app_id}/header.jpg'

In [24]:
def get_text_length(text): 
    return len(set(text.split(' ')))

In [85]:
### get total_positive, total_negative, total_reviews in Crawler for table games 
### get game_id, review, recommended, time in Crawler for table games 

app_list = get_app_list()

request_params = {
    'language': 'english'
}

# load or download new (maximum ~5000 newest reviews)
load_mode = False

for app_id in app_ids: 
    game_tuples = [] 

    game_name = get_name(app_id, app_list)
    
    header_img_url = get_header_img_url(app_id)

    
    if load_mode: 
        review_dict = download_steam_reviews.load_review_dict(app_id)['reviews'].values()
    else: 
        review_dict = download_steam_reviews.download_reviews_for_app_id(app_id, 
                                                                     chosen_request_params=request_params)[0]['reviews'].values()
    
    with conn:
        cursor.execute("""INSERT INTO games (app_id, game_name, header_img_url) VALUES (?, ?, ?);""", 
                       (app_id, game_name, header_img_url))
    
    # get game_id (fk) for table reviews
    game_id = cursor.execute("""SELECT game_id FROM games WHERE app_id=?;""", (app_id,)).fetchone()[0] 
    
    review_tuples = []
    
    for review_dict_value in review_dict:     
        review = review_dict_value['review']
        recommended = 1 if review_dict_value['voted_up'] else 0
        time = review_dict_value['timestamp_updated']
        
        review_tuples.append((review, recommended, time, game_id))
        
        
#         if get_text_length(review) < 3:
#             print(f'> {review}')
            
    
    with conn:
        cursor.executemany("""INSERT INTO reviews (review, recommended, time, game_id) VALUES 
                                (?, ?, ?, ?);""", review_tuples)    

[appID = 1145360] expected #reviews = 85063
502 Bad Gateway for appID = 1145360 and cursor = AoJ4i+rv0/oCeuSj7QI=. Cooldown: 10 seconds
Number of queries 150 reached. Cooldown: 310 seconds
Number of queries 150 reached. Cooldown: 310 seconds
Number of queries 150 reached. Cooldown: 310 seconds
Number of queries 150 reached. Cooldown: 310 seconds
Number of queries 150 reached. Cooldown: 310 seconds
502 Bad Gateway for appID = 1145360 and cursor = AoJwhPXv/e4Cfpvt4wE=. Cooldown: 10 seconds
[appID = 275850] expected #reviews = 126533
502 Bad Gateway for appID = 275850 and cursor = AoJ4zYu8+voCca/C8AI=. Cooldown: 10 seconds
502 Bad Gateway for appID = 275850 and cursor = AoJ4k+2IsvYCdfuougI=. Cooldown: 10 seconds
Number of queries 150 reached. Cooldown: 310 seconds
502 Bad Gateway for appID = 275850 and cursor = AoJw38ibgfYCe4bAsQI=. Cooldown: 10 seconds
502 Bad Gateway for appID = 275850 and cursor = AoJ41pXYo/MCe+z9lAI=. Cooldown: 10 seconds
Number of queries 150 reached. Cooldown: 310 s

### Step 2:

- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to insert **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
        

    

In [165]:
df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
                    FROM reviews JOIN games USING(game_id);""", conn)

In [166]:
df_reviews

Unnamed: 0,review_id,review,game_id
0,1,Metroidvania with some influences from Dark So...,1
1,2,It's ok. Frustrating mechanics turned me off o...,1
2,3,Cat,1
3,4,Has problems but overall pretty good. I'd sugg...,1
4,5,This is a very good and very short game. If yo...,1
...,...,...,...
277499,277500,"Quite nice, actually.",12
277500,277501,Rest in peace Harambe. Your name will always b...,12
277501,277502,"1ST PUBLIC REVIEW!\n\nEDIT: Quick tips, left c...",12
277502,277503,Its pretty good tbh\nEDIT: Its pretty good but...,12


In [27]:
def add_missing_punct(text): 
    return re.sub('([A-Za-z0-9])\s*(\n+|$)', '\g<1>. ', text)


def replace_bullets(text): 
    text = re.sub('([A-Za-z0-9])\s*\n+\s*[+-]?\s*', '\g<1>. ', text)
    text = re.sub('\s*([:+-]+)\s*\n+\s*[+-]?\s*', '. ', text) 
    return text
    
def replace_colons(text): 
    return re.sub('\s*:\s*(\n+)\s*', '.\g<1>', text) 
    
# remove url from text
def remove_url(text):
    return re.sub(r"http\S+", ' ', text)


def remove_square_brackets(text): 
    return re.sub('\[(.*?)\]', ' ', text)

# remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()  
    return text


# replace ’ with ' 
def normalize_single_quote(text):
    return re.sub('[’‘]', '\'', text)


# remove non english characters effectively
def remove_non_ascii(text): 
    return text.encode("ascii", errors="ignore").decode()
    
    
# remove ANSI escape sequences
def remove_ansi_escape_sequences(text):
    ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', text)
    
    
# remove multiple whitespaces with single whitespace
def remove_multi_whitespaces(text): 
    return re.sub('\s+', ' ', text.strip())

In [28]:
def remove_bullet_nums(sent): 
    return re.sub('^\s*\d+\s*[.\)]+\s*([^\d])', '\g<1>', sent)

def remove_leading_symbols(sent):
    return re.sub('^[^A-Za-z\"\'\d]+', '', sent)

def uppercase_first(sent): 
    return sent[0].upper() + sent[1:] if len(sent) != 0 else sent

In [168]:
# ['transformer', 'parser']
# ['tok2vec', 'parser']
# ['sentencizer']

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentencizer")

def tokenize_sent(text, pipes=['tok2vec', 'parser']):
    with nlp.select_pipes(enable=pipes):
        doc = nlp(text)
        sents = [str(sent).strip() for sent in doc.sents]
    return sents

In [169]:
def lowercase(text):
    return text.lower()

def expand_contractions(text):
    for key in contractions:
        value = contractions[key]
        text = text.replace(key, value)
    return text

# remove digits 
def remove_digits(text): 
    return re.sub('\d+', ' ', text)

# remove symbols 
def remove_symbols(text):
    return re.sub('[^A-Za-z,.\s\d]+', ' ', text)

# lemmatization with spacy 
def lemmatize_text(text): 
    doc = nlp(text, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc if token.pos_ != 'PUNCT']
    return ' '.join(lemma)

# remove stop words 
def remove_stopwords(text, word_list=[]):
    stop_words = stopwords.words("english")
    stop_words.extend(word_list)
    stop_words = set(stop_words)
    return ' '.join(e.lower() for e in text.split() if e.lower() not in stop_words)

def get_extra_stopwords(game_name): 
    stopwords = set(['game', 'lot', 'bit', 'way '])
    doc = nlp(game_name, disable=['parser', 'ner'])
    for token in doc: 
        if token.pos_ not in {'PUNCT', 'NUM'}:
            stopwords.add(token.text.lower())
    return stopwords

In [176]:
sent_tuples = set()

# filter out short reviews (len < 3)
df_reviews = df_reviews[df_reviews.apply(lambda x: get_text_length(x['review']) > 2, axis=1)]

for game_id in [11, 12]:
# for game_id in df_reviews['game_id'].unique():
    df_reviews_game = df_reviews[df_reviews['game_id'] == game_id]
    
    reviews_game_cleaned = df_reviews_game['review'].map(remove_square_brackets)\
                        .map(remove_html_tags)\
                        .map(remove_url)\
                        .map(normalize_single_quote)\
                        .map(remove_non_ascii)\
                        .map(remove_ansi_escape_sequences)\
                        .map(add_missing_punct)\
                        .map(replace_colons)
    
    for review_id, review in zip(df_reviews_game['review_id'], reviews_game_cleaned):
        sents = pd.Series(tokenize_sent(review)).map(remove_bullet_nums)\
                                                .map(remove_leading_symbols)\
                                                .map(uppercase_first)\
                                                .map(add_missing_punct)\
                                                .map(remove_multi_whitespaces)\
        
        # filter out short sentences (len < 3) and sentences with uneven number of ", ), ( 
        for sent in sents:
            if get_text_length(sent) > 2 and sent.count('"') % 2 == 0 and sent.count('(') == sent.count(')'):
                sent_tuples.add((review_id, sent))
                
sent_tuples = list(sent_tuples)


"...  " looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.





In [181]:
# with conn:
#     cursor.executemany("""INSERT INTO sents (review_id, sent, sent_prep) VALUES 
#                         (?, ?, ?);""", sent_tuples)    

with conn:
    cursor.executemany("""INSERT INTO sents (review_id, sent) VALUES 
                        (?, ?);""", sent_tuples)    

### Step 3: 
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.

In [251]:
df_sents_ = pd.read_sql_query("""SELECT sent, sent_id, game_id
                    FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

In [223]:
def get_tokens_index_dict(string):
    index_dict = {}
    index = 0 
    for i, char in enumerate(string): 
        if char == ' ':
            index += 1
        else: 
            index_dict[i] = index
    
    return index_dict

In [224]:
def get_stopwords(word_list=[]):
    stopwords = nlp.Defaults.stop_words.copy()  
    stopwords |= set(word_list)
    return stopwords

In [225]:
def get_pos_tags(tokens): 
    pos_tags = []
    
    for token in tokens: 
        if token.text == ',':
            pos_tags.append('COMMA')
        elif token.text == '-': 
            pos_tags.append('HYPHEN')
        elif token.text.lower() in stop_words: 
            pos_tags.append('STOP')
        elif token.pos_ in ('NOUN', 'PROPN'):
            pos_tags.append('NOUN')
        else: 
            pos_tags.append(token.pos_)
    
    return pos_tags

In [226]:
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")

def get_kws_sent_ids(df_sents):
    kws_sent_ids = set()

    noun_types = set(['NOUN', 'PROPN'])
    adj_types = set(['ADJ'])
    pattern = re.compile('(ADJ (COMMA ADJ )*)*(NOUN (HYPHEN NOUN )*(HYPHEN )?)*NOUN')

    for sent, sent_id in zip(df_sents['sent'], df_sents['sent_id']): 
        with nlp.select_pipes(enable=['transformer', 'tagger', 'attribute_ruler']):
            doc = nlp(sent)

        tokens = [token.text for token in doc]
        tags_str = ' '.join(get_pos_tags([token for token in doc]))
        tokens_index_dict = get_tokens_index_dict(tags_str)
        matches = pattern.finditer(tags_str)
        
        for match in matches: 
            tags = [elem for elem in match.group(0).split()]
            
            index = match.start()
            
            kw = []
            
            for i, tag in enumerate(tags): 
                kw_token = tokens[tokens_index_dict[index]]
                
                index += len(tag) + 1
                
                if tag in ('NOUN', 'ADJ'):
                    kw.append(kw_token.lower())
                
            kw = ' '.join(kw)
            
            if len(kw) > 1: 
                kws_sent_ids.add((kw, sent_id))
            
        if len(kws_sent_ids) % 1000 == 0:
            print(f'Processed {len(kws_sent_ids)}')
    
    return kws_sent_ids

In [227]:
def get_kws_sent_ids_group(kws_sent_ids):
    res={}
    
    for kw_sent_id in kws_sent_ids:
        if kw_sent_id[0] not in res:
            res[kw_sent_id[0]] = [kw_sent_id[1]]
        else:
            res[kw_sent_id[0]].append(kw_sent_id[1])
            
    return res

In [228]:
# get the list of biggest clusters 
def get_biggest_clusters(kw_clusters, top_n=50):
    cluster_nums_sizes = []

    for cluster_num, kws_sent_ids in kw_clusters.items():
        if cluster_num != -1:
            cluster_size = sum([len(kw_sent_ids[1]) for kw_sent_ids in kws_sent_ids])
            cluster_nums_sizes.append((cluster_num, cluster_size))

    cluster_nums_sizes_sorted = sorted(cluster_nums_sizes, key=lambda x: x[1], reverse=True)
    return set([cluster_num_size[0] for cluster_num_size in cluster_nums_sizes_sorted[:top_n]])

In [229]:
# get the current auto-increment id of a primary key of a table
def get_current_auto_pk(table_name):
    with conn:
        cursor.execute(f"SELECT seq FROM sqlite_sequence WHERE name='{table_name}'")
    pk_id = cursor.fetchone() 
    return pk_id[0] + 1 if pk_id else 1

In [230]:
# get tuples for 4 tables clusters, kws, kws_sents and clusters_sents
def get_cluster_kw_tuples(game_id, kw_clusters, biggest_clusters):
    cluster_tuples, kw_tuples, cluster_sent_tuples, kw_sent_tuples = [], [], [], []
    
    # get cluster_id and kw_id
    cluster_id = get_current_auto_pk('clusters')
    kw_id = get_current_auto_pk('kws')
    
    for cluster_num, kws_sent_ids in kw_clusters.items():         
        if cluster_num in biggest_clusters:
            kws_sent_ids_sorted = sorted(kws_sent_ids, key=lambda x: len(x[1]), reverse=True)
            cluster_name = kws_sent_ids_sorted[0][0]
            
            # clusters
            cluster_tuples.append((cluster_name,))            

            for kw_sent_ids in kws_sent_ids: 
                kw = kw_sent_ids[0]
                sent_ids = kw_sent_ids[1]
                
                # kws
                kw_tuples.append((kw, cluster_id))
                
                for sent_id in sent_ids: 
                    kw_sent_tuples.append((sent_id, kw_id))
                    
                kw_id += 1
                
            cluster_id += 1
    
    return cluster_tuples, kw_tuples, kw_sent_tuples

In [231]:
def cluster_kws(kws): 
    kw_embeddings = embedder.encode(kws)

    # Normalize the embeddings to unit length
    kw_embeddings = kw_embeddings / np.linalg.norm(kw_embeddings, axis=1, keepdims=True)

    # perform agglomerative clustering
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
    clustering_model.fit(kw_embeddings)
    cluster_assignment = clustering_model.labels_

    kw_clusters = {}
    for kw_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in kw_clusters:
            kw_clusters[cluster_id] = []

        kw = kws[kw_id]
        kw_clusters[cluster_id].append((kw, kws_sent_ids_group[kw]))
    
    return kw_clusters

In [233]:
df_sents

Unnamed: 0,sent,sent_id
197224,"Do I recommend this game? Well...at 10-15$, I'...",197225
197226,Watched some recent videos and decided to buy ...,197227
197228,"Get it. Don't get it yet (fixes), but get it.",197229
197229,"After almost 14 hours of gameplay, I cannot re...",197230
197233,Unfortunately it cost me $60 dollars and 13 ho...,197234
...,...,...
375778,"As many others have described this game, a mil...",375779
375782,Bought this game in 2016 and it was absolutely...,375783
375785,The best space/survival/exploraion game ever.,375786
375787,"Initial release: I still enjoyed the game, but...",375788


In [252]:
game_ids = df_sents_['game_id'].unique()

In [236]:
for game_id in game_ids: 
    df_sents = df_sents_[df_sents_['game_id'] == game_id][['sent', 'sent_id']][:1000]

    # remove stopwords before grouping
    stop_words = get_stopwords(['game', 'games', 'lot', 'lots', 'ton', 'tons', 'bit', 'bits', 'fun', 
                                'way', 'ways', 'thing', 'things', 'time', 'times', 'type', 'types', 
                                   'opinion', 'opinions', 'sense', 'terms', 'lack', 'fact'])

    kws_sent_ids = get_kws_sent_ids(df_sents)
    kws_sent_ids_group = get_kws_sent_ids_group(kws_sent_ids)
    kws_sent_ids_group = {key: kws_sent_ids_group[key] for key, val in kws_sent_ids_group.items() if len(val) > 1}

    kws = list(kws_sent_ids_group.keys())

    kw_clusters = cluster_kws(kws)

    biggest_clusters = get_biggest_clusters(kw_clusters, 100)
    cluster_tuples, kw_tuples, kw_sent_tuples = get_cluster_kw_tuples(game_id, kw_clusters, biggest_clusters)
    
    with conn:
        cursor.executemany("""INSERT INTO clusters (cluster_name) VALUES 
                            (?);""", cluster_tuples)    
    
    with conn:
        cursor.executemany("""INSERT INTO kws (kw, cluster_id) VALUES 
                            (?, ?);""", kw_tuples)

    with conn:
        cursor.executemany("""INSERT INTO kws_sents (sent_id, kw_id) VALUES 
                            (?, ?);""", kw_sent_tuples)

Processed 2000
Processed 2000


### Step 5:
- Remove all **sent_id** in table **sents** if they don't exist in table **clusters_sents**.
    - Insert **score_flair**, **score_vader**, **recommended**, **score_total**, **sent_embedding** in table **sents**.

In [255]:
with conn: 
    cursor.execute("""
                        DELETE FROM sents 
                        WHERE sent_id NOT IN (
                        SELECT DISTINCT sent_id
                        FROM kws_sents);
                   """)

In [137]:
# df_sents = pd.read_sql_query("""SELECT cluster_id, sent_id, game_id, cluster_name, sent, sent_embedding, senti
#                                 FROM clusters_sents 
#                                 JOIN clusters USING(cluster_id)
#                                 JOIN sents USING(sent_id);""", conn)

In [258]:
df_sents = pd.read_sql_query("""SELECT game_id, sent, sent_id 
                                FROM reviews JOIN sents USING(review_id);""", conn)

In [259]:
df_sents

Unnamed: 0,game_id,sent,sent_id
0,5,Music is brilliant.,1
1,5,Don't buy it on sale because the developer des...,5
2,4,Calvin Finch died from falling off a cliff fro...,6
3,7,It's simple and puzzle solutions don't make yo...,7
4,5,"Short version: Great story, great difficulty, ...",9
...,...,...,...
136105,7,Short (4 - 8 hours),197219
136106,9,Fun to figure out puzzles with a friend.,197220
136107,9,"The story was overall alright, at times predic...",197221
136108,4,Each room was preserved after the death of the...,197223


In [265]:
def decode_sent_embeddings(sent_embeddings):
    return np.array([pickle.loads(sent_embedding) for sent_embedding in sent_embeddings])

In [260]:
def get_score_flair(sent, threshold=0.9): 
    sent_flair = flair.data.Sentence(sent)
    sentiment_model_fast.predict(sent_flair)
    
    value_flair = sent_flair.labels[0].value
    score_flair = sent_flair.labels[0].score
    
    value_flair = 1 if value_flair == 'POSITIVE' else -1
    
    score_flair = value_flair*score_flair
    
    return 1 if score_flair > threshold else -1 if score_flair < -threshold else 0

In [261]:
sentis = df_sents['sent'].map(get_score_flair) 

In [262]:
sent_embeddings = embedder.encode(df_sents['sent'])

sent_embeddings = sent_embeddings / np.linalg.norm(sent_embeddings, axis=1, keepdims=True)

sent_embeddings_str = [pickle.dumps(sent_embedding) for sent_embedding in sent_embeddings]

In [263]:
sent_tuples = []

for senti, sent_embedding_str, sent_id in zip(sentis, sent_embeddings_str, df_sents['sent_id']): 
    sent_tuples.append((senti, sent_embedding_str, sent_id))     

In [264]:
with conn:
    cursor.executemany("""UPDATE sents SET (senti, sent_embedding) = (?, ?)
                        WHERE sent_id=?;""", sent_tuples)

In [267]:
df_sents = pd.read_sql_query("""SELECT game_id, cluster_id, cluster_name, kw_id, kw, sent_id, sent, sent_embedding, senti
                                FROM reviews
                                JOIN sents USING(review_id) 
                                JOIN kws_sents USING(sent_id)
                                JOIN kws USING(kw_id)
                                JOIN clusters USING(cluster_id);""", conn)

In [268]:
df_sents

Unnamed: 0,game_id,cluster_id,cluster_name,kw_id,kw,sent_id,sent,sent_embedding,senti
0,1,1,couple,1,love,194652,"You can just tell it was a labor of love, and ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
1,1,1,couple,1,love,21009,Lots of love went into this game.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
2,1,1,couple,1,love,102171,Started the game and played a few minutes and ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
3,1,1,couple,1,love,169784,I was genuinely surprised by how much I enjoye...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
4,1,1,couple,1,love,41086,"It is very polished, you can feel it has been ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
...,...,...,...,...,...,...,...,...,...
267019,10,1000,phyreengine,17325,phyreengine,70798,PhyreEngine is a free to use game engine from ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
267020,10,1000,phyreengine,17325,phyreengine,107612,You made a nice world using the PhyreEngine.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
267021,10,1000,phyreengine,17325,phyreengine,100522,PhyreEngine has been adopted by several game s...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
267022,10,1000,phyreengine,17325,phyreengine,3304,You made a nice world using the PhyreEngine.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1


In [297]:
game_id = 1

In [298]:
df_sent = df_sents[df_sents['game_id'] == game_id]

In [299]:
cluster_names = list(df_sent['cluster_name'].value_counts().index)

In [300]:
cluster_names = list(df_sent['cluster_name'].value_counts().index)
max_sents = 5

if len(cluster_names) >= max_sents: 
    n_sents = 1
    top_n_cluster = max_sents
else: 
    n_sents = ceil(max_sents / len(cluster_names)) 
    top_n_cluster = len(cluster_names)

cluster_name_vals = []
sent_vals = []


for cluster_name in cluster_names[:top_n_cluster]: 
    df_sent_cluster = df_sent[df_sent['cluster_name'] == cluster_name]
    
    sents = df_sent_cluster['sent'].reset_index(drop = True)
    sent_embeddings = decode_sent_embeddings(df_sent_cluster['sent_embedding'])
    
    summary_sents = hybrid_sum(sents, sent_embeddings, n_sents)

    cluster_name_vals.extend([cluster_name] * n_sents)
    sent_vals.extend(summary_sents)

In [301]:
fig = go.Figure(data=[go.Table(header=dict(values=['Aspect', 'Sentence']),
                 cells=dict(values=[cluster_name_vals, sent_vals]))
                     ])
fig.show()

In [307]:
cluster_sent_tuples = []

for game_id in range(0, 11):
    df_sent = df_sents[(df_sents['game_id'] == game_id)]
    n_sents = 5
    cluster_names = list(df_sent['cluster_name'].value_counts().index)

    for cluster_name in cluster_names:     
    #     print(f'Summary for cluster: {cluster_name}')
        df_sent_cluster_name = df_sent[df_sent['cluster_name'] == cluster_name]

        sents = df_sent_cluster_name['sent'].reset_index(drop = True)
        sent_embeddings = decode_sent_embeddings(df_sent_cluster_name['sent_embedding'])
        n_sents = n_sents

        # summarize with hybrid clustering
        sents_sum = hybrid_sum(sents, sent_embeddings, n_sents)

    #     pprint(sents_sum)

        for i, sent in enumerate(sents_sum): 
            df_sent_rows = df_sent_cluster_name[df_sent_cluster_name['sent'] == sent]

            kw_id = df_sent_rows['kw_id'].iloc[0]
            sent_id = df_sent_rows['sent_id'].iloc[0]

            cluster_sent_tuples.append([int(i + 1), int(kw_id), int(sent_id)])

In [309]:
# 500 sents per game
len(cluster_sent_tuples)

5000

In [310]:
with conn:
    cursor.executemany("""UPDATE kws_sents SET rank = ?
                        WHERE kw_id=? AND sent_id=?;""",
                   (cluster_sent_tuples))

In [274]:
def get_centroid(arr):
    length, dim = arr.shape
    return np.array([np.sum(arr[:, i])/length for i in range(dim)])

# return index of the vectors in corpus_embeddings nearest to the centroid
def get_nearest_indexes(centroids, corpus_embeddings):
    return vq(centroids, corpus_embeddings)[0]

In [275]:
def kmeans_sum(sents,  sent_embeddings, n_sents=5):
    sents_sum = []

    clustering_model = KMeans(n_clusters=n_sents, random_state=0)
    clustering_model.fit(sent_embeddings)

    cluster_assignment = clustering_model.labels_
    centroids = clustering_model.cluster_centers_

    clustered_sentences = [[] for i in range(n_sents)]
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        clustered_sentences[cluster_id].append(sents[sentence_id])

    indexes = get_nearest_indexes(centroids, sent_embeddings)
    
#     closest, distances = vq(centroids, sent_embeddings_pca)

    for index in indexes: 
        sents_sum.append(sents[index])
        
    return sents_sum

In [292]:
def agglo_cluster(sents, sent_embeddings, n_sents, threshold=0.5): 
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=threshold)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
#     pprint(clustered_sents)
    
    return (cluster_assignment, clustered_sents, clustered_sent_embeddings) if len(clustered_sents) >= n_sents else agglo_cluster(sents, sent_embeddings, n_sents, threshold=threshold - 0.1)

In [293]:
def hdbscan_cluster(sents, sent_embeddings, n_sents):
    clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
        
#     pprint(clustered_sents)
        
    return (cluster_assignment, clustered_sents, clustered_sent_embeddings) 

In [295]:
def hybrid_cluster(sents, sent_embeddings, n_sents): 
    clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
#     pprint(clustered_sents)
    
    # remove unidentified cluster -1
    clustered_sents = {k: v for k, v in clustered_sents.items() if k != -1}
    
    if len(clustered_sents) >= n_sents:
        cluster_assignment = [e for e in cluster_assignment if e != -1]
        clustered_sent_embeddings = {k: v for k, v in clustered_sent_embeddings.items() if k != -1}
        return cluster_assignment, clustered_sents, clustered_sent_embeddings
    else: 
        return agglo_cluster(sents, sent_embeddings, n_sents)

In [296]:
def sort_clusters(cluster_assignment):
    cluster_nums, counts = np.unique(cluster_assignment, return_counts=True)
    cluster_nums_counts = list(zip(cluster_nums, counts))
    
    return [cluster_num for cluster_num, _ in sorted(cluster_nums_counts, key=lambda x: x[1], reverse=True)]

In [280]:
# generate summary
def gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents):
    cluster_nums = sort_clusters(cluster_assignment)
    
    sents_count = 0 

    sents_sum = []

    for cluster_num in cluster_nums:
        centroid = get_centroid(np.array(clustered_sent_embeddings[cluster_num]))
        centroid = np.array([centroid])

        index = get_nearest_indexes(centroid, clustered_sent_embeddings[cluster_num])[0]
        sents_sum.append(clustered_sents[cluster_num][index])

        sents_count += 1
        if sents_count == n_sents: 
            break
    
    return sents_sum

In [281]:
def agglo_sum(sents, sent_embeddings, n_sents=5, threshold=0.5):
    if len(sents) < 2 or n_sents > len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = agglo_cluster(sents, sent_embeddings, n_sents, threshold)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)   

In [282]:
def hybrid_sum(sents, sent_embeddings, n_sents=5): 
    if len(sents) < 2 or n_sents >= len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = hybrid_cluster(sents, sent_embeddings, n_sents)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)

In [283]:
def hdbscan_sum(sents, sent_embeddings, n_sents=5): 
    if len(sents) < 2 or n_sents >= len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = hdbscan_cluster(sents, sent_embeddings, n_sents)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)

In [427]:
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'I hate the whole universe.',
          'The car is badly broken.',
          'Summer is the hottest season.'
          ]

corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [428]:
agglo_sum(corpus, corpus_embeddings, 5)

{0: ['A man is eating food.',
     'A man is eating a piece of bread.',
     'A man is eating pasta.'],
 1: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 2: ['A monkey is playing drums.',
     'Someone in a gorilla costume is playing a set of drums.'],
 3: ['A cheetah is running behind its prey.',
     'A cheetah chases prey on across a field.'],
 4: ['I hate the whole universe.'],
 5: ['The car is badly broken.'],
 6: ['A man is riding a horse.',
     'A man is riding a white horse on an enclosed ground.'],
 7: ['Summer is the hottest season.']}


['A man is eating food.',
 'The girl is carrying a baby.',
 'Someone in a gorilla costume is playing a set of drums.',
 'A cheetah is running behind its prey.',
 'A man is riding a white horse on an enclosed ground.']

In [429]:
hybrid_sum(corpus, corpus_embeddings, 5)

{-1: ['I hate the whole universe.',
      'The car is badly broken.',
      'Summer is the hottest season.'],
 0: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 1: ['A man is eating food.',
     'A man is eating a piece of bread.',
     'A man is eating pasta.'],
 2: ['A man is riding a horse.',
     'A man is riding a white horse on an enclosed ground.'],
 3: ['A monkey is playing drums.',
     'Someone in a gorilla costume is playing a set of drums.'],
 4: ['A cheetah is running behind its prey.',
     'A cheetah chases prey on across a field.']}


['A man is eating food.',
 'The girl is carrying a baby.',
 'A man is riding a white horse on an enclosed ground.',
 'Someone in a gorilla costume is playing a set of drums.',
 'A cheetah is running behind its prey.']

In [430]:
hdbscan_sum(corpus, corpus_embeddings, 5)

ValueError: Epsilon must be a float value greater than or equal to 0!

In [455]:
agglo_sum(sents, sent_embeddings, 100)

{0: ["It's a dark and depressing world, and the themes I've encountered "
     'thusfar are good, while the boss battles are good for getting you '
     'psyched and pumped up for a fight.',
     'If you like beating a tough boss fight, then getting killed by some '
     'random monster only to have to redo that fight again, this game is for '
     'you.',
     "There's one boss that has an enemy in the bit of the room prior to the "
     'arena and that enemy will make the fight immensely painful if not '
     'genuinely impossible if you decide not to kill it.',
     "There's also an optional extra boss if you want one more offensive spell "
     'in your inventory (or just seek one more fight), because the boss fights '
     'are just that well-designed.',
     'Pretty much all bosses have random attack orders and this could end up '
     'in your "favor", allowing you to just spam your ranged attack to kill '
     'them... or it can end up just requiring you to run around for 80% o

      'In addition to some new abilities you will gain later on to help with '
      'traversal and combat.',
      'The items available make quite a difference in combat and movement, '
      'giving it a good feeling of progression.',
      'These modifiers range from combat oriented to exploration oriented.',
      'You get to mix and match passive items in order to change your attacks '
      'up a bit, and theres a fair amount of challenge.',
      'Over the course of the game you\'ll acquire a number of "active" items '
      'that you can equip and use to do things like heal yourself or increase '
      'your attack power, and you\'ll acquire more "passive" items that add '
      'effects such as poison to your ranged attacks.',
      'Pick up collectables to enhance your abilities, and switch and swap '
      'passive and active skills to give you just the right advantage to face '
      'the enemy that lays before you.',
      'You have your map, your short ranged and long ran

      'attack, one of which just makes your melee weapon do more damage, and '
      'only two of which actually provide new mobility options.',
      'The controls are simple but effective, and besides one or two design '
      "quirks that personally I would've changed (the interaction between the "
      'attack and roll animations, and the invincibility frames of the latter) '
      "there's nothing to complain about - you have melee and ranged attacks, "
      'dodge, double-jump and items with various active and passive effects to '
      'use.',
      'The ability to use both melee and ranged attacks since the beginning is '
      'quite neat, since it offers more options in the way you engage in '
      'battles.',
      'Combat has you dodging enemies attack with a dodge roll (and later, an '
      'air dodge), attacking enemies with either melee or ranged attacks, '
      'depending on how you want to play.',
      'Enemy damage is way too high (a lot of regular enemies can 2

      'variation, the enemy moves fast and attacks fast, outspeeds us in '
      'general, badly designed bosses.',
      'From the in-game items to the enemy types nothing feels out of place in '
      'the setting presented to you.',
      'Some might say not having different armour and weapons adds to the '
      "simplicity and the charm of this game, but the game doesn't make up for "
      'it.',
      'I often find in games that ranged weapons are normally only put there '
      'for very specific situations, or for people to mess around.',
      "There's no change in weapons but it didn't seem needed; the game felt "
      'well balanced around the moves and weapons you have.',
      'Gameplay elements are much more compacted from past games with all the '
      'gadgets and whatnot removed, but the two weapons with modifiable '
      'properties feel great and their expanded complexities and use-cases '
      'easily make up for the losses.',
      "No real progression in term

['So you have a melee attack and a ranged attack.',
 'Fun game with great visuals and satisfying combat.',
 'Fun and challenging enemies, especially the bosses.',
 'Combat has a nice visceral feel to it.',
 'Few variety of enemies.',
 'This is a great MetroidVania with some interesting combat and a compelling in game universe.',
 'Except using a bow and your leaf to attack enemies, consumables option and even some handy items/spells you can use during combat.',
 "I really tried to like this game, but I can't find any redeeming qualities - the combat is basic and not very satisfying, the platforming feels unresponsive, the story (at least as far as I played) seems non-existant...",
 'This game is fun, yet challenging, and I will admit that I am having some trouble adapting to the enemies attacks, and it leads to a quick death.',
 'The reason is that you will turn your bow from a decent secondary option, into your main weapon that destroys most bosses and enemies with little effort.',
 '

In [351]:
hybrid_sum(sents, sent_embeddings, 100)

NameError: name 'sent_embeddings' is not defined

In [630]:
doc_big = ' '.join(sents)
# For Strings
parser = PlaintextParser.from_string(doc_big,Tokenizer("english"))

In [632]:
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 5 sentences
summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)

2D Dark Souls.
Dark Souls.
Dark Souls but not Dark Souls.
It's like Dark Souls.
The Dark Souls of 2D.


In [633]:
summarizer_1 = LuhnSummarizer()
summary_1 = summarizer_1(parser.document, 5)
for sentence in summary_1:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Animations look great, but end up being super clunky in the combat the game sets out, some enemies have contact damage, some dont(even bosses choose randomly on this), some enemies that look exactly the same as others have entirely different attacks that need to to already be doing something that counters them, and sometimes the action that counters one, and the action that counters some

In [634]:
summarizer_2 = LsaSummarizer()
summary_2 = summarizer_2(parser.document, 5)
for sentence in summary_2:
    print(sentence)

Expect 6-8 hours of gameplay on your first run through, even with harder difficulties subsequent runs should take an hour or so since it doesn't change too much between.
After long season of Hollow Knight, Momodora: Reverie Under The Moonlight feels too short but gameplay is fun so you can always try on hard mode and hunt some achievement!
If you like tough as hell action games with c r a z y boss battles, rolling, double-jumping and (later) mid-air dodging.....metroidvania style progression.........beautiful graphics........smooth animation....uhhh???
The story itself isn't really deep, and I would have loved if we could learn just a little bit more about each character, but everything still plays out perfectly.
Now, that 'bad' list seems to offset the good by a large margin, but it is also very dependent on what I find annoying and unnecessary for modern action platformers.


In [637]:
summarizer_3 = TextRankSummarizer()
summary_3 = summarizer_3(parser.document, 5)
for sentence in summary_3:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Momodora: Reverie Under the Moonlight is a cutesy pixel Metroidvania game with few memorable boss battle with suitable background music while you hack and slash your way to the next boss battle, which you need the skill to dodge the attack because you'll be doing a lot of trial and error even tho the game is short but there is a few flaws on the game, the level design is alright but can 

In [644]:
summarizer_4 = SumBasicSummarizer()
summary_4 = summarizer_4(parser.document, 10)
for sentence in summary_4:
    print(f'> {sentence}')

> Fun metroidvania-style game.
> It manages it to make Dark Souls 2d.
> If you like action platformers this one is for you.
> Good Pixel art.
> 2D metroidvania with a darksouls feel.
> The music, the artstyle, the mechanics!
> and a great soundtrack.
> Very challenging, but lots of fun!
> Boss fights.
> I want more games in that style.


In [671]:
agglo_sum(sents, sent_embeddings, 10, 0.5)

['Very good metroidvania styled game.',
 'Has some challenging boss fights.',
 'Fantastic pixel art.',
 'It has some game mechanics just like Dark Souls .',
 'Great game with fantastic art and music.',
 'Dark Souls.',
 'So you have a melee attack and a ranged attack.',
 "I'm playing the game on normal difficulty and it's already quite challenging.",
 "It reminds me most of Hollow Knight, although it isn't quite as good as that game.",
 'I can say this game is worth of its full price.']