### Steps
- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id**, **review**, **recommended**, **time** into table **reviews**.
- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to create **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents**t to link table **clusters** and **sents**.

### Step 1:

- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id** (fk), **review**, **recommended**, **time** into table **reviews**.

In [9]:
import requests 
import pandas as pd
import numpy as np

import download_steam_reviews

import sqlite3

import re
from bs4 import BeautifulSoup

import spacy
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

from sklearn.cluster import AgglomerativeClustering

from fuzzysearch import find_near_matches

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import flair

sentiment_model = flair.models.TextClassifier.load('sentiment')
sentiment_model_fast = flair.models.TextClassifier.load('sentiment-fast')
senti_analyzer = SentimentIntensityAnalyzer()

import pickle

import plotly.express as px
import plotly.graph_objects as go

from sklearn.cluster import KMeans
import hdbscan
from summa.summarizer import summarize

from scipy.cluster.vq import vq

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap.umap_ as umap

from tqdm import tqdm

from collections import Counter

from fuzzywuzzy import fuzz, process

from spacy import displacy

from pprint import pprint

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.sum_basic import SumBasicSummarizer

import plotly.graph_objects as go
from math import ceil

2021-07-28 00:30:10,602 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-distillbert_4.pt
2021-07-28 00:30:14,198 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-ft-rnn.pt


In [10]:
# nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load("en_core_web_md")
# nlp = spacy.load("en_core_web_lg")
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1a6db066588>

In [938]:
print(nlp.pipe_names)

['transformer', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'sentencizer']


In [11]:
# from contractions import contractions
contractions = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [12]:
conn = sqlite3.connect('./data/steam_reviews.db') 
cursor = conn.cursor()

In [None]:
def print_cursor(cursor): 
    print(cursor.execute("""
        select * from games;
    """).fetchall())

In [None]:
print_cursor(cursor)

In [None]:
print(cursor.execute("""
        select game_id from games 
        where app_id=367520;
    """).fetchall()[0][0])

In [825]:
# 428550 - Momodora: Reverie under the Moonlight
# 367520 - Hollow Knight
app_ids = [428550, 367520]

In [826]:
### get game_name ###

def get_app_list(): 
    app_list_url = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'
    resp_data = requests.get(app_list_url)
    return resp_data.json()

def get_name(app_id, app_list): 
    for app in app_list['applist']['apps']: 
        if app['appid'] == app_id: 
            return app['name']

In [827]:
### get header_img_url  

def get_header_img_url(app_id): 
    return f'https://cdn.cloudflare.steamstatic.com/steam/apps/{app_id}/header.jpg'

In [850]:
### get total_positive, total_negative, total_reviews in Crawler for table games 
### get game_id, review, recommended, time in Crawler for table games 

app_list = get_app_list()

request_params = {
    'language': 'english'
}

# load or download new (maximum ~5000 newest reviews)
load_mode = True

for app_id in app_ids: 
    game_tuples = [] 

    game_name = get_name(app_id, app_list)
    
    header_img_url = get_header_img_url(app_id)
    
    total_positive, total_negative, total_reviews = 0, 0, 0
    
    if load_mode: 
        review_dict = download_steam_reviews.load_review_dict(app_id)['reviews'].values()
    else: 
        review_dict = download_steam_reviews.download_reviews_for_app_id(app_id, 
                                                                     chosen_request_params=request_params, 
                                                                     reviews_limit=5000)[0]['reviews'].values()
    
    review_tuples = []
    
    with conn:
        cursor.execute("""INSERT INTO games (app_id, game_name, header_img_url) VALUES (?, ?, ?);""", 
                       (app_id, game_name, header_img_url))
    
    # get game_id (fk) for table reviews
    game_id = cursor.execute("""SELECT game_id FROM games WHERE app_id=?;""", (app_id,)).fetchone()[0] 
    
    for review_dict_value in review_dict: 
        total_reviews += 1
    
        voted_up = 1 if review_dict_value['voted_up'] else 0
    
        if voted_up: 
            total_positive += 1
        else: 
            total_negative += 1
     
        review = review_dict_value['review']
        recommended = voted_up
        time = review_dict_value['timestamp_updated']
        
        if get_text_length(review) < 3:
            review_tuples.append((review, recommended, time, game_id))
    
    with conn:
        cursor.execute("""UPDATE games SET (total_positive, total_negative, total_reviews) = (?, ?, ?)
                            WHERE game_id=?;""",
                       (total_positive, total_negative, total_reviews, game_id))
    
    with conn:
        cursor.executemany("""INSERT INTO reviews (review, recommended, time, game_id) VALUES 
                                (?, ?, ?, ?);""", review_tuples)    

> Cat
> awooooo
> .
> Good game
> anime tiddy
> kinda fun
> good
> awesome sauce
> :)
> Nice.
> Solid platformer
> Fun. Entertaining.
> Gud.
> very cool
> booba
> ⠀
> Fun game.
> _
> very cool
> good game
> bom jogo
> 8/10
> Very nice

> yes
> .
> Great game.
> h
> https://store.steampowered.com/app/367520/Hollow_Knight/
> great game

> Lubella pog
> good game
> yes
> love
> ^‿^
> Nice game
> good shit
> everything
> Good Game
> good
> w
> yes
> Beautiful
> good
> mario pissing
> no
> pretty good

> Love
> :megusta:
> tachilo
> :]
> good
> 7
> <3
> GOOD
> Good shit
> great metroidvania
> 通关一次之后看了疯狂模式速通视频感觉就这操作自己也可以做到而兴致勃勃的去尝试的我深深的发现了一件事
我是傻逼
> Loved it
>  .
> Noice
> fantastic game!
> yeet
> is good
> good
> Fantastic Metroidvania
> leaf game
> Meh...
> nicu
> [i]YES[/i]
> งานภาพและดนตรีประกอบคือสมบรูณ์แบบ
เกมนี้มีจบสองแบบ
มีความท้าทาย ในการเดินทางและต่อสู้เพราะการsaveต้องทำที่จุดsaveในแมพ
โดยรวมถือว่าเยี่ยมยอดแต่ศัตรูจับทางง่ายไปหน่อย
> nice
> 
> Good Game.
> Nice.
> Absolutely amazin

> Hard. Good.
> video game
> good :)
> git gud
> This
game
is
beautiful!
> fart fart fart fart fart fart fart fart fart fart fart fart fart fart fart fart fart poo fart fart fart fart fart fart fart fart fart fart fart fart fart poo fart
> Great Game
> It's good.
> .
> Masterpiece
> ss
> its alright
> a
> stunningly beautiful
> very fun
> goated
> absolute masterpiece!
> cool

> good :thumbsup:
> mantul
> Best game
> veery good
> good
> 10/10
> good
> Get it
> e
> VERY VERY egg
> ndsnlddsln
> is aight
> penis
> poo poo man
> Beautiful
> magnificent
> perfecto

> good
> brilliant
> very good
> game geud

> game good
> good

> I'm good.
> g.o.a.t
> QUIRREL
> fun

> s
> goood
> 再坚持一下就出新手村了，加油！
> 11/10
> games good
> yes
> git gud!
> Get it.
> 10/10
> 10/10
> yes
> dtyftgyfgvbcxfgh
> very poggers
> meh
> Its Pog
> pog gaem
> funny bugs
> It's alright.
> amaxing
> It's fun.
> awsome ga<me
> just wow
> shad o
> quality game
> aaaaaaaaaaaaaaaaa

> Amazing!
> Cool game
> it good
> its ok
> gAM

### Step 2:

- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to insert **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
        

    

In [5]:
df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
                    FROM reviews JOIN games USING(game_id);""", conn)

In [122]:
def add_missing_punct(text): 
    return re.sub('([A-Za-z0-9])\s*(\n+|$)', '\g<1>. ', text)


def replace_bullets(text): 
    text = re.sub('([A-Za-z0-9])\s*\n+\s*[+-]?\s*', '\g<1>. ', text)
    text = re.sub('\s*([:+-]+)\s*\n+\s*[+-]?\s*', '. ', text) 
    return text
    
def replace_colons(text): 
    return re.sub('\s*:\s*(\n+)\s*', '.\g<1>', text) 
    
# remove url from text
def remove_url(text):
    return re.sub(r"http\S+", ' ', text)


def remove_square_brackets(text): 
    return re.sub('\[(.*?)\]', ' ', text)

# remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()  
    return text


# replace ’ with ' 
def normalize_single_quote(text):
    return re.sub('[’‘]', '\'', text)


# remove non english characters effectively
def remove_non_ascii(text): 
    return text.encode("ascii", errors="ignore").decode()
    
    
# remove ANSI escape sequences
def remove_ansi_escape_sequences(text):
    ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', text)
    
    
# remove multiple whitespaces with single whitespace
def remove_multi_whitespaces(text): 
    return re.sub('\s+', ' ', text.strip())

In [103]:
def remove_bullet_nums(sent): 
    return re.sub('^\s*\d+\s*[.\)]+\s*([^\d])', '\g<1>', sent)

def remove_leading_symbols(sent):
    return re.sub('^[^A-Za-z\"\'\d]+', '', sent)

def uppercase_first(sent): 
    return sent[0].upper() + sent[1:] if len(sent) != 0 else sent

In [104]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'sentencizer']

In [200]:
reviews = df_reviews['review']

In [203]:
# ['transformer', 'parser']
# ['tok2vec', 'parser']
# ['sentencizer']

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentencizer")

def tokenize_sent(text, pipes=['tok2vec', 'parser']):
    with nlp.select_pipes(enable=pipes):
        doc = nlp(text)
        sents = [str(sent).strip() for sent in doc.sents]
    return sents

In [204]:
reviews_cleaned = reviews.map(remove_square_brackets)\
                        .map(remove_html_tags)\
                        .map(remove_url)\
                        .map(normalize_single_quote)\
                        .map(remove_non_ascii)\
                        .map(remove_ansi_escape_sequences)\
                        .map(add_missing_punct)\
                        .map(replace_colons)


"." looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.


"https://store.steampowered.com/app/367520/Hollow_Knight/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.


" ." looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.


"..." looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.


"/" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.


"https://www.youtube.com/watch?v=AB6sOhQan9Y" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.



In [237]:
sents_list = []
for review in reviews_cleaned: 
    sents = pd.Series(tokenize_sent(review)).map(remove_bullet_nums)\
                                            .map(remove_leading_symbols)\
                                            .map(uppercase_first)\
                                            .map(add_missing_punct)\
                                            .map(remove_multi_whitespaces)\
    
    for sent in sents:
        if get_text_length(sent) > 2 and sent.count('"') % 2 == 0 and sent.count('(') == sent.count(')'):
            sents_list.append(sent)





In [16]:
def lowercase(text):
    return text.lower()

def expand_contractions(text):
    for key in contractions:
        value = contractions[key]
        text = text.replace(key, value)
    return text

# remove digits 
def remove_digits(text): 
    return re.sub('\d+', ' ', text)

# remove symbols 
def remove_symbols(text):
    return re.sub('[^A-Za-z,.\s\d]+', ' ', text)

# lemmatization with spacy 
def lemmatize_text(text): 
    doc = nlp(text, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc if token.pos_ != 'PUNCT']
    return ' '.join(lemma)

# remove stop words 
def remove_stopwords(text, word_list=[]):
    stop_words = stopwords.words("english")
    stop_words.extend(word_list)
    stop_words = set(stop_words)
    return ' '.join(e.lower() for e in text.split() if e.lower() not in stop_words)

def get_extra_stopwords(game_name): 
    stopwords = set(['game', 'lot', 'bit', 'way '])
    doc = nlp(game_name, disable=['parser', 'ner'])
    for token in doc: 
        if token.pos_ not in {'PUNCT', 'NUM'}:
            stopwords.add(token.text.lower())
    return stopwords

In [17]:
def get_text_length(text): 
    return len(set(text.split(' ')))

In [247]:
sent_tuples = set()

# filter out short reviews (len < 3)
df_reviews = df_reviews[df_reviews.apply(lambda x: get_text_length(x['review']) > 2, axis=1)]

for game_id in df_reviews['game_id'].unique():
    df_reviews_game = df_reviews[df_reviews['game_id'] == game_id]
    
    reviews_game_cleaned = df_reviews_game['review'].map(remove_square_brackets)\
                        .map(remove_html_tags)\
                        .map(remove_url)\
                        .map(normalize_single_quote)\
                        .map(remove_non_ascii)\
                        .map(remove_ansi_escape_sequences)\
                        .map(add_missing_punct)\
                        .map(replace_colons)
    
    for review_id, review in zip(df_reviews_game['review_id'], reviews_game_cleaned):
        sents = pd.Series(tokenize_sent(review)).map(remove_bullet_nums)\
                                                .map(remove_leading_symbols)\
                                                .map(uppercase_first)\
                                                .map(add_missing_punct)\
                                                .map(remove_multi_whitespaces)\
        
        

#         sents_prep = sents.map(lowercase)\
#                         .map(expand_contractions)\
#                         .map(remove_digits)\
#                         .map(remove_symbols)\
#                         .map(remove_multi_whitespaces)\
#                         .map(lemmatize_text)\
#                         .map(lambda x: remove_stopwords(x, word_list=extra_stopwords))

#         for sent, sent_prep in zip(sents, sents_prep):
#             sent_tuples.append((review_id, sent, sent_prep))
        
        # filter out short sentences (len < 3)
        for sent in sents:
            if get_text_length(sent) > 2 and sent.count('"') % 2 == 0 and sent.count('(') == sent.count(')'):
                sent_tuples.add((review_id, sent))
                
sent_tuples = list(sent_tuples)





In [249]:
len(sent_tuples)

29788

In [250]:
# with conn:
#     cursor.executemany("""INSERT INTO sents (review_id, sent, sent_prep) VALUES 
#                         (?, ?, ?);""", sent_tuples)    

with conn:
    cursor.executemany("""INSERT INTO sents (review_id, sent) VALUES 
                        (?, ?);""", sent_tuples)    

### Step 3: 
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.

In [289]:
df_sents = pd.read_sql_query("""SELECT sent, sent_id, game_id
                    FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

In [290]:
game_id = 2

In [291]:
df_sents = df_sents[df_sents['game_id'] == game_id][['sent', 'sent_id']]

In [292]:
df_sents

Unnamed: 0,sent,sent_id
5,"The atmosphere is unique and well executed, fr...",6
7,Its the best game ever.,8
10,Excellent platforming and combat are just the ...,11
19,"Higher beings, these words are for you alone.",20
23,Has great soundtracks and artworks.,24
...,...,...
29770,Ok back to pain.,29771
29772,It is my favorite game of all time.,29773
29774,"Dope game, beautiful graphics, great music, cr...",29775
29775,Buy the game.,29776


In [293]:
def get_tokens_index_dict(string):
    index_dict = {}
    index = 0 
    for i, char in enumerate(string): 
        if char == ' ':
            index += 1
        else: 
            index_dict[i] = index
    
    return index_dict

In [294]:
def get_stopwords(word_list=[]):
    stopwords = nlp.Defaults.stop_words.copy()  
    stopwords |= set(word_list)
    return stopwords

# remove stopwords before grouping
stop_words = get_stopwords(['game', 'games', 'lot', 'lots', 'ton', 'tons', 'bit', 'bits', 'fun', 
                            'way', 'ways', 'thing', 'things', 'time', 'times', 'type', 'types', 
                               'opinion', 'opinions', 'sense', 'terms', 'lack', 'fact'])

In [295]:
# check if a keyword is messy
def is_messy(kw): 
    return re.search('[^a-zA-Z0-9\s\-:,;]', kw)

def clean_token(kw_token):
    kw_token = kw_token.lower()
    kw_token = re.sub('[-:]', '', kw_token)
    return kw_token

In [258]:
# text = "Awesome games are hard to come by."

In [259]:
# nlp = spacy.load("en_core_web_trf")
# nlp.add_pipe("sentencizer")

# with nlp.select_pipes(enable=['transformer', 'tagger', 'attribute_ruler']):
#     doc = nlp(text)
#     tokens = [(token.text, token.pos_)  for token in doc]
    
# pprint(tokens)    

In [260]:
# nlp = spacy.load("en_core_web_md")
# nlp.add_pipe("sentencizer")

# with nlp.select_pipes(enable=['tok2vec', 'tagger', 'attribute_ruler']):
#     doc = nlp(text)
#     tokens = [(token.text, token.pos_)  for token in doc]
    
# pprint(tokens)    

In [298]:
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")

def get_kws_sent_ids(df_sents):
    kws_sent_ids = set()

    noun_types = set(['NOUN', 'PROPN'])
    adj_types = set(['ADJ'])
    pattern = re.compile('(ADJ (PUNCT ADJ )*)*((NOUN|PROPN) (PUNCT (NOUN|PROPN) )*(PUNCT )?)*(NOUN|PROPN)')

    for sent, sent_id in zip(df_sents['sent'], df_sents['sent_id']): 
        with nlp.select_pipes(enable=['transformer', 'tagger', 'attribute_ruler']):
            doc = nlp(sent)

        tokens = [token.text for token in doc]
        tags_str = ' '.join([token.pos_ for token in doc])
        tokens_index_dict = get_tokens_index_dict(tags_str)
        matches = pattern.finditer(tags_str)
        
        for match in matches: 
            tags = [elem for elem in match.group(0).split()]
            
            kw = [] 
            index = match.start()
            
            for i, tag in enumerate(tags): 
                kw_token = clean_token(tokens[tokens_index_dict[index]])
                index += len(tag) + 1
                
                if kw_token in (',', ';') and tags[i - 1] in ('PROPN', 'NOUN') and len(kw) != 0:
                    kw = ' '.join(kw)

                    if not is_messy(kw):
                        kws_sent_ids.add((kw, sent_id))

                    kw = []
                
                # remove all adjective before a stop word, otherwise would end up with lots of ADJ alone keywords 
                elif kw_token in stop_words and tag in ('PROPN', 'NOUN') and tags[i - 1] in ('ADJ'): 
                    kw = []
                    
                elif kw_token not in (',', ';') and kw_token not in stop_words and len(kw_token) != 0:
                    kw.append(kw_token)

                    
            if len(kw) != 0:
                kw = ' '.join(kw)

                if not is_messy(kw):
                    kws_sent_ids.add((kw, sent_id))
        
        if len(kws_sent_ids) % 1000 == 0:
            print(f'Processed {len(kws_sent_ids)}')
    
    return kws_sent_ids

In [299]:
def get_kws_sent_ids_group(kws_sent_ids):
    res={}
    
    for kw_sent_id in kws_sent_ids:
        if kw_sent_id[0] not in res:
            res[kw_sent_id[0]] = [kw_sent_id[1]]
        else:
            res[kw_sent_id[0]].append(kw_sent_id[1])
            
    return res

In [300]:
kws_sent_ids = get_kws_sent_ids(df_sents)
kws_sent_ids_group = get_kws_sent_ids_group(kws_sent_ids)

Processed 4000
Processed 4000
Processed 6000
Processed 8000
Processed 9000
Processed 9000
Processed 11000
Processed 15000
Processed 15000
Processed 17000
Processed 18000


In [309]:
kws_sent_ids_group = {key: kws_sent_ids_group[key] for key, val in kws_sent_ids_group.items() if len(val) > 1}

In [310]:
# 46959
len(kws_sent_ids)

18253

In [311]:
# 13673
len(kws_sent_ids_group)

1641

In [308]:
# pprint(kws_sent_ids_group)

In [305]:
kws = list(kws_sent_ids_group.keys())

In [306]:
# kws

In [312]:
kw_embeddings = embedder.encode(kws)

# Normalize the embeddings to unit length
kw_embeddings = kw_embeddings / np.linalg.norm(kw_embeddings, axis=1, keepdims=True)

In [313]:
# perform agglomerative clustering
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
clustering_model.fit(kw_embeddings)
cluster_assignment = clustering_model.labels_

# clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
# clustering_model.fit(kw_embeddings)
# cluster_assignment = clustering_model.labels_

In [314]:
kw_clusters = {}
for kw_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in kw_clusters:
        kw_clusters[cluster_id] = []

    kw = kws[kw_id]
    kw_clusters[cluster_id].append((kw, kws_sent_ids_group[kw]))

In [315]:
print(len(kw_clusters))

314


In [316]:
pprint(kw_clusters)

{0: [('markoth', [15284, 18427, 13583, 23311]),
     ('masochist', [10813, 4330, 19718])],
 1: [('chore',
      [1625, 8478, 4502, 19132, 29061, 5971, 22705, 8810, 1081, 19634]),
     ('pog', [9665, 12842, 193, 14340, 8020, 14822, 12109]),
     ('poggers', [11524, 8802, 14340, 21256]),
     ('horas', [13224, 7707])],
 2: [('platform', [4646, 12196, 15209, 10443, 15155]),
     ('great platformer', [10121, 19773, 28407, 7956, 24901]),
     ('platforming',
      [13472,
       1568,
       28148,
       16417,
       365,
       23573,
       13650,
       10643,
       926,
       4163,
       13506,
       2500,
       6123,
       18235,
       9305,
       15339,
       13391,
       839,
       14615,
       10764,
       9776,
       4140,
       3677,
       21064,
       19551,
       7998,
       21416,
       25385,
       6828,
       15674,
       3334,
       21273,
       15820,
       16574,
       18790,
       21949,
       25360,
       8376,
       18998,
       11117,


       8810]),
     ('unique art style', [29543, 7463, 17540]),
     ('great art',
      [18805, 5203, 22297, 27106, 29245, 22556, 12290, 25165, 20069, 5561]),
     ('beautiful art style',
      [17119, 18607, 2394, 20010, 5689, 19695, 7511, 26301, 6178]),
     ('great design', [19186, 20546, 7546, 13754]),
     ('good design', [1352, 2244, 20223]),
     ('true masterpiece', [1632, 3753, 3121, 6491, 14611]),
     ('beauty',
      [4958, 7185, 15423, 12955, 15674, 24951, 29766, 19625, 27510, 8420]),
     ('designs', [5229, 18202, 21798, 10481, 16685, 13705]),
     ('great world design', [10981, 15211, 2139, 24721]),
     ('charming art style', [29641, 11262, 12197]),
     ('decoration', [17540, 15423]),
     ('amazing artstyle', [16452, 3431]),
     ('world design', [19712, 21236, 279, 7532, 15524, 29766]),
     ('visual style', [29572, 13250, 21008]),
     ('aesthetic', [10735, 27365, 9742, 1811, 9644]),
     ('fucking masterpiece', [25011, 20437, 14727]),
     ('modern classic', [1635

        24089,
        5333,
        20234,
        10190,
        8172,
        22675,
        25161,
        7221,
        25040,
        10877,
        25896,
        19645,
        2748,
        4725,
        16685,
        9241,
        1163,
        15519,
        25065,
        2194,
        23356,
        19326,
        12352,
        20597,
        27123,
        18533,
        17741,
        26513,
        20198,
        7747,
        16897,
        484,
        1558,
        23203,
        26043,
        18951,
        9938,
        14254,
        2276,
        17720,
        19880,
        21362,
        18041,
        20297,
        8770,
        1347]),
      ('story',
       [23791,
        21996,
        17250,
        22056,
        8540,
        4954,
        29537,
        23746,
        8420,
        13433,
        11435,
        3806,
        6689,
        10541,
        18662,
        24591,
        17741,
        22261,
        19794,
        23427,
        18713

        16623,
        1079,
        5984,
        17398,
        16861,
        25176,
        11391,
        24084,
        24851,
        7724,
        6076,
        11222,
        14970,
        20650,
        13095,
        11472,
        4414,
        19831,
        19795,
        23549,
        22969,
        21772,
        18618,
        6784,
        6868,
        3095,
        8493,
        29091,
        13959,
        20500,
        26936,
        25893]),
      ('dollars',
       [21215,
        22641,
        17255,
        29167,
        18046,
        21442,
        26780,
        1666,
        3719,
        17128,
        13502,
        1257,
        925,
        17515,
        28559,
        9507,
        3168,
        18765,
        818,
        28646,
        25797,
        3303,
        17026,
        6154,
        26352]),
      ('penny',
       [8391,
        21185,
        3852,
        3415,
        14094,
        444,
        29579,
        13218,
        2973

        17382,
        18447,
        1003,
        19172,
        24833,
        27995,
        24806,
        16949,
        14246,
        19933,
        28513,
        9424,
        15366,
        404,
        15417]),
      ('computer', [4884, 22572, 8872, 7378, 17181, 5802, 14553]),
      ('spare computer', [28763, 8534, 20262, 8462]),
      ('internet', [13705, 25755, 4976, 7768, 11617])],
 89: [('movement',
       [3478,
        13808,
        9607,
        1127,
        15446,
        26326,
        7355,
        4750,
        8654,
        9874,
        1054,
        20034,
        13906,
        16966,
        28692,
        18566,
        13140,
        2744,
        3616,
        22518,
        25573,
        9100,
        20768,
        21136,
        26020,
        12356,
        7703,
        26621,
        2510,
        10285,
        18790,
        9217,
        7668,
        28032,
        29520,
        4003,
        19953,
        3309,
        19461,
        27491

         20717,
         10649,
         161,
         21798,
         3937,
         14022,
         6142,
         28829,
         3440,
         20202]),
       ('knight',
        [946,
         4141,
         16933,
         1999,
         509,
         24903,
         2378,
         18707,
         28229,
         11569,
         19575,
         11619,
         4942,
         18410,
         28455,
         23693,
         5684,
         20034,
         7282,
         4491,
         8866,
         28523]),
       ('watcher knights', [8577, 3825, 17548, 23173, 6483]),
       ('little knight', [810, 26313, 20420]),
       ('hollow knight silksong', [25910, 23435, 10104]),
       ('hallow knight', [21011, 17019, 14565]),
       ('funny knight', [21121, 28275]),
       ('playing hollow knight', [8608, 24214, 2278]),
       ('shovel knight', [19886, 27894])],
 111: [('content',
        [12945,
         13944,
         24851,
         17036,
         6239,
         20195,
         11910

 178: [('waste', [27839, 14093, 23414, 24369, 29609]),
       ('complete waste', [14990, 18618, 24878]),
       ('utter trash', [16644, 15789]),
       ('disposal', [1999, 14369]),
       ('garbage', [347, 9147])],
 179: [('test', [2873, 26859, 10004, 2046]),
       ('measures', [24433, 4476]),
       ('determination', [8899, 13033])],
 180: [('different npcs', [20033, 24036]),
       ('npcs', [16815, 11675, 17960, 23998, 20705, 6883, 25100, 2198]),
       ('npc', [2495, 15060]),
       ('interesting npcs', [12434, 6985])],
 181: [('notch', [5988, 23168, 1873, 15228, 19154, 2097])],
 182: [('understatement', [21916, 22416, 23316, 18343, 2930, 23221])],
 183: [('loads', [27947, 10836, 20195, 26386, 127, 21185])],
 184: [('level',
        [19671,
         24598,
         11921,
         18196,
         6410,
         16861,
         14615,
         28986,
         3708,
         1379,
         5278,
         16895,
         9736,
         16197,
         29766,
         15162,
         1

         25130,
         16354,
         3675,
         20861,
         5173,
         28355,
         4823,
         15902,
         6321,
         4056,
         8217,
         2479,
         6588,
         22985,
         28869,
         15615]),
       ('combat system',
        [26071, 745, 3854, 8054, 26150, 13538, 24253, 23673, 9874]),
       ('enemies',
        [8975,
         24630,
         22814,
         8669,
         22677,
         10432,
         3060,
         25480,
         13969,
         14794,
         11244,
         6802,
         24037,
         9990,
         29509,
         627,
         6642,
         13327,
         6586,
         9794,
         7274,
         21983,
         14054,
         10409,
         1693,
         23197,
         12258,
         16197,
         13186,
         28129,
         1725,
         3537,
         20954,
         8064,
         7511,
         2495,
         10836,
         17455,
         19169,
         18248,
         7312,

In [317]:
# get the list of biggest clusters 
def get_biggest_clusters(kw_clusters, top_n=50):
    cluster_nums_sizes = []

    for cluster_num, kws_sent_ids in kw_clusters.items():
        if cluster_num != -1:
            cluster_size = sum([len(kw_sent_ids[1]) for kw_sent_ids in kws_sent_ids])
            cluster_nums_sizes.append((cluster_num, cluster_size))

    cluster_nums_sizes_sorted = sorted(cluster_nums_sizes, key=lambda x: x[1], reverse=True)
    return set([cluster_num_size[0] for cluster_num_size in cluster_nums_sizes_sorted[:top_n]])

In [318]:
biggest_clusters = get_biggest_clusters(kw_clusters, 100)

In [319]:
len(biggest_clusters)

100

In [320]:
# get tuples for 3 tables clusters, kws and clusters_sents
def get_cluster_kw_tuples(game_id, kw_clusters, biggest_clusters):
    game_id = int(game_id)
    cluster_tuples, kw_tuples, clusters_sents = [], [], []
    
    for cluster_num, kws_sent_ids in kw_clusters.items(): 
        cluster_num = int(cluster_num)
        
        if cluster_num in biggest_clusters:
            kws_sent_ids_sorted = sorted(kws_sent_ids, key=lambda x: len(x[1]), reverse=True)
            cluster_name = kws_sent_ids_sorted[0][0]
            cluster_tuples.append((game_id, cluster_num, cluster_name))

            for kw_sent_ids in kws_sent_ids: 
                kw = kw_sent_ids[0]
                sent_ids = kw_sent_ids[1]
                freq = len(kw_sent_ids[1])
                kw_tuples.append((kw, freq, cluster_num, game_id))

                for sent_id in sent_ids: 
                    clusters_sents.append((sent_id, cluster_num, game_id))
    
    return cluster_tuples, kw_tuples, clusters_sents

In [321]:
cluster_tuples, kw_tuples, cluster_sent_tuples = get_cluster_kw_tuples(game_id, kw_clusters, biggest_clusters)

In [322]:
cluster_tuples

[(2, 215, 'people'),
 (2, 159, 'silksong'),
 (2, 6, 'controls'),
 (2, 194, 'bosses'),
 (2, 66, 'plenty'),
 (2, 201, 'areas'),
 (2, 12, 'difficulty'),
 (2, 112, 'kingdom'),
 (2, 35, 'story'),
 (2, 9, 'art'),
 (2, 93, 'music'),
 (2, 22, 'kind'),
 (2, 40, 'reason'),
 (2, 106, 'metroidvania'),
 (2, 25, 'gameplay'),
 (2, 4, 'hours'),
 (2, 219, 'team cherry'),
 (2, 223, 'combat'),
 (2, 31, 'direction'),
 (2, 117, 'love'),
 (2, 111, 'content'),
 (2, 49, 'exploration'),
 (2, 73, 'switch'),
 (2, 110, 'hollow knight'),
 (2, 15, 'experience'),
 (2, 47, 'player'),
 (2, 68, 'dark souls'),
 (2, 2, 'platforming'),
 (2, 21, 'problem'),
 (2, 82, 'hand'),
 (2, 55, 'money'),
 (2, 39, 'idea'),
 (2, 118, 'depth'),
 (2, 42, 'charms'),
 (2, 52, 'price'),
 (2, 45, 'tears'),
 (2, 58, 'bugs'),
 (2, 16, 'rest'),
 (2, 122, 'dlc'),
 (2, 61, 'world'),
 (2, 165, 'life'),
 (2, 104, 'words'),
 (2, 129, 'graphics'),
 (2, 27, 'upgrades'),
 (2, 54, 'atmosphere'),
 (2, 23, 'hallownest'),
 (2, 238, 'review'),
 (2, 282, 'fa

In [323]:
# 50 1527 23779
print(len(cluster_tuples), len(kw_tuples), len(cluster_sent_tuples))
print(cluster_tuples[0])
print(kw_tuples[0])
print(cluster_sent_tuples[0])

100 1159 11931
(2, 215, 'people')
('people', 103, 215, 2)
(3866, 215, 2)


In [378]:
len(cluster_tuples)

100

In [324]:
with conn:
    cursor.executemany("""INSERT INTO clusters (game_id, cluster_num, cluster_name) VALUES 
                        (?, ?, ?);""", cluster_tuples)    
    
with conn:
    cursor.executemany("""INSERT INTO kws (kw, freq, cluster_id) VALUES 
                    (?, ?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""",
                   kw_tuples)

with conn:
    cursor.executemany("""INSERT INTO clusters_sents (sent_id, cluster_id) VALUES 
                        (?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""", cluster_sent_tuples)

### Step 5:
- Remove all **sent_id** in table **sents** if they don't exist in table **clusters_sents**.
    - Insert **score_flair**, **score_vader**, **recommended**, **score_total**, **sent_embedding** in table **sents**.

In [379]:
with conn: 
    cursor.execute("""
                        DELETE FROM sents 
                        WHERE sent_id NOT IN (
                        SELECT DISTINCT sent_id
                        FROM clusters_sents);
                   """)

In [382]:
def get_score_flair(sent, threshold=0.9): 
    sent_flair = flair.data.Sentence(sent)
    sentiment_model_fast.predict(sent_flair)
    
    value_flair = sent_flair.labels[0].value
    score_flair = sent_flair.labels[0].score
    
    value_flair = 1 if value_flair == 'POSITIVE' else -1
    
    score_flair = value_flair*score_flair
    
    return 1 if score_flair > threshold else -1 if score_flair < -threshold else 0

In [383]:
sentis = df_sent['sent'].map(get_score_flair) 

In [384]:
sent_embeddings = embedder.encode(df_sent['sent'])

sent_embeddings = sent_embeddings /  np.linalg.norm(sent_embeddings, axis=1, keepdims=True)

sent_embeddings_str = [pickle.dumps(sent_embedding) for sent_embedding in sent_embeddings]

In [385]:
sent_tuples = []

for senti, sent_embedding_str, sent_id in zip(sentis, sent_embeddings_str, df_sent['sent_id']): 
    sent_tuples.append((senti, sent_embedding_str, sent_id))     

In [386]:
with conn:
    cursor.executemany("""UPDATE sents SET (senti, sent_embedding) = (?, ?)
                        WHERE sent_id=?;""", sent_tuples)

In [480]:
# df_sent_cluster = df_sent[(df_sent['cluster_name'] == 'dark soul') & (df_sent['senti'] == -1)]

In [None]:
# for sent in df_sent_cluster['sent']: 
#     print(f'>>> {sent}')

In [481]:
# sents = df_sent_cluster['sent'].reset_index(drop = True)
# sent_embeddings = decode_sent_embeddings(df_sent_cluster['sent_embedding'])

In [416]:
# sents

In [None]:
# # PCA with 3 dimensions
# pca = PCA(n_components=3)
# components = pca.fit_transform(sent_embeddings)

# total_var = pca.explained_variance_ratio_.sum() * 100

In [None]:
# fig = px.scatter_3d(
#     components, x=0, y=1, z=2, color=clustering_model.labels_.astype(np.str),
#     title=f'Total Explained Variance: {total_var:.2f}%',
#     labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}, 
#     hover_data=[sents],
# )
# fig.show()

In [187]:
df_sents = pd.read_sql_query("""SELECT cluster_id, sent_id, game_id, cluster_name, sent, sent_embedding, senti
                                FROM clusters_sents 
                                JOIN clusters USING(cluster_id)
                                JOIN sents USING(sent_id);""", conn)

In [210]:
game_id = 2

In [188]:
def decode_sent_embeddings(sent_embeddings):
    return np.array([pickle.loads(sent_embedding) for sent_embedding in sent_embeddings])

In [189]:
df_sents

Unnamed: 0,cluster_id,sent_id,game_id,cluster_name,sent,sent_embedding,senti
0,1,25698,1,secrets,You can tell which rooms you've explored and w...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
1,1,13408,1,secrets,Tons of secrets to be discovered.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
2,1,13973,1,secrets,"This game is a perfect 10/10, short enough for...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
3,1,28086,1,secrets,There are plenty of secrets and upgrades scatt...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
4,1,3217,1,secrets,Easygoing 2d metroidvania that will take rough...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1
...,...,...,...,...,...,...,...
40772,200,19397,2,patience,I have no idea how people manage the patience ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1
40773,200,24786,2,patience,"That being said, this game takes far more pati...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
40774,200,23586,2,patience,"Challenging combat, but do-able with enough pl...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
40775,200,28285,2,patience,Just a side note: this game will keep getting ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1


In [190]:
df_sent = df_sents[df_sents['game_id'] == game_id]

In [191]:
df_sent

Unnamed: 0,cluster_id,sent_id,game_id,cluster_name,sent,sent_embedding,senti
28846,101,3866,2,people,Reasonable people can reasonably disagree--lik...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
28847,101,29079,2,people,I can only concure with the overwhelming posit...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
28848,101,22188,2,people,"Very fun, but not for people who dont like cha...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
28849,101,17832,2,people,Not easy but not as hard as people say.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1
28850,101,5679,2,people,You should feel threatened towards people like...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1
...,...,...,...,...,...,...,...
40772,200,19397,2,patience,I have no idea how people manage the patience ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1
40773,200,24786,2,patience,"That being said, this game takes far more pati...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
40774,200,23586,2,patience,"Challenging combat, but do-able with enough pl...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0
40775,200,28285,2,patience,Just a side note: this game will keep getting ...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1


In [192]:
cluster_names = list(df_sent['cluster_name'].value_counts().index)
max_sents = 5

if len(cluster_names) >= max_sents: 
    n_sents = 1
    top_n_cluster = max_sents
else: 
    n_sents = ceil(max_sents / len(cluster_names))
    top_n_cluster = len(cluster_names)

cluster_name_vals = []
sent_vals = []


for cluster_name in cluster_names[:top_n_cluster]: 
    df_sent_cluster = df_sent[df_sent['cluster_name'] == cluster_name]
    
    sents = df_sent_cluster['sent'].reset_index(drop = True)
    sent_embeddings = decode_sent_embeddings(df_sent_cluster['sent_embedding'])
    
    summary_sents = hybrid_sum(sents, sent_embeddings, n_sents)

    cluster_name_vals.extend([cluster_name] * n_sents)
    sent_vals.extend(summary_sents)

In [193]:
fig = go.Figure(data=[go.Table(header=dict(values=['Aspect', 'Sentence']),
                 cells=dict(values=[cluster_name_vals, sent_vals]))
                     ])
fig.show()

In [211]:
df_sent = df_sents[(df_sents['game_id'] == game_id)]
n_sents = 5

In [214]:
cluster_sent_tuples = []

for cluster_name in cluster_names:     
#     print(f'Summary for cluster: {cluster_name}')
    df_sent_cluster_name = df_sent[df_sent['cluster_name'] == cluster_name]
    
    sents = df_sent_cluster_name['sent'].reset_index(drop = True)
    sent_embeddings = decode_sent_embeddings(df_sent_cluster_name['sent_embedding'])
    n_sents = n_sents
    
#     # summarize with agglomerative clustering
#     sum_type = 'agglo'
#     sents_sum = agglo_sum(sents, sent_embeddings, n_sents)
    
#     for i, sent in enumerate(sents_sum): 
#         cluster_id = df_sent_cluster_name['cluster_id'].iloc[0]
#         sent_id = df_sent_cluster_name[df_sent_cluster_name['sent'] == sent]['sent_id'].iloc[0]
        
#         cluster_sent_tuples.append([cluster_id, sent_id, sum_type, i + 1])
        
    # summarize with hybrid clustering
    sents_sum = hybrid_sum(sents, sent_embeddings, n_sents)
    
#     pprint(sents_sum)
    
    for i, sent in enumerate(sents_sum): 
        cluster_id = df_sent_cluster_name['cluster_id'].iloc[0]
        sent_id = df_sent_cluster_name[df_sent_cluster_name['sent'] == sent]['sent_id'].iloc[0]
        
        cluster_sent_tuples.append([int(i + 1), int(cluster_id), int(sent_id)])

In [215]:
len(cluster_sent_tuples)

500

In [216]:
with conn:
    cursor.executemany("""UPDATE clusters_sents SET rank = ?
                        WHERE cluster_id=? AND sent_id=?;""",
                   (cluster_sent_tuples))

In [61]:
def get_centroid(arr):
    length, dim = arr.shape
    return np.array([np.sum(arr[:, i])/length for i in range(dim)])

# return index of the vectors in corpus_embeddings nearest to the centroid
def get_nearest_indexes(centroids, corpus_embeddings):
    return vq(centroids, corpus_embeddings)[0]

In [62]:
def kmeans_sum(sents,  sent_embeddings, n_sents=5):
    sents_sum = []

    clustering_model = KMeans(n_clusters=n_sents, random_state=0)
    clustering_model.fit(sent_embeddings)

    cluster_assignment = clustering_model.labels_
    centroids = clustering_model.cluster_centers_

    clustered_sentences = [[] for i in range(n_sents)]
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        clustered_sentences[cluster_id].append(sents[sentence_id])

    indexes = get_nearest_indexes(centroids, sent_embeddings)
    
#     closest, distances = vq(centroids, sent_embeddings_pca)

    for index in indexes: 
        sents_sum.append(sents[index])
        
    return sents_sum

In [63]:
# kmeans_sum(sents, sent_embeddings, 10)

In [143]:
len(sents)

452

In [64]:
def agglo_cluster(sents, sent_embeddings, n_sents, threshold=0.5): 
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=threshold)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
#     pprint(clustered_sents)
    
    return (cluster_assignment, clustered_sents, clustered_sent_embeddings) if len(clustered_sents) >= n_sents else agglo_cluster(sents, sent_embeddings, n_sents, threshold=threshold - 0.1)

In [65]:
def hybrid_cluster(sents, sent_embeddings, n_sents): 
    clustering_model = hdbscan.HDBSCAN(min_cluster_size=2)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
#     pprint(clustered_sents)
    
    # remove unidentified cluster -1
    clustered_sents = {k: v for k, v in clustered_sents.items() if k != -1}
    
    if len(clustered_sents) >= n_sents:
        cluster_assignment = [e for e in cluster_assignment if e != -1]
        clustered_sent_embeddings = {k: v for k, v in clustered_sent_embeddings.items() if k != -1}
        return cluster_assignment, clustered_sents, clustered_sent_embeddings
    else: 
        return agglo_cluster(sents, sent_embeddings, n_sents)

In [66]:
def sort_clusters(cluster_assignment):
    cluster_nums, counts = np.unique(cluster_assignment, return_counts=True)
    cluster_nums_counts = list(zip(cluster_nums, counts))
    
    return [cluster_num for cluster_num, _ in sorted(cluster_nums_counts, key=lambda x: x[1], reverse=True)]

In [67]:
# generate summary
def gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents):
    cluster_nums = sort_clusters(cluster_assignment)
    
    sents_count = 0 

    sents_sum = []

    for cluster_num in cluster_nums:
        centroid = get_centroid(np.array(clustered_sent_embeddings[cluster_num]))
        centroid = np.array([centroid])

        index = get_nearest_indexes(centroid, clustered_sent_embeddings[cluster_num])[0]
        sents_sum.append(clustered_sents[cluster_num][index])

        sents_count += 1
        if sents_count == n_sents: 
            break
    
    return sents_sum

In [68]:
def agglo_sum(sents, sent_embeddings, n_sents=5, threshold=0.5):
    if len(sents) < 2 or n_sents > len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = agglo_cluster(sents, sent_embeddings, n_sents, threshold)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)   

In [69]:
def hybrid_sum(sents, sent_embeddings, n_sents=5): 
    if len(sents) < 2 or n_sents >= len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = hybrid_cluster(sents, sent_embeddings, n_sents)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)

In [70]:
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'I hate the whole universe.',
          'The car is badly broken.',
          'Summer is the hottest season.'
          ]

corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [73]:
agglo_sum(corpus, corpus_embeddings, 5)

['A man is eating food.',
 'The girl is carrying a baby.',
 'Someone in a gorilla costume is playing a set of drums.',
 'A cheetah is running behind its prey.',
 'A man is riding a white horse on an enclosed ground.']

In [455]:
agglo_sum(sents, sent_embeddings, 100)

{0: ["It's a dark and depressing world, and the themes I've encountered "
     'thusfar are good, while the boss battles are good for getting you '
     'psyched and pumped up for a fight.',
     'If you like beating a tough boss fight, then getting killed by some '
     'random monster only to have to redo that fight again, this game is for '
     'you.',
     "There's one boss that has an enemy in the bit of the room prior to the "
     'arena and that enemy will make the fight immensely painful if not '
     'genuinely impossible if you decide not to kill it.',
     "There's also an optional extra boss if you want one more offensive spell "
     'in your inventory (or just seek one more fight), because the boss fights '
     'are just that well-designed.',
     'Pretty much all bosses have random attack orders and this could end up '
     'in your "favor", allowing you to just spam your ranged attack to kill '
     'them... or it can end up just requiring you to run around for 80% o

      'In addition to some new abilities you will gain later on to help with '
      'traversal and combat.',
      'The items available make quite a difference in combat and movement, '
      'giving it a good feeling of progression.',
      'These modifiers range from combat oriented to exploration oriented.',
      'You get to mix and match passive items in order to change your attacks '
      'up a bit, and theres a fair amount of challenge.',
      'Over the course of the game you\'ll acquire a number of "active" items '
      'that you can equip and use to do things like heal yourself or increase '
      'your attack power, and you\'ll acquire more "passive" items that add '
      'effects such as poison to your ranged attacks.',
      'Pick up collectables to enhance your abilities, and switch and swap '
      'passive and active skills to give you just the right advantage to face '
      'the enemy that lays before you.',
      'You have your map, your short ranged and long ran

      'attack, one of which just makes your melee weapon do more damage, and '
      'only two of which actually provide new mobility options.',
      'The controls are simple but effective, and besides one or two design '
      "quirks that personally I would've changed (the interaction between the "
      'attack and roll animations, and the invincibility frames of the latter) '
      "there's nothing to complain about - you have melee and ranged attacks, "
      'dodge, double-jump and items with various active and passive effects to '
      'use.',
      'The ability to use both melee and ranged attacks since the beginning is '
      'quite neat, since it offers more options in the way you engage in '
      'battles.',
      'Combat has you dodging enemies attack with a dodge roll (and later, an '
      'air dodge), attacking enemies with either melee or ranged attacks, '
      'depending on how you want to play.',
      'Enemy damage is way too high (a lot of regular enemies can 2

      'variation, the enemy moves fast and attacks fast, outspeeds us in '
      'general, badly designed bosses.',
      'From the in-game items to the enemy types nothing feels out of place in '
      'the setting presented to you.',
      'Some might say not having different armour and weapons adds to the '
      "simplicity and the charm of this game, but the game doesn't make up for "
      'it.',
      'I often find in games that ranged weapons are normally only put there '
      'for very specific situations, or for people to mess around.',
      "There's no change in weapons but it didn't seem needed; the game felt "
      'well balanced around the moves and weapons you have.',
      'Gameplay elements are much more compacted from past games with all the '
      'gadgets and whatnot removed, but the two weapons with modifiable '
      'properties feel great and their expanded complexities and use-cases '
      'easily make up for the losses.',
      "No real progression in term

['So you have a melee attack and a ranged attack.',
 'Fun game with great visuals and satisfying combat.',
 'Fun and challenging enemies, especially the bosses.',
 'Combat has a nice visceral feel to it.',
 'Few variety of enemies.',
 'This is a great MetroidVania with some interesting combat and a compelling in game universe.',
 'Except using a bow and your leaf to attack enemies, consumables option and even some handy items/spells you can use during combat.',
 "I really tried to like this game, but I can't find any redeeming qualities - the combat is basic and not very satisfying, the platforming feels unresponsive, the story (at least as far as I played) seems non-existant...",
 'This game is fun, yet challenging, and I will admit that I am having some trouble adapting to the enemies attacks, and it leads to a quick death.',
 'The reason is that you will turn your bow from a decent secondary option, into your main weapon that destroys most bosses and enemies with little effort.',
 '

In [456]:
hybrid_sum(sents, sent_embeddings, 100)

{-1: ['Come to think of it, that is actually funny and not annoying, but no '
      'way I am going to do that fight when parents are around.',
      'One of the reasons I got the chance to listen to that music so much was '
      'because that fight was kicking my sweet ass all kinds of ways.',
      "It's a dark and depressing world, and the themes I've encountered "
      'thusfar are good, while the boss battles are good for getting you '
      'psyched and pumped up for a fight.',
      'You explore fight, collect power-ups and items to improve you '
      'character, and run around trying to figure out where to go next.',
      "I didn't even use all 10 of the healing items I had in that fight, and "
      'I had 4 full heals in reserve just in case that got never used....',
      'That fight was amazing.',
      'It was something simple and predictable but enjoyable, especially that '
      'fight in the monestary, by far best part of the game.',
      "Also useless, because you

      'dramatically alter how you will approach or revisit a level.',
      'I think it could have used a couple more weapons and skills(like wall '
      'jump) but all in all it as pretty sweet.',
      "You're constantly being shot at by things off screen, and most of the "
      'projectiles pass through scenery so there are just projectiles coming '
      'from all over the place seemingly at random.',
      'The challenge in that game comes from learning the patterns of the '
      'ennemies and how to respond to them, in Momodora you just spam the '
      'thing till their health is gone while the projectiles they send to you '
      'is so stupidly easy to avoid that you forget halfway through the game '
      "that there's a ROLL button.",
      'In fact, almost all the bosses in this game encourage you to just stand '
      'and spam projectiles.',
      "Their projectiles show some variation, but aren't very challenging.",
      'The only complaint I have on this front is th

       "error, as not understanding how the boss' attacks work can get you "
       'instantly killed.',
       'There are lots of unfair enemy attacks and placements that are '
       'unpredictable without having played the level previously, so there is '
       'a lot of trial and error.'],
 149: ['The spooky music and enemies, the graphics, texture, style of combat.',
       'There are wonderfully designed enemies, environments, and music.',
       'The spooky music and enemies, the graphics, texture, style of combat.',
       'Beautiful artwork, captivating soundtrack, and ridiculously cool '
       'enemies.',
       "It's a polished and punishing little piece with good enemies, art and "
       'an amazing soundtrack.'],
 150: ['My one gripe with the game is that the instant you learn the attacks '
       'of bosses they become relatively simple, even on Insane.',
       "Difficulty-wise, I think it's moderate, leaning to easy, since the "
       'attack patterns of monsters & b

       'ambushes, two-to-three hit deaths, the lack of healing in the first '
       'two areas, envirnomental hazards that sometimes kill you and sometimes '
       'do nothing because you have to go through them, enemies who can '
       'surprise-lunge at you or hide their projectiles in the foreground '
       'objects, enemies that spam projectiles at you from offscreen, early '
       'bosses that rely on rote-memorization because they have more attacks '
       'that kill you out of nowhere than you have hit points...',
       "Enemies are placed in obnoxious areas, the dodge roll doesn't feel "
       'right, and the little shield/knife throwy enemies can screw off.',
       'Once you get past the dodgy hitboxes, uncomfortably stiff controls, '
       'enemies that only telegraph some of their attacks, a dodge roll that '
       'leaves you vulnerable to contact damage with enemies, bullshit '
       'ambushes, two-to-three hit deaths, the lack of healing in the first '
       

['Fun and challenging enemies, especially the bosses.',
 'Combat is satisfying.',
 'Fun game with great visuals and satisfying combat.',
 'Mostly limited to leaf and bow, almost no side weapons/abilities.',
 'Few variety of enemies.',
 'The Maple Leaf seems like an odd choice for a weapon, but in practice it whips out similar to some sword animations in other games, and it has a familiar three hit combo you can mess with.',
 'Enemies tend to deal quite a bit of damage.',
 "Those fights are at least kind of interesting conceptually, but the vast majority of the game is not in those fights, and as outlined above, there isn't much else to it.",
 'The spooky music and enemies, the graphics, texture, style of combat.',
 'The reason is that you will turn your bow from a decent secondary option, into your main weapon that destroys most bosses and enemies with little effort.',
 'Very few ways to mix up combat, in some respects.',
 'There are two types of attack a melee attack with your sacred 

In [630]:
doc_big = ' '.join(sents)
# For Strings
parser = PlaintextParser.from_string(doc_big,Tokenizer("english"))

In [632]:
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 5 sentences
summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)

2D Dark Souls.
Dark Souls.
Dark Souls but not Dark Souls.
It's like Dark Souls.
The Dark Souls of 2D.


In [633]:
summarizer_1 = LuhnSummarizer()
summary_1 = summarizer_1(parser.document, 5)
for sentence in summary_1:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Animations look great, but end up being super clunky in the combat the game sets out, some enemies have contact damage, some dont(even bosses choose randomly on this), some enemies that look exactly the same as others have entirely different attacks that need to to already be doing something that counters them, and sometimes the action that counters one, and the action that counters some

In [634]:
summarizer_2 = LsaSummarizer()
summary_2 = summarizer_2(parser.document, 5)
for sentence in summary_2:
    print(sentence)

Expect 6-8 hours of gameplay on your first run through, even with harder difficulties subsequent runs should take an hour or so since it doesn't change too much between.
After long season of Hollow Knight, Momodora: Reverie Under The Moonlight feels too short but gameplay is fun so you can always try on hard mode and hunt some achievement!
If you like tough as hell action games with c r a z y boss battles, rolling, double-jumping and (later) mid-air dodging.....metroidvania style progression.........beautiful graphics........smooth animation....uhhh???
The story itself isn't really deep, and I would have loved if we could learn just a little bit more about each character, but everything still plays out perfectly.
Now, that 'bad' list seems to offset the good by a large margin, but it is also very dependent on what I find annoying and unnecessary for modern action platformers.


In [637]:
summarizer_3 = TextRankSummarizer()
summary_3 = summarizer_3(parser.document, 5)
for sentence in summary_3:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Momodora: Reverie Under the Moonlight is a cutesy pixel Metroidvania game with few memorable boss battle with suitable background music while you hack and slash your way to the next boss battle, which you need the skill to dodge the attack because you'll be doing a lot of trial and error even tho the game is short but there is a few flaws on the game, the level design is alright but can 

In [644]:
summarizer_4 = SumBasicSummarizer()
summary_4 = summarizer_4(parser.document, 10)
for sentence in summary_4:
    print(f'> {sentence}')

> Fun metroidvania-style game.
> It manages it to make Dark Souls 2d.
> If you like action platformers this one is for you.
> Good Pixel art.
> 2D metroidvania with a darksouls feel.
> The music, the artstyle, the mechanics!
> and a great soundtrack.
> Very challenging, but lots of fun!
> Boss fights.
> I want more games in that style.


In [671]:
agglo_sum(sents, sent_embeddings, 10, 0.5)

['Very good metroidvania styled game.',
 'Has some challenging boss fights.',
 'Fantastic pixel art.',
 'It has some game mechanics just like Dark Souls .',
 'Great game with fantastic art and music.',
 'Dark Souls.',
 'So you have a melee attack and a ranged attack.',
 "I'm playing the game on normal difficulty and it's already quite challenging.",
 "It reminds me most of Hollow Knight, although it isn't quite as good as that game.",
 'I can say this game is worth of its full price.']