### Steps
- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id**, **review**, **recommended**, **time** into table **reviews**.
- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to create **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents**t to link table **clusters** and **sents**.

### Step 1:

- Use a list of **app_id** to get info from Steam crawler and insert **app_id**, **game_name**, **header_img_url**, **total_positive**, **total_negative**, **total_reviews** into table **games** and insert **game_id** (fk), **review**, **recommended**, **time** into table **reviews**.

In [1]:
import requests 
import pandas as pd
import numpy as np

import download_steam_reviews

import sqlite3

import re
from bs4 import BeautifulSoup

import spacy
# nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_trf")

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

from sklearn.cluster import AgglomerativeClustering

from fuzzysearch import find_near_matches

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import flair

sentiment_model = flair.models.TextClassifier.load('sentiment')
sentiment_model_fast = flair.models.TextClassifier.load('sentiment-fast')
senti_analyzer = SentimentIntensityAnalyzer()

import pickle

import plotly.express as px
import plotly.graph_objects as go

from sklearn.cluster import KMeans
import hdbscan
from summa.summarizer import summarize

from scipy.cluster.vq import vq

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap.umap_ as umap

from tqdm import tqdm

from collections import Counter

from fuzzywuzzy import fuzz, process

from spacy import displacy

from pprint import pprint



2021-07-16 10:44:55,391 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-distillbert_4.pt
2021-07-16 10:45:01,322 loading file C:\Users\HuyTran\.flair\models\sentiment-en-mix-ft-rnn.pt


In [2]:
# from contractions import contractions
contractions = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [3]:
conn = sqlite3.connect('./data/steam_reviews.db') 
cursor = conn.cursor()

In [None]:
def print_cursor(cursor): 
    print(cursor.execute("""
        select * from games;
    """).fetchall())

In [None]:
print_cursor(cursor)

In [None]:
print(cursor.execute("""
        select game_id from games 
        where app_id=367520;
    """).fetchall()[0][0])

In [None]:
# 428550 - Momodora: Reverie under the Moonlight
# 367520 - Hollow Knight
app_ids = [428550, 367520]

In [None]:
### get game_name ###

def get_app_list(): 
    app_list_url = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'
    resp_data = requests.get(app_list_url)
    return resp_data.json()

def get_name(app_id, app_list): 
    for app in app_list['applist']['apps']: 
        if app['appid'] == app_id: 
            return app['name']

In [None]:
### get header_img_url  

def get_header_img_url(app_id): 
    return f'https://cdn.cloudflare.steamstatic.com/steam/apps/{app_id}/header.jpg'

In [None]:
### get total_positive, total_negative, total_reviews in Crawler for table games 
### get game_id, review, recommended, time in Crawler for table games 

app_list = get_app_list()

request_params = {
    'language': 'english'
}

# load or download new (maximum ~5000 newest reviews)
load_mode = True

for app_id in app_ids: 
    game_tuples = [] 

    game_name = get_name(app_id, app_list)
    
    header_img_url = get_header_img_url(app_id)
    
    total_positive, total_negative, total_reviews = 0, 0, 0
    
    if load_mode: 
        review_dict = download_steam_reviews.load_review_dict(app_id)['reviews'].values()
    else: 
        review_dict = download_steam_reviews.download_reviews_for_app_id(app_id, 
                                                                     chosen_request_params=request_params, 
                                                                     reviews_limit=5000)[0]['reviews'].values()
    
    review_tuples = []
    
    with conn:
        cursor.execute("""INSERT INTO games (app_id, game_name, header_img_url) VALUES (?, ?, ?);""", 
                       (app_id, game_name, header_img_url))
    
    # get game_id (fk) for table reviews
    game_id = cursor.execute("""SELECT game_id FROM games WHERE app_id=?;""", (app_id,)).fetchone()[0] 
    
    for review_dict_value in review_dict: 
        total_reviews += 1
    
        voted_up = 1 if review_dict_value['voted_up'] else 0
    
        if voted_up: 
            total_positive += 1
        else: 
            total_negative += 1
     
        review = review_dict_value['review']
        recommended = voted_up
        time = review_dict_value['timestamp_updated']
        
        review_tuples.append((review, recommended, time, game_id))
    
    with conn:
        cursor.execute("""UPDATE games SET (total_positive, total_negative, total_reviews) = (?, ?, ?)
                            WHERE game_id=?;""",
                       (total_positive, total_negative, total_reviews, game_id))
    
    with conn:
        cursor.executemany("""INSERT INTO reviews (review, recommended, time, game_id) VALUES 
                                (?, ?, ?, ?);""", review_tuples)    

### Step 2:

- Preprocess **review** in table **reviews** (add_missing_punct, replace_bullets, remove_url, remove_html_tags, normalize_single_quote, remove_non_ascii, remove_ansi_escape_sequences, remove_multi_whitespaces), then tokenize_sent and remove_leading_symbols to insert **review_id**, **sent** into table **sents**. 
- Preprocess **sent** in table **sents** (lowercase, expand contractions, remove_digits, remove_symbols, remove_multi_whitespaces, lemmatize_text, remove_stopwords) to insert **sent_prep** in table **sents**.
- Use **sent_prep**, **review_id** in table **sents** to insert **review_prep** in table **reviews** by joining **sent_prep**.
        

    

In [None]:
df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
                    FROM reviews JOIN games USING(game_id);""", conn)

In [None]:
df_reviews

In [4]:
def add_missing_punct(text): 
    return re.sub('([A-Za-z0-9])\s*$', '\g<1>. ', text)


def replace_bullets(text): 
    text = re.sub('([A-Za-z0-9])\s*\n+\s*[+-]?\s*', '\g<1>. ', text)
    text = re.sub('\s*([:+-]+)\s*\n+\s*[+-]?\s*', '. ', text) 
    return text
    
    
# remove url from text
def remove_url(text):
    return re.sub(r"http\S+", ' ', text)


# remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    # remove square brackets and characters inside
    text = re.sub('\[(.*?)\]', ' ', text)
    return text


# replace ’ with ' 
def normalize_single_quote(text):
    return re.sub('[’‘]', '\'', text)


# remove non english characters effectively
def remove_non_ascii(text): 
    return text.encode("ascii", errors="ignore").decode()
    
    
# remove ANSI escape sequences
def remove_ansi_escape_sequences(text):
    ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', text)
    
    
# remove multiple whitespaces with single whitespace
def remove_multi_whitespaces(text): 
    return re.sub('\s+', ' ', text.strip())

In [5]:
def remove_bullet_nums(sent): 
    return re.sub('^\s*\d+\n*[.\)]+\s*([^\d])', '\g<1>', sent)

def remove_leading_symbols(sent):
    return re.sub('^[^A-Za-z\"\'\d]+', '', sent)

def uppercase_first(sent): 
    return sent[0].upper() + sent[1:] if len(sent) != 0 else sent

In [6]:
def tokenize_sent(text):
    doc = nlp(text, disable=['ner', 'attribute_ruler', 'lemmatizer', 'sentencizer'])
    sents = [str(sent).strip() for sent in doc.sents]
    return sents

In [7]:
def lowercase(text):
    return text.lower()

def expand_contractions(text):
    for key in contractions:
        value = contractions[key]
        text = text.replace(key, value)
    return text

# remove digits 
def remove_digits(text): 
    return re.sub('\d+', ' ', text)

# remove symbols 
def remove_symbols(text):
    return re.sub('[^A-Za-z,.\s\d]+', ' ', text)

# lemmatization with spacy 
def lemmatize_text(text): 
    doc = nlp(text, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc if token.pos_ != 'PUNCT']
    return ' '.join(lemma)

# remove stop words 
def remove_stopwords(text, word_list=[]):
    stop_words = stopwords.words("english")
    stop_words.extend(word_list)
    stop_words = set(stop_words)
    return ' '.join(e.lower() for e in text.split() if e.lower() not in stop_words)

def get_extra_stopwords(game_name): 
    stopwords = set(['game', 'lot', 'bit', 'way '])
    doc = nlp(game_name, disable=['parser', 'ner'])
    for token in doc: 
        if token.pos_ not in {'PUNCT', 'NUM'}:
            stopwords.add(token.text.lower())
    return stopwords

In [None]:
sent_tuples = []

for game_id in df_reviews['game_id'].unique():
    with conn: 
        game_name = cursor.execute("""SELECT game_name FROM games WHERE game_id=?;""", (int(game_id),)).fetchone()[0]
    
    extra_stopwords = get_extra_stopwords(game_name)
    
    df_reviews_game = df_reviews[df_reviews['game_id'] == game_id]
    
    reviews_game_cleaned = df_reviews_game['review'].map(add_missing_punct)\
                    .map(replace_bullets)\
                    .map(remove_url)\
                    .map(remove_html_tags)\
                    .map(normalize_single_quote)\
                    .map(remove_non_ascii)\
                    .map(remove_ansi_escape_sequences)\
                    .map(remove_multi_whitespaces)
    
    for review_id, review in zip(df_reviews_game['review_id'], reviews_game_cleaned):
        sents = pd.Series(tokenize_sent(review)).map(remove_bullet_nums)\
                                                .map(remove_leading_symbols)\
                                                .map(uppercase_first)\
                                                .map(add_missing_punct)\
                                                .map(remove_multi_whitespaces)

        sents_prep = sents.map(lowercase)\
                        .map(expand_contractions)\
                        .map(remove_digits)\
                        .map(remove_symbols)\
                        .map(remove_multi_whitespaces)\
                        .map(lemmatize_text)\
                        .map(lambda x: remove_stopwords(x, word_list=extra_stopwords))

        for sent, sent_prep in zip(sents, sents_prep):
            sent_tuples.append((review_id, sent, sent_prep))

In [None]:
with conn:
    cursor.executemany("""INSERT INTO sents (review_id, sent, sent_prep) VALUES 
                        (?, ?, ?);""", sent_tuples)    

### Step 3: 
- Use **review_prep** in table **reviews** to calculate special bigrams frequency, get 50 most frequent keywords.
- Insert **kw**, **freq** into table **kws**.
- Embed 50 keywords using S-BERT and cluster them using agglomerative clustering with a distance_threshold=0.6. 
	- Insert **cluster_name** (name of the most frequent keyword in cluster) into table **clusters**.
	- Insert **cluster_id** in table **kws**.

In [None]:
# df_reviews = pd.read_sql_query("""SELECT review_id, review, game_id 
#                     FROM reviews JOIN games USING(game_id);""", conn)

In [None]:
# def get_ngram(x, ngram, min_df=1):
#     vec = CountVectorizer(ngram_range=[ngram, ngram], min_df=min_df).fit(x)
#     bow = vec.transform(x)
#     sum_words = bow.sum(axis = 0)
#     words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
#     words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
#     return words_freq

# def bigram_rules(bigram): 
#     first_pos = set(['ADJ', 'NOUN'])
#     second_pos = set(['NOUN'])
    
#     tags = [token.pos_ for token in nlp(bigram, disable=['parser','ner'])]
    
#     return tags[0] in first_pos and tags[1] in second_pos

In [None]:
# def unigram_rules(unigram): 
#     first_pos = set(['NOUN', 'PROPN'])
#     tags = [token.pos_ for token in nlp(unigram, disable=['parser', 'ner'])]
#     return tags[0] in first_pos

In [617]:
df_sents = pd.read_sql_query("""SELECT sent, sent_id, game_id
                    FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

In [618]:
df_sents = df_sent_prep[df_sent_prep['game_id'] == 1][['sent', 'sent_id']]

In [None]:
for sent in sents_10: 
    print(f'> {sent}')

In [23]:
def get_tokens_index_dict(string):
    index_dict = {}
    index = 0 
    for i, char in enumerate(string): 
        if char == ' ':
            index += 1
        else: 
            index_dict[i] = index
    
    return index_dict

In [558]:
def get_stopwords(word_list=[]):
    stopwords = nlp.Defaults.stop_words.copy()  
    stopwords |= set(word_list)
    return stopwords

# remove stopwords before grouping
stop_words = get_stopwords(['game', 'games', 'lot', 'lots', 'ton', 'tons', 'bit', 'bits', 'fun', 
                            'way', 'ways', 'thing', 'things', 'time', 'times', 'type', 'types', 
                               'opinion', 'opinions', 'sense', 'terms', 'lack', 'fact'])

In [553]:
# check if a keyword is messy
def is_messy(kw): 
    return re.search('[^a-zA-Z0-9\s\-:,;]', kw)

def clean_token(kw_token):
    kw_token = kw_token.lower()
    kw_token = re.sub('[-:]', '', kw_token)
    return kw_token

In [554]:
def get_kws_sent_ids(df_sents):
    kws_sent_ids = set()

    noun_types = set(['NOUN', 'PROPN'])
    adj_types = set(['ADJ'])
    pattern = re.compile('(ADJ (PUNCT ADJ )*)*((NOUN|PROPN) (PUNCT (NOUN|PROPN) )*(PUNCT )?)*(NOUN|PROPN)')

    for sent, sent_id in zip(df_sents['sent'], df_sents['sent_id']): 
        doc = nlp(sent, disable=['parser', 'ner'])

        tokens = [token.text for token in doc]
        tags_str = ' '.join([token.pos_ for token in doc])
        tokens_index_dict = get_tokens_index_dict(tags_str)
        matches = pattern.finditer(tags_str)
        
        for match in matches: 
            tags = [elem for elem in match.group(0).split()]
            
            kw = [] 
            index = match.start()
            
            for i, tag in enumerate(tags): 
                kw_token = clean_token(tokens[tokens_index_dict[index]])
                index += len(tag) + 1
                
                if kw_token in (',', ';') and tags[i - 1] in ('PROPN', 'NOUN') and len(kw) != 0:
                    kw = ' '.join(kw)

                    if not is_messy(kw):
                        kws_sent_ids.add((kw, sent_id))

                    kw = []
                
                # remove all adjective before a stop word, otherwise would end up with lots of ADJ alone keywords 
                elif kw_token in stop_words and tag in ('PROPN', 'NOUN') and tags[i - 1] in ('ADJ'): 
                    kw = []
                    
                elif kw_token not in (',', ';') and kw_token not in stop_words and len(kw_token) != 0:
                    kw.append(kw_token)

                    
            if len(kw) != 0:
                kw = ' '.join(kw)

                if not is_messy(kw):
                    kws_sent_ids.add((kw, sent_id))
    
    return kws_sent_ids

In [555]:
def get_kws_sent_ids_group(kws_sent_ids):
    res={}
    
    for kw_sent_id in kws_sent_ids:
        if kw_sent_id[0] not in res:
            res[kw_sent_id[0]] = [kw_sent_id[1]]
        else:
            res[kw_sent_id[0]].append(kw_sent_id[1])
            
    return res

In [556]:
kws_sent_ids = get_kws_sent_ids(df_sents[:1000])
kws_sent_ids_group = get_kws_sent_ids_group(kws_sent_ids)

In [343]:
# 1000 sents > 1562 
# 100 sents > 243 

In [559]:
pprint(kws_sent_ids_group)

{'2d': [4718],
 '2d dark souls': [4724, 3720, 1967, 2384, 4948],
 '2d dark souls vibe': [1611],
 '2d feel': [5219],
 '2d metroidvania': [3566],
 '2d pixelated art style': [2707],
 '2d sidescrolling metroidvania style': [4839],
 '2nd try': [1091],
 'aaa': [1827],
 'abilities': [517, 399, 2767, 3063],
 'ability': [1934, 1754, 314, 3115, 1759, 5236],
 'access': [345],
 'accounts': [4280],
 'accuracy': [6270],
 'achievement': [6069, 5173],
 'achievementhunting': [3896],
 'achievements': [2183, 534, 3920, 5623, 3830],
 'achivements': [2549],
 'action': [4491],
 'action adventure': [5538],
 'action platform': [2913],
 'action platformer': [1800, 1905, 5190, 1969],
 'action platformer itch': [3128],
 'action platformers': [2673, 3955],
 'active item slot': [3481],
 'active item slots': [3481, 3480, 4460],
 'active items': [5710],
 'actives': [4460],
 'actual achievements': [2440],
 'actual metroid': [4807],
 'actual shards': [1558],
 'addition': [804],
 'additional challanges': [3830],
 'ador

 'fate': [5550, 2758],
 'favorites': [423],
 'favourite boss fights': [2658],
 'favourite elements': [1375],
 'feature': [5272],
 'females': [1997],
 'fennel': [5580],
 'fidget animations': [3333],
 'fight': [1384, 4461, 5648, 5943, 6376, 4431],
 'fighting': [4587, 3312],
 'fighting bosses': [3491],
 'fights': [4465],
 'filthy casuals': [3597],
 'final boss': [5696,
                1764,
                5242,
                4486,
                1091,
                3583,
                4859,
                3660,
                2199,
                5580,
                4720,
                5695,
                3652,
                2278,
                5116,
                821,
                3578,
                1546,
                4145,
                490,
                1470,
                3738,
                801,
                5549,
                3021,
                3654,
                5875,
                5550,
                1553,
                36

 'publisher': [3024],
 'punishing dark souls feel': [2758],
 'purpose': [5701],
 'puzzles': [5370, 1375],
 'quality': [2946, 4590, 1037],
 'quality pixel art animations': [2874],
 'quarrel': [3876],
 'questionable design choices': [2545],
 'questions': [1670],
 'quick experience': [2168],
 'rabi ribi': [33],
 'radar': [2622, 6349],
 'rage': [5786],
 'random spot': [1384],
 'randomness': [2757],
 'range': [3456],
 'range attacks': [2760],
 'ranged attacks': [1934],
 'rdein': [5239],
 'reaching new heights': [4491],
 'reactive controls': [3144],
 'real aerial combos': [1552],
 'real letdown': [3738],
 'real reason': [5116],
 'real treat': [2102],
 'real upgrades': [341],
 'reason': [5200, 1408, 1048],
 'reasonable price': [3691],
 'redeeming aspect': [3327],
 'regards': [3480, 3468],
 'regrets': [4932],
 'repeat playthroughs': [5623],
 'replay value': [2102,
                  6069,
                  4490,
                  5301,
                  4482,
                  5620,
           

In [560]:
kws = list(kws_sent_ids_group.keys())

In [561]:
kw_embeddings = embedder.encode(kws)

# Normalize the embeddings to unit length
kw_embeddings = kw_embeddings /  np.linalg.norm(kw_embeddings, axis=1, keepdims=True)

In [562]:
# perform agglomerative clustering
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
clustering_model.fit(kw_embeddings)
cluster_assignment = clustering_model.labels_

In [563]:
kw_clusters = {}
for kw_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in kw_clusters:
        kw_clusters[cluster_id] = []

    kw = kws[kw_id]
    kw_clusters[cluster_id].append((kw, kws_sent_ids_group[kw]))

In [564]:
print(len(kw_clusters))

306


In [None]:
# with conn:
#     cursor.executemany("""INSERT INTO clusters (game_id, cluster_num, cluster_name) VALUES 
#                         (?, ?, ?);""", cluster_tuples)    
    
# with conn:
#     cursor.executemany("""INSERT INTO kws (kw, freq, cluster_id) VALUES 
#                     (?, ?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""",
#                    kw_tuples)

# with conn:
#     cursor.executemany("""INSERT INTO clusters_sents  (cluster_id, sent_id) VALUES 
#                         (?, ?);""", cluster_sent_tuples)

In [431]:
# pprint(kw_clusters)

In [565]:
kw_clusters

{170: [('music',
   [3714,
    2384,
    5239,
    1205,
    5797,
    6101,
    5483,
    2398,
    2645,
    3846,
    3822,
    532,
    2531,
    2567,
    550,
    2584,
    3870,
    6195,
    585,
    4754,
    2953,
    415,
    6260,
    2565,
    2346,
    990,
    4940,
    1018,
    1082,
    2126,
    2022,
    2658,
    1291,
    1771,
    4564,
    3557,
    1680,
    4098,
    6342,
    202,
    506]),
  ('sound design',
   [4477, 5344, 2126, 5050, 1772, 2693, 585, 2398, 3095, 3822]),
  ('special sound effect', [5943]),
  ('great music',
   [3005, 3925, 4826, 3128, 5306, 5215, 572, 5873, 4213, 3169, 41]),
  ('extra sounds', [5963]),
  ('pretty sound track', [6367]),
  ('sound', [2167, 3105, 2713]),
  ('sound effects', [646, 2713, 415, 48, 2531, 2224, 746, 5963, 4754, 2323]),
  ('great soundtrack',
   [3889, 2514, 5358, 423, 2736, 682, 5457, 5837, 4055, 1771, 5372, 1160]),
  ('soundtrack', [5194, 746, 3137, 5542]),
  ('similar sound', [4452]),
  ('resonant music', [152])

In [600]:
# get the list of biggest clusters 
def get_biggest_clusters(kw_clusters, top_n=50):
    cluster_nums_sizes = []

    for cluster_num, kws_sents in kw_clusters.items():
        sents_num = sum([len(kw_sents[1]) for kw_sents in kws_sents])
        cluster_nums_sizes.append((cluster_num, sents_num))

    cluster_nums_sizes_sorted = sorted(cluster_nums_sizes, key=lambda x: x[1], reverse=True)
    return set([cluster_num_size[0] for cluster_num_size in cluster_nums_sizes_sorted[:top_n]])

In [603]:
biggest_clusters = get_biggest_clusters(kw_clusters)

In [604]:
game_id = 1

cluster_tuples = []
kw_tuples = []
clusters_sents = []

for cluster_num, kws_sent_ids in kw_clusters.items(): 
    cluster_num = cluster_num
    game_id = game_id
    
    if cluster_num in biggest_clusters:
        kws_sent_ids_sorted = sorted(kws_sent_ids, key=lambda x: len(x[1]), reverse=True)
        cluster_name = kws_sent_ids_sorted[0][0]
        cluster_tuples.append((game_id, cluster_num, cluster_name))

        sents_num = sum([len(kw_sent_ids[1]) for kw_sent_ids in kws_sent_ids])

        for kw_sent_ids in kws_sent_ids: 
            kw = kw_sent_ids[0]
            sent_ids = kw_sent_ids[1]
            freq = len(kw_sent_ids[1])
            kw_tuples.append((kw, freq, cluster_num, game_id))

            for sent_id in sent_ids: 
                clusters_sents.append((sent_id, cluster_num, game_id))

In [611]:
print(cluster_tuples[0])
print(kw_tuples[0])
print(clusters_sents[0])

(1, 170, 'music')
('music', 41, 170, 1)
(3714, 170, 1)


In [474]:
print(cluster_tuples[0])
print(kw_tuples[0])
print(clusters_sents[0])

(1, 130, 'music')
('music', 41, 130, 1)
(3714, 130, 1)


In [None]:
with conn:
    cursor.executemany("""INSERT INTO clusters (game_id, cluster_num, cluster_name) VALUES 
                        (?, ?, ?);""", cluster_tuples)    
    
with conn:
    cursor.executemany("""INSERT INTO kws (kw, freq, cluster_id) VALUES 
                    (?, ?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""",
                   kw_tuples)

with conn:
    cursor.executemany("""INSERT INTO clusters_sents  (sent_id, cluster_id) VALUES 
                        (?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""", cluster_sent_tuples)

In [None]:
# cluster_tuples = []
# kw_tuples = []

# for game_id in df_sent_prep['game_id'].unique():
#     sents_prep = df_sent_prep[df_sent_prep['game_id'] == game_id]['sent_prep']
    
#     bigram_freq = get_ngram(sents_prep, 2, 3)
#     bigram_df = pd.DataFrame(bigram_freq, columns=['bigram', 'freq'])
    
#     bigram_df_50 = []
#     count = 0
    
#     for bigram, freq in zip(bigram_df['bigram'], bigram_df['freq']): 
#         if bigram_rules(bigram):
#             bigram_df_50.append((bigram, freq))

#             count += 1

#             if count == 50: 
#                 break
                
#     bigram_df = pd.DataFrame(bigram_df_50, columns=['bigram', 'freq'])
    
#     kws = bigram_df['bigram']
#     kw_embeddings = embedder.encode(kws)
    
#     # Normalize the embeddings to unit length
#     kw_embeddings = kw_embeddings /  np.linalg.norm(kw_embeddings, axis=1, keepdims=True)
    
#     # perform agglomerative clustering
#     clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=0.6)
#     clustering_model.fit(kw_embeddings)
#     cluster_assignment = clustering_model.labels_
    
#     kw_clusters = {}
#     for kw_id, cluster_id in enumerate(cluster_assignment):
#         if cluster_id not in kw_clusters:
#             kw_clusters[cluster_id] = []

#         kw_clusters[cluster_id].append(kws[kw_id])
        
#     bigram_df['cluster_num'] = cluster_assignment
    
#     for cluster_num, cluster_val in kw_clusters.items():
#         cluster_tuples.append((int(game_id), int(cluster_num), cluster_val[0]))
    
#     for kw, freq, cluster_num in zip(bigram_df['bigram'], bigram_df['freq'], bigram_df['cluster_num']):
#         kw_tuples.append((kw, freq, int(cluster_num), int(game_id)))

In [None]:
# with conn:
#     cursor.executemany("""INSERT INTO clusters (game_id, cluster_num, cluster_name) VALUES 
#                         (?, ?, ?);""", cluster_tuples)    
    
# with conn:
#     cursor.executemany("""INSERT INTO kws (kw, freq, cluster_id) VALUES 
#                     (?, ?, (SELECT cluster_id FROM clusters WHERE cluster_num=? AND game_id=?));""",
#                    kw_tuples)

### Step 4: 
- Loop through **sent_prep** in table **sents**, fuzzy-match each **kw** in table **kws**. 
    - Insert **cluster_id**, **sent_id** in table **clusters_sents** to link table **clusters** and **sents**.

In [None]:
# df_sent_prep = pd.read_sql_query("""SELECT game_id, sent_prep, sent_id
#                     FROM sents JOIN reviews USING(review_id) JOIN games USING(game_id);""", conn)

# df_kw = pd.read_sql_query("""SELECT game_id, cluster_id, kw 
#                     FROM kws LEFT JOIN clusters USING(cluster_id);""", conn)

In [None]:
# cluster_sent_tuples = []

# for game_id in df_sent_prep['game_id'].unique():
#     sents_prep = df_sent_prep[df_sent_prep['game_id'] == game_id]
#     kws = df_kw[df_kw['game_id'] == game_id]
    
#     kw_clusters = {}
#     for kw, cluster_id in zip(kws['kw'], kws['cluster_id']):
#         if cluster_id not in kw_clusters:
#             kw_clusters[cluster_id] = []

#         kw_clusters[cluster_id].append(kw)
        
#     for sent_id, sent_prep in zip(sents_prep['sent_id'], sents_prep['sent_prep']):
#         for cluster_id, kws in kw_clusters.items():
#             for kw in kws:
#                 matches = find_near_matches(kw, sent_prep, max_l_dist=1)

#                 if len(matches) != 0: 
#                     cluster_sent_tuples.append((cluster_id, sent_id))
#                     break

In [None]:
# with conn:
#     cursor.executemany("""INSERT INTO clusters_sents  (cluster_id, sent_id) VALUES 
#                         (?, ?);""", cluster_sent_tuples)

### Step 5:
- Remove all **sent_id** in table **sents** if they don't exist in table **clusters_sents**.
    - Insert **score_flair**, **score_vader**, **recommended**, **score_total**, **sent_embedding** in table **sents**.

In [None]:
# remove all sentences that contains no keyword from table sents
with conn: 
    cursor.execute("""
        DELETE FROM sents 
        WHERE sent_id NOT IN (
            SELECT DISTINCT sent_id
            FROM clusters_sents);
    """)

In [None]:
df_sent = pd.read_sql_query("""SELECT sent_id, sent, recommended
                                FROM sents JOIN reviews USING(review_id);""", conn)

In [None]:
df_sent

In [None]:
def get_score_flair(sent, threshold=0.9): 
    sent_flair = flair.data.Sentence(sent)
    sentiment_model_fast.predict(sent_flair)
    
    value_flair = sent_flair.labels[0].value
    score_flair = sent_flair.labels[0].score
    
    value_flair = 1 if value_flair == 'POSITIVE' else -1
    
    score_flair = value_flair*score_flair
    
    return 1 if score_flair > threshold else -1 if score_flair < -threshold else 0

In [None]:
sentis = df_sent['sent'].map(get_score_flair) 

In [None]:
sent_embeddings = embedder.encode(df_sent['sent'])

sent_embeddings = sent_embeddings /  np.linalg.norm(sent_embeddings, axis=1, keepdims=True)

sent_embeddings_str = [pickle.dumps(sent_embedding) for sent_embedding in sent_embeddings]

In [None]:
sent_tuples = []

for senti, sent_embedding_str, sent_id in zip(sentis, sent_embeddings_str, df_sent['sent_id']): 
    sent_tuples.append((senti, sent_embedding_str, sent_id))     

In [None]:
with conn:
    cursor.executemany("""UPDATE sents SET (senti, sent_embedding) = (?, ?)
                        WHERE sent_id=?;""", sent_tuples)

In [None]:
df_sent_embedding = pd.read_sql_query("""SELECT sent_embedding FROM sents;""", conn)
sent_embeddings = np.array([pickle.loads(sent_embedding) for sent_embedding in df_sent_embedding['sent_embedding']])

In [475]:
df_sent = pd.read_sql_query("""SELECT cluster_id, sent_id, sent, sent_embedding, senti, cluster_name, game_id
                                FROM clusters_sents LEFT JOIN(sents) USING(sent_id)
                                                    LEFT JOIN(clusters) USING(cluster_id);""", conn)

In [12]:
def decode_sent_embeddings(sent_embeddings):
    return np.array([pickle.loads(sent_embedding) for sent_embedding in sent_embeddings])

In [None]:
game_id = 1

In [477]:
df_sent = df_sent[df_sent['game_id'] == game_id]

In [479]:
df_sent

Unnamed: 0,cluster_id,sent_id,sent,sent_embedding,senti,cluster_name,game_id
0,1,1,Metroidvania with some influences from Dark So...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,dark soul,1
1,11,3,I had to check some guide to see the true ending.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,true ending,1
2,16,21,A short but sweet metroidvania that is worth y...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,worth price,1
3,9,25,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,action platformer,1
4,20,25,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,range attack,1
...,...,...,...,...,...,...,...
4188,2,22073,From a metroidvania perspective the gameplay h...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,-1,boss fight,1
4189,3,22080,"Con un diseo ""pixel-art"", escenarios en 2D, y ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,pixel art,1
4190,1,22085,One of the first wave of dark souls inspired m...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,1,dark soul,1
4191,6,22086,I rate it much higher than Hollow Knight.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...,0,hollow knight,1


In [478]:
df_sent_1 = df_sent[['senti', 'cluster_name']].groupby(['cluster_name', 'senti']).size().reset_index(name='count')

df_sent_1['sum_count'] = df_sent_1['count'].groupby(df_sent_1['cluster_name']).transform('sum')

df_sent_1 = df_sent_1.sort_values(by=['sum_count', 'senti'], ascending=[False, True])

df_sent_1

Unnamed: 0,cluster_name,senti,count,sum_count
6,dark soul,-1,77,675
7,dark soul,0,299,675
8,dark soul,1,299,675
51,pixel art,-1,22,606
52,pixel art,0,64,606
...,...,...,...,...
38,long time,1,24,40
41,main character,1,19,40
48,passive item,-1,10,39
49,passive item,0,19,39


In [None]:
colors_dict  = {
    1:'#26de81',
    0:'#fed330',
    -1:'#fc5c65'
}

colors = [colors_dict[k] for k in df_sent_1['senti'].values]

# fig = go.Figure(data=[
#     go.Bar(x=df_sent_1['cluster_name'], 
#            y=df_sent_1['count'],
#           marker_color=colors)
# ])
fig = go.Figure()
fig.add_trace(go.Bar(name="positive", x=df_sent_1[df_sent_1['senti'] == 1]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == 1]['count'], marker_color='#26de81'))
fig.add_trace(go.Bar(name="neutral", x=df_sent_1[df_sent_1['senti'] == 0]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == 0]['count'], marker_color='#fed330'))
fig.add_trace(go.Bar(name="negative", x=df_sent_1[df_sent_1['senti'] == -1]['cluster_name'], y=df_sent_1[df_sent_1['senti'] == -1]['count'], marker_color='#fc5c65'))

# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()

In [480]:
df_sent_cluster = df_sent[(df_sent['cluster_name'] == 'dark soul') & (df_sent['senti'] == -1)]

In [None]:
for sent in df_sent_cluster['sent']: 
    print(f'>>> {sent}')

In [481]:
sents = df_sent_cluster['sent'].reset_index(drop = True)
sent_embeddings = decode_sent_embeddings(df_sent_cluster['sent_embedding'])

In [482]:
sents

0     The Bad: low monster variety, combat is simpli...
1     The difficulty is a bit extreme for me, the co...
2                             It's not Dark Souls hard.
3     Difficulty wise, I'd put it about the same as ...
4     The music and the setting try to be Dark Souls...
                            ...                        
72    All of these things exist outside of Dark Soul...
73    This game feels like the lovechild of Dark Sou...
74        Not like Dark Souls (someone else said that).
75    1 and 2 were Cave Story-esque without the focu...
76    I never really got into Dark Souls, and I thin...
Name: sent, Length: 77, dtype: object

In [None]:
# # PCA with 3 dimensions
# pca = PCA(n_components=3)
# components = pca.fit_transform(sent_embeddings)

# total_var = pca.explained_variance_ratio_.sum() * 100

In [None]:
# fig = px.scatter_3d(
#     components, x=0, y=1, z=2, color=clustering_model.labels_.astype(np.str),
#     title=f'Total Explained Variance: {total_var:.2f}%',
#     labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}, 
#     hover_data=[sents],
# )
# fig.show()

In [8]:
df_sent = pd.read_sql_query("""SELECT sent, sent_embedding
                                FROM sents JOIN reviews USING(review_id)
                                WHERE game_id=1;""", conn)

In [9]:
df_sent

Unnamed: 0,sent,sent_embedding
0,Metroidvania with some influences from Dark So...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
1,I had to check some guide to see the true ending.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
2,A short but sweet metroidvania that is worth y...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
3,It's a 2-d action platformer with a simple yet...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
4,"Other than that, there isn't a ton of replay v...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
...,...,...
3769,From a metroidvania perspective the gameplay h...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
3770,"Con un diseo ""pixel-art"", escenarios en 2D, y ...",b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
3771,One of the first wave of dark souls inspired m...,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...
3772,I rate it much higher than Hollow Knight.,b'\x80\x03cnumpy.core.multiarray\n_reconstruct...


In [13]:
sents = df_sent['sent'].reset_index(drop = True)
sent_embeddings = decode_sent_embeddings(df_sent['sent_embedding'])
n_sents = 5

In [15]:
sent_embeddings

array([[-0.04324087,  0.02180931,  0.01068075, ...,  0.03434775,
        -0.04854265, -0.00114909],
       [-0.03150932, -0.00418032,  0.04931263, ...,  0.06524638,
        -0.02565461,  0.02001331],
       [-0.05950123,  0.02035298,  0.03827898, ..., -0.00917198,
        -0.0430759 ,  0.03698024],
       ...,
       [-0.03187256, -0.01025104, -0.00273902, ...,  0.06177759,
        -0.06792822,  0.01506269],
       [ 0.04929354, -0.02860888, -0.03428772, ..., -0.12428585,
        -0.01833793, -0.07858311],
       [-0.10368083, -0.00016783,  0.05353399, ..., -0.01143   ,
        -0.02192041,  0.03859407]], dtype=float32)

In [11]:
def get_centroid(arr):
    length, dim = arr.shape
    return np.array([np.sum(arr[:, i])/length for i in range(dim)])

# return index of the vectors in corpus_embeddings nearest to the centroid
def get_nearest_indexes(centroids, corpus_embeddings):
    return vq(centroids, corpus_embeddings)[0]

In [254]:
def kmeans_sum(sents,  sent_embeddings, n_sents=5):
    sents_sum = []

    clustering_model = KMeans(n_clusters=n_sents, random_state=0)
    clustering_model.fit(sent_embeddings)

    cluster_assignment = clustering_model.labels_
    centroids = clustering_model.cluster_centers_

    clustered_sentences = [[] for i in range(n_sents)]
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        clustered_sentences[cluster_id].append(sents[sentence_id])

    indexes = get_nearest_indexes(centroids, sent_embeddings)
    
#     closest, distances = vq(centroids, sent_embeddings_pca)

    for index in indexes: 
        sents_sum.append(sents[index])
        
    return sents_sum

In [228]:
# kmeans_sum(sents, sent_embeddings, 10)

In [351]:
def agglo_cluster(sents, sent_embeddings, n_sents, threshold=0.5): 
    clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=threshold)
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
    return (cluster_assignment, clustered_sents, clustered_sent_embeddings) if len(clustered_sents) >= n_sents else agglo_cluster(sents, sent_embeddings, n_sents, threshold=threshold - 0.1)

In [352]:
def hybrid_cluster(sents, sent_embeddings, n_sents): 
    clustering_model = hdbscan.HDBSCAN(min_cluster_size=2, cluster_selection_method='leaf')
    clustering_model.fit(sent_embeddings)
    cluster_assignment = clustering_model.labels_

    clustered_sents = {}
    clustered_sent_embeddings = {}

    for sent_id, cluster_id in enumerate(cluster_assignment):
        if cluster_id not in clustered_sents:
            clustered_sents[cluster_id] = []
            clustered_sent_embeddings[cluster_id] = []

        clustered_sents[cluster_id].append(sents[sent_id])
        clustered_sent_embeddings[cluster_id].append(sent_embeddings[sent_id])
    
    # remove unidentified cluster -1
    clustered_sents = {k: v for k, v in clustered_sents.items() if k != -1}
    
    if len(clustered_sents) >= n_sents:
        cluster_assignment = [e for e in cluster_assignment if e != -1]
        clustered_sent_embeddings = {k: v for k, v in clustered_sent_embeddings.items() if k != -1}
        return cluster_assignment, clustered_sents, clustered_sent_embeddings
    else: 
        return agglo_cluster(sents, sent_embeddings, n_sents)

In [353]:
def sort_clusters(cluster_assignment):
    cluster_nums, counts = np.unique(cluster_assignment, return_counts=True)
    cluster_nums_counts = list(zip(cluster_nums, counts))
    
    return [cluster_num for cluster_num, _ in sorted(cluster_nums_counts, key=lambda x: x[1], reverse=True)]

In [354]:
# generate summary
def gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents):
    cluster_nums = sort_clusters(cluster_assignment)
    
    sents_count = 0 

    sents_sum = []

    for cluster_num in cluster_nums:
        centroid = get_centroid(np.array(clustered_sent_embeddings[cluster_num]))
        centroid = np.array([centroid])

        index = get_nearest_indexes(centroid, clustered_sent_embeddings[cluster_num])[0]
        sents_sum.append(clustered_sents[cluster_num][index])

        sents_count += 1
        if sents_count == n_sents: 
            break
    
    return sents_sum

In [367]:
def agglo_sum(sents, sent_embeddings, n_sents=5, threshold=0.5):
    if len(sents) < 2 or n_sents > len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = agglo_cluster(sents, sent_embeddings, n_sents, threshold)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)   

In [368]:
def hybrid_sum(sents, sent_embeddings, n_sents=5): 
    if len(sents) < 2 or n_sents >= len(sents): 
        return sents
    
    cluster_assignment, clustered_sents, clustered_sent_embeddings = hybrid_cluster(sents, sent_embeddings, n_sents)
    return gen_sum(cluster_assignment, clustered_sents, clustered_sent_embeddings, n_sents)

In [380]:
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'I hate the whole universe.',
          'The car is badly broken.',
          'Summer is the hottest season.'
          ]

corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [381]:
agglo_sum(corpus, corpus_embeddings, 5)

['A man is eating food.',
 'The girl is carrying a baby.',
 'Someone in a gorilla costume is playing a set of drums.',
 'A cheetah is running behind its prey.',
 'A man is riding a white horse on an enclosed ground.']

In [382]:
hybrid_sum(corpus, corpus_embeddings, 5)

['A man is eating food.',
 'The girl is carrying a baby.',
 'A man is riding a white horse on an enclosed ground.',
 'Someone in a gorilla costume is playing a set of drums.',
 'A cheetah is running behind its prey.']

In [638]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.sum_basic import SumBasicSummarizer

In [630]:
doc_big = ' '.join(sents)
# For Strings
parser = PlaintextParser.from_string(doc_big,Tokenizer("english"))

In [632]:
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 5 sentences
summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)

2D Dark Souls.
Dark Souls.
Dark Souls but not Dark Souls.
It's like Dark Souls.
The Dark Souls of 2D.


In [633]:
summarizer_1 = LuhnSummarizer()
summary_1 = summarizer_1(parser.document, 5)
for sentence in summary_1:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Animations look great, but end up being super clunky in the combat the game sets out, some enemies have contact damage, some dont(even bosses choose randomly on this), some enemies that look exactly the same as others have entirely different attacks that need to to already be doing something that counters them, and sometimes the action that counters one, and the action that counters some

In [634]:
summarizer_2 = LsaSummarizer()
summary_2 = summarizer_2(parser.document, 5)
for sentence in summary_2:
    print(sentence)

Expect 6-8 hours of gameplay on your first run through, even with harder difficulties subsequent runs should take an hour or so since it doesn't change too much between.
After long season of Hollow Knight, Momodora: Reverie Under The Moonlight feels too short but gameplay is fun so you can always try on hard mode and hunt some achievement!
If you like tough as hell action games with c r a z y boss battles, rolling, double-jumping and (later) mid-air dodging.....metroidvania style progression.........beautiful graphics........smooth animation....uhhh???
The story itself isn't really deep, and I would have loved if we could learn just a little bit more about each character, but everything still plays out perfectly.
Now, that 'bad' list seems to offset the good by a large margin, but it is also very dependent on what I find annoying and unnecessary for modern action platformers.


In [637]:
summarizer_3 = TextRankSummarizer()
summary_3 = summarizer_3(parser.document, 5)
for sentence in summary_3:
    print(sentence)

This game is make but 1 guy but the 1st game was like cave story mod (he said it was mod) and the the second game was different and 1,2 is on itch but with a different name and free then a third game he change his name (i think pze comment on this if i was incorrect) and it's on steam same name of the game's and this is the fourth with a different game and the game it have 2 ending's (i did the good ending by being me) and 1,2,3 game's look cute and dark at the same time but this game is dark pf the darkness and cute the maker of this game's you can find all the game's he did on it's only 5 game's now.
Momodora: Reverie Under the Moonlight is a cutesy pixel Metroidvania game with few memorable boss battle with suitable background music while you hack and slash your way to the next boss battle, which you need the skill to dodge the attack because you'll be doing a lot of trial and error even tho the game is short but there is a few flaws on the game, the level design is alright but can 

In [644]:
summarizer_4 = SumBasicSummarizer()
summary_4 = summarizer_4(parser.document, 10)
for sentence in summary_4:
    print(f'> {sentence}')

> Fun metroidvania-style game.
> It manages it to make Dark Souls 2d.
> If you like action platformers this one is for you.
> Good Pixel art.
> 2D metroidvania with a darksouls feel.
> The music, the artstyle, the mechanics!
> and a great soundtrack.
> Very challenging, but lots of fun!
> Boss fights.
> I want more games in that style.


In [671]:
agglo_sum(sents, sent_embeddings, 10, 0.5)

['Very good metroidvania styled game.',
 'Has some challenging boss fights.',
 'Fantastic pixel art.',
 'It has some game mechanics just like Dark Souls .',
 'Great game with fantastic art and music.',
 'Dark Souls.',
 'So you have a melee attack and a ranged attack.',
 "I'm playing the game on normal difficulty and it's already quite challenging.",
 "It reminds me most of Hollow Knight, although it isn't quite as good as that game.",
 'I can say this game is worth of its full price.']