<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# CAPSTONE : Ubisoft's Skull & Bones

# Part 6 – Supervised Model

Our last and final step is to use supervised models to process the data and categorise the topic(s) for each review.

For this experimentation, we will be using three different models:
1) One Vs Rest Classifier
2) Classifier Chain
3) Label Powerset

In [82]:
# Importing all libraries used: 

import requests
import pandas as pd 
import numpy as np
from datetime import datetime
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk import ngrams, FreqDist
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from skmultilearn.model_selection import IterativeStratification
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import BernoulliNB

from skmultilearn.problem_transform import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report #precision+recall+f1-score
from skmultilearn.problem_transform import LabelPowerset

## Importing Data

In [2]:
reviews = pd.read_csv('../data/output/top_1500_reviews_categorised.csv')
pd.set_option('display.max_columns', None)
reviews.head()

Unnamed: 0,categories,review
0,['movement'],the only game where i avoid fast travel
1,"['assassin theme', 'pirate theme']",this is the best assassins creed game and prob...
2,['no topic'],best game ever iv 'e been playing from day one...
3,['pirate theme'],shanties before panties
4,"['pirate theme', 'music']",best part of the game are the sea shanties low...


In order for our data to process the categories correctly, they have to be registered and identified as a list. Let's check what the current type of data it is stored as.

In [3]:
type(reviews['categories'][1])

str

It currently stores the data as a string type, meaning it will process all the apostrophes, commas, and brackets as a string as well.

In order to clean this, we will need to remove the apostrophes and brackets.

In [4]:
def remove_apostrophes_and_brackets(text):
    text = re.sub(r"['\[\]]", '', text)
    return text

In [5]:
reviews['categories'] = reviews['categories'].apply(remove_apostrophes_and_brackets)

In [6]:
reviews.head()

Unnamed: 0,categories,review
0,movement,the only game where i avoid fast travel
1,"assassin theme, pirate theme",this is the best assassins creed game and prob...
2,no topic,best game ever iv 'e been playing from day one...
3,pirate theme,shanties before panties
4,"pirate theme, music",best part of the game are the sea shanties low...


Now, we have to put it back into a list format by splitting the string up with every ', '.

In [7]:
for index in range(len(reviews)):
    reviews['categories'][index] = reviews['categories'][index].split(', ')

reviews.head()

Unnamed: 0,categories,review
0,[movement],the only game where i avoid fast travel
1,"[assassin theme, pirate theme]",this is the best assassins creed game and prob...
2,[no topic],best game ever iv 'e been playing from day one...
3,[pirate theme],shanties before panties
4,"[pirate theme, music]",best part of the game are the sea shanties low...


In [8]:
type(reviews['categories'][1])

list

It is now registered as a list.

Now, as mentioned in the previous notebook, we have to combine and drop certain topics so we have a narrower list of topics for our labelling.

In order to do this, we will need to split the categories list into different columns with boolean (1 or 0) values.

In [9]:
# Create a list of all unique categories
all_categories = set(category for categories_list in reviews['categories'] for category in categories_list)

# Initialize a dictionary to hold the boolean values for each category
category_columns = {category: [] for category in all_categories}

# Iterate through the DataFrame and populate the dictionary
for index, row in reviews.iterrows():
    categories_list = row['categories']
    for category in all_categories:
        if category in categories_list:
            category_columns[category].append(1)
        else:
            category_columns[category].append(0)

# Create new DataFrame with the boolean columns
category_df = pd.DataFrame(category_columns)

# Concatenate the original DataFrame and the new category DataFrame
result_df = pd.concat([reviews, category_df], axis=1)

result_df


Unnamed: 0,categories,review,enjoyment,audio,length,difficulity,game length,review.1,character,cinematic /art.,price,optimization,fun,matchmaking,music,full screen,customer support,others,status,assassin theme,cutscenes,multiplayer,uplay,worth buying,specs,gameplay,series,shanties,progression,entertainment value,pirate theme,ship gameplay,price / quality,graphics/art style,graphics,game time / length,animals,salt level,details,emotional,storyline,exploration,requirements,music /sound.,difficulty,combat,errors,pc requirements,soundtrack,windowed mode,bugs,naval,ubisoft connect,singleplayer,worth paying for,cloud mechanics,game time/length,grinding,treasurehunt,movement,story,game time,servers,craft,conclusion,stealth,no topic,grind,audience,replayability,frame rate
0,[movement],the only game where i avoid fast travel,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,"[assassin theme, pirate theme]",this is the best assassins creed game and prob...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,[no topic],best game ever iv 'e been playing from day one...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,[pirate theme],shanties before panties,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"[pirate theme, music]",best part of the game are the sea shanties low...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1474,"[storyline, pirate theme, animals]",just like all assassin creed games you go into...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1475,"[graphics, entertainment value, music, pirate ...",excellent graphics excellent game play i love ...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1476,[no topic],this is a fantastic game but for ubisoft to co...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1477,"[errors, combat]",bugs bugs everywhere ship fights are good tho,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


First, let's take a look at the list of column names.

In [10]:
list(result_df.columns)

['categories',
 'review',
 'enjoyment',
 'audio',
 'length',
 'difficulity',
 'game length',
 'review',
 'character',
 'cinematic /art.',
 'price',
 'optimization',
 'fun',
 'matchmaking',
 'music',
 'full screen',
 'customer support',
 'others',
 'status',
 'assassin theme',
 'cutscenes',
 'multiplayer',
 'uplay',
 'worth buying',
 'specs',
 'gameplay',
 'series',
 'shanties',
 'progression',
 'entertainment value',
 'pirate theme',
 'ship gameplay',
 'price / quality',
 'graphics/art style',
 'graphics',
 'game time / length',
 'animals',
 'salt level',
 'details',
 'emotional',
 'storyline',
 'exploration',
 'requirements',
 'music /sound.',
 'difficulty',
 'combat',
 'errors',
 'pc requirements',
 'soundtrack',
 'windowed mode',
 'bugs',
 'naval',
 'ubisoft connect',
 'singleplayer',
 'worth paying for',
 'cloud mechanics',
 'game time/length',
 'grinding',
 'treasurehunt',
 'movement',
 'story',
 'game time',
 'servers',
 'craft',
 'conclusion',
 'stealth',
 'no topic',
 'grind',


We've noticed there are two columns named 'review'. One was the original review we got from Steam, however, the other 'review' was from the topic labelling. As we are unable to work with two columns of the same names, we will rename the original Steam 'review' column as 'reviews' with an s.

In [12]:
len(result_df.columns)

71

In [11]:
result_df.columns = ['categories',
 'reviews',   # This is the only value that is changing from the original columns list
 'enjoyment',
 'audio',
 'length',
 'difficulity',
 'game length',
 'review',
 'character',
 'cinematic /art.',
 'price',
 'optimization',
 'fun',
 'matchmaking',
 'music',
 'full screen',
 'customer support',
 'others',
 'status',
 'assassin theme',
 'cutscenes',
 'multiplayer',
 'uplay',
 'worth buying',
 'specs',
 'gameplay',
 'series',
 'shanties',
 'progression',
 'entertainment value',
 'pirate theme',
 'ship gameplay',
 'price / quality',
 'graphics/art style',
 'graphics',
 'game time / length',
 'animals',
 'salt level',
 'details',
 'emotional',
 'storyline',
 'exploration',
 'requirements',
 'music /sound.',
 'difficulty',
 'combat',
 'errors',
 'pc requirements',
 'soundtrack',
 'windowed mode',
 'bugs',
 'naval',
 'ubisoft connect',
 'singleplayer',
 'worth paying for',
 'cloud mechanics',
 'game time/length',
 'grinding',
 'treasurehunt',
 'movement',
 'story',
 'game time',
 'servers',
 'craft',
 'conclusion',
 'stealth',
 'no topic',
 'grind',
 'audience',
 'replayability',
 'frame rate']

In [12]:
result_df.head()

Unnamed: 0,categories,reviews,enjoyment,audio,length,difficulity,game length,review,character,cinematic /art.,price,optimization,fun,matchmaking,music,full screen,customer support,others,status,assassin theme,cutscenes,multiplayer,uplay,worth buying,specs,gameplay,series,shanties,progression,entertainment value,pirate theme,ship gameplay,price / quality,graphics/art style,graphics,game time / length,animals,salt level,details,emotional,storyline,exploration,requirements,music /sound.,difficulty,combat,errors,pc requirements,soundtrack,windowed mode,bugs,naval,ubisoft connect,singleplayer,worth paying for,cloud mechanics,game time/length,grinding,treasurehunt,movement,story,game time,servers,craft,conclusion,stealth,no topic,grind,audience,replayability,frame rate
0,[movement],the only game where i avoid fast travel,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,"[assassin theme, pirate theme]",this is the best assassins creed game and prob...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,[no topic],best game ever iv 'e been playing from day one...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,[pirate theme],shanties before panties,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"[pirate theme, music]",best part of the game are the sea shanties low...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now, we will define a function to combine topics together. Where the main topic will return a True (1) value if the sub_topics are also a True (1) value.

In [13]:
def combine_topics_into_main(main_topic, sub_topics):
     for sub_topic in sub_topics:
        for index in range(len(result_df)):
            if result_df[sub_topic][index] == 1:
                # Check if the column name is in the subtopics list
                # Update the main_topic column to 1 if the subtopic column is 1
                result_df[main_topic][index] = 1


Testing out this function on some topics below:

In [14]:
combine_topics_into_main('graphics', ['graphics/art style', 'cinematic /art.'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [15]:
# Checking to see if the function works
result_df[result_df['graphics'] == 1]

Unnamed: 0,categories,reviews,enjoyment,audio,length,difficulity,game length,review,character,cinematic /art.,price,optimization,fun,matchmaking,music,full screen,customer support,others,status,assassin theme,cutscenes,multiplayer,uplay,worth buying,specs,gameplay,series,shanties,progression,entertainment value,pirate theme,ship gameplay,price / quality,graphics/art style,graphics,game time / length,animals,salt level,details,emotional,storyline,exploration,requirements,music /sound.,difficulty,combat,errors,pc requirements,soundtrack,windowed mode,bugs,naval,ubisoft connect,singleplayer,worth paying for,cloud mechanics,game time/length,grinding,treasurehunt,movement,story,game time,servers,craft,conclusion,stealth,no topic,grind,audience,replayability,frame rate
5,"[graphics, combat, difficulty, uplay, cloud me...",pros -beautiful graphics -huge artistic work -...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,"[assassin theme, combat, stealth, exploration,...",the best assassin 's creed game since assassin...,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
11,"[storyline, exploration, graphics, movement, e...",black flag is competing to be the best assassi...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
12,"[storyline, combat, optimization, movement, gr...",only halfway through but some impressions so f...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
14,"[difficulty, graphics, music, story, price, re...",difficulty my 90 year old grandma could play ...,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1458,"[series, exploration, graphics]",this is the best assassin 's creed i 've playe...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1459,"[pirate theme, storyline, errors, combat, expl...",i feel this is a large improvement over iii th...,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1469,"[storyline, graphics, assassin theme, pirate t...",i like my games like i like my women saucy and...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1470,"[graphics, combat, assassin theme, uplay, seri...",great graphic beautiful landscape good ship ba...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [16]:
result_df[((result_df['graphics/art style'] == 1) & (result_df['graphics'] == 1))]

Unnamed: 0,categories,reviews,enjoyment,audio,length,difficulity,game length,review,character,cinematic /art.,price,optimization,fun,matchmaking,music,full screen,customer support,others,status,assassin theme,cutscenes,multiplayer,uplay,worth buying,specs,gameplay,series,shanties,progression,entertainment value,pirate theme,ship gameplay,price / quality,graphics/art style,graphics,game time / length,animals,salt level,details,emotional,storyline,exploration,requirements,music /sound.,difficulty,combat,errors,pc requirements,soundtrack,windowed mode,bugs,naval,ubisoft connect,singleplayer,worth paying for,cloud mechanics,game time/length,grinding,treasurehunt,movement,story,game time,servers,craft,conclusion,stealth,no topic,grind,audience,replayability,frame rate
29,"[graphics/art style, gameplay, audio, music, p...",---{graphics/art style it 's the matrix beau...,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,1,0


In [17]:
result_df[(result_df['cinematic /art.'] == 1) & (result_df['graphics'] == 1)]

Unnamed: 0,categories,reviews,enjoyment,audio,length,difficulity,game length,review,character,cinematic /art.,price,optimization,fun,matchmaking,music,full screen,customer support,others,status,assassin theme,cutscenes,multiplayer,uplay,worth buying,specs,gameplay,series,shanties,progression,entertainment value,pirate theme,ship gameplay,price / quality,graphics/art style,graphics,game time / length,animals,salt level,details,emotional,storyline,exploration,requirements,music /sound.,difficulty,combat,errors,pc requirements,soundtrack,windowed mode,bugs,naval,ubisoft connect,singleplayer,worth paying for,cloud mechanics,game time/length,grinding,treasurehunt,movement,story,game time,servers,craft,conclusion,stealth,no topic,grind,audience,replayability,frame rate
88,"[no topic, graphics, price, requirements, diff...",player bases kids everyone mature casual p...,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0


After checking this through and confirming that the function is doing what it is supposed to, let's add the drop function so that it drops the 'sub_topics' when input into the function.

In [18]:
# Defining a function to return the number of rows with each topic.
def count_rows(df):
    for category in df.columns:
        if category != 'categories' or category != 'reviews':
            print(f"{category} has {df[df[category] == 1].shape[0]} rows")

In [19]:
def combine_topics_into_main(main_topic, sub_topics):
    for sub_topic in sub_topics:
        for index in range(len(result_df)):
            if result_df[sub_topic][index] == 1:
            # Check if the column name is in the subtopics list
                # Update the main_topic column to 1 if the subtopic column is 1
                result_df[main_topic][index] = 1
                # Drop the subtopic column
        result_df.drop(columns = sub_topic, inplace=True)
    return count_rows(result_df)

In [20]:
combine_topics_into_main('graphics', ['graphics/art style', 'cinematic /art.'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
difficulity has 4 rows
game length has 2 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
fun has 5 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
cutscenes has 1 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 233 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
game time / length has 1 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 536 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 40 rows
combat has 304 rows
errors has 146 rows
pc re

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


The main aim here is to ensure that each topic has at least 20 rows. The topics can either be combined or dropped, depending on the relevance to the other topics.

In [21]:
combine_topics_into_main('game length', ['game time', 'game time / length', 'game time/length'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
difficulity has 4 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
fun has 5 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
cutscenes has 1 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 233 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 536 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 40 rows
combat has 304 rows
errors has 146 rows
pc requirements has 8 rows
soundtr

In [22]:
combine_topics_into_main('difficulty', ['difficulity'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
fun has 5 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
cutscenes has 1 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 233 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 536 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 146 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [23]:
combine_topics_into_main('entertainment value', ['fun'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
cutscenes has 1 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 536 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 146 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed mode has 1 row

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

In [24]:
combine_topics_into_main('uplay', ['ubisoft connect'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
cutscenes has 1 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 536 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 146 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed mode has 1 row

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


'craft' does not have any other sub topics to be grouped with and is under 10 rows. Let's drop this.

In [25]:
result_df.drop(columns= 'craft', inplace = True)

In [26]:
combine_topics_into_main('storyline', ['story', 'cutscenes'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 146 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed mode has 1 rows
bugs has 17 rows
na

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

In [27]:
combine_topics_into_main('errors', ['bugs'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 14 rows
optimization has 142 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
worth buying has 5 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
price / quality has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed mode has 1 rows
naval has 1 rows
si

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

In [28]:
combine_topics_into_main('price', ['price / quality', 'worth buying', 'worth paying for'])

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
audio has 8 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
music has 116 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
music /sound. has 1 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 2 rows
windowed mode has 1 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
g

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

In [29]:
combine_topics_into_main('soundtrack', ['music', 'audio', 'music /sound.'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

categories has 0 rows
reviews has 0 rows
enjoyment has 1 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 238 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 125 rows
windowed mode has 1 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
treasurehunt has 1 rows
movement has 101

Replayability is not mentioned much and does not seem to fit with the other categories, and has only 7 rows. Let's drop this column.

In [30]:
result_df.drop(columns='replayability', inplace = True)

In [31]:
combine_topics_into_main('entertainment value', ['enjoyment'])

categories has 0 rows
reviews has 0 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
salt level has 1 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 125 rows
windowed mode has 1 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
treasurehunt has 1 rows
movement has 101 rows
servers has 1 r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


'salt level' only has 1 row, and does not relate to any other topic. Let's drop this.

In [32]:
result_df.drop(columns='salt level', inplace = True)

In [33]:
combine_topics_into_main('exploration', ['treasurehunt'])

categories has 0 rows
reviews has 0 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 125 rows
windowed mode has 1 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
servers has 1 rows
conclusion has 1 rows
stealth has 57 rows


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [34]:
combine_topics_into_main('full screen', ['windowed mode'])

categories has 0 rows
reviews has 0 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 1 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
pc requirements has 8 rows
soundtrack has 125 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
servers has 1 rows
conclusion has 1 rows
stealth has 57 rows
no topic has 262 rows
gri

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [35]:
combine_topics_into_main('specs', ['pc requirements'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

categories has 0 rows
reviews has 0 rows
length has 5 rows
game length has 12 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 9 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
servers has 1 rows
conclusion has 1 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
audience has 

In [36]:
combine_topics_into_main('game length', ['length'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

categories has 0 rows
reviews has 0 rows
game length has 17 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
status has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 9 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
progression has 1 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
servers has 1 rows
conclusion has 1 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
audience has 7 rows
frame rate 

The 'servers', 'status', 'progression' and 'audience' topics only have 1 row. Let's drop these

In [37]:
result_df.drop(columns=['servers', 'status', 'progression', 'audience'], inplace = True)

In [38]:
combine_topics_into_main('storyline', ['conclusion'])

categories has 0 rows
reviews has 0 rows
game length has 17 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
matchmaking has 1 rows
full screen has 32 rows
customer support has 91 rows
others has 1 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 9 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
details has 1 rows
emotional has 1 rows
storyline has 557 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
frame rate has 45 rows


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


'others' and 'matchmaking' only have 1 row each, and they are not related to any other topic. We will drop these columns as well.

In [39]:
result_df.drop(columns=['others','matchmaking'], inplace = True)

In [40]:
combine_topics_into_main('storyline', ['details', 'emotional'])

categories has 0 rows
reviews has 0 rows
game length has 17 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
full screen has 32 rows
customer support has 91 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 9 rows
gameplay has 19 rows
series has 397 rows
shanties has 2 rows
entertainment value has 239 rows
pirate theme has 454 rows
ship gameplay has 1 rows
graphics has 370 rows
animals has 16 rows
storyline has 558 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
naval has 1 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
frame rate has 45 rows


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [41]:
combine_topics_into_main('pirate theme', ['naval', 'ship gameplay', 'shanties'])

categories has 0 rows
reviews has 0 rows
game length has 17 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
full screen has 32 rows
customer support has 91 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 9 rows
gameplay has 19 rows
series has 397 rows
entertainment value has 239 rows
pirate theme has 456 rows
graphics has 370 rows
animals has 16 rows
storyline has 558 rows
exploration has 279 rows
requirements has 6 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
frame rate has 45 rows


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1


In [42]:
combine_topics_into_main('specs', ['requirements'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df[main_topic][index] = 1
A value is trying to be set on a cop

categories has 0 rows
reviews has 0 rows
game length has 17 rows
review has 1 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
full screen has 32 rows
customer support has 91 rows
assassin theme has 190 rows
multiplayer has 1 rows
uplay has 243 rows
specs has 14 rows
gameplay has 19 rows
series has 397 rows
entertainment value has 239 rows
pirate theme has 456 rows
graphics has 370 rows
animals has 16 rows
storyline has 558 rows
exploration has 279 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
singleplayer has 1 rows
cloud mechanics has 20 rows
grinding has 1 rows
movement has 101 rows
stealth has 57 rows
no topic has 262 rows
grind has 3 rows
frame rate has 45 rows


dropping the rest with less than 10 rows of values

In [43]:
result_df.drop(columns=['grinding','multiplayer', 'grind', 'singleplayer', 'review'], inplace = True)

In [48]:
count_rows(result_df)

reviews has 0 rows
game length has 17 rows
character has 153 rows
price has 16 rows
optimization has 142 rows
full screen has 32 rows
customer support has 91 rows
assassin theme has 190 rows
uplay has 243 rows
specs has 14 rows
gameplay has 19 rows
series has 397 rows
entertainment value has 239 rows
pirate theme has 456 rows
graphics has 370 rows
animals has 16 rows
storyline has 558 rows
exploration has 279 rows
difficulty has 44 rows
combat has 304 rows
errors has 163 rows
soundtrack has 125 rows
cloud mechanics has 20 rows
movement has 101 rows
stealth has 57 rows
no topic has 262 rows
frame rate has 45 rows


In [44]:
result_df.shape

(1479, 28)

Our final output gives us 28 columns including the reviews and categories columns. As our categories are now cleaned up, let's drop the original 'categories' column.

In [45]:
result_df.drop(columns = 'categories', inplace = True)

## Multi-Label Classification

In order to perform our multi-label classifications, our 'y' data needs to be in one column, in the form of a list. Similar to how we started out earlier in the notebook. 

In [49]:
result_df.columns

Index(['reviews', 'game length', 'character', 'price', 'optimization',
       'full screen', 'customer support', 'assassin theme', 'uplay', 'specs',
       'gameplay', 'series', 'entertainment value', 'pirate theme', 'graphics',
       'animals', 'storyline', 'exploration', 'difficulty', 'combat', 'errors',
       'soundtrack', 'cloud mechanics', 'movement', 'stealth', 'no topic',
       'frame rate'],
      dtype='object')

For the pre-processing stage, our reviews need to be tokenized, stemmed, with stopwords removed in order to have less noise for the model get better results.

In [53]:
# Tokenizing the sentences.
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [54]:
pstem = PorterStemmer()

In [55]:
# Setup: Checking through the stopwords in the original library.
stopword = stopwords.words('english')
stopword

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

We want to add some of our own stopwords to this list, including some which are more tailored to the gaming topic such as 'spoiler'.

In [56]:
# Let's add some stop words, more specific to this particular dataset.
for word in ['really', 'spoiler', 'it', 'my', 'will', 'this', 'of', 'but', 'was', 'for']:
    stopword.append(word)
stopword

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

We also do not want to lose too much of the context of the reviews in the end, hence, we would like to take out certain stopwords listed below.

In [57]:
# Now let's remove some words from this dictionary stopwords.
# List of words to be removed
remove = ['against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'up', 'down', 'out', 'off', 'over', 'under', 'again',
 'further', 'then', 'once', 'here','why', 'when', 'where', 'how', 'any', 'both', 'each', 'few', 'more', 'most',
 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'than', 'too', 'very', 'will', 'don', "don't", 'should', "should've", 'now',
 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
 "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't",
 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [58]:
# Removing words from stopwords
print(len(stopword)) #length of stopwords before removal
for x in remove:
    stopword.remove(x)

print(len(stopword)) #length of stopwords after removal.

189
105


In [59]:
stopword

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'to',
 'from',
 'in',
 'on',
 'there',
 'all',
 'so',
 's',
 't',
 'can',
 'just',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'really',
 'spoiler',
 'it',
 'my',
 'will',
 'this',
 'of',
 'but',
 'was',
 'for']

The following cleaning steps should have already been done in the data cleaning notebook, however, we want to be extra sure as there may have been some that may have filtered through.

In [60]:
# Defining a function to remove emojis
def remove_emoji(text):
    clean_list = []
    for sublist in text:
        emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
        clean_words = [emoji_pattern.sub(r'', word) if not re.match(r'^\d+$', word) else word for word in sublist]
        clean_list.append(clean_words)
    return clean_list

In [62]:
# Defining a function to remove unwanted punctuations
def remove_punc_lines(text):
    clean_list = []
    for sublist in text:
        cleaned_sublist = []
        for line in sublist:
            if '/' in line or any(char.isalnum() for char in line) or not line.strip(): # Returns true if it is alphanumeric or an empty string / space
                cleaned_sublist.append(line)
        clean_list.append(cleaned_sublist)
    return clean_list


In [63]:
# Joining the reviews back into a whole sentences
def join(list):
    clean_joined = []
    
    for i in range(len(list)):
        clean_joined.append(" ".join(list[i]))

    return clean_joined

In [64]:
# Combining this into a function for the whole process
def stop_stem(df, df_column, stopwords_list):    # Using the option to provide a list of stopwords
    series = list(df[df_column].astype(str))     # Converting the column into a list of lists
    token = [tokenizer.tokenize(series[i].lower()) for i in range(len(series))]    # Step 1: Tokenizing the words
    
    no_stop = []
    for i in range(len(token)):     # Step 2: Removing stopwords

        row = []
        for sample in token[i]:
            if sample.lower() not in stopwords_list:
                row.append(sample.lower())

        no_stop.append(row)

    stemmed = []
    for i in range(len(no_stop)):
    
        row = []
        for sample in no_stop[i]:
            row.append(pstem.stem(sample))
    
        stemmed.append(row)
    
    no_emoji = remove_emoji(stemmed)    # Step 3: Removing emojis

    no_punc = remove_punc_lines(no_emoji)    # Step 4: Removing punctuations and unnecessary symbols

    # no_num = remove_num(no_punc)    # Step 5: Remove numbers

    clean_text = join(no_punc)     # Step 6: Join the sentence back together

    return clean_text


In [65]:
stop_stem(result_df, 'reviews', stopword)

['onli game where avoid fast travel',
 'best assassin creed game probabl best pirat game ever',
 "best game ever iv 'e play day one (love 'm 75 year young arrrrr matey",
 'shanti befor panti',
 "best part game sea shanti lowkey 1700 's jam poppin",
 'pro -beauti graphic -huge artist work -vast world -naval battl awesomeeee -main quest captiv con -uplay -same old basic combat system -lack difficulti -littl no chang gameplay compar older ac -uplay -uplay -seriously... uplay advic disabl cloud save sync lost 15 hour gameplay idiot set avoid simpli disabl cloud sync',
 "best assassin 's creed game sinc assassin 's creed 2 come someon hate pirat game come three part assassin ship captain ubi mean abstergo employe assassin part similar previou game improv sword fight feel alot better than previou game more batman -like (arkham game less counter -kill spam rang weapon power not over power final put stealth mechan game game assassin (you hide bush long grass abl move around mission design good

## Train Test Split

Before starting the modelling process, we will need to split our data to train and test datasets so there is no data leakage.

In [69]:
modelling = result_df

In [70]:
modelling.shape

(1479, 27)

In [71]:
modelling['reviews'] = stop_stem(result_df, 'reviews', stopword)

In [72]:
modelling.columns

Index(['reviews', 'game length', 'character', 'price', 'optimization',
       'full screen', 'customer support', 'assassin theme', 'uplay', 'specs',
       'gameplay', 'series', 'entertainment value', 'pirate theme', 'graphics',
       'animals', 'storyline', 'exploration', 'difficulty', 'combat', 'errors',
       'soundtrack', 'cloud mechanics', 'movement', 'stealth', 'no topic',
       'frame rate'],
      dtype='object')

In [73]:
X = modelling['reviews']
y = modelling.drop(columns = 'reviews')

Converting this into a format that our model can accept.

In [76]:
labels = []
for index in range(y.shape[0]):
    row_list = []
    for category in y.columns:
        if y[category][index] == 1:
            row_list.append(category)
    labels.append(row_list)

labels

[['movement'],
 ['assassin theme', 'pirate theme'],
 ['no topic'],
 ['pirate theme'],
 ['pirate theme', 'soundtrack'],
 ['uplay', 'graphics', 'difficulty', 'combat', 'cloud mechanics'],
 ['character',
  'optimization',
  'assassin theme',
  'uplay',
  'graphics',
  'storyline',
  'exploration',
  'combat',
  'stealth'],
 ['uplay'],
 ['optimization', 'series', 'pirate theme', 'storyline'],
 ['pirate theme', 'soundtrack'],
 ['optimization', 'customer support', 'uplay'],
 ['entertainment value', 'graphics', 'storyline', 'exploration', 'movement'],
 ['optimization',
  'customer support',
  'entertainment value',
  'graphics',
  'storyline',
  'combat',
  'movement',
  'stealth'],
 ['series', 'pirate theme', 'storyline', 'exploration'],
 ['game length',
  'price',
  'specs',
  'entertainment value',
  'graphics',
  'storyline',
  'difficulty',
  'soundtrack'],
 ['storyline', 'soundtrack'],
 ['series'],
 ['gameplay', 'graphics', 'storyline', 'soundtrack'],
 ['character',
  'optimization',
  

Creating another column called 'labels' that is a column with the list of topics for each review

In [77]:
modelling['labels'] = [[] for _ in range(len(modelling))]
for index in range(modelling.shape[0]):
    modelling['labels'][index] = labels[index]

modelling.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modelling['labels'][index] = labels[index]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modelling['labels'][index] = labels[index]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modelling['labels'][index] = labels[index]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modelling['labels'][index] = labels[ind

Unnamed: 0,reviews,game length,character,price,optimization,full screen,customer support,assassin theme,uplay,specs,gameplay,series,entertainment value,pirate theme,graphics,animals,storyline,exploration,difficulty,combat,errors,soundtrack,cloud mechanics,movement,stealth,no topic,frame rate,labels
0,onli game where avoid fast travel,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,[movement]
1,best assassin creed game probabl best pirat ga...,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,"[assassin theme, pirate theme]"
2,best game ever iv 'e play day one (love 'm 75 ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,[no topic]
3,shanti befor panti,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,[pirate theme]
4,best part game sea shanti lowkey 1700 's jam p...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,"[pirate theme, soundtrack]"


In [79]:
modelling_proper = modelling[['reviews','labels']]
modelling_proper.head()

Unnamed: 0,reviews,labels
0,onli game where avoid fast travel,[movement]
1,best assassin creed game probabl best pirat ga...,"[assassin theme, pirate theme]"
2,best game ever iv 'e play day one (love 'm 75 ...,[no topic]
3,shanti befor panti,[pirate theme]
4,best part game sea shanti lowkey 1700 's jam p...,"[pirate theme, soundtrack]"


In [80]:
y = modelling_proper['labels']
X = modelling_proper['reviews']

We will need to use the Muli Label Binarizer to transform the data before we perform train test split.

In [83]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)
labels = list(mlb.classes_)

In [85]:
stratifier = IterativeStratification(n_splits = 2, order = 1, sample_distribution_per_fold = [0.75,0.25])

for train, test in stratifier.split (X,y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]

### One Vs Rest Classifier 

The [One Vs Res Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) fits a classifier for each label. We will be attempting to use this classifier to perform multi-label classification.

In [88]:
tvec = TfidfVectorizer()

X_train = tvec.fit_transform(X_train)
X_test = tvec.transform(X_test)

We will be using BernoulliNB classifier to start off with.

In [89]:
clf = OneVsRestClassifier(BernoulliNB())

In [90]:
clf.fit(X_train, y_train)

y_train_pred_orc = clf.predict(X_train)
y_test_pred_orc = clf.predict(X_test)

In [92]:
print(classification_report(y_train, y_train_pred_orc,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       0.18      0.25      0.21        12
     assassin theme       0.34      0.17      0.23       142
          character       0.41      0.53      0.46       115
    cloud mechanics       0.12      0.20      0.15        15
             combat       0.68      0.60      0.63       228
   customer support       0.29      0.41      0.34        68
         difficulty       0.25      0.45      0.32        33
entertainment value       0.36      0.36      0.36       179
             errors       0.37      0.25      0.30       122
        exploration       0.61      0.58      0.59       209
         frame rate       0.17      0.32      0.22        34
        full screen       0.10      0.17      0.13        24
        game length       0.18      0.17      0.17        12
           gameplay       0.14      0.20      0.17        15
           graphics       0.71      0.58      0.63       278
           movement    

Using F1 score as our metric, our train results performed better with topics which had more rows. Using 60% as a base passing mark, only the topics 'combat', 'graphics', and 'storyline' did well enough with this model.

Let's compare this to our test results.

In [93]:
print(classification_report(y_test, y_test_pred_orc,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       0.00      0.00      1.00         4
     assassin theme       0.17      0.08      0.11        48
          character       0.39      0.58      0.47        38
    cloud mechanics       0.00      0.00      1.00         5
             combat       0.62      0.66      0.64        76
   customer support       0.23      0.43      0.30        23
         difficulty       0.19      0.45      0.27        11
entertainment value       0.34      0.38      0.36        60
             errors       0.13      0.10      0.11        41
        exploration       0.59      0.61      0.60        70
         frame rate       0.08      0.18      0.11        11
        full screen       0.06      0.12      0.08         8
        game length       0.00      0.00      1.00         5
           gameplay       0.00      0.00      1.00         4
           graphics       0.74      0.71      0.72        92
           movement    

We will disregard the F1 scores of 1.00, as their precision and recall were both 0.00. This is due to the lack of data in either the train set or test set, or both.

For our test data, 'combat', 'exploration', 'graphics', and 'storyline' were well labelled. 

Considering the difference between train and test results, our model may be underfit. 

Let's try to use a Logisitic Regression Classifier in our One Vs Rest.

#### One Vs Rest: Logistic Regression

In [94]:
clf_lr = OneVsRestClassifier(LogisticRegression())

In [95]:
clf_lr.fit(X_train, y_train)

y_train_pred_orc_lr = clf_lr.predict(X_train)
y_test_pred_orc_lr = clf_lr.predict(X_test)

In [96]:
print(classification_report(y_train, y_train_pred_orc_lr,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00        12
     assassin theme       0.68      0.12      0.20       142
          character       1.00      0.06      0.11       115
    cloud mechanics       1.00      0.00      0.00        15
             combat       0.97      0.60      0.74       228
   customer support       1.00      0.00      0.00        68
         difficulty       1.00      0.06      0.11        33
entertainment value       1.00      0.07      0.13       179
             errors       1.00      0.09      0.17       122
        exploration       0.96      0.44      0.61       209
         frame rate       1.00      0.00      0.00        34
        full screen       1.00      0.00      0.00        24
        game length       1.00      0.00      0.00        12
           gameplay       1.00      0.00      0.00        15
           graphics       0.96      0.62      0.75       278
           movement    

Our train results performed better for this model compared to the previous one.
'combat', 'exploration', 'graphics', 'pirate theme', 'series', 'storyline', and 'uplay' all did above 60%. 

Labels with data inputs above 220 definitely performed over 70%. Let's take a look at our test results.

In [98]:
print(classification_report(y_test, y_test_pred_orc_lr,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00         4
     assassin theme       0.67      0.12      0.21        48
          character       0.00      0.00      1.00        38
    cloud mechanics       1.00      0.00      0.00         5
             combat       0.85      0.53      0.65        76
   customer support       1.00      0.00      0.00        23
         difficulty       1.00      0.09      0.17        11
entertainment value       1.00      0.02      0.03        60
             errors       0.50      0.02      0.05        41
        exploration       0.73      0.34      0.47        70
         frame rate       1.00      0.00      0.00        11
        full screen       1.00      0.00      0.00         8
        game length       1.00      0.00      0.00         5
           gameplay       1.00      0.00      0.00         4
           graphics       0.95      0.67      0.79        92
           movement    

For our test data, 'combat', 'graphics', 'pirate theme', and 'storyline' had good results above 60%. 

'graphics' and 'storyline' have f1 scores close to or above 80%. This is very good compared to the first.

### Classifier Chain

The second model we will be testing out is the [Classifier Chain](https://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html). We will use the Logistic Regression in this Classifier Chain as well.

In [99]:
# initialize classifier chains multi-label classifier
classifier = ClassifierChain(LogisticRegression())
# Training logistic regression model on train data
classifier.fit(X_train, y_train)

In [100]:
X_train_pred_lr = classifier.predict(X_train)
X_test_pred_lr = classifier.predict(X_test)

In [101]:
print(classification_report(y_train, X_train_pred_lr,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00        12
     assassin theme       0.68      0.12      0.20       142
          character       1.00      0.02      0.03       115
    cloud mechanics       1.00      0.00      0.00        15
             combat       0.98      0.48      0.65       228
   customer support       1.00      0.00      0.00        68
         difficulty       1.00      0.06      0.11        33
entertainment value       1.00      0.03      0.06       179
             errors       1.00      0.07      0.12       122
        exploration       0.97      0.29      0.44       209
         frame rate       1.00      0.00      0.00        34
        full screen       1.00      0.00      0.00        24
        game length       1.00      0.17      0.29        12
           gameplay       1.00      0.00      0.00        15
           graphics       0.94      0.22      0.35       278
           movement    

Using F1 score again as our comparison metric, we can see that 'combat' was performing well, as well as 'no topic', 'pirate theme', and 'storyline'. This is not too far off from our One Vs Rest Classifier, except the 'no topic' score was higher here. And there were not as many labels that performed as well.

Let's take a look at the test results

In [102]:
print(classification_report(y_test, X_test_pred_lr,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00         4
     assassin theme       0.67      0.12      0.21        48
          character       0.00      0.00      1.00        38
    cloud mechanics       1.00      0.00      0.00         5
             combat       0.86      0.42      0.57        76
   customer support       1.00      0.00      0.00        23
         difficulty       1.00      0.09      0.17        11
entertainment value       1.00      0.00      0.00        60
             errors       1.00      0.02      0.05        41
        exploration       0.72      0.30      0.42        70
         frame rate       1.00      0.00      0.00        11
        full screen       1.00      0.00      0.00         8
        game length       1.00      0.20      0.33         5
           gameplay       1.00      0.00      0.00         4
           graphics       0.93      0.30      0.46        92
           movement    

Unfortunately for our test results, only the labels 'pirate theme' and 'storyline' had a decent score. Compared to our One Vs Rest Classifier, it does not perform as well.

#### Label Powerset

Last but not least, we will try the[Label Powerset](http://scikit.ml/api/skmultilearn.problem_transform.lp.html) model.

In [105]:
# initialize label powerset multi-label classifier
lps = LabelPowerset(LogisticRegression())
# train
lps.fit(X_train, y_train)


In [106]:
X_train_pred_lps = lps.predict(X_train)
X_test_pred_lps = lps.predict(X_test)

In [107]:
print(classification_report(y_train, X_train_pred_lps,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00        12
     assassin theme       0.67      0.15      0.25       142
          character       1.00      0.00      0.00       115
    cloud mechanics       1.00      0.00      0.00        15
             combat       1.00      0.00      0.00       228
   customer support       1.00      0.00      0.00        68
         difficulty       1.00      0.00      0.00        33
entertainment value       1.00      0.03      0.06       179
             errors       0.71      0.20      0.31       122
        exploration       1.00      0.00      0.00       209
         frame rate       1.00      0.00      0.00        34
        full screen       1.00      0.00      0.00        24
        game length       1.00      0.00      0.00        12
           gameplay       1.00      0.00      0.00        15
           graphics       1.00      0.02      0.04       278
           movement    

The Label Powerset has done the worst so far, with none of the labels having any good F1 score metrics. The only label which had an F1 score above 50% was the 'pirate theme' label.

In [108]:
print(classification_report(y_test, X_test_pred_lps,zero_division=1,target_names=labels))
print("\n")

                     precision    recall  f1-score   support

            animals       1.00      0.00      0.00         4
     assassin theme       0.31      0.10      0.16        48
          character       1.00      0.00      0.00        38
    cloud mechanics       1.00      0.00      0.00         5
             combat       1.00      0.00      0.00        76
   customer support       1.00      0.00      0.00        23
         difficulty       1.00      0.00      0.00        11
entertainment value       1.00      0.00      0.00        60
             errors       0.29      0.10      0.15        41
        exploration       1.00      0.00      0.00        70
         frame rate       1.00      0.00      0.00        11
        full screen       1.00      0.00      0.00         8
        game length       1.00      0.00      0.00         5
           gameplay       1.00      0.00      0.00         4
           graphics       1.00      0.00      0.00        92
           movement    

Similarly to the train results, our test results did not do well either, with only the 'pirate theme' label barely passing the 60% F1 score mark. Still, this model has performed the worst out of the three.

## Conclusion

Using F1 score as the main performance metric, the main aim is to have a classifier with the ability to correctly perform multi-label classification on as many labels as we can. 

As such, we can conclude that the One Vs Rest Classifier using Logistic Regression performed the best, with 4 labels in the test set having an F1 score above 60%. Moreover, 2 of those labels could perform close to or above 80%.

Therefore, we are able to use the One Vs Rest Classifier with Logistic Regression to identify reviews with the labels 'graphics', 'combat', 'pirate theme' and 'storyline' confidently. 

Ubisoft can then use this to identify from the reviews, what specific topics people addressed. As such, this can help the team's game developers identify and solve certain issues, or continue to provide what the consumers want. The marketing and PR team can also use this information to release appropriate statements addressing these specific topics, ensuring that Ubisoft does care about their consumers.

## Future Improvements

There are many things that can be improved with this model. The aim in the future is to be able to have all labels' F1 score at a decent percentage so it can be used for new reviews to correctly identify topics mentioned.

One way the model can be improved is to have more data. As our previous unsupervised model took up a lot of time and money, not all the data was able to be run through the model. Future works would be to: 
1) Run through all the rows and label all of them using unsupervised model
2) Narrow down all the labels into the top used labels
3) Run the prompt through the unsupervised model again with a list of specific labels.

Compared to just running through a sample and the top weighted reviews.