<div id="container" style="position:relative;">
<div style="float:left">

***Kazi Shahid***

***BrainStation Data Science Diploma Candidate***

***Capstone Project***

=============================================================

***Project SteamBuzz: Will Our Game Create a Buzz in the Steam community?***

***Part 5: Neural Network - Word2Vec***
</div>
<div style="position:relative; float:right"><img style="height:100px" src ="https://i.ibb.co/mcvpL4Z/Steam-Buzz-logo.png" />
</div>
</div>

---
# Overview

We have derived some insights from our ML models up until this point. In this part of the project, we will attempt to check if we can derive some more insights, at least from a high level, using an appropriate algorithm. If we can understand which words in the strategy video games space are associated with a certain word or jargon, based on the strategy gamers' lexicon, it will open a door for us to find many possibilities in terms of potential gameplay features or attributes that we can incorporate into our game that we might have missed along the way.

---
# Algorithm for Word Associations - *Word2Vec*

[Word2Vec](https://en.wikipedia.org/wiki/Word2vec) is a technique for natural language processing published in 2013, constituting a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words by learning word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

As the name implies, Word2Vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

---
# Gensim Library

[Gensim](https://radimrehurek.com/gensim/intro.html) is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. It is designed to process raw, unstructured digital texts (”plain text”) using unsupervised machine learning algorithms.

We will start this part of the project with installing the `gensim` library.

In [1]:
# Installing gensim library
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-win_amd64.whl (24.0 MB)
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.0.1
    Uninstalling gensim-4.0.1:
      Successfully uninstalled gensim-4.0.1
Successfully installed gensim-4.1.2


In [1]:
# Importing the necessary data analysis toolkits
import pandas as pd

# To display a considerable extent (first 500 characters) of the content of each column of the dataframes
pd.set_option('display.max_colwidth', 100)

# Filtering out potential warnings
import warnings
warnings.filterwarnings('ignore')

---
# Loading the Text Dataset to Train our Model on

Now we will load the text dataset to train our Word2Vec model on. This is the cleaned up review text data from Part 3 of this project.

In [2]:
# Importing the dataset from its pickle file into Pandas DataFrame forms

###############################################################

## Option 1: Loading from local hard drive
## Uncomment the below (edit the path if necessary) and run

# df = pd.read_pickle("data\steam_review_strategy_cleaned.pkl")

###############################################################

## Option 2: Working on cloud (e.g., Google Colab/Pro/Pro+) and loading from the cloud (e.g., Google Drive)
## Uncomment the below (edit the path as necessary) and run

# from google.colab import drive
# drive.mount('/content/gdrive')
# df = pd.read_pickle("/content/gdrive/MyDrive/Colab Notebooks/steam_review_strategy_cleaned.pkl")

In [3]:
# Showing the top 5 rows of the dataset
print(f"Top 5 rows of the dataset:")
df.head(5)

Top 5 rows of the dataset:


Unnamed: 0,title,review_text,recommended
0,Might & Magic(r) Heroes(r) VII,might magic hero v great might magic hero vi epic ♥♥♥♥ fulli releas month still game break bug t...,0
1,Galactic Civilizations III,hate write realli want like game stand game complet unplay bought game sale month releas end abl...,0
2,Game Dev Tycoon,recommend game anyon like awesom game,1
3,Banished,extrem bore almost varieri game year go littl noth happen would recommend anno 1404 1701 way int...,0
4,Elite Dangerous,care game peopl preorder still receiv actual game content got framework 400 billion star system ...,0


In [4]:
# Displaying the total number of rows and columns in the dataset (with thousand separators)
print(f"Total number of rows in the strategy game review data: {len(df):,}")
print(f"Total number of columns in the strategy game review data: {len(df.columns):,}")

Total number of rows in the strategy game review data: 690,592
Total number of columns in the strategy game review data: 3


# Data Preprocessing

In order to feed the data into our model, which does not take a whole Pandas DataFrame, we need to process it into a proper form. We will create the corpus to train our model with.

In [5]:
corpus = df['review_text'].tolist()

## Tokenizing Review Text

Now we will tokenize the corpus.

In [6]:
# Importing NLTK library to aid in tokenizing
import nltk
nltk.download('punkt')

corpus = df.apply(lambda row: nltk.word_tokenize(row['review_text']), axis=1)

[nltk_data] Downloading package punkt to C:\Users\KTS-
[nltk_data]     TSK\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# Printing the corpus
print("The corpus:")
corpus

The corpus:


0         [might, magic, hero, v, great, might, magic, hero, vi, epic, ♥♥♥♥, fulli, releas, month, still, ...
1         [hate, write, realli, want, like, game, stand, game, complet, unplay, bought, game, sale, month,...
2                                                                [recommend, game, anyon, like, awesom, game]
3         [extrem, bore, almost, varieri, game, year, go, littl, noth, happen, would, recommend, anno, 140...
4         [care, game, peopl, preorder, still, receiv, actual, game, content, got, framework, 400, billion...
                                                         ...                                                 
690587    [got, good, kink, still, playabl, great, everi, kink, could, rememb, iron, game, better, ever, s...
690588    [would, recommend, game, whole, thing, reli, skill, activ, opposit, realiti, small, number, play...
690589                 [coupon, 50, game, one, want, trade, alreadi, game, need, coupon, great, game, though]
690590    

We can see that it is an indexed list form, which is an appropriate form as input into a Word2Vec model.

# Word2Vec Model Creation

Now, using Word2Vec, we will train our model on the corpus of tokens we created above.

In [8]:
from gensim.models import Word2Vec

# Instantiating a Word2Vec object
## The input will be the corpus defined above, constituting all the review text
## Setting min_count to 5, i.e., only the words appearing in at least 5 times in the corpus will be included in the model
w2v = Word2Vec(corpus, min_count=5)

In [9]:
# Creating a dictionary of all the unique words appearing at least 5 times in our corpus
vocab = w2v.wv.index_to_key

# Vocabulary Learned by the Model

Below is a collection of the vocabulary the model has learned:

In [11]:
print(vocab)



# Using Word2Vec Model Towards Solving Our Business Problem

Now that our Word2Vec model is trained, we go back to our problem space. While the ML models provided some insights in the form of which words are most predictive of positive or negative sentiments, most of the times these words at the top of the list were quite generic. Many reviewers leave the reviews at "this game is awful" but does not leave the extra information regarding why they found it awful. These extra information would have been valuable to the developers in the way that they could have addressed these issues into making the game more favourable to the gamers and changing their sentiment from negative to positive, or further boosting their positive sentiment.

This Word2Vec model can help in divulging terms that have been considered closely associated (using gamers' reviews). This provides an extra layer of functionality to us.

For example: perhaps we have incorporated the "Orcs" as a playable race in our game that we want to launch. However, using our artistic freedom, we have made the Orcs resemble more like humans (with brown skins and resembling more "human"-like structure than "humanoid"), perhaps with a view to making them less scary and so that gamers can associate with them more.

Now, if we search for the word "orc" (a curtailed form as we undertook "stemming" as part of text clean-up process in Part 3 of this project), we get the following results:

In [12]:
# Showing word vectors (hence "wv") that are most similar to the word "orc"
w2v.wv.most_similar("orc")

[('undead', 0.6112743616104126),
 ('dwarv', 0.6008398532867432),
 ('vampir', 0.56852787733078),
 ('ork', 0.5491148829460144),
 ('greenskin', 0.5479283332824707),
 ('goblin', 0.5442926287651062),
 ('warmag', 0.5372846722602844),
 ('elf', 0.5229054093360901),
 ('necromanc', 0.5204229354858398),
 ('demon', 0.5193759202957153)]

As we can see, "human" or similar words do not come at all in the same space as "orc", nor does any word in the same ballpark as "brown skin". We see words such as "greenskin", "undead", "goblin", "elf", "demon", etc. This:

- Gives an indication that our game may fail if we keep the Orcs human-like.
- Provides some suggestions as to what we can incorporate in our game (from the "Orcs" standpoint) that may boost positive sentiment towards our game.

Similarly, going from "race" (the Orcs) to gameplay elements: the artillery feature that is common in war or fantasy games, we are presented with words in the similar space such as:

In [13]:
w2v.wv.most_similar("artilleri")

[('infantri', 0.8367178440093994),
 ('artillari', 0.83668053150177),
 ('mortar', 0.8122435212135315),
 ('arti', 0.7918648719787598),
 ('antiair', 0.7607645988464355),
 ('artileri', 0.757856547832489),
 ('bomber', 0.7544510364532471),
 ('airstrik', 0.7522616386413574),
 ('tank', 0.7426561713218689),
 ('cannon', 0.7365334033966064)]

Supplies or resource collection and utilization has historically been a major feature in strategy games. If we want to see what gameplay attributes or features strategy gamers positively correlate with "suppli" (stemmed form of "supplies"), we find the below:

In [14]:
w2v.wv.most_similar("suppli")

[('fuel', 0.6948274970054626),
 ('resourc', 0.6788396239280701),
 ('replenish', 0.6741318106651306),
 ('manpow', 0.6686011552810669),
 ('stockpil', 0.6582539677619934),
 ('food', 0.653714656829834),
 ('ressourc', 0.6525422930717468),
 ('supli', 0.6484736800193787),
 ('resuppli', 0.6381763815879822),
 ('ammunit', 0.6169068217277527)]

How about strategy gamers think of when they hear "guns"? -

In [17]:
w2v.wv.most_similar("gun")

[('gunner', 0.7430365085601807),
 ('pistol', 0.7331417202949524),
 ('weapon', 0.7299442291259766),
 ('minigun', 0.7134132385253906),
 ('rifl', 0.7053825855255127),
 ('recoil', 0.7041664123535156),
 ('smg', 0.6933473944664001),
 ('bipod', 0.6868212223052979),
 ('flamethrow', 0.679459273815155),
 ('handgun', 0.679399311542511)]

---
# Conclusion

There are many such possibilities through a Word2Vec model that we can achieve, with a view to addressing our business problem. With suggestions derived from the Word2Vec model as we have seen above, we can strategize our game development workplan, introduce new gameplay features based on Word2Vec suggestions, remove gameplay features, edit gameplay features, etc., with the target of drawing positive sentiment from our target strategy gamers.