# Video Game Playability Analysis Based on Players’ Reviews with PySpark

## Big Data Computing final project - A.Y. 2022-2023

Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author

Ilaria De Sio - [desio.2064970@studenti.uniroma1.it](mailto:desio.2064970@studenti.uniroma1.it)

The project is based on the paper entitled *A Data-Driven Approach for Video Game
Playability Analysis Based on Players’ Reviews* in this case study, the definition of
playability analyzed consists of three basic concepts ”**functionality**, **usability**, and
**gameplay**” defined by the *framework of Paavilainen*.

The goal is to obtain an explicit
and simplified framework so that not only the intuitively quantified assessment of the
overall playability of the chosen game is obtained but also to analyze and be able
to view the positive and negative aspects of it, and while classifying the information
that can be ”playability-informative” and ”non-playability-informative” divided into
the classes listed above.

## Define some global constants

## Import PySpark packages and other dependencies

In [1]:
!pip install pyspark



In [2]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("PySparkTutorial").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/06/16 15:03:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1.  Dataset initialization
I chose to use the dataset [https://doi.org/10.6084/m9.figshare.14222531.v1](https://doi.org/10.6084/m9.figshare.14222531.v1) directly provided by the authors of the paper containing the review data from Steam for **No Man’s Sky** in terms of playability by users.
This case of study is really interesting because this game was released on 2016, before which a social media “hype” had been evoked leading to an unprecedentedly high expectation.
Unexpectedly the release was disastrous, but for the last four years, the
game has been continuously maintained with its quality gradually increasing, which makes it a unique case where the changes in game quality is observable.



In [4]:
game_dataset=pd.read_csv("/Users/ilariadesio/Desktop/Computerscience/Firstyear/Secondsemester/BigData/Projects/Video_Game_Playability_Analysis/input/data_clean.csv")
game_dataset.head()

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
0,70427607,english,This game has the elements of many games sewn ...,2020-06-07 07:33:21,2020-06-07 07:33:21,True,0,0,0.0,0,False,False,False,143,1,14368,1041,2020-06-02 07:05:32
1,70426209,english,game is k. random gen from presets. no voice a...,2020-06-07 06:48:46,2020-06-07 06:48:46,True,0,0,0.0,0,False,False,False,133,5,3927,51,2020-06-02 11:31:12
2,70425814,english,I first played 2 years ago and it was fun for ...,2020-06-07 06:35:34,2020-06-07 06:35:34,True,0,0,0.0,0,True,False,False,670,31,1942,987,2020-06-07 06:21:38
3,70425169,english,I wasn't sure if I'd like this--survival stuff...,2020-06-07 06:13:16,2020-06-07 06:13:16,True,0,0,0.0,0,False,False,False,98,37,121,121,2020-06-07 01:31:17
4,70425032,english,"This is an amazing game, Where to start?\r\nYo...",2020-06-07 06:08:28,2020-06-07 06:08:28,True,0,0,0.0,0,True,False,False,16,6,5887,5887,2020-06-07 07:18:45


## 1.1 Dataset Shape and Scheme

The dataset contains approximately 99k records of Steam's reviews.


* ```recommendationid```: The review ID;
* ```language```: Review language;
* ```review```: The text of user review;
* ```timestamp_created ```: The date a review is posted;
* ```timestamp_updated```: Update date of a review;
* ```voted_up```: True means it was a positive recommendation;
* ```votes_up```: The number of other users who found this review helpful;
* ```votes_funny```: How many other player think the review is funny;
* ```weighted_cote_score```: Helpfulness score;
* ```comment_count```: How many other player comment the review;
* ```steam_purchase```: Game purchased on steam or not;
* ```received_for_free```: Game received for free or not;
* ```written_during_early_access```:
* ```author_num_games_owned```: Number of games owned by the author;
* ```author_num_reviews```: How many other reviews has this user done;
* ```author_playtime_forever```: Number of total hours played by the author;
* ```author_playtime_last_two_weeks```: Number of hours played by the author in the last two weeks;
* ```author_last_played```:

-------





In [5]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(game_dataset.shape[0], game_dataset.shape[1]))

The shape of the dataset is 99993 rows by 18 columns


In [6]:
game_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99993 entries, 0 to 99992
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   recommendationid                99993 non-null  int64  
 1   language                        99993 non-null  object 
 2   review                          99957 non-null  object 
 3   timestamp_created               99993 non-null  object 
 4   timestamp_updated               99993 non-null  object 
 5   voted_up                        99993 non-null  bool   
 6   votes_up                        99993 non-null  int64  
 7   votes_funny                     99993 non-null  int64  
 8   weighted_vote_score             99993 non-null  float64
 9   comment_count                   99993 non-null  int64  
 10  steam_purchase                  99993 non-null  bool   
 11  received_for_free               99993 non-null  bool   
 12  written_during_early_access     

# 2. Data Preprocessing
In this phase involves cleaning and transforming the raw data to ensure its quality and compatibility with the analysis.



Convert to datetime the columns ```timestamp_created``` and ```timestamp_updated```

In [7]:
# Convert to datetime
game_dataset['timestamp_created'] = pd.to_datetime(game_dataset['timestamp_created'])
game_dataset['timestamp_updated'] = pd.to_datetime(game_dataset['timestamp_updated'])

## 2.1 Data Cleaning

From the data info above, we can already notice that there are missing values in review. Since our work is going to be heavily relying on this column, we have to clean it from these missing values. In addition, we also need to check for duplicated values following the standard data cleaning procedure.

In [8]:
game_dataset[game_dataset['review'].isna()]


Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
5115,66456723,english,,2020-04-02 22:35:41,2020-04-02 22:35:41,True,0,0,0.0,0,True,False,False,29,1,28002,4295,2020-06-07 06:16:39
5191,66346015,english,,2020-04-01 14:17:30,2020-04-01 14:17:30,True,0,0,0.0,0,False,False,False,318,3,9758,0,2020-04-30 03:09:27
5232,66276955,english,,2020-03-31 18:23:38,2020-03-31 18:23:38,True,0,0,0.0,0,True,False,False,91,3,6001,0,2020-03-31 18:45:40
5541,65678049,english,,2020-03-24 03:33:57,2020-03-24 03:33:57,True,0,0,0.0,0,True,False,False,240,5,8316,0,2020-04-26 03:39:39
5636,65449433,english,,2020-03-21 05:37:26,2020-03-21 05:37:26,True,0,0,0.0,0,True,False,False,109,6,601,0,2020-03-21 16:45:18
5865,65060556,english,,2020-03-15 02:27:56,2020-03-15 02:27:56,True,0,0,0.0,0,True,False,False,38,3,1292,74,2020-05-26 20:24:50
6190,64664205,english,,2020-03-07 18:10:21,2020-03-07 18:10:21,True,0,0,0.0,0,False,False,False,33,2,4114,0,2020-03-17 19:50:32
6618,64241196,english,,2020-02-28 12:35:57,2020-02-28 12:35:57,True,1,0,0.525862,0,True,False,False,22,1,3288,46,2020-06-03 23:05:22
6675,64210477,english,,2020-02-27 20:27:52,2020-02-27 20:27:52,True,0,0,0.0,0,True,False,False,58,1,5675,0,2020-04-09 01:00:06
6840,64111598,english,,2020-02-25 19:02:33,2020-02-25 19:02:33,True,0,0,0.0,0,True,False,False,57,11,6720,0,2020-04-20 15:29:47


In [1]:
game_dataset.isna().sum()

NameError: name 'game_dataset' is not defined

In [10]:
# Drop rows with missing reviews
game_dataset.dropna(inplace=True)

# Sanity check
game_dataset.isna().sum()

recommendationid                  0
language                          0
review                            0
timestamp_created                 0
timestamp_updated                 0
voted_up                          0
votes_up                          0
votes_funny                       0
weighted_vote_score               0
comment_count                     0
steam_purchase                    0
received_for_free                 0
written_during_early_access       0
author_num_games_owned            0
author_num_reviews                0
author_playtime_forever           0
author_playtime_last_two_weeks    0
author_last_played                0
dtype: int64

In [11]:
game_dataset.count()

recommendationid                  99957
language                          99957
review                            99957
timestamp_created                 99957
timestamp_updated                 99957
voted_up                          99957
votes_up                          99957
votes_funny                       99957
weighted_vote_score               99957
comment_count                     99957
steam_purchase                    99957
received_for_free                 99957
written_during_early_access       99957
author_num_games_owned            99957
author_num_reviews                99957
author_playtime_forever           99957
author_playtime_last_two_weeks    99957
author_last_played                99957
dtype: int64

Rows with null values have been deleted correctly, now the rows are 99957.
Now let's check for duplicates.

In [12]:
game_dataset.duplicated().sum()

0

It seems that there are no duplicated rows. But are there duplicated reviews?

In [13]:
game_dataset.duplicated(subset='review').sum()

3903

In [14]:
game_dataset[game_dataset.duplicated(subset='review',keep=False)].sample(10)

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
60353,25270550,english,terrible,2016-08-31 15:41:26,2016-08-31 15:41:26,False,12,0,0.46483,0,True,False,False,51,1,4263,0,2017-09-07 19:03:59
23966,48664965,english,.,2019-01-31 11:23:23,2019-01-31 11:23:23,True,2,0,0.530201,0,True,False,False,221,2,3603,0,2019-08-23 12:29:53
54131,25913113,english,This game sucks,2016-10-07 22:21:48,2016-10-07 22:21:48,False,5,1,0.475396,0,True,True,False,164,3,312,0,2018-08-04 19:47:52
96424,24856048,english,ass,2016-08-12 20:29:23,2016-08-12 20:29:23,False,3,0,0.541667,0,False,False,False,200,6,133,0,2018-08-03 21:49:11
59561,25290753,english,:),2016-09-01 18:49:24,2016-09-01 18:49:24,False,5,0,0.495651,0,True,False,False,59,1,83,0,2016-08-17 19:37:37
7612,63718641,english,it's good,2020-02-18 11:34:16,2020-02-18 11:34:16,True,2,0,0.504505,0,False,False,False,587,11,1251,0,2020-05-12 09:33:54
36558,43738867,english,Cool,2018-07-25 15:18:20,2018-07-25 15:18:20,True,1,0,0.0,0,True,False,False,5,5,7138,0,2020-04-25 15:09:29
46703,29147995,english,rubbish,2017-01-09 19:42:36,2017-01-09 19:42:36,False,1,0,0.498247,0,True,False,False,219,3,3433,0,2019-09-12 13:12:33
1613,68709052,english,its fun\r\n,2020-05-07 03:24:45,2020-05-07 03:24:45,True,0,0,0.0,0,True,False,False,1,1,1685,0,2020-05-08 02:54:34
52310,26103579,english,balls,2016-10-18 16:56:06,2016-10-18 16:56:06,False,1,0,0.52381,0,False,False,False,101,13,133,0,2016-08-17 05:29:51


As we can see there are not actually equal reviews but with similar terms, most of them are very short reviews such as 'good' or 'amazing'. These reviews are still important for our classification task, so we will not drop them.

But we may note that some reviews may also be written only by special characters, these types of reviews should be removed, because there may be smilies or special characters are not significant and also that may have multiple or ambiguous meanings, making accurate interpretation difficult.

In [None]:
import re

# Define a function to remove special characters from the string
def remove_special_characters(text):
    pattern = r'[^a-zA-Z0-9\s]'
    return re.sub(pattern, '', text)

#Apply function to 'review' column
game_dataset['review'] = game_dataset['review'].apply(remove_special_characters)

### 2.2.1 First Hypothesis
**Does there exist a correlation between the number of hours a person played a game and the sentiment of the review?**

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Inizializza la sessione Spark
spark = SparkSession.builder.getOrCreate()

# Converti il DataFrame Pandas in un DataFrame Spark
df = spark.createDataFrame(game_dataset)

# Definisci i tokenizer, remover e vectorizer
tokenizer = RegexTokenizer(inputCol='review', outputCol='words', pattern='\\W')
remover = StopWordsRemover(inputCol='words', outputCol='filtered_words')
vectorizer = CountVectorizer(inputCol='filtered_words', outputCol='features')

# Definisci il modello di classificazione
classifier = NaiveBayes(featuresCol='features', labelCol='label')

# Crea il pipeline per l'elaborazione e la classificazione
pipeline = Pipeline(stages=[tokenizer, remover, vectorizer, classifier])

# Dividi il dataset in train set e test set
train_data, test_data = df.randomSplit([0.8, 0.2], seed=123)

# Addestra il modello sul train set
model = pipeline.fit(train_data)

# Effettua le previsioni sul test set
predictions = model.transform(test_data)

# Valuta le prestazioni del modello utilizzando un valutatore di classificazione binaria
evaluator = BinaryClassificationEvaluator(labelCol='label')
accuracy = evaluator.evaluate(predictions)

# Filtra le righe con sentiment positivo e crea il DataFrame pandas corrispondente
positive_df = predictions.filter(predictions.prediction == 1).select('review').toPandas()

# Filtra le righe con sentiment negativo e crea il DataFrame pandas corrispondente
negative_df = predictions.filter(predictions.prediction == 0).select('review').toPandas()

# Stampa i DataFrame pandas con le recensioni positive e negative
print("Recensioni positive:")
print(positive_df)

print("Recensioni negative:")
print(negative_df)


## 2.2 Data Exploration
At this phase I will analyze different hypotheses of correlation tar the variables to actually test whether or not they are correlated according to the hypothesis provided.