# Video Game Playability Analysis Based on Players’ Reviews with PySpark

## Big Data Computing final project - A.Y. 2022-2023

Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author

Ilaria De Sio - [desio.2064970@studenti.uniroma1.it](mailto:desio.2064970@studenti.uniroma1.it)

The project is based on the paper entitled *A Data-Driven Approach for Video Game
Playability Analysis Based on Players’ Reviews* in this case study, the definition of
playability analyzed consists of three basic concepts ”**functionality**, **usability**, and
**gameplay**” defined by the *framework of Paavilainen*.

The goal is to obtain an explicit
and simplified framework so that not only the intuitively quantified assessment of the
overall playability of the chosen game is obtained but also to analyze and be able
to view the positive and negative aspects of it, and while classifying the information
that can be ”playability-informative” and ”non-playability-informative” divided into
the classes listed above.

## Define some global constants

In [9]:
# GITHUB
DATASET_URL = "https://raw.githubusercontent.com/iladesio/iladesio/Video_Game_Playability_Analysis/tree/main/dataset/data_clean.csv"

# GOOGLE DRIVE
GDRIVE_DIR = "/content/gdrive" # Your own mount point on Google Drive
GDRIVE_HOME_DIR = GDRIVE_DIR  # Your own home directory

# File Variable
GDRIVE_DATASET_FILE = GDRIVE_HOME_DIR + "/" + DATASET_URL.split("/")[-1]

## Import PySpark packages and other dependencies

In [None]:
!pip install pyspark

In [11]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [12]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("PySparkTutorial").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/06/15 22:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1.  Dataset initialization
I chose to use the dataset [https://doi.org/10.6084/m9.figshare.14222531.v1](https://doi.org/10.6084/m9.figshare.14222531.v1) directly provided by the authors of the paper containing the review data from Steam for **No Man’s Sky** in terms of playability by users.
This case of study is really interesting because this game was released on 2016, before which a social media “hype” had been evoked leading to an unprecedentedly high expectation.
Unexpectedly the release was disastrous, but for the last four years, the
game has been continuously maintained with its quality gradually increasing, which makes it a unique case where the changes in game quality is observable.



In [13]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
game_dataset=pd.read_csv("drive/My Drive/data_clean.csv")
game_dataset.head()
from pyspark.sql import SparkSession
from textblob import TextBlob

# Inizializza la sessione Spark
spark = SparkSession.builder.getOrCreate()

# Converti il DataFrame Pandas in un DataFrame Spark
df = spark.createDataFrame(game_dataset)

# Definisci una funzione per l'analisi del sentiment con textblob
def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    return sentiment

# Registra la funzione come UDF (User-Defined Function)
spark.udf.register("analyze_sentiment_udf", analyze_sentiment)

# Applica la funzione di analisi del sentiment al DataFrame
df_with_sentiment = df.withColumn("sentiment_score", spark.sql("analyze_sentiment_udf(review)"))

# Filtra le righe con sentiment score positivo e crea il DataFrame pandas corrispondente
positive_df = df_with_sentiment.filter(df_with_sentiment.sentiment_score > 0).select('review').toPandas()

# Filtra le righe con sentiment score negativo e crea il DataFrame pandas corrispondente
negative_df = df_with_sentiment.filter(df_with_sentiment.sentiment_score < 0).select('review').toPandas()

# Stampa i DataFrame pandas con le recensioni positive e negative
print("Recensioni positive:")
print(positive_df)

print("Recensioni negative:")
print(negative_df)


Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
0,70427607,english,This game has the elements of many games sewn ...,2020-06-07 07:33:21,2020-06-07 07:33:21,True,0,0,0.0,0,False,False,False,143,1,14368,1041,2020-06-02 07:05:32
1,70426209,english,game is k. random gen from presets. no voice a...,2020-06-07 06:48:46,2020-06-07 06:48:46,True,0,0,0.0,0,False,False,False,133,5,3927,51,2020-06-02 11:31:12
2,70425814,english,I first played 2 years ago and it was fun for ...,2020-06-07 06:35:34,2020-06-07 06:35:34,True,0,0,0.0,0,True,False,False,670,31,1942,987,2020-06-07 06:21:38
3,70425169,english,I wasn't sure if I'd like this--survival stuff...,2020-06-07 06:13:16,2020-06-07 06:13:16,True,0,0,0.0,0,False,False,False,98,37,121,121,2020-06-07 01:31:17
4,70425032,english,"This is an amazing game, Where to start?\r\nYo...",2020-06-07 06:08:28,2020-06-07 06:08:28,True,0,0,0.0,0,True,False,False,16,6,5887,5887,2020-06-07 07:18:45


In [None]:
print(type(game_dataset))

<class 'pandas.core.frame.DataFrame'>


## 1.1 Dataset Shape and Scheme

The dataset contains approximately 99k records of Steam's reviews.


* ```recommendationid```: The review ID;
* ```language```: Review language;
* ```review```: The text of user review;
* ```timestamp_created ```: The date a review is posted;
* ```timestamp_updated```: Update date of a review;
* ```voted_up```: True means it was a positive recommendation;
* ```votes_up```: The number of other users who found this review helpful;
* ```votes_funny```: How many other player think the review is funny;
* ```weighted_cote_score```: Helpfulness score;
* ```comment_count```: How many other player comment the review;
* ```steam_purchase```: Game purchased on steam or not;
* ```received_for_free```: Game received for free or not;
* ```written_during_early_access```:
* ```author_num_games_owned```: Number of games owned by the author;
* ```author_num_reviews```: How many other reviews has this user done;
* ```author_playtime_forever```: Number of total hours played by the author;
* ```author_playtime_last_two_weeks```: Number of hours played by the author in the last two weeks;
* ```author_last_played```:

-------





In [None]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(game_dataset.shape[0], game_dataset.shape[1]))

The shape of the dataset is 99993 rows by 18 columns


In [None]:
game_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99993 entries, 0 to 99992
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   recommendationid                99993 non-null  int64  
 1   language                        99993 non-null  object 
 2   review                          99957 non-null  object 
 3   timestamp_created               99993 non-null  object 
 4   timestamp_updated               99993 non-null  object 
 5   voted_up                        99993 non-null  bool   
 6   votes_up                        99993 non-null  int64  
 7   votes_funny                     99993 non-null  int64  
 8   weighted_vote_score             99993 non-null  float64
 9   comment_count                   99993 non-null  int64  
 10  steam_purchase                  99993 non-null  bool   
 11  received_for_free               99993 non-null  bool   
 12  written_during_early_access     

## 2.1 Data Cleaning

From the data info above, we can already notice that there are missing values in review. Since our work is going to be heavily relying on this column, we have to clean it from these missing values. In addition, we also need to check for duplicated values following the standard data cleaning procedure.



In [None]:
game_dataset[game_dataset['review'].isna()]


Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
5115,66456723,english,,2020-04-02 22:35:41,2020-04-02 22:35:41,True,0,0,0.0,0,True,False,False,29,1,28002,4295,2020-06-07 06:16:39
5191,66346015,english,,2020-04-01 14:17:30,2020-04-01 14:17:30,True,0,0,0.0,0,False,False,False,318,3,9758,0,2020-04-30 03:09:27
5232,66276955,english,,2020-03-31 18:23:38,2020-03-31 18:23:38,True,0,0,0.0,0,True,False,False,91,3,6001,0,2020-03-31 18:45:40
5541,65678049,english,,2020-03-24 03:33:57,2020-03-24 03:33:57,True,0,0,0.0,0,True,False,False,240,5,8316,0,2020-04-26 03:39:39
5636,65449433,english,,2020-03-21 05:37:26,2020-03-21 05:37:26,True,0,0,0.0,0,True,False,False,109,6,601,0,2020-03-21 16:45:18
5865,65060556,english,,2020-03-15 02:27:56,2020-03-15 02:27:56,True,0,0,0.0,0,True,False,False,38,3,1292,74,2020-05-26 20:24:50
6190,64664205,english,,2020-03-07 18:10:21,2020-03-07 18:10:21,True,0,0,0.0,0,False,False,False,33,2,4114,0,2020-03-17 19:50:32
6618,64241196,english,,2020-02-28 12:35:57,2020-02-28 12:35:57,True,1,0,0.525862,0,True,False,False,22,1,3288,46,2020-06-03 23:05:22
6675,64210477,english,,2020-02-27 20:27:52,2020-02-27 20:27:52,True,0,0,0.0,0,True,False,False,58,1,5675,0,2020-04-09 01:00:06
6840,64111598,english,,2020-02-25 19:02:33,2020-02-25 19:02:33,True,0,0,0.0,0,True,False,False,57,11,6720,0,2020-04-20 15:29:47


In [None]:
game_dataset.isna().sum()

recommendationid                   0
language                           0
review                            36
timestamp_created                  0
timestamp_updated                  0
voted_up                           0
votes_up                           0
votes_funny                        0
weighted_vote_score                0
comment_count                      0
steam_purchase                     0
received_for_free                  0
written_during_early_access        0
author_num_games_owned             0
author_num_reviews                 0
author_playtime_forever            0
author_playtime_last_two_weeks     0
author_last_played                 0
dtype: int64

In [None]:
# Drop rows with missing reviews
game_dataset.dropna(inplace=True)

# Sanity check
game_dataset.isna().sum()

recommendationid                  0
language                          0
review                            0
timestamp_created                 0
timestamp_updated                 0
voted_up                          0
votes_up                          0
votes_funny                       0
weighted_vote_score               0
comment_count                     0
steam_purchase                    0
received_for_free                 0
written_during_early_access       0
author_num_games_owned            0
author_num_reviews                0
author_playtime_forever           0
author_playtime_last_two_weeks    0
author_last_played                0
dtype: int64

In [None]:
game_dataset.count()

recommendationid                  99957
language                          99957
review                            99957
timestamp_created                 99957
timestamp_updated                 99957
voted_up                          99957
votes_up                          99957
votes_funny                       99957
weighted_vote_score               99957
comment_count                     99957
steam_purchase                    99957
received_for_free                 99957
written_during_early_access       99957
author_num_games_owned            99957
author_num_reviews                99957
author_playtime_forever           99957
author_playtime_last_two_weeks    99957
author_last_played                99957
dtype: int64

Rows with null values have been deleted correctly, now the rows are 99957.
Now let's check for duplicates.

In [None]:
game_dataset.duplicated().sum()

0

It seems that there are no duplicated rows. But are there duplicated reviews?

In [None]:
game_dataset.duplicated(subset='review').sum()

3903

COME VEDIAMO NON CI SONO EFFETTIVAMENTE RECENSIONI UGUALI MA SOLO TERMINI SIMILI

In [None]:
game_dataset[game_dataset.duplicated(subset='review',keep=False)].sample(10)

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
7870,63359369,english,good,2020-02-11 22:49:20,2020-02-11 22:49:20,True,0,0,0.0,0,True,False,False,50,8,4171,0,2020-05-14 23:52:43
3931,67094213,english,cool,2020-04-11 13:37:17,2020-04-11 13:37:17,True,0,0,0.0,0,True,False,False,68,10,661,0,2020-04-30 10:42:08
61280,25250624,english,nice,2016-08-30 15:01:21,2020-01-26 15:39:01,True,1,1,0.52381,0,True,False,False,154,5,2205,0,2020-01-26 17:50:43
58795,25316553,english,gg,2016-09-03 02:34:36,2016-09-03 02:34:36,False,1,1,0.502488,0,False,False,False,37,1,1839,0,2020-03-21 01:04:18
12087,59470832,english,fun game,2019-12-08 05:00:34,2019-12-08 05:00:34,True,0,0,0.0,0,True,False,False,136,4,6543,0,2020-04-12 02:14:45
9344,62001007,english,it good,2020-01-17 23:12:48,2020-01-17 23:12:48,True,1,0,0.52381,0,True,False,False,177,10,285,0,2020-05-06 00:54:33
812,69794406,english,nice game,2020-05-25 18:32:39,2020-05-25 18:32:39,True,0,0,0.479655,0,False,False,False,27,5,4322,2784,2020-06-02 18:20:31
82754,24909840,english,love it.,2016-08-14 12:03:37,2016-08-14 12:03:37,True,0,0,0.432099,0,True,False,False,84,4,4551,0,2020-05-08 19:05:55
6697,64201403,english,its fun,2020-02-27 16:15:41,2020-02-27 16:15:41,True,0,0,0.0,0,False,False,False,59,10,1954,0,2020-02-29 18:32:39
13745,58536748,english,Cool,2019-11-29 08:33:05,2019-11-29 08:33:05,True,1,0,0.487388,0,True,False,False,169,1,1074,0,2019-11-29 08:51:55


## 2. Data Exploration

Convert to datetime the columns ```timestamp_created``` and ```timestamp_updated```

In [None]:
# Convert to datetime
game_dataset['timestamp_created'] = pd.to_datetime(game_dataset['timestamp_created'])
game_dataset['timestamp_updated'] = pd.to_datetime(game_dataset['timestamp_updated'])

In [None]:
game_dataset = game_dataset.dropna(subset=['review'])


## Preprocessing


In [None]:
# Length of characters in the review
game_dataset['length'] = game_dataset['review'].apply(len)

In [None]:
# summary statistics
game_dataset[['votes_funny','votes_up','author_playtime_forever','length']].describe()

Unnamed: 0,votes_funny,votes_up,author_playtime_forever,length
count,99957.0,99957.0,99957.0,99957.0
mean,300778.7,6.41882,5791.863101,456.517713
std,35940920.0,123.939431,12466.496919,763.350112
min,0.0,0.0,1.0,1.0
25%,0.0,0.0,992.0,61.0
50%,0.0,1.0,2639.0,192.0
75%,0.0,3.0,6280.0,521.0
max,4294967000.0,16993.0,645618.0,8135.0
