# Video Game Playability Analysis Based on Players’ Reviews with PySpark

## Big Data Computing final project - A.Y. 2022-2023

Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author

Ilaria De Sio - [desio.2064970@studenti.uniroma1.it](mailto:desio.2064970@studenti.uniroma1.it)

The project is based on the paper entitled *A Data-Driven Approach for Video Game
Playability Analysis Based on Players’ Reviews* in this case study, the definition of
playability analyzed consists of three basic concepts ”**functionality**, **usability**, and
**gameplay**” defined by the *framework of Paavilainen*.

The goal is to obtain an explicit
and simplified framework so that not only the intuitively quantified assessment of the
overall playability of the chosen game is obtained but also to analyze and be able
to view the positive and negative aspects of it, and while classifying the information
that can be ”playability-informative” and ”non-playability-informative” divided into
the classes listed above.

## Define some global constants

## Import PySpark packages and other dependencies

In [1]:
!pip install pyspark
!pip install -q pyspark==3.3.0 spark-nlp==4.3.2

! cd ~/.ivy2/cache/com.johnsnowlabs.nlp/spark-nlp_2.12/jars && ls -lt

from sparknlp.pretrained import ResourceDownloader
ResourceDownloader.showPublicPipelines(lang="en")

from sparknlp.pretrained import PretrainedPipeline

zsh:cd:1: no such file or directory: /Users/ilariadesio/.ivy2/cache/com.johnsnowlabs.nlp/spark-nlp_2.12/jars


AssertionError: 

In [None]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

import re

import nltk
from nltk import *

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("PySparkTutorial").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

## 1.  Dataset initialization
I chose to use the dataset [https://doi.org/10.6084/m9.figshare.14222531.v1](https://doi.org/10.6084/m9.figshare.14222531.v1) directly provided by the authors of the paper containing the review data from Steam for **No Man’s Sky** in terms of playability by users.
This case of study is really interesting because this game was released on 2016, before which a social media “hype” had been evoked leading to an unprecedentedly high expectation.
Unexpectedly the release was disastrous, but for the last four years, the
game has been continuously maintained with its quality gradually increasing, which makes it a unique case where the changes in game quality is observable.



In [None]:
PATH="/Users/ilariadesio/Desktop/Computerscience/Firstyear/Secondsemester/BigData/Projects/Video_Game_Playability_Analysis/input/data_clean.csv"
game_dataset = spark.read.load(PATH,
                               format="csv",
                               sep=",",
                               inferSchema="true",
                               header="true")
game_dataset = pd.read_csv(
    "/Users/ilariadesio/Desktop/Computerscience/Firstyear/Secondsemester/BigData/Projects/Video_Game_Playability_Analysis/input/data_clean.csv")
game_dataset.head()

## 1.1 Dataset Shape and Scheme

The dataset contains approximately 99k records of Steam's reviews.


* ```recommendationid```: The review ID;
* ```language```: Review language;
* ```review```: The text of user review;
* ```timestamp_created ```: The date a review is posted;
* ```timestamp_updated```: Update date of a review;
* ```voted_up```: True means it was a positive recommendation;
* ```votes_up```: The number of other users who found this review helpful;
* ```votes_funny```: How many other player think the review is funny;
* ```weighted_cote_score```: Helpfulness score;
* ```comment_count```: How many other player comment the review;
* ```steam_purchase```: Game purchased on steam or not;
* ```received_for_free```: Game received for free or not;
* ```written_during_early_access```:
* ```author_num_games_owned```: Number of games owned by the author;
* ```author_num_reviews```: How many other reviews has this user done;
* ```author_playtime_forever```: Number of total hours played by the author;
* ```author_playtime_last_two_weeks```: Number of hours played by the author in the last two weeks;
* ```author_last_played```:

-------





Initially in this more visual phase the dataframe provided by pandas will be used, later in text processing and for the rest of the project it will fall back to the spark dataframe

In [None]:
print(type(game_dataset))

# 2. Data Pre-processing
In this phase involves cleaning and transforming the raw data to ensure its quality and compatibility with the analysis.



Convert to datetime the columns ```timestamp_created``` and ```timestamp_updated```

## 2.1 Data Cleaning

From the data info above, we can already notice that there are missing values in review. Since our work is going to be heavily relying on this column, we have to clean it from these missing values. In addition, we also need to check for duplicated values following the standard data cleaning procedure.

In [None]:
game_dataset[game_dataset['review'].isna()]


In [None]:
game_dataset.isna().sum()

In [None]:
# Drop rows with missing reviews
game_dataset.dropna(inplace=True)

# Sanity check
game_dataset.isna().sum()

In [None]:
game_dataset.count()

Rows with null values have been deleted correctly, now the rows are 99957.
Now let's check for duplicates.

In [None]:
game_dataset.duplicated().sum()

It seems that there are no duplicated rows. But are there duplicated reviews?

In [None]:
game_dataset.duplicated(subset='review').sum()

In [None]:
game_dataset[game_dataset.duplicated(subset='review',keep=False)].sample(10)

As we can see there are not actually equal reviews but with similar terms, most of them are very short reviews such as 'good' or 'amazing'. These reviews are still important for our classification task, so we will not drop them.

##Text-processing
We may note that some reviews may also be written only by special characters, these types of reviews should be removed, because there may be smilies or special characters are not significant and also that may have multiple or ambiguous meanings, making accurate interpretation difficult.

###Convert all the text of the review to lowercase

In [None]:
def convert_to_lowercase(text):
    return text.lower()

spark_df = spark.createDataFrame(game_dataset)

convert_to_lowercase_udf = udf(convert_to_lowercase, StringType())

# Application of the UDF to the 'review' column of the DataFrame
spark_df = spark_df.withColumn('review_lower', convert_to_lowercase_udf(spark_df['review']))

"""
Since it is not possible to directly assign the UDF result to the same input column in Spark (this is due to the fact that Spark's DataFrames are immutable,
which means that they cannot be changed directly),
a new column called lower_reviews will be created and then replaced with the original reviews and deleted.
"""
spark_df = spark_df.withColumn(spark_df.columns[2], col('review_lower'))
spark_df=spark_df.drop('review_lower')

##Removal of extra-spaces

###Removal of non-ASCII characters
For example, the dollar sign ($), accented letters such as à, é, ô, also there are many special symbols such as ☺ (smiley face) and others that are not included in standard ASCII

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import re

def remove_non_ascii(text):
    # Utilizziamo un'espressione regolare per trovare tutti i caratteri non ASCII
    non_ascii_regex = re.compile('[^\x00-\x7F]')
    # Sostituiamo i caratteri non ASCII con una stringa vuota
    cleaned_text = non_ascii_regex.sub('', text)
    return cleaned_text

# Definizione della funzione UDF (User-Defined Function)
remove_non_ascii_udf = udf(remove_non_ascii, StringType())

# Applicazione della funzione UDF alla colonna 'review'
spark_df = spark_df.withColumn('cleaned_review', remove_non_ascii_udf(spark_df['review']))

spark_df = spark_df.withColumn(spark_df.columns[2], col('cleaned_review'))
spark_df=spark_df.drop('cleaned_review')


first_row = spark_df.select('review').show(n=5, truncate=False)
print(first_row)


###Sentence Tokenization
In this phase divide each review item from the DataFrame into sentence-level review instances, due to the fact that each review with multiple sentences can contain multiple topics and various sentiments.

In [None]:
nltk.download('punkt')
def sent_tokenize(text):
    return nltk.sent_tokenize(text)

# Creating the UDF function for the phrase tokenizer.
sent_tokenize_udf = udf(sent_tokenize, ArrayType(StringType()))

# Applying the phrase tokenizer to the 'review' column of the DataFrame.
spark_df = spark_df.withColumn('sentences', sent_tokenize_udf(spark_df['review']))

columns = spark_df.columns
columns.remove('sentences')
spark_df = spark_df.select(columns[:3] + ['sentences'] + columns[3:])

first_row = spark_df.select('sentences').show(n=5, truncate=False)
print(first_row)

spark_df.printSchema();

### Remove Stop-words (NOT USED by Project Choice)
I initially tried to apply to the removal of stop words, but seeing the results, I noticed that it might result in the loss of some relevant information about sentence structure, so I decided not to use it in this case.

In [None]:
"""
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords_eng = stopwords.words('english')

# Define a UDF function to remove stop-words.
def remove_stopwords(sentence):
    if sentence is not None:
        return [word for word in sentence if word.lower() not in stopwords_eng]
    return None

remove_stopwords_udf = udf(remove_stopwords, ArrayType(StringType()))

# Apply stop-word screening to the DataFrame.
spark_df = spark_df.withColumn('filtered_sentences', remove_stopwords_udf(spark_df['sentences']))

first_row = spark_df.select('filtered_sentences').show(n=5, truncate=False)
print(first_row)
spark_df.printSchema();
"""

### Lemmatization

In [None]:
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline


# Creazione degli annotatori Spark NLP
document_assembler = DocumentAssembler().setInputCol("review").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained().setInputCols(["token"]).setOutputCol("lemma")

# Creazione del pipeline
pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])

# Applicazione del pipeline al DataFrame
processed_df = pipeline.fit(spark_df).transform(spark_df)

# Visualizzazione dei risultati
processed_df.select("review", "lemma.result").show(truncate=False)



## 2.2 Data Exploration
At this phase I will analyze different hypotheses of correlation tar the variables to actually test whether or not they are correlated according to the hypothesis provided.

### 2.2.1 First Hypothesis
**Does there exist a correlation between the number of hours a person played a game and the sentiment of the review?**

In [None]:
import nltk

nltk.download("vader_lexicon")



In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from nltk.sentiment import SentimentIntensityAnalyzer

# Converte il DataFrame Pandas in un DataFrame PySpark
spark_df = spark.createDataFrame(game_dataset)

# Inizializza il SentimentIntensityAnalyzer di NLTK
sia = SentimentIntensityAnalyzer()

# Definisci la funzione per l'analisi del sentiment
def analyze_sentiment(review):
    # Calcola il sentiment della recensione utilizzando il SentimentIntensityAnalyzer di NLTK
    sentiment = sia.polarity_scores(review)["compound"]

    # Determina se la recensione è positiva o negativa in base al valore del sentiment
    if sentiment > 0:
        return "positive"
    elif sentiment < 0:
        return "negative"
    else:
        return "neutral"

# Registra la funzione come UDF (User Defined Function)
sentiment_udf = udf(analyze_sentiment, StringType())

# Applica la sentiment analysis al DataFrame
classified_df = spark_df.withColumn("sentiment", sentiment_udf(spark_df["review"]))

# Dividi il DataFrame in due DataFrame separati per le recensioni positive e negative
positive_reviews = classified_df.filter(classified_df["sentiment"] == "positive")
negative_reviews = classified_df.filter(classified_df["sentiment"] == "negative")


# Mostra i risultati del DataFrame positive_reviews come DataFrame Pandas
positive_reviews_pandas = positive_reviews.toPandas().head(5)
print("Positive Reviews:")
print(positive_reviews_pandas)

# Mostra i risultati del DataFrame negative_reviews come DataFrame Pandas
negative_reviews_pandas = negative_reviews.toPandas().head(5)
print("Negative Reviews:")
print(negative_reviews_pandas)

# ciao prova