# Video Game Playability Analysis Based on Players’ Reviews with PySpark

## Big Data Computing final project - A.Y. 2022-2023

Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author

Ilaria De Sio - [desio.2064970@studenti.uniroma1.it](mailto:desio.2064970@studenti.uniroma1.it)

The project is based on the paper entitled *A Data-Driven Approach for Video Game
Playability Analysis Based on Players’ Reviews* in this case study, the definition of
playability analyzed consists of three basic concepts ”**functionality**, **usability**, and
**gameplay**” defined by the *framework of Paavilainen*. 

The goal is to obtain an explicit
and simplified framework so that not only the intuitively quantified assessment of the
overall playability of the chosen game is obtained but also to analyze and be able
to view the positive and negative aspects of it, and while classifying the information
that can be ”playability-informative” and ”non-playability-informative” divided into
the classes listed above.

## Define some global constants

In [19]:
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [1]:
# GITHUB
DATASET_URL = "https://raw.githubusercontent.com/iladesio/iladesio/Video_Game_Playability_Analysis/tree/main/dataset/data_clean.csv"

# GOOGLE DRIVE
GDRIVE_DIR = "/content/gdrive" # Your own mount point on Google Drive
GDRIVE_HOME_DIR = GDRIVE_DIR  # Your own home directory

# File Variable
GDRIVE_DATASET_FILE = GDRIVE_HOME_DIR + "/" + DATASET_URL.split("/")[-1]

## Import PySpark packages and other dependencies

In [2]:
!pip install pyspark



In [3]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

In [4]:
import pandas as pd

In [5]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("PySparkTutorial").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/18 14:41:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1.  Dataset initialization
I chose to use the dataset [https://doi.org/10.6084/m9.figshare.14222531.v1](https://doi.org/10.6084/m9.figshare.14222531.v1) directly provided by the authors of the paper containing the review data from Steam for **No Man’s Sky** in terms of playability by users.
This case of study is really interesting because this game was released on 2016, before which a social media “hype” had been evoked leading to an unprecedentedly high expectation. 
Unexpectedly the release was disastrous, but for the last four years, the
game has been continuously maintained with its quality gradually increasing, which makes it a unique case where the changes in game quality is observable.



In [6]:
game_dataset=pd.read_csv("/Users/ilariadesio/Desktop/Computerscience/First year/Second semester/Big Data/Projects/Video_Game_Playability_Analysis/dataset/data_clean.csv")
game_dataset.head()

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
0,70427607,english,This game has the elements of many games sewn ...,2020-06-07 07:33:21,2020-06-07 07:33:21,True,0,0,0.0,0,False,False,False,143,1,14368,1041,2020-06-02 07:05:32
1,70426209,english,game is k. random gen from presets. no voice a...,2020-06-07 06:48:46,2020-06-07 06:48:46,True,0,0,0.0,0,False,False,False,133,5,3927,51,2020-06-02 11:31:12
2,70425814,english,I first played 2 years ago and it was fun for ...,2020-06-07 06:35:34,2020-06-07 06:35:34,True,0,0,0.0,0,True,False,False,670,31,1942,987,2020-06-07 06:21:38
3,70425169,english,I wasn't sure if I'd like this--survival stuff...,2020-06-07 06:13:16,2020-06-07 06:13:16,True,0,0,0.0,0,False,False,False,98,37,121,121,2020-06-07 01:31:17
4,70425032,english,"This is an amazing game, Where to start?\r\nYo...",2020-06-07 06:08:28,2020-06-07 06:08:28,True,0,0,0.0,0,True,False,False,16,6,5887,5887,2020-06-07 07:18:45


In [7]:
print(type(game_dataset))

<class 'pandas.core.frame.DataFrame'>


## 1.1 Dataset Shape and Scheme

The dataset contains approximately 99k records of Steam's reviews.


* ```recommendationid```: The review ID;
* ```language```: Review language;
* ```review```: The text of user review;
* ```timestamp_created ```: The date a review is posted;
* ```timestamp_updated```: 
* ```voted_up```:
* ```votes_up```:
* ```votes_funny```: How many other player think the review is funny;
* ```weighted_cote_score```:
* ```comment_count```: How many other player comment the review;
* ```steam_purchase```:
* ```received_for_free```:
* ```written_during_early_access```:
* ```author_num_games_owned```:
* ```author_num_reviews```: How many other reviews has this user done;
* ```author_playtime_forever```:
* ```author_playtime_last_two_weeks```:
* ```author_last_played```:

-------





In [8]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(game_dataset.shape[0], game_dataset.shape[1]))

The shape of the dataset is 99993 rows by 18 columns


In [9]:
game_dataset.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99993 entries, 0 to 99992
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   recommendationid                99993 non-null  int64  
 1   language                        99993 non-null  object 
 2   review                          99957 non-null  object 
 3   timestamp_created               99993 non-null  object 
 4   timestamp_updated               99993 non-null  object 
 5   voted_up                        99993 non-null  bool   
 6   votes_up                        99993 non-null  int64  
 7   votes_funny                     99993 non-null  int64  
 8   weighted_vote_score             99993 non-null  float64
 9   comment_count                   99993 non-null  int64  
 10  steam_purchase                  99993 non-null  bool   
 11  received_for_free               99993 non-null  bool   
 12  written_during_early_access     

## 2.1 Data Cleaning

From the data info above, we can already notice that there are missing values in review. Since our work is going to be heavily relying on this column, we have to clean it from these missing values. In addition, we also need to check for duplicated values following the standard data cleaning procedure.



In [10]:
game_dataset[game_dataset['review'].isna()]


Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
5115,66456723,english,,2020-04-02 22:35:41,2020-04-02 22:35:41,True,0,0,0.0,0,True,False,False,29,1,28002,4295,2020-06-07 06:16:39
5191,66346015,english,,2020-04-01 14:17:30,2020-04-01 14:17:30,True,0,0,0.0,0,False,False,False,318,3,9758,0,2020-04-30 03:09:27
5232,66276955,english,,2020-03-31 18:23:38,2020-03-31 18:23:38,True,0,0,0.0,0,True,False,False,91,3,6001,0,2020-03-31 18:45:40
5541,65678049,english,,2020-03-24 03:33:57,2020-03-24 03:33:57,True,0,0,0.0,0,True,False,False,240,5,8316,0,2020-04-26 03:39:39
5636,65449433,english,,2020-03-21 05:37:26,2020-03-21 05:37:26,True,0,0,0.0,0,True,False,False,109,6,601,0,2020-03-21 16:45:18
5865,65060556,english,,2020-03-15 02:27:56,2020-03-15 02:27:56,True,0,0,0.0,0,True,False,False,38,3,1292,74,2020-05-26 20:24:50
6190,64664205,english,,2020-03-07 18:10:21,2020-03-07 18:10:21,True,0,0,0.0,0,False,False,False,33,2,4114,0,2020-03-17 19:50:32
6618,64241196,english,,2020-02-28 12:35:57,2020-02-28 12:35:57,True,1,0,0.525862,0,True,False,False,22,1,3288,46,2020-06-03 23:05:22
6675,64210477,english,,2020-02-27 20:27:52,2020-02-27 20:27:52,True,0,0,0.0,0,True,False,False,58,1,5675,0,2020-04-09 01:00:06
6840,64111598,english,,2020-02-25 19:02:33,2020-02-25 19:02:33,True,0,0,0.0,0,True,False,False,57,11,6720,0,2020-04-20 15:29:47


In [11]:
game_dataset.isna().sum()

recommendationid                   0
language                           0
review                            36
timestamp_created                  0
timestamp_updated                  0
voted_up                           0
votes_up                           0
votes_funny                        0
weighted_vote_score                0
comment_count                      0
steam_purchase                     0
received_for_free                  0
written_during_early_access        0
author_num_games_owned             0
author_num_reviews                 0
author_playtime_forever            0
author_playtime_last_two_weeks     0
author_last_played                 0
dtype: int64

In [12]:
# Drop rows with missing reviews
game_dataset.dropna(inplace=True)

# Sanity check
game_dataset.isna().sum()

recommendationid                  0
language                          0
review                            0
timestamp_created                 0
timestamp_updated                 0
voted_up                          0
votes_up                          0
votes_funny                       0
weighted_vote_score               0
comment_count                     0
steam_purchase                    0
received_for_free                 0
written_during_early_access       0
author_num_games_owned            0
author_num_reviews                0
author_playtime_forever           0
author_playtime_last_two_weeks    0
author_last_played                0
dtype: int64

In [13]:
game_dataset.count()

recommendationid                  99957
language                          99957
review                            99957
timestamp_created                 99957
timestamp_updated                 99957
voted_up                          99957
votes_up                          99957
votes_funny                       99957
weighted_vote_score               99957
comment_count                     99957
steam_purchase                    99957
received_for_free                 99957
written_during_early_access       99957
author_num_games_owned            99957
author_num_reviews                99957
author_playtime_forever           99957
author_playtime_last_two_weeks    99957
author_last_played                99957
dtype: int64

Rows with null values have been deleted correctly, now the rows are 99957.
Now let's check for duplicates.

In [14]:
game_dataset.duplicated().sum()

0

It seems that there are no duplicated rows. But are there duplicated reviews?

In [15]:
game_dataset.duplicated(subset='review').sum()

3903

COME VEDIAMO NON CI SONO EFFETTIVAMENTE RECENSIONI UGUALI MA SOLO TERMINI SIMILI 

In [16]:
game_dataset[game_dataset.duplicated(subset='review',keep=False)].sample(10)

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_last_played
42884,34196360,english,it's good now,2017-08-12 17:45:30,2017-08-12 17:45:30,True,15,3,0.490296,0,False,False,False,202,3,2517,0,2019-08-30 00:57:33
66821,25189530,english,meh,2016-08-27 20:03:22,2016-08-27 20:03:22,False,3,0,0.508897,0,True,False,False,217,1,215,0,2016-08-13 00:15:29
85513,24892722,english,It's pretty good,2016-08-13 19:00:33,2019-08-22 03:02:52,True,0,1,0.0,0,True,False,False,218,5,6732,0,2020-04-21 22:42:56
3455,67271745,english,nice game,2020-04-14 04:24:12,2020-04-14 04:24:12,True,0,0,0.0,0,True,False,False,7,1,5520,0,2020-05-09 17:47:27
9575,61838448,english,they made it good,2020-01-14 11:37:36,2020-01-14 11:37:36,True,0,0,0.487805,0,False,False,False,191,32,598,0,2019-08-14 18:28:31
8822,62417391,english,It's getting there,2020-01-26 01:14:07,2020-01-26 01:14:07,True,0,0,0.0,0,True,False,False,42,5,3400,0,2020-01-26 03:32:27
18210,54670036,english,Too many bugs,2019-08-17 20:23:29,2019-08-17 20:23:29,False,1,0,0.5,0,False,False,False,370,13,277,0,2019-08-23 15:28:30
32048,43849604,english,Yes,2018-07-31 07:42:10,2018-07-31 07:42:10,True,0,1,0.456432,0,False,False,False,282,11,2288,0,2020-01-10 07:45:36
8787,62445708,english,<3,2020-01-26 12:24:33,2020-01-26 12:24:33,True,1,0,0.47619,0,True,False,False,57,2,14843,0,2020-04-29 18:11:57
7549,63774438,english,good game,2020-02-19 12:54:00,2020-02-19 12:54:00,True,0,0,0.0,0,False,False,False,9,1,8379,76,2020-05-31 17:13:53


## 2. Data Exploration

Convert to datetime the columns ```timestamp_created``` and ```timestamp_updated```

In [17]:
# Convert to datetime
game_dataset['timestamp_created'] = pd.to_datetime(game_dataset['timestamp_created'])
game_dataset['timestamp_updated'] = pd.to_datetime(game_dataset['timestamp_updated'])

## Preprocessing


In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.ml import Pipeline

# Crea una sessione Spark
spark = SparkSession.builder.appName("EMNaiveBayesExample").getOrCreate()

# Carica il dataset delle recensioni di Steam
dataset_path = "/Users/ilariadesio/Desktop/Computerscience/First year/Second semester/Big Data/Projects/Video_Game_Playability_Analysis/dataset/data_clean.csv"
df = spark.read.csv(dataset_path, header=True, inferSchema=True)

# Pre-elaborazione dei testi
tokenizer = RegexTokenizer(inputCol="review", outputCol="words", pattern="\s+", gaps=True)
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
vectorizer = CountVectorizer(inputCol="filtered_words", outputCol="features")

# Creazione del modello di classificazione Naive Bayes con EM
nb = NaiveBayes(featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", smoothing=1.0, modelType="multinomial", thresholds=None, weightCol=None)

# Creazione del pipeline per l'addestramento
pipeline = Pipeline(stages=[tokenizer, remover, vectorizer, nb])

# Divisione del dataset in train set e test set
train_data, test_data = df.randomSplit([0.8, 0.2], seed=123)

# Addestramento del modello
model = pipeline.fit(train_data)

# Valutazione del modello
predictions = model.transform(test_data)
predictions.select("review", "label", "prediction").show()

# Esempio di classificazione di nuove recensioni
new_reviews = spark.createDataFrame([
    ("This game is very enjoyable and has great playability.", ),
    ("I couldn't even play the game due to constant crashes.", ),
    ("The controls are intuitive and the gameplay is smooth.", )
], ["review"])

new_predictions = model.transform(new_reviews)
new_predictions.select("review", "prediction").show()

# Chiudi la sessione Spark
spark.stop()




                                                                                

23/05/18 15:23:40 WARN StopWordsRemover: Default locale set was [en_IT]; however, it was not found in available locales in JVM, falling back to en_US locale. Set param `locale` in order to respect another locale.


[Stage 2:>                                                          (0 + 8) / 8]

23/05/18 15:23:42 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 16)
org.apache.spark.SparkException: Failed to execute user defined function (RegexTokenizer$$Lambda$2997/0x0000000801a15fe8: (string) => array<string>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:190)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scal

Py4JJavaError: An error occurred while calling o45.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2.0 failed 1 times, most recent failure: Lost task 7.0 in stage 2.0 (TID 16) (10.22.246.184 executor driver): org.apache.spark.SparkException: Failed to execute user defined function (RegexTokenizer$$Lambda$2997/0x0000000801a15fe8: (string) => array<string>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:190)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Cannot invoke "String.toLowerCase()" because "originStr" is null
	at org.apache.spark.ml.feature.RegexTokenizer.$anonfun$createTransformFunc$2(Tokenizer.scala:146)
	... 19 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1274)
	at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:233)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function (RegexTokenizer$$Lambda$2997/0x0000000801a15fe8: (string) => array<string>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:190)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: java.lang.NullPointerException
	at org.apache.spark.ml.feature.RegexTokenizer.$anonfun$createTransformFunc$2(Tokenizer.scala:146)
	... 19 more
