# Final Project Planning Document

For my final project, I would like to use my Project 4 dataset which contained user preference data from 73,516 users and 12,294 anime.  In Project 4, I was unable to perform certain operations in Pandas with the full dataset due to memory constraints on my computer.  In Project 5, I was unable to compute cosine-similarity for a smaller dataset as my local Spark session would crash.  As a result, I would like to learn to overcome both of these obstacles through leveraging AWS alongside more optimized code.  In Project 4, my content based recommender system used the genre column to determine similarity between anime.  Since the data was sourced from myanimelist.net, which does contain anime descriptions, I will additionally scrape the description data and use that for the TF-IDF and cosine similarity analysis.  In this version of my project, there are also certain genre of anime that I will exclude, as I would like to build a family-friendly recommender system.  

To scrape the data, I will use BeautifulSoup in Python and parse data from the description html tags.  The id in the dataset will match the anime_id in the original dataset. 

My plan is to use Amazon S3 for data storage and create the recommender system in Amazon EC2.  This can be done with the free trial version of AWS.  I will use Sci-kit Learn for the TF-IDF analysis.  Potentially, I will use Amazon Sagemaker as an alternative to EC2 as it can offer more memory and also has built in Spark containers.  However, this may incur additional costs beyond the free tier.

As suggested by another classmate in our discussions board, my strategy will be to develop the code locally with PySpark on a smaller subset of the data, and then bring it over to AWS once I have a working prototype.  Spark set-up and some preliminary pre-processing steps are shown below.

# Spark Set-Up

In [151]:
import findspark
from pyspark import SparkContext
import pyspark 
from pyspark.sql import SparkSession
import os
from pyspark.sql.types import IntegerType,StringType,StructField,StructType, BooleanType, FloatType
from pyspark.sql.functions import monotonically_increasing_id, concat_ws, col, lit, mean, count_distinct, when, split, udf, coalesce, when

In [3]:
findspark.init()
sc = SparkContext.getOrCreate()

os.environ["SPARK_LOCAL_DIRS"] = "C:\\Temp\\spark-temp"

spark = SparkSession.builder.config("spark.driver.memory", "6g").config("spark.executor.memory", "6g").getOrCreate()
spark

# New Data

In [7]:
anime_desc_schema = StructType([
    StructField('anime_id', IntegerType()), 
    StructField('description', StringType())
])
anime_descriptions = spark.read.option("sep", ";").csv("anime_summaries.csv", schema = anime_desc_schema)

In [17]:
anime_descriptions = anime_descriptions.filter(col("anime_id").isNotNull())

# Pre-Processing

In [18]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as KNN
import matplotlib.pyplot as plt
import kagglehub
import gc
# Authenticate
# kagglehub.login()
path = kagglehub.dataset_download("CooperUnion/anime-recommendations-database")
print(path)
path_anime = path + '\\anime.csv'
path_rating = path + '\\rating.csv'

C:\Users\Kim\.cache\kagglehub\datasets\CooperUnion\anime-recommendations-database\versions\1


In [19]:
#anime = pd.read_csv(path_anime, header = 0)
anime = spark.read.csv(path_anime, header = True)

In [20]:
anime.printSchema() 

root
 |-- anime_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- type: string (nullable = true)
 |-- episodes: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- members: string (nullable = true)



In [21]:
anime.columns

['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

In [22]:
schema = StructType([
    StructField('anime_id', IntegerType()), 
    StructField('name', StringType()),
    StructField('genre', StringType()),
    StructField('type', StringType()),
    StructField('episodes', FloatType()),
    StructField('rating', FloatType()),
    StructField('members', IntegerType())
])

In [23]:
anime = spark.read.csv(path_anime, schema = schema, header = True)

In [24]:
anime.printSchema()

root
 |-- anime_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- type: string (nullable = true)
 |-- episodes: double (nullable = true)
 |-- rating: double (nullable = true)
 |-- members: integer (nullable = true)



In [40]:
anime.count()

12294

In [39]:
anime.filter(col('genre').like('%Ecchi%')).union(anime.filter(col('genre').like('%Hentai%'))).union(anime.filter(col('genre').like('%Erotica%'))).distinct().count()

1778

In [26]:
anime_full = anime.join(anime_descriptions, on = "anime_id", how = "inner")

In [27]:
anime_full.count()

10640

In [262]:
new_anime_full = anime_full.sample(withReplacement=False, fraction=0.0003)
new_anime_full.show()

+--------+--------------------+--------------------+-----+--------+------+-------+--------------------+
|anime_id|                name|               genre| type|episodes|rating|members|         description|
+--------+--------------------+--------------------+-----+--------+------+-------+--------------------+
|     483|Kurau Phantom Memory|Action, Drama, Sc...|   TV|    24.0|  7.46|  17463|It is the year 21...|
|   16790|    Super Samchongsa|Action, Mecha, Sh...|Movie|     1.0|  4.33|     63|A Korean animated...|
|    8184|  Bouken Gabotenjima|           Adventure|   TV|    39.0|  6.58|    140|Bouken Gabotenjim...|
|   13501|  Cofun Gal no Coffy|Comedy, Historica...|  ONA|    10.0|  4.45|     75|Tumulus Gal Coffy...|
+--------+--------------------+--------------------+-----+--------+------+-------+--------------------+



In [43]:
#rating = pd.read_csv(path_ratings, header = 0)
rating = spark.read.csv(path_rating, header = True)
rating.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- anime_id: string (nullable = true)
 |-- rating: string (nullable = true)



In [44]:
rating_schema = StructType([
    StructField('user_id', IntegerType()), 
    StructField('anime_id', IntegerType()),
    StructField('rating', FloatType())
])

rating = spark.read.csv(path_rating, schema = rating_schema, header = True)
#rating = rating.loc[rating['rating'] != -1]
rating = rating.filter(col('rating') != -1)
rating = rating.dropna()

In [45]:
#random_selection = pd.DataFrame(rating['user_id'].unique()).sample(frac = .2, random_state = 63)
#new_rating = rating[rating['user_id'].isin(random_selection[0])]
#new_rating
new_rating = rating.sample(withReplacement=False, fraction=0.05)

I split the 20% of the data into a training and test set.

In [111]:
#df_random = new_rating.sample(frac = .2, random_state = 63) # for the sake of this exercise, going to use only 20% of the dataset due to size
#split_size = int(0.8*len(df_random)) # designate split size (80%)
#train_df = df_random[:split_size] # split dataset into 80% train and 20% test
#test_df = df_random[split_size:]
#train_df = pd.DataFrame(train_df)
#test_df = pd.DataFrame(test_df)

train_df, test_df = new_rating.randomSplit([0.8,0.2], seed = 63)

In [112]:
train_df.show()

+-------+--------+------+
|user_id|anime_id|rating|
+-------+--------+------+
|      3|   14075|   8.0|
|      3|   28223|   6.0|
|      3|   28891|   9.0|
|      5|      47|   8.0|
|      5|     527|   6.0|
|      5|     898|   5.0|
|      5|    1674|   5.0|
|      5|    2961|   6.0|
|      5|    2962|   6.0|
|      5|    3503|   1.0|
|      5|    3702|   9.0|
|      5|   11113|   7.0|
|      5|   12447|   4.0|
|      5|   13367|   2.0|
|      5|   16067|   5.0|
|      5|   16417|   3.0|
|      5|   17875|   5.0|
|      5|   23447|   5.0|
|      5|   25159|   2.0|
|      5|   28677|   5.0|
+-------+--------+------+
only showing top 20 rows


In [143]:
train_means = train_df.select(mean('rating')).collect()[0][0] 
print(train_means)

user_bias = train_df.groupBy("user_id").mean("rating").orderBy("user_id")
user_bias = user_bias.withColumnRenamed("avg(Rating)","user_bias")
user_bias = user_bias.withColumn("user_bias", user_bias["user_bias"] - train_means)
user_bias.show()

anime_bias = train_df.groupBy("anime_id").mean("rating").orderBy("anime_id")
anime_bias = anime_bias.withColumnRenamed("avg(Rating)","anime_bias")
anime_bias = anime_bias.withColumn("anime_bias", anime_bias["anime_bias"] - train_means)
anime_bias.show()

7.812962801637774
+-------+--------------------+
|user_id|           user_bias|
+-------+--------------------+
|      3|-0.14629613497110672|
|      5|  -2.871786331049538|
|      7| -0.7541392722260092|
|     10|  1.1870371983622263|
|     11| 0.18703719836222632|
|     14|-0.42834741702238865|
|     17|   -1.28915327782825|
|     20|  2.1870371983622263|
|     21| -0.3129628016377737|
|     23|  1.6870371983622263|
|     24| -1.4796294683044406|
|     27|  0.6870371983622263|
|     29| 0.18703719836222632|
|     30|  1.1870371983622263|
|     31| 0.18703719836222632|
|     32|  2.1870371983622263|
|     37| -1.8129628016377737|
|     38| -2.3129628016377737|
|     39|  2.1870371983622263|
|     41|-0.47962946830444064|
+-------+--------------------+
only showing top 20 rows
+--------+--------------------+
|anime_id|          anime_bias|
+--------+--------------------+
|       1|  1.0832957017635874|
|       5|  0.5894762227524701|
|       6|  0.6015449703829514|
|       7| -0.4349140

In [153]:
global_baseline_train = train_df
global_baseline_train = global_baseline_train.join(user_bias, on = "user_id", how = "left")
global_baseline_train = global_baseline_train.join(anime_bias, on = "anime_id", how = "left")
global_baseline_train = global_baseline_train.withColumn("prediction", 
                                                         coalesce(global_baseline_train["user_bias"],lit(0)) + 
                                                         coalesce(global_baseline_train["anime_bias"],lit(0)) + 
                                                         train_means)

global_baseline_train = global_baseline_train.withColumn("prediction", when(global_baseline_train["prediction"] < 1, 1)
    .when(global_baseline_train["prediction"] > 10, 10)
    .otherwise(global_baseline_train["prediction"]))

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="rating", metricName="rmse")
rmse_gb_train = evaluator.evaluate(global_baseline_train)
print(rmse_gb_train)

1.1129095572088195


In [154]:
global_baseline_test = test_df
global_baseline_test = global_baseline_test.join(user_bias, on = "user_id", how = "left")
global_baseline_test = global_baseline_test.join(anime_bias, on = "anime_id", how = "left")
global_baseline_test = global_baseline_test.withColumn("prediction", 
                                                       coalesce(global_baseline_test["user_bias"],lit(0)) + 
                                                       coalesce(global_baseline_test["anime_bias"],lit(0)) + 
                                                       train_means)
global_baseline_test = global_baseline_test.withColumn("prediction", when(global_baseline_test["prediction"] < 1, 1)
    .when(global_baseline_test["prediction"] > 10, 10)
    .otherwise(global_baseline_test["prediction"]))

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="rating", metricName="rmse")
rmse_gb_test = evaluator.evaluate(global_baseline_test)
print(rmse_gb_test)

1.384266137831554


In [164]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.linalg import DenseVector, Vectors

In [284]:
tokenizer = Tokenizer(inputCol="description", outputCol="words")
anime_df_processed = tokenizer.transform(new_anime_full)
hashingTF = HashingTF(inputCol="words", outputCol="features",numFeatures=1) #numFeatures=500
anime_features_df = hashingTF.transform(anime_df_processed)
anime_features_df.show()

+--------+--------------------+--------------------+-----+--------+------+-------+--------------------+--------------------+---------------+
|anime_id|                name|               genre| type|episodes|rating|members|         description|               words|       features|
+--------+--------------------+--------------------+-----+--------+------+-------+--------------------+--------------------+---------------+
|     483|Kurau Phantom Memory|Action, Drama, Sc...|   TV|    24.0|  7.46|  17463|It is the year 21...|[it, is, the, yea...|(1,[0],[106.0])|
|   16790|    Super Samchongsa|Action, Mecha, Sh...|Movie|     1.0|  4.33|     63|A Korean animated...|[a, korean, anima...| (1,[0],[24.0])|
|    8184|  Bouken Gabotenjima|           Adventure|   TV|    39.0|  6.58|    140|Bouken Gabotenjim...|[bouken, gabotenj...| (1,[0],[20.0])|
|   13501|  Cofun Gal no Coffy|Comedy, Historica...|  ONA|    10.0|  4.45|     75|Tumulus Gal Coffy...|[tumulus, gal, co...| (1,[0],[65.0])|
+--------+---

In [285]:
idf = IDF(inputCol="features", outputCol="IDF_features")
idf_model = idf.fit(anime_features_df)
tfidf_df = idf_model.transform(anime_features_df)
tfidf_df = tfidf_df.select("anime_id","IDF_features")
tfidf_df.show()

+--------+-------------+
|anime_id| IDF_features|
+--------+-------------+
|     483|(1,[0],[0.0])|
|   16790|(1,[0],[0.0])|
|    8184|(1,[0],[0.0])|
|   13501|(1,[0],[0.0])|
+--------+-------------+



In [286]:
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="IDF_features", outputCol="norm")
normalized_tfidf_df = normalizer.transform(tfidf_df)

In [287]:
normalized_tfidf_df_2 = normalized_tfidf_df.withColumnRenamed("anime_id", "anime_id_2").withColumnRenamed("norm","norm_2")
cosine_sim = normalized_tfidf_df.crossJoin(normalized_tfidf_df_2)

In [288]:
sim_cos = udf(lambda x,y : float(x.dot(y)), FloatType())

# # executing udf on dataframe
cosine_sim = cosine_sim.withColumn("similarity", sim_cos(col("norm"),col("norm_2")))

In [292]:
cosine_sim.printSchema()

root
 |-- anime_id: integer (nullable = true)
 |-- IDF_features: vector (nullable = true)
 |-- norm: vector (nullable = true)
 |-- anime_id_2: integer (nullable = true)
 |-- IDF_features: vector (nullable = true)
 |-- norm_2: vector (nullable = true)
 |-- similarity: float (nullable = true)



In [293]:
cosine_sim.show()

Py4JJavaError: An error occurred while calling o36892.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 497.0 failed 1 times, most recent failure: Lost task 0.0 in stage 497.0 (TID 1010) (DESKTOP-K2VAS2U executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:123)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:386)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:104)
	... 24 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:2935)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2935)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2927)
	at scala.collection.immutable.List.foreach(List.scala:334)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2927)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1295)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1295)
	at scala.Option.foreach(Option.scala:437)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1295)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3207)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3141)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3130)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2484)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2505)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2524)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:544)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:497)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:58)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:402)
	at org.apache.spark.sql.execution.adaptive.ResultQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:325)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$4(SQLExecution.scala:318)
	at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:268)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$3(SQLExecution.scala:316)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:312)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:123)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
	... 3 more
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:386)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:104)
	... 24 more


# Data Scraping