
# Spark Preparation

We check if we are in Google Colab. If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. Learn more from A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!


In [4]:

!pip install -q findspark


In [2]:

import os

os.environ["JAVA_HOME"] = "/opt/homebrew/Caskroom/openjdk/21.0.2"
os.environ["SPARK_HOME"] = "/opt/homebrew/Caskroom/apache-spark/3.5.1/libexec"



## Start a Local Cluster


In [None]:

import findspark
findspark.init()


In [None]:

spark_url = "local"


In [None]:

from pyspark.sql import SparkSession

spark = SparkSession.builder    .master(spark_url)    .appName('Spark Tutorial')    .config('spark.ui.port', '4040')    .getOrCreate()



# Spark Assignment

Based on the movie review dataset in 'netflix-rotten-tomatoes-metacritic-imdb.csv', answer the below questions.

Note: do not clean or remove missing data.


In [None]:

path = 'netflix-rotten-tomatoes-metacritic-imdb.csv'
df = spark.read.option("header", True).csv(path)


In [None]:

df.show(10)


In [None]:

from pyspark.sql.functions import avg, min, max, countDistinct, explode, split, col

avg_gem = df.select(avg('Hidden Gem Score'))
max_gem = df.select(max('Hidden Gem Score'))

avg_gem.show()
max_gem.show()


In [None]:

count = df.filter(df['Languages'].contains('Korea')).count()
print(count)


In [None]:

avg_director = df.groupBy('Director').agg(avg('Hidden Gem Score'))
max_score = avg_director.select(max('avg(Hidden Gem Score)')).collect()[0][0]

avg_director.filter(avg_director['avg(Hidden Gem Score)'] == max_score).show()


In [None]:

df.withColumn("Genre", explode(split(col("Genre"), ", "))).select('Genre').distinct().count()
