# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [8]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
    !tar xf spark-3.5.1-bin-hadoop3.tgz
    !mv spark-3.5.1-bin-hadoop3 spark
    !pip install -q findspark
    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark"


# Start a Local Cluster

In [4]:
!pip install -q pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [18]:
from pyspark.sql.functions import col, avg, split, explode

In [9]:
import findspark
from pyspark.sql import SparkSession

findspark.init()
spark = SparkSession.builder.master('local').appName('Assignment 10 Spark').getOrCreate()

# Spark Assignment

Based on the movie review dataset in 'netflix-rotten-tomatoes-metacritic-imdb.csv', answer the below questions.

**Note:** do not clean or remove missing data

In [10]:
df = spark.read.option("header", True).csv('netflix-rotten-tomatoes-metacritic-imdb.csv')

## What is the maximum and average of the overall hidden gem score?

In [11]:
hiddenGemScore = df.select("Hidden Gem Score")

print("maximum of the overall hidden gem score:", hiddenGemScore.agg({"Hidden Gem Score": "max"}).collect()[0][0])
print("average of the overall hidden gem score:", hiddenGemScore.agg({"Hidden Gem Score": "avg"}).collect()[0][0])

maximum of the overall hidden gem score: 9.8
average of the overall hidden gem score: 5.937551386501226


## How many movies that are available in Korea?

In [14]:
print("amount of movies that are available in Korea: ", df.filter(col('Languages').like('%Korean%')).count())

amount of movies that are available in Korea:  735


## Which director has the highest average hidden gem score?

In [17]:
print(df.groupBy('Director').agg(avg('Hidden Gem Score').alias('Average Hidden Gem Score')).orderBy('Average Hidden Gem Score', ascending=False).first()['Director'], "has the highest average hidden gem score")

Dorin Marcu has the highest average hidden gem score


## How many genres are there in the dataset?

In [23]:
tmp = df.withColumn('Genre', split(df['Genre'], ', '))

print(tmp.select(explode(tmp['Genre']).alias('Genre')).distinct().count(), "genres are there in the dataset")

28 genres are there in the dataset
