# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
    !tar xf spark-3.5.1-bin-hadoop3.tgz
    !mv spark-3.5.1-bin-hadoop3 spark
    !pip install -q findspark
    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark"

# Start a Local Cluster

In [None]:
# Install PySpark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=50581ce2fa36933cfb53e42c89a4f84938e8f663477f27903c160209c7814bbc
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("Spark Assignment") \
    .getOrCreate()

# Spark Assignment

Based on the movie review dataset in 'netflix-rotten-tomatoes-metacritic-imdb.csv', answer the below questions.

**Note:** do not clean or remove missing data

In [None]:
# Load the dataset
df = spark.read.csv("netflix-rotten-tomatoes-metacritic-imdb.csv", header=True)

In [25]:
df.show(5)

+-------------------+--------------------+--------------------+----------------+---------------+----------------+--------------------+------------+---------------+--------------------+--------------------+-----------+----------+---------------------+----------------+---------------+--------------------+----------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+
|              Title|               Genre|                Tags|       Languages|Series or Movie|Hidden Gem Score|Country Availability|     Runtime|       Director|              Writer|              Actors|View Rating|IMDb Score|Rotten Tomatoes Score|Metacritic Score|Awards Received|Awards Nominated For| Boxoffice|Release Date|Netflix Release Date|    Production House|        Netflix Link|           IMDb Link|             Summary|IMDb Votes|               Image|              

## What is the maximum and average of the overall hidden gem score?

In [None]:
max_hidden_gem_score = df.agg({"Hidden Gem Score": "max"}).collect()[0][0]
avg_hidden_gem_score = df.agg({"Hidden Gem Score": "avg"}).collect()[0][0]

In [None]:
print("Maximum Hidden Gem Score:", max_hidden_gem_score)
print("Average Hidden Gem Score:", round(avg_hidden_gem_score, 2))

Maximum Hidden Gem Score: 9.8
Average Hidden Gem Score: 5.94


## How many movies that are available in Korea?

In [None]:
# How many series or movies that are available in Korean language?
korean_series_movies_count = df.filter(df.Languages.contains("Korean")).count()

In [None]:
print("Number of series or movies available in Korean language:", korean_series_movies_count)

Number of series or movies available in Korean language: 735


## Which director has the highest average hidden gem score?

In [26]:
director_avg_hidden_gem_score = df.groupBy("Director") \
                                  .agg({"Hidden Gem Score": "avg"}) \
                                  .orderBy("avg(Hidden Gem Score)", ascending=False) \
                                  .first()

In [27]:
print("Director with the highest average hidden gem score:", director_avg_hidden_gem_score["Director"])

Director with the highest average hidden gem score: Dorin Marcu


## How many genres are there in the dataset?

In [32]:
from pyspark.sql.functions import split, explode

# Split the Genre column by comma and explode to separate individual genres
df_genres = df.withColumn("Genre", explode(split(df["Genre"], ", ")))

In [33]:
df_genres.show(10)

+-------------------+---------+--------------------+----------------+---------------+----------------+--------------------+------------+---------------+--------------------+--------------------+-----------+----------+---------------------+----------------+---------------+--------------------+----------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+
|              Title|    Genre|                Tags|       Languages|Series or Movie|Hidden Gem Score|Country Availability|     Runtime|       Director|              Writer|              Actors|View Rating|IMDb Score|Rotten Tomatoes Score|Metacritic Score|Awards Received|Awards Nominated For| Boxoffice|Release Date|Netflix Release Date|    Production House|        Netflix Link|           IMDb Link|             Summary|IMDb Votes|               Image|              Poster|        TMDb Tr

In [34]:
# Count distinct genres
distinct_genre_count = df_genres.select("Genre").distinct().count()

In [35]:
print("Number of distinct genres in the dataset:", distinct_genre_count)

Number of distinct genres in the dataset: 28
