# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB:
  !apt-get update # Update apt-get repository.
  !apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
  !wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
  !tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
  !pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

  # Set environment variables
  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [2,173 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [61.3 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-

# Start a Local Cluster

In [None]:
# Initialize findspark
import findspark
findspark.init()

In [None]:
# Create a PySpark session
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local[*]")\
        .appName('NetflixDE')\
        .getOrCreate()
spark

# Spark Assignment

Based on the movie review dataset in 'netflix-rotten-tomatoes-metacritic-imdb.csv', answer the below questions.

**Note:** do not clean or remove missing data

In [None]:
import requests

url = 'https://raw.githubusercontent.com/pvateekul/2110446_DSDE_2023s2/main/code/Week10_Spark/netflix-rotten-tomatoes-metacritic-imdb.csv'
response = requests.get(url)

local_path = '/tmp/netflix-rotten-tomatoes-metacritic-imdb.csv'
with open(local_path, 'wb') as file:
    file.write(response.content)

df = spark.read.csv(local_path, header=True, inferSchema=True)
df.show(5)

+-------------------+--------------------+--------------------+----------------+---------------+----------------+--------------------+------------+---------------+--------------------+--------------------+-----------+----------+---------------------+----------------+---------------+--------------------+----------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+
|              Title|               Genre|                Tags|       Languages|Series or Movie|Hidden Gem Score|Country Availability|     Runtime|       Director|              Writer|              Actors|View Rating|IMDb Score|Rotten Tomatoes Score|Metacritic Score|Awards Received|Awards Nominated For| Boxoffice|Release Date|Netflix Release Date|    Production House|        Netflix Link|           IMDb Link|             Summary|IMDb Votes|               Image|              

In [None]:
#import functions
from pyspark.sql.functions import col, min, max, avg, explode, split, countDistinct

## What is the maximum and average of the overall hidden gem score?

In [None]:
df.select(max('Hidden Gem Score'), avg('Hidden Gem Score')).show()

+---------------------+---------------------+
|max(Hidden Gem Score)|avg(Hidden Gem Score)|
+---------------------+---------------------+
|                  9.8|    5.937551386501234|
+---------------------+---------------------+



## How many movies that are available in Korea?

In [None]:
korea_avail = df.filter(col("Languages").contains("Korean")).count()
korea_avail

735

## Which director has the highest average hidden gem score?

In [None]:
from pyspark.sql import functions as F
highest_avg = df.groupby('Director')\
              .agg(F.avg('Hidden Gem Score')\
              .alias('Avg Hidden Gem Score'))\
              .orderBy(F.desc('Avg Hidden Gem Score'))
highest_avg.show(1)

+-----------+--------------------+
|   Director|Avg Hidden Gem Score|
+-----------+--------------------+
|Dorin Marcu|                 9.8|
+-----------+--------------------+
only showing top 1 row



## How many genres are there in the dataset?

In [None]:
genres_count = df.withColumn("Genre", explode(split(col("Genre"), ",\s*")))\
                                      .agg(countDistinct(col("Genre"))\
                                      .alias("Distinct Genres Count"))

genres_count.show()

+---------------------+
|Distinct Genres Count|
+---------------------+
|                   28|
+---------------------+

