# DataFrames Basics Exercises

## Prerrequisites

Install Spark and Java in VM

In [1]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.1
!wget -q https://apache.osuosl.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop2.tgz

In [2]:
ls -l # check the .tgz is there

total 267680
drwxr-xr-x 1 root root      4096 Dec  7 14:41 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 274099817 Oct 15 10:53 spark-3.3.1-bin-hadoop2.tgz


In [3]:
# unzip it
!tar xf spark-3.3.1-bin-hadoop2.tgz

In [4]:
!pip install -q findspark

Defining the environment

In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop2"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [6]:
import findspark
findspark.init("spark-3.3.1-bin-hadoop2")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFramesBasics Exercises") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.3.1'

In [7]:
spark

In [8]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [9]:
# Import sql functions
from pyspark.sql.functions import *

Download datasets

In [10]:
!mkdir -p dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2022/master/datasets/movies.json -P /dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2022/master/datasets/cars.json -P /dataset

## DataFrames Basics Exercises

1) Create a manual DF describing smartphones
  - maker
  - model
  - screen dimension
  - camera megapixels
  
2) Read another file from the dataset/ folder, e.g. movies.json
  - print its schema
  - count the number of rows, call count()

In [16]:
header = ["Maker", "Model", "Screen Dimension", "Camera Megapixels"]
values = [
        ("Apple", "iPhone 11", "512x256", 25),  # create your data here, be consistent in the types.
        ("Sony", "Z", "300x256", 35),
        ("Samsung", "Edge", "300x1256", 45)
    ]

df = spark.createDataFrame(values,header)

df.printSchema()
df.show()



smartphones = [{"maker":"Nokia", "model":"7200", "screen dimension":"200x400", "camera megapixels":24},
               {"maker":"Apple", "model":"iPhone X", "screen dimension":"500x450", "camera megapixels":20},
               {"maker":"Samsung", "model":"Galaxy 2", "screen dimension":"400x300", "camera megapixels":28},
               {"maker":"Sony", "model":"Xperia Z", "screen dimension":"200x300", "camera megapixels":14}]


df_phone = spark.read.json(spark.sparkContext.parallelize([smartphones]))

root
 |-- Maker: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Screen Dimension: string (nullable = true)
 |-- Camera Megapixels: long (nullable = true)

+-------+---------+----------------+-----------------+
|  Maker|    Model|Screen Dimension|Camera Megapixels|
+-------+---------+----------------+-----------------+
|  Apple|iPhone 11|         512x256|               25|
|   Sony|        Z|         300x256|               35|
|Samsung|     Edge|        300x1256|               45|
+-------+---------+----------------+-----------------+



In [17]:
df = spark.read.json("/dataset/movies.json")

df.printSchema()
df.show()
df.count()

root
 |-- Creative_Type: string (nullable = true)
 |-- Director: string (nullable = true)
 |-- Distributor: string (nullable = true)
 |-- IMDB_Rating: double (nullable = true)
 |-- IMDB_Votes: long (nullable = true)
 |-- MPAA_Rating: string (nullable = true)
 |-- Major_Genre: string (nullable = true)
 |-- Production_Budget: long (nullable = true)
 |-- Release_Date: string (nullable = true)
 |-- Rotten_Tomatoes_Rating: long (nullable = true)
 |-- Running_Time_min: long (nullable = true)
 |-- Source: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- US_DVD_Sales: long (nullable = true)
 |-- US_Gross: long (nullable = true)
 |-- Worldwide_Gross: long (nullable = true)

+--------------------+-----------------+--------------+-----------+----------+-----------+-----------+-----------------+------------+----------------------+----------------+-------------------+--------------------+------------+--------+---------------+
|       Creative_Type|         Director|   Distributor|

3201

## Columns and Expressions Exercises

1. Read the movies DF and select 2 columns of your choice
2. Create another column summing up the total profit of the movies = US_Gross + Worldwide_Gross + DVD sales. Are you pbtaining nulls? How you can solve it?
3. Select all COMEDY movies with IMDB rating above 6

Use as many versions as possible

In [47]:
#!pip install pyspark
dfcolumn = df.select("Director", "US_Gross")

dfcolumn.show()

#dfsuma = df.withColumn('total', df.coalesce(df.US_Gross, df.lit(0)) + df.coalesce(df.Worldwide_Gross, df.lit(0)) + df.coalesce(df.US_DVD_Sales, df.lit(0)))

import pyspark.sql.functions as f

df1 = df.withColumn("new_count", f.coalesce(f.col('US_Gross'), f.lit(0)) + f.coalesce(f.col('Worldwide_Gross'), f.lit(0)))
df1.show()

dffilter = df1.filter("IMDB_Rating  > 6 AND Major_Genre == 'Comedy'")

dffilter.show()

ComediesDF3 = df.select("Title", "Major_Genre", "IMDB_Rating") \
.where((col("IMDB_Rating") > 6) & (col("Major_Genre") == "Comedy"))

ComediesDF3.show()



+-----------------+--------+
|         Director|US_Gross|
+-----------------+--------+
|             null|  146083|
|             null|   10876|
|             null|  203134|
|             null|  373615|
|             null| 1009819|
|             null|   24551|
|Christopher Nolan|   44705|
|             null| 6026908|
|   Roman Polanski| 1641825|
|             null|20400000|
|             null|37600000|
|             null|37402877|
|             null|13129846|
|Richard Fleischer|29548291|
|             null| 5228617|
|             null| 3000000|
|             null| 2000000|
|    Blake Edwards| 5000000|
|             null|80000000|
|     Sidney Lumet|       0|
+-----------------+--------+
only showing top 20 rows

+--------------------+-----------------+--------------+-----------+----------+-----------+-----------+-----------------+------------+----------------------+----------------+-------------------+--------------------+------------+--------+---------------+---------+
|       Creativ