## Test notebook for the cluster created with docker compose.

The dashboard of the Cluster is port mapped from the Spark master container, and should be visible on `localhost:8080`.

The Spark master address inside the Docker network created once the `docker compose up` is executed is the following: `spark://spark-master:7077`.

### Create spark session

In [None]:
# import the python libraries to create/connect to a Spark Session
from pyspark.sql import SparkSession

# from the CLIENT, create a Spark Session
# pointing to the Spark master address
#
#   - master  --> the spark master node IP:PORT address    
#   - appName --> the spark application name 
#   - config  --> a set of configuration parameters
spark = SparkSession.builder\
    .master("spark://spark-master:7077")\
    .appName("Test spark application")\
    .config("spark.executor.memory", "512m")\
    .getOrCreate()

In [None]:
# check the newly created spark session
# this is the entry point to all other Spark functionalities
spark

The __Spark UI__ link provided won't work, as it refers to the Docker container running the Spark application.

Port mapping is however provided for the `4040` port and you can open the Spark UI link to the application pointing your browser to `localhost:4040`.

In [None]:
# from the Spark Session we can access the Spark Context
# the Spark Context is the driver we will use to submit applications to Spark
sc = spark.sparkContext

## Testing basic parallelization

In [None]:
# distribute a simple list over the cluster, and counts the elements in parallel
sc.parallelize([1,2,3,4,5,6,7,8]).count()

## Accessing files

In [None]:
# create an RDD by reading an input file
rdd = sc.textFile("../../datasets/lecture1/file_1.txt")

In [None]:
# take 2 elements from the dataset
# `take` is the equivalent of `show`
rdd.take(2)

In [None]:
# create an RDD by reading ALL files in the folder
rdd = sc.textFile("/mapd-workspace/datasets/lecture1/file_*.txt")

In [None]:
# print the number of elements in the RDD
rdd.count()

In [None]:
# check how many partitions Spark has used to subdivide the dataset
rdd.getNumPartitions()