### Basic practice in using Lambda + map functions with PySpark

Package Import

In [10]:
### 
# You might have noticed this code in the screencast.
#
# import findspark
# findspark.init('spark-2.3.2-bin-hadoop2.7')
#
# The findspark Python module makes it easier to install
# Spark in local mode on your computer. This is convenient
# for practicing Spark syntax locally. 
# However, the workspaces already have Spark installed and you do not
# need to use the findspark module
#
###

from pyspark import SparkConf, SparkContext

Setting up Spark Context

In [4]:
config = SparkConf().setAppName("maps_and_lazy_evaluation_example").setMaster("local[*]")

if('sc' in locals()):
    sc.stop()

sc = SparkContext(conf = config)

#Simple method:
#sc = SparkContext(appName="maps_and_lazy_evaluation_example")

Importing Data

In [5]:
log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "despacito",
        "All the stars"
]

Loading data and operations into Spark - to be processed in a parallel fashion

---

Question: What is Spark actually doing here? Does it split up the file into smaller chunks for parallel processing? What are the other options for loading data into Spark?

In [6]:
#Note: These commands appear to be run instantaneously, but they are not. Rather, we are LOADING instructions into Spark.
#Once we are done (and Spark knows our entire process), Spark will optimize the DAG and perform the operations.

'''Note: I think it's quite important to initialize the data in the same statement as all of the operations.
By doing this, there's a low risk that any operations might be added again to the DAG by re-running a command.
If this current command is re-run, the entire DAG is overwritten.'''


# parallelize the log_of_songs to use with Spark
distributed_song_log = sc.parallelize(log_of_songs)

#Rather, we are definin
distributed_song_log = distributed_song_log.map(lambda x: x.lower())

In [7]:
#To force Spark to perform the operations you've specified, use the 'collect' method:
results = distributed_song_log.collect()
results

                                                                                

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']

# Using SparkSession to create a Data Frame

Importing modules

In [11]:
from pyspark.sql import SparkSession

In [15]:
#Note that Spark only allows one Spark context and one Spark session to be defined at any time.
#In the code below, 'GetOrCreate' will either create the Spark session or modify the existing one.

sparkSesh = SparkSession \
    .builder \
    .appName("app Name") \
    .config('config option','config value') \
    .getOrCreate()

Look at parameters of the spark context

In [17]:
sparkSesh.sparkContext.getConf().getAll()

[('spark.driver.host', '192.168.0.234'),
 ('spark.driver.extraJavaOptions',
  '-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED'),
 ('spark.app.startTime', '1661614686745'),
 ('spark.app.id', 'local-1661614686774'),
 ('spark.executor.id', 'driver'),
 ('spark.app.name', 'maps_and_lazy_evaluation_example'),
 ('spark.app.sub