### Basic practice in using Lambda + map functions with PySpark

Package Import

In [1]:
### 
# You might have noticed this code in the screencast.
#
# import findspark
# findspark.init('spark-2.3.2-bin-hadoop2.7')
#
# The findspark Python module makes it easier to install
# Spark in local mode on your computer. This is convenient
# for practicing Spark syntax locally. 
# However, the workspaces already have Spark installed and you do not
# need to use the findspark module
#
###

from pyspark import SparkConf, SparkContext

Setting up Spark Context

In [None]:
config = SparkConf().setAppName("maps_and_lazy_evaluation_example").setMaster("local[*]")

if('sc' in locals()):
    sc.stop()

sc = SparkContext(conf = config)

#Simple method:
#sc = SparkContext(appName="maps_and_lazy_evaluation_example")

Importing Data

In [None]:
log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "despacito",
        "All the stars"
]

Loading data and operations into Spark - to be processed in a parallel fashion

---

Question: What is Spark actually doing here? Does it split up the file into smaller chunks for parallel processing? What are the other options for loading data into Spark?

In [None]:
#Note: These commands appear to be run instantaneously, but they are not. Rather, we are LOADING instructions into Spark.
#Once we are done (and Spark knows our entire process), Spark will optimize the DAG and perform the operations.

'''Note: I think it's quite important to initialize the data in the same statement as all of the operations.
By doing this, there's a low risk that any operations might be added again to the DAG by re-running a command.
If this current command is re-run, the entire DAG is overwritten.'''


# parallelize the log_of_songs to use with Spark
distributed_song_log = sc.parallelize(log_of_songs)

#Rather, we are definin
distributed_song_log = distributed_song_log.map(lambda x: x.lower())

In [None]:
#To force Spark to perform the operations you've specified, use the 'collect' method:
results = distributed_song_log.collect()
results

Spark Broadcast practice

In [3]:
#Setting up Spark Context:
config = SparkConf().setAppName("spark_broadcast_example").setMaster("local[*]")

if('sc' in locals()):
    sc.stop()

sc = SparkContext(conf = config)

22/09/16 12:08:59 WARN Utils: Your hostname, rambino-AERO-15-XD resolves to a loopback address: 127.0.1.1; using 192.168.0.198 instead (on interface wlp48s0)
22/09/16 12:08:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/16 12:08:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
#Code for Spark broadcast
my_list = [1,2,3,4]
my_list_rdd = sc.parallelize(my_list)

result = my_list_rdd.map(lambda x: x).collect()
print(result)

[Stage 0:>                                                        (0 + 16) / 16]

[1, 2, 3, 4]


                                                                                