# Getting Started With Spark using Scala

In this version of the python API, we have to create our own Spark Context.

In [1]:
from pyspark import SparkContext
sc = SparkContext()
sc

### The Python API

Spark is written in scala, which compiles to java bytecode, but you can write python code to communicate to the java virtual machine through a library called py4j.  Python has the richest API, but it can be somewhat limiting if you need to use a method that is not available, or if you need to write a specialized piece of code.  The latency associated with communicating back and forth to the JVM can sometimes cause the code to run slower.

An exception to this is the SparkSQL library, which has an execution planning engine that precompiles the queries.  Even with this optimization, there are cases where the code may run slower than the native scala version.

The general recommendation for pyspark code is to use the "out of the box" methods available as much as possible, and avoid overly frequent (iterative) calls to Spark methods.  If you need to write really high performance or specialized code, try doing it in scala.

But hey, we know.  Python rules, and the plotting libraries are way better.  So, it's up to you, really!

## Creating RDDs

For demonstration purposes, we create an RDD here by calling `sc.parallelize()`  

In [2]:
data = range(1,30)
# print first element of iterator
print(data[0])
len(data)
xrangeRDD = sc.parallelize(data, 4)

# this will let us know that we created an RDD
xrangeRDD

1


PythonRDD[1] at RDD at PythonRDD.scala:48

## Transformations

A transformation is an operation on an RDD that results in a new RDD.  The transformed RDD is generated very rapidly, because the new RDD is *lazily evaluated*, which means that the calculation is not carried out when the new RDD is generated.  The RDD will contain a series of transformations, or computation instructions, that will only be carried out when an action is called.

In [3]:
subRDD = xrangeRDD.map(lambda x: x-1)
filteredRDD = subRDD.filter(lambda x : x<10)


## Actions 

A transformation returns a result to the driver.

In [4]:
print(filteredRDD.collect())
filteredRDD.count()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


10

## Caching Data

This simple example shows how to create an RDD and cache it.  The timings seem to be trivial improvements, but this is mostly because the main latency here is in sending the results back to the driver.  If you wish to see the actual computation time, browse to the Spark UI...it's at host:4040.  You'll see that the second calculation took much less time!

In [5]:
import time 

test = sc.parallelize(range(1,50000),50)
test.cache()

t1 = time.time()
# first count will trigger evaluation of count *and* cache
count1 = test.count()
dt1 = time.time() - t1
print("dt1: ", dt1)

t2 = time.time()
# second count operates on cached data only
count2 = test.count()
dt2 = time.time() - t2
print("dt2: ", dt2)

test.count()

dt1:  1.1904611587524414
dt2:  0.48461413383483887


49999

## Spark SQL

In order to work with the extremely powerful SQL engine in Apache Spark, you will need to create a Spark Session.

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [7]:
spark

### Create Your First DataFrame!

You can create a structured data set (much like a database table) in Spark.  Once you have done that, you can then use powerful SQL tools to query and join your dataframes.

In [8]:
df = spark.read.json("people.json")
df.show()
df.printSchema()

# Register the DataFrame as a SQL temporary view
df.createTempView("people")

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



### Some More DataFrame Examples


In [9]:
df.select("name").show()
df.select(df["name"]).show()
spark.sql("SELECT name FROM people").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [10]:
df.filter(df["age"] > 21).show()
spark.sql("SELECT age, name FROM people WHERE age > 21").show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [11]:
df.groupBy("age").count().show()
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show()

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    0|
|  30|    1|
+----+-----+

