# Demonstrating sparkmagic

## This notebook will demonstrate how you can use the spark magic to interspere your Python code with code that is running against a Spark cluster

Maybe you start by processing some data in regular python.

**Notice that this is regular Python code.**

In [1]:
for i in range(10):
    print("2015-03-{}, 12:00, {}, {}, {}".format(i, i+10, i+5, i))

2015-03-0, 12:00, 10, 5, 0
2015-03-1, 12:00, 11, 6, 1
2015-03-2, 12:00, 12, 7, 2
2015-03-3, 12:00, 13, 8, 3
2015-03-4, 12:00, 14, 9, 4
2015-03-5, 12:00, 15, 10, 5
2015-03-6, 12:00, 16, 11, 6
2015-03-7, 12:00, 17, 12, 7
2015-03-8, 12:00, 18, 13, 8
2015-03-9, 12:00, 19, 14, 9


Then, **you probably need Spark to analyze some data**. So, we'll load `sparkmagic` in order to be able to talk to Spark from my Python notebook.

In [2]:
%load_ext sparkmagic.magics

With it, the `%manage_spark` line magic and the `%%spark` magic are available.

The `%%manage_spark` line magic will let you manage Livy endpoints and Spark sessions. You can create and delete sessions for an endpoint from it.

In order to start using Spark on the kernel, we'll irst, add an Endpoint.

An Endpoint is a [Livy](https://github.com/cloudera/livy) installation running on a Spark cluster. 
sparkmagic allows you to specify the Livy endpoint along with a username and password to authenticate to it. If the livy endpoint is on your local machine or has no password, simply leave the text fields for username and password blank.

In [3]:
%manage_spark

Added endpoint https://localhost:8998
Creating SparkContext as 'sc'
Creating HiveContext as 'sqlContext'


![add_endpoint](images/addendpoint.PNG)

Now, add a session to the endpoint you added. The name you give to the session will be used with the `%%spark` magic to run Spark code. You can also specify the configuration you want to start the session with. You can create either Python (PySpark) or Scala (Spark) sessions.

A SparkContext will be created with the `sc` name. A HiveContext will be created with the `sqlContext` name.

We'll start by adding a PySpark session first.

![add_session](images/addsession.PNG)

You can now run Spark code on the IPython kernel. In order to get some instructions on what commands are available, run `%spark?`

In [4]:
%spark?

## Pyspark

From here on, I just need to reference my session by its name 'my_pyspark'.

>Note that if there's only one session for the notebook, you do not need to specify a session name.

Notice that in order to send my code to the Spark cluster, I'll simply add `%%spark` at the beginning of the cell and the magics will handle the execution of the code in the Spark cluster.

In the following cell, I'll create a Resilient Distributed Dataset (RDD) called fruits, and print its first element.

In [5]:
%%spark
fruits = sc.textFile('wasb:///example/data/fruits.txt')
print('First element of fruits is {} and its description is:\n{}'.format(fruits.first(), fruits.toDebugString()))

First element of fruits is apple and its description is:
(2) wasb:///example/data/fruits.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  wasb:///example/data/fruits.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []

Now, you've created your session and executed some statements. If you want to look at the Livy logs for this session, simply run a cell like so:

In [6]:
%spark logs

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/spark/lib/spark-examples-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/06/03 22:04:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/03 22:04:18 INFO TimelineClientImpl: Timeline service address: http://localhost:8188/ws/v1/timeline/
16/06/03 22:04:19 INFO MetricsConfig: loaded properties from hadoop-metrics2-azure-file-system.properties
16/06/03 22:04:19 INFO WasbAzureIaasSink: Init starting.
16/06/03 22:04:19 INFO AzureIaasSink: Init starti

## SparkSQL

Thanks to SparkSQL, I can use SQL to simply query the data I have cached in my Spark cluster. I can then analyze it and get automatic visualizations.

All Spark sessions will have a HiveContext to run queries on top of. Simply select it by passing the `--context` or `-c` argument with the `sql` value to the `%%spark` magic.

First, let's see what tables we have defined:

In [7]:
%%spark -c sql
SHOW TABLES

Unnamed: 0,isTemporary,tableName
0,False,hivesampletable


Now, let's query one of the available tables.

Notice that we are passing the `--output` or `-o` parameter with a `df_hvac` value so that the output of our SQL query is saved in the `df_hvac` variable in the IPython kernel context as a [Pandas](http://pandas.pydata.org/) DataFrame.

In [8]:
%%spark -c sql -o df_hvac --maxrows 10
SELECT * FROM hivesampletable

Unnamed: 0,clientid,country,devicemake,devicemodel,deviceplatform,market,querydwelltime,querytime,sessionid,sessionpagevieworder,state
0,8,United States,Samsung,SCH-i500,Android,en-US,13.920401,2016-06-03 18:54:20,0,0,California
1,23,United States,HTC,Incredible,Android,en-US,,2016-06-03 19:19:44,0,0,Pennsylvania
2,23,United States,HTC,Incredible,Android,en-US,1.475742,2016-06-03 19:19:46,0,1,Pennsylvania
3,23,United States,HTC,Incredible,Android,en-US,0.245968,2016-06-03 19:19:47,0,2,Pennsylvania
4,28,United States,Motorola,Droid X,Android,en-US,20.309534,2016-06-03 01:37:50,1,1,Colorado
5,28,United States,Motorola,Droid X,Android,en-US,16.298167,2016-06-03 00:53:31,0,0,Colorado
6,28,United States,Motorola,Droid X,Android,en-US,1.771523,2016-06-03 00:53:50,0,1,Colorado
7,28,United States,Motorola,Droid X,Android,en-US,11.675599,2016-06-03 16:44:21,2,1,Utah
8,28,United States,Motorola,Droid X,Android,en-US,36.944689,2016-06-03 16:43:41,2,0,Utah
9,28,United States,Motorola,Droid X,Android,en-US,28.981142,2016-06-03 01:37:19,1,0,Colorado


>SQL queries also have other parameters you can pass in, like `--samplemethod`, `--maxrows`, `--samplefraction`, and `--quiet`.

We can now simply use the Pandas dataframe from the IPython notebook.

In [9]:
df_hvac.head()

Unnamed: 0,clientid,country,devicemake,devicemodel,deviceplatform,market,querydwelltime,querytime,sessionid,sessionpagevieworder,state
0,8,United States,Samsung,SCH-i500,Android,en-US,13.920401,2016-06-03 18:54:20,0,0,California
1,23,United States,HTC,Incredible,Android,en-US,,2016-06-03 19:19:44,0,0,Pennsylvania
2,23,United States,HTC,Incredible,Android,en-US,1.475742,2016-06-03 19:19:46,0,1,Pennsylvania
3,23,United States,HTC,Incredible,Android,en-US,0.245968,2016-06-03 19:19:47,0,2,Pennsylvania
4,28,United States,Motorola,Droid X,Android,en-US,20.309534,2016-06-03 01:37:50,1,1,Colorado


However, you might want to visualize the data in the pandas dataframe. You can write your own code with any number of visualization libraries or you can rely on a widget we've created:

In [10]:
from autovizwidget.widget.utils import display_dataframe
display_dataframe(df_hvac)

![widget](images/widget.PNG)

>You could also choose to have this widget display by default for *all* Pandas dataframes from here on by running this piece of code:

```
ip = get_ipython()
ip.display_formatter.ipython_display_formatter.for_type_by_name('pandas.core.frame', 'DataFrame', display_dataframe)
```

## Scala support

The thing is, maybe you are more comfortable writing Spark code in Scala. I can easily do that, interspersing my regular Python code with Scala code that will run against my Spark cluster.

Let's add a Scala session:

In [11]:
%manage_spark

Creating SparkContext as 'sc'
Creating HiveContext as 'sqlContext'


![add_session](images/addsession_s.PNG)

And just run some Spark code. Notice that we now specify the session we want to use, `-s my_spark`.

In [12]:
%%spark -s my_spark
val hvacText = sc.textFile("wasb:///example/data/fruits.txt")
hvacText.first()

res0: String = apple

Now, we can query the table with **SparkSQL** too:

In [13]:
%%spark -s my_spark -c sql -o my_df_from_scala --maxrows 10
SELECT * FROM hivesampletable

Unnamed: 0,clientid,country,devicemake,devicemodel,deviceplatform,market,querydwelltime,querytime,sessionid,sessionpagevieworder,state
0,8,United States,Samsung,SCH-i500,Android,en-US,13.920401,2016-06-03 18:54:20,0,0,California
1,23,United States,HTC,Incredible,Android,en-US,,2016-06-03 19:19:44,0,0,Pennsylvania
2,23,United States,HTC,Incredible,Android,en-US,1.475742,2016-06-03 19:19:46,0,1,Pennsylvania
3,23,United States,HTC,Incredible,Android,en-US,0.245968,2016-06-03 19:19:47,0,2,Pennsylvania
4,28,United States,Motorola,Droid X,Android,en-US,20.309534,2016-06-03 01:37:50,1,1,Colorado
5,28,United States,Motorola,Droid X,Android,en-US,16.298167,2016-06-03 00:53:31,0,0,Colorado
6,28,United States,Motorola,Droid X,Android,en-US,1.771523,2016-06-03 00:53:50,0,1,Colorado
7,28,United States,Motorola,Droid X,Android,en-US,11.675599,2016-06-03 16:44:21,2,1,Utah
8,28,United States,Motorola,Droid X,Android,en-US,36.944689,2016-06-03 16:43:41,2,0,Utah
9,28,United States,Motorola,Droid X,Android,en-US,28.981142,2016-06-03 01:37:19,1,0,Colorado


And we can still access the result of the Spark query from Scala as a Pandas dataframe!

In [14]:
my_df_from_scala.head()

Unnamed: 0,clientid,country,devicemake,devicemodel,deviceplatform,market,querydwelltime,querytime,sessionid,sessionpagevieworder,state
0,8,United States,Samsung,SCH-i500,Android,en-US,13.920401,2016-06-03 18:54:20,0,0,California
1,23,United States,HTC,Incredible,Android,en-US,,2016-06-03 19:19:44,0,0,Pennsylvania
2,23,United States,HTC,Incredible,Android,en-US,1.475742,2016-06-03 19:19:46,0,1,Pennsylvania
3,23,United States,HTC,Incredible,Android,en-US,0.245968,2016-06-03 19:19:47,0,2,Pennsylvania
4,28,United States,Motorola,Droid X,Android,en-US,20.309534,2016-06-03 01:37:50,1,1,Colorado


# Cleaning up

Now, as you've created sessions, make sure to clean them up so that your cluster does not utilize those resources.

Simply click on the `Delete` buttons!

In [None]:
%manage_spark

![clean_up](images/cleanup.PNG)