# Working with an existing remote Spark via HTTP (sample 2)

IBM Watson Studio provides the interface for Python notebooks to work with an existing remote Spark through HTTP connection and user-friendly sparkmagic commands. This sample notebook shows how to send an SQL request to remote Spark to get a DataFrame.

The installation of the remote Spark in this sample is using Horton Data Platform (HDP), which utilizes Livy HTTP REST API. Livy is an open source REST interface for interacting with [Apache Spark](http://spark.apache.org) from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in [Apache Hadoop YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).

This notebook runs on Python 2.

 ## Table of contents 

   1.  [Load sparkmagic](#load-sparkmagic)<br>
   2.  [Create a connection to remote Spark](#connection-to-remote-spark)<br>
   3.  [Send an SQL request to show information about all tables](#show-inf-all-tables)<br>
   4.  [Send an SQL request to show the contents of a specific table](#show-inf-one-table)<br>
   5.  [Print out information from the returned DataFrame](#print-data-frame)<br>
   6.  [Visualize your data using Brunel](#visualize-data)<br>
   6.  [Delete the session](#delete-session)<br>        

<a id="load-sparkmagic"></a>
## 1. Load sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.


In [1]:
%load_ext sparkmagic.magics
import dsx_core_utils
dsx_core_utils.setup_livy_sparkmagic()
%reload_ext sparkmagic.magics

11/01/2017 09:03:17 PM - proxy_util - INFO - Set custom headers.
11/01/2017 09:03:17 PM - proxy_util - INFO - Set proxy user to be useed with Livy.
11/01/2017 09:03:17 PM - proxy_util - INFO - proxy settings set.


<a id="connection-to-remote-spark"></a>
##  2. Create a connection to remote Spark 

Run the following cell to invoke the user interface for managing Spark. In the user interface, perform the following tasks to create a connection to the remote Spark:
 * Check **Manage Endpoints**. If you already see an endpoint defined, then your Watson Studio Admin has configured a default Watson Studio Endpoint.
 * Otherwise, select the **Add Endpoint** tab to create the endpoint of the Livy service URL. Type the Livy service URL in the **Address** field, select the authentication type, and specify the authentication credentials if required. Then, select the **Add endpoint** button.
 * Select the **Add Session** tab to create a session. Choose the endpoint, type the session name, and choose the language. Then, select the **Create Session** button. 

In [2]:
%manage_spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
135,application_1509308217126_0029,pyspark,idle,Link,Link,✔


SparkContext available as 'sc'.
HiveContext available as 'sqlContext'.


<a id="show-inf-all-tables"></a>
## 3. Send an SQL request to show information about all tables

Send an SQL request to remote Spark to list all tables.

In [3]:
%%spark -c sql
SHOW TABLES

Unnamed: 0,tableName,isTemporary
0,sample_07,False
1,sample_08,False
2,sample_09,False


<a id="show-inf-one-table"></a>
## 4. Send an SQL request to show the contents of a specific table

Send an SQL request to remote Spark to show the __sample_07__ table contents, and return the DataFrame to the local notebook.

In [4]:
%%spark -c sql -o df_tab07 --maxrows 10
SELECT * FROM sample_07

Unnamed: 0,code,description,total_emp,salary
0,00-0000,All Occupations,134354250,40690
1,11-0000,Management occupations,6003930,96150
2,11-1011,Chief executives,299160,151370
3,11-1021,General and operations managers,1655410,103780
4,11-1031,Legislators,61110,33880
5,11-2011,Advertising and promotions managers,36300,91100
6,11-2021,Marketing managers,165240,113400
7,11-2022,Sales managers,322170,106790
8,11-2031,Public relations managers,47210,97170
9,11-3011,Administrative services managers,239360,76370


<a id="print-data-frame"></a>
## 5. Print out information from the returned DataFrame

Print out the first 3 rows and then the total rows from the returned DataFrame.

In [5]:
df_tab07.head(3)

Unnamed: 0,code,description,total_emp,salary
0,00-0000,All Occupations,134354250,40690
1,11-0000,Management occupations,6003930,96150
2,11-1011,Chief executives,299160,151370


In [6]:
print(len(df_tab07))

10


<a id="visualize-data"></a>
## 6. Visualize your data using Brunel

Apply the Brunel functions to the DataFrame and draw a diagram.

In [7]:
import pandas as pd
import brunel

%brunel data('df_tab07') x(description) y(salary) mean(salary) bar tooltip(#all) :: width=300, height=300

<IPython.core.display.Javascript object>

<a id="delete-session"></a>
## 7. Delete the remote Spark session

Run the following cell to remove your remote Spark session. You can also use the Spark Manager in Step 2

In [10]:
%spark delete -s SESSION_NAME

## Summary

In this notebook, you learned how to send an SQL request to remote Spark to get a DataFrame.

<div class="alert alert-block alert-info"> Note: To save resources and get the best performance please use the code below to stop the kernel before exiting your notebook.</div>

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<hr>
Copyright &copy; IBM Corp. 2017. Released as licensed Sample Materials.