# Working with an existing remote Spark via HTTP (sample 3)

IBM Watson Studio provides the interface for Python notebooks to work with an existing remote Spark through HTTP connection and user-friendly sparkmagic commands. This sample notebook shows how to work with remote Spark using the Livy Spark kernel.

The installation of the remote Spark in this sample is using Horton Data Platform (HDP), which utilizes Livy HTTP REST API. Livy is an open source REST interface for interacting with [Apache Spark](http://spark.apache.org) from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in [Apache Hadoop YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).

This notebook runs on Python2 with Livy Spark.

## Table of contents 

   1.  [Load sparkmagic](#load-sparkmagic)<br>
   2.  [Create a connection to remote Spark by using the command mode](#connection-to-remote-spark)<br>
   3.  [View the sparkmagic help](#view-help)<br>
   4.  [View the tables](#view-tables)<br>
   5.  [Select data from a table and return the DataFrame](#select-data-data-frame)<br>
   6.  [Print out rows from the returned DataFrame](#print-rows)<br>
   6.  [Visualize your data using Brunel](#visualize-data)<br>  
   6.  [Show the default context of cells in remote Spark](#show-default-context)<br>  
   6.  [Show a missing DataFrame](#missing-DataFrame)<br>  
   4.  [Return the DataFrame in the local notebook](#return-DataFrame)<br>  
   6.  [Print the DataFrame in remote Spark context](#DataFrame-remote-Spark)<br>
   6.  [Print the DataFrame in the local notebook](#print-DataFrame-local)<br>        
   6.  [Delete the remote session](#delete-session)<br>        

<a id="load-sparkmagic"></a>
## 1. Load sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.


In [1]:
%load_ext sparkmagic.magics
import dsx_core_utils
dsx_core_utils.setup_livy_sparkmagic()
%reload_ext sparkmagic.magics

11/02/2017 01:16:26 AM - proxy_util - INFO - Set custom headers.
11/02/2017 01:16:26 AM - proxy_util - INFO - Set proxy user to be useed with Livy.
11/02/2017 01:16:26 AM - proxy_util - INFO - proxy settings set.


<a id="connection-to-remote-spark"></a>
##  2. Create a connection to remote Spark by using the command mode
Run the following cell to invoke the user interface for managing Spark. In the user interface, perform the following tasks to create a connection to the remote Spark:
 * Check **Manage Endpoints**. If you already see an endpoint defined, then your Watson Studio Admin has configured a default Watson Studio Endpoint.
 * Otherwise, select the **Add Endpoint** tab to create the endpoint of the Livy service URL. Type the Livy service URL in the **Address** field, select the authentication type, and specify the authentication credentials if required. Then, select the **Add endpoint** button.
 * Select the **Add Session** tab to create a session. Choose the endpoint, type the session name, and choose the language. Then, select the **Create Session** button. 

In [2]:
%manage_spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
146,application_1509308217126_0040,pyspark,idle,Link,Link,✔


SparkContext available as 'sc'.
HiveContext available as 'sqlContext'.


<a id="view-help"></a>
## 3. View the sparkmagic help

To view the sparkmagic help topics, execute the help command.

In [3]:
%help

Magic,Example,Explanation
info,%%info,Outputs session information for the current Livy endpoint.
cleanup,%%cleanup -f,"Deletes all sessions for the current Livy endpoint, including this notebook's session. The force flag is mandatory."
delete,%%delete -f -s 0,Deletes a session by number for the current Livy endpoint. Cannot delete this kernel's session.
logs,%%logs,Outputs the current session's Livy logs.
configure,"%%configure -f {""executorMemory"": ""1000M"", ""executorCores"": 4}",Configure the session creation parameters. The force flag is mandatory if a session has already been  created and the session will be dropped and recreated. Look at Livy's POST /sessions Request Body for a list of valid parameters. Parameters must be passed in as a JSON string.
spark,%%spark -o df df = spark.read.parquet('...,"Executes spark commands.  Parameters:  -o VAR_NAME: The Spark dataframe of name VAR_NAME will be available in the %%local Python context as a  Pandas dataframe with the same name.  -m METHOD: Sample method, either take or sample.  -n MAXROWS: The maximum number of rows of a dataframe that will be pulled from Livy to Jupyter.  If this number is negative, then the number of rows will be unlimited.  -r FRACTION: Fraction used for sampling."
sql,%%sql -o tables -q SHOW TABLES,"Executes a SQL query against the variable sqlContext (Spark v1.x) or spark (Spark v2.x).  Parameters:  -o VAR_NAME: The result of the SQL query will be available in the %%local Python context as a  Pandas dataframe.  -q: The magic will return None instead of the dataframe (no visualization).  -m, -n, -r are the same as the %%spark parameters above."
local,%%local a = 1,All the code in subsequent lines will be executed locally. Code must be valid Python code.


<a id="view-tables"></a>
## 4. View the tables

Execute the following sql command to view the tables.

In [4]:
%spark -c sql
SHOW TABLES


pandas.lib is deprecated and will be removed in a future version.
You can access infer_dtype as pandas.api.types.infer_dtype



<a id="select-data-data-frame"></a>
## 5. Select data from a table and return the DataFrame

Execute the following sql command to select data from the __sample_07__ table, and return DataFrame __df_tab07__ to the local notebook.

__Note__: Click the different buttons to view data using different types of visualizations, such as pie chart, scatter chart, line chart, area chart, and bar chart.

In [5]:
%spark -c sql -o df_tab07 --maxrows 10
SELECT * FROM sample_07

<a id="print-rows"></a>
## 6. Print out rows from the returned DataFrame

Print out the first 3 rows and the total number of rows from the returned DataFrame.

In [6]:
%local
df_tab07.head(3)

In [7]:
%local
print(len(df_tab07))

10


<a id="visualize-data"></a>
## 7. Visualize your data using Brunel

Apply Brunel to the DataFrame to visualize data. 

In [8]:
%local
import pandas as pd
import brunel

%brunel data('df_tab07') x(description) y(salary) mean(salary) bar tooltip(#all) :: width=300, height=300

<IPython.core.display.Javascript object>

<a id="show-default-context"></a>
## 8. Show the default context of cells in remote Spark


In [9]:
df2 = sqlContext.sql("SELECT * FROM sample_08")

In [10]:
df2.head(3)

[Row(code=u'00-0000', description=u'All Occupations', total_emp=135185230, salary=42270), Row(code=u'11-0000', description=u'Management occupations', total_emp=6152650, salary=100310), Row(code=u'11-1011', description=u'Chief executives', total_emp=301930, salary=160440)]

<a id="missing-DataFrame"></a>
##  9. Show a missing DataFrame
Run the following command to see that the __df2 dataframe__ is not available in the local notebook.
 

In [11]:
%local
df2.head(3)

NameError: name 'df2' is not defined

<a id="return-DataFrame"></a>
##  10. Return the DataFrame in the local notebook

Run the following command to return the __sample_08 DataFrame__ in the local notebook.

In [12]:
%spark -o dd
dd = sqlContext.sql("SELECT * FROM sample_08")

<a id="DataFrame-remote-Spark"></a>
## 11. Print the DataFrame in remote Spark context

In [13]:
dd.show()

+-------+--------------------+---------+------+
|   code|         description|total_emp|salary|
+-------+--------------------+---------+------+
|00-0000|     All Occupations|135185230| 42270|
|11-0000|Management occupa...|  6152650|100310|
|11-1011|    Chief executives|   301930|160440|
|11-1021|General and opera...|  1697690|107970|
|11-1031|         Legislators|    64650| 37980|
|11-2011|Advertising and p...|    36100| 94720|
|11-2021|  Marketing managers|   166790|118160|
|11-2022|      Sales managers|   333910|110390|
|11-2031|Public relations ...|    51730|101220|
|11-3011|Administrative se...|   246930| 79500|
|11-3021|Computer and info...|   276820|118710|
|11-3031|  Financial managers|   500590|110640|
|11-3041|Compensation and ...|    38810| 93410|
|11-3042|Training and deve...|    29350| 93830|
|11-3049|Human resources m...|    60980|103920|
|11-3051|Industrial produc...|   154030| 91200|
|11-3061| Purchasing managers|    67150| 94300|
|11-3071|Transportation, s...|    96300|

<a id="print-DataFrame-local"></a>
## 12. Print the DataFrame in the local notebook

In [14]:
%local
dd.head(3)

<a id="delete-session"></a>
## 13. Delete the remote Spark session

Remember that you can also use the spark manager in step 2

In [None]:
%spark delete -s SESSION_NAME

## Summary

In this notebook, you learned how to work with remote Spark using Python2 with Livy Spark kernel.

<hr>
Copyright &copy; IBM Corp. 2017. Released as licensed Sample Materials.