# Plotting With SparkMagic on Hops

To run large scale computations in a hops cluster from Jupyter we use sparkmagic, a livy REST server, and the pyspark kernel. 

The fact that the default computation on a cluster is distributed over several machines makes it a little different to do things such as plotting compared to when running code locally. 

This notebook illustrates how you can combine plotting and large-scale computations on a Hops cluster in a single notebook.

In [None]:
# When pyspark kernel is started we get a Spark session automatically created for us
spark

##  Check which "magic" functions are available from sparkmagic

In [None]:
%%help

## Load a CSV file in Spark from your Project

In [None]:
from hops import hdfs
df = spark.read.format("csv").option("header", "true").load(hdfs.project_path() + "TestJob/data/visualization/Pokemon.csv")

In [None]:
df.count()

In [None]:
df.show(5)

## Name the Spark DataFrame to Be Able to Use SQL

In [None]:
df.createOrReplaceTempView("pokemons")

## Use SparkMagic to Collect the Spark Dataframe as a Pandas Dataframe Locally

This command will send the dataset from the cluster to the server where Jupyter is running and convert it into a pandas dataframe. This is only suitable for smaller datasets. A common practice is to run spark jobs to process a large dataset and shrink it before plotting.

In [None]:
%%sql -c sql -o python_df --maxrows 10
SELECT * FROM pokemons

## The Pandas DataFrame is now Available in %%local mode

In [None]:
%%local
python_df.head()

In [None]:
%%local
python_df["Name"].values

In [None]:
%%bash
pip install --user matplotlib
pip install --user seaborn

## Local Plotting with MatplotLib and Seaborn

After the data have been loaded locally as a pandas dataframe, it can get plotted on the Jupyter server. By using the magic "%%local" at the top of a cell, the code in the cell will be executed locally on the Jupyter server, rather than remotely with Livy on the Spark cluster. Once the pandas dataframe is available locally it can be plotted with libraries such as matplotlib and seaborn

In [None]:
%%local
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
%%local
stats = python_df.columns[5:11]
plt.figure(figsize=(25, 20))

for ii, stat in enumerate(stats):
    title = "Distributions of {stat}".format(
        stat = stat
    )
    plt.subplot(3, 3, ii+1)
    plt.title(title)
    stats = np.array(map(lambda x: int(x), python_df[stat].values))
    sns.distplot(stats)
    x = plt.gca().get_xlim()[1] * .6
    y = plt.gca().get_ylim()[1] * .9
    plt.text(x, y, '$\mu: {mu: .2f}, \sigma: {sigma: .2f}$'.format(mu = stats.mean(), sigma=stats.std()))
    
    
plt.tight_layout()
plt.show()

In [None]:
%%local
#view the nuumber of pokemons for Type 1 and Type 2 using one plot
f, (ax1,ax2) = plt.subplots(2,1,figsize=(15, 8),sharex=True)

sns.countplot('Type 1',data=python_df,ax=ax1)
sns.countplot('Type 2',data=python_df,ax=ax2)

In [None]:
%%local
sns.catplot(x='Legendary',kind='count',data=python_df,height=5,aspect=1)