# Spark Demo

## Install Java, Spark, and Findspark
This installs Apache Spark 2.4.3, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-eu.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz -O spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

## Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7"

## Start a SparkSession
This will start a local Spark session.

In [None]:
import os
import subprocess

import findspark

findspark.init()

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
spark

In [None]:
!which python

In [None]:
!hostname

In [None]:
ngrok = subprocess.Popen(["./ngrok", "http", "http://a78eefd263d2:4040/"])

In [None]:
ngrok.kill()

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > /dev/null 2>&1

In [None]:
!unzip ngrok-stable-linux-amd64.zip > /dev/null 2>&1

In [None]:
# get the public ip to view tensorboard webui
! curl -s http://localhost:4041/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

In [None]:
get_ipython().system_raw('./ngrok http http://d36a079c99f5:4040/ &')

In [None]:
!./ngrok http d36a079c99f5:4040

In [None]:
import IPython

In [None]:
def display_tensorboard(port, height):
    import IPython.display
    shell = """
    <div id="root"></div>
    <script>
        (function() {
        window.TENSORBOARD_ENV = window.TENSORBOARD_ENV || {};
        window.TENSORBOARD_ENV["IN_COLAB"] = true;
        document.querySelector("base").href = "https://localhost:%PORT%/#/overview";
        function executeAllScripts(root) {
            // When `script` elements are inserted into the DOM by
            // assigning to an element's `innerHTML`, the scripts are not
            // executed. Thus, we manually re-insert these scripts so that
            // TensorBoard can initialize itself.
            for (const script of root.querySelectorAll("script")) {
            const newScript = document.createElement("script");
            newScript.type = script.type;
            newScript.textContent = script.textContent;
            root.appendChild(newScript);
            script.remove();
            }
        }
        function setHeight(root, height) {
            // We set the height dynamically after the TensorBoard UI has
            // been initialized. This avoids an intermediate state in
            // which the container plus the UI become taller than the
            // final width and cause the Colab output frame to be
            // permanently resized, eventually leading to an empty
            // vertical gap below the TensorBoard UI. It's not clear
            // exactly what causes this problematic intermediate state,
            // but setting the height late seems to fix it.
            root.style.height = `${height}px`;
        }
        const root = document.getElementById("root");
        fetch(".")
            .then((x) => x.text())
            .then((html) => void (root.innerHTML = html))
            .then(() => executeAllScripts(root))
            .then(() => setHeight(root, %HEIGHT%));
        })();
    </script>
    """.replace("%PORT%", "%d" % port).replace("%HEIGHT%", "%d" % height)
    html = IPython.display.HTML(shell)
    return html

In [None]:
display_tensorboard(8080, 400)

In [None]:
http://d3f38a4386f9:4040/

## Use Spark!
That's all there is to it - you're ready to use Spark!

In [None]:
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3)