diff --git a/samples/features/sql-big-data-cluster/machine-learning/spark/h2o/h2o-automl-powerplant.ipynb b/samples/features/sql-big-data-cluster/machine-learning/spark/h2o/h2o-automl-powerplant.ipynb index 4f2aa6d349..8f33420586 100644 --- a/samples/features/sql-big-data-cluster/machine-learning/spark/h2o/h2o-automl-powerplant.ipynb +++ b/samples/features/sql-big-data-cluster/machine-learning/spark/h2o/h2o-automl-powerplant.ipynb @@ -1,549 +1,215 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Powerplant Output Prediction\n", - "- This notebook is based on the power plant output prediction example presented in H2O’s [blog post on H2O AutoML in Spark](https://www.h2o.ai/blog/h2os-automl-in-spark/).\n", - "- Run this notebook in Azure Data Studio connected to a SQL Server 2019 Big Data Cluster by following the instructions [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/notebooks-guidance?view=sqlallproducts-allversions)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Spark Configuration\n", - "- We can control the Spark Driver and Executor memory, cores, and number of executors per pod using the “%%configure” cell magic\n", - "- Additional configuration settings are listed at the end of this notebook\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "data": { - "text/html": [ - "Current session configs: {'executorMemory': '4g', 'driverMemory': '4g', 'executorCores': 2, 'driverCores': 2, 'numExecutors': 2, 'kind': 'pyspark3'}
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" + "metadata": { + "kernelspec": { + "name": "pyspark3kernel", + "display_name": "PySpark3" + }, + "language_info": { + "name": "pyspark3", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "pygments_lexer": "python3" + } }, - { - "data": { - "text/html": [ - "No active sessions." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "%%configure -f\n", - "{\n", - " \"executorMemory\": \"4g\",\n", - " \"driverMemory\": \"4g\",\n", - " \"executorCores\": 2,\n", - " \"driverCores\": 2,\n", - " \"numExecutors\": 2\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Install H2O\n", - "- This cell downloads the h2o_pysparkling_2.3 python package and installs it on the pod where the Spark driver is currently running, if it is not already installed. Propagating the software to additional pods is handled automatically once we launch H2O.\n", - "- For an enterprise scenario where we cannot reach out to the PyPi repository on the Internet, pip3 can be pointed to a local copy." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Starting Spark application\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
24application_1543381571657_0025pyspark3idleLinkLink
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "SparkSession available as 'spark'.\n", - "Collecting h2o_pysparkling_2.3\n", - "Requirement already satisfied (use --upgrade to upgrade): tabulate in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): six in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): future in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): requests in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): colorama>=0.3.8 in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): pyspark<=2.3.2,>=2.3.0 in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): idna<2.8,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\n", - "Requirement already satisfied (use --upgrade to upgrade): py4j==0.10.7 in /usr/local/lib/python3.5/dist-packages (from pyspark<=2.3.2,>=2.3.0->h2o_pysparkling_2.3)\n", - "Installing collected packages: h2o-pysparkling-2.3\n", - "Successfully installed h2o-pysparkling-2.3-2.3.18\n", - "You are using pip version 8.1.1, however version 18.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command." - ] - } - ], - "source": [ - "import subprocess\n", - "\n", - "# Install H2O PySparkling\n", - "stdout = subprocess.check_output(\n", - " \"pip3 install h2o_pysparkling_2.3\",\n", - " stderr=subprocess.STDOUT,\n", - " shell=True).decode(\"utf-8\")\n", - "print(stdout)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Download and copy data to HDFS" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ls: `/tmp/powerplant_output.csv': No such file or directory\n", - "--2018-12-06 19:29:35-- https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/automl/data/powerplant_output.csv\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 308777 (302K) [text/plain]\n", - "Saving to: 'powerplant_output.csv'\n", - "\n", - " 0K .......... .......... .......... .......... .......... 16% 574K 0s\n", - " 50K .......... .......... .......... .......... .......... 33% 1.14M 0s\n", - " 100K .......... .......... .......... .......... .......... 49% 30.9M 0s\n", - " 150K .......... .......... .......... .......... .......... 66% 1.16M 0s\n", - " 200K .......... .......... .......... .......... .......... 82% 42.7M 0s\n", - " 250K .......... .......... .......... .......... .......... 99% 77.2M 0s\n", - " 300K . 100% 2937G=0.2s\n", - "\n", - "2018-12-06 19:29:36 (1.68 MB/s) - 'powerplant_output.csv' saved [308777/308777]" - ] - } - ], - "source": [ - "dataFileName = \"powerplant_output.csv\"\n", - "dataFileUrl = \"https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/automl/data/\" + dataFileName\n", - "\n", - "# Download data file and copy to HDFS, if not already there\n", - "cmd = 'hdfs dfs -ls /tmp/' + dataFileName + ' || ' \\\n", - " '(wget ' + dataFileUrl + ' && ' \\\n", - " 'hdfs dfs -copyFromLocal ' + dataFileName + ' /tmp && ' \\\n", - " 'rm ' + dataFileName + ')'\n", - "\n", - "stdout = subprocess.check_output(\n", - " cmd,\n", - " stderr=subprocess.STDOUT,\n", - " shell=True).decode(\"utf-8\")\n", - "print(stdout)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Start H2O engine" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Connecting to H2O server at http://10.244.0.66:54323... successful.\n", - "-------------------------- ---------------------------------------------------\n", - "H2O cluster uptime: 13 secs\n", - "H2O cluster timezone: Etc/UTC\n", - "H2O data parsing timezone: UTC\n", - "H2O cluster version: 3.22.0.2\n", - "H2O cluster version age: 14 days, 12 hours and 24 minutes\n", - "H2O cluster name: sparkling-water-root_application_1543381571657_0021\n", - "H2O cluster total nodes: 2\n", - "H2O cluster free memory: 6.928 Gb\n", - "H2O cluster total cores: 32\n", - "H2O cluster allowed cores: 4\n", - "H2O cluster status: accepting new members, healthy\n", - "H2O connection url: http://10.244.0.66:54323\n", - "H2O connection proxy:\n", - "H2O internal security: False\n", - "H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4\n", - "Python version: 3.5.2 final\n", - "-------------------------- ---------------------------------------------------\n", - "\n", - "Sparkling Water Context:\n", - " * H2O name: sparkling-water-root_application_1543381571657_0021\n", - " * cluster size: 2\n", - " * list of used nodes:\n", - " (executorId, host, port)\n", - " ------------------------\n", - " (1,mssql-storage-pool-default-1.service-storage-pool-default.test.svc.cluster.local,54321)\n", - " (2,mssql-storage-pool-default-0.service-storage-pool-default.test.svc.cluster.local,54321)\n", - " ------------------------\n", - "\n", - " Open H2O Flow in browser: http://10.244.0.66:54323 (CMD + click in Mac OSX)\n", - "\n", - " \n", - " * Yarn App ID of Spark application: application_1543381571657_0021" - ] - } - ], - "source": [ - "from pysparkling import H2OContext\n", - "\n", - "hc = H2OContext.getOrCreate(spark)" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "mssql-storage-pool-default-0" - ] - } - ], - "source": [ - "# Print the hostname of the pod where the driver is running\n", - "stdout = subprocess.check_output(\n", - " \"hostname\",\n", - " stderr=subprocess.STDOUT,\n", - " shell=True).decode(\"utf-8\")\n", - "print(stdout)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Read and split data" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [], - "source": [ - "powerplant_df = spark.read.option(\"inferSchema\", \"true\").csv(\"/tmp/powerplant_output.csv\", header=True)\n", - "\n", - "splits = powerplant_df.randomSplit([0.8, 0.2], seed=1)\n", - "train = splits[0]\n", - "for_predictions = splits[1]\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Training and Prediction\n", - "- Fit AutoML model on training data\n", - "- Generate predictions on \"for_predictions\" data" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\n", - "|TemperatureCelcius|ExhaustVacuumHg|AmbientPressureMillibar|RelativeHumidity|HourlyEnergyOutputMW| prediction_output|\n", - "+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\n", - "| 10.01| 41.17| 1018.78| 86.84| 479.4|[477.07336140008584]|\n", - "| 10.02| 39.66| 1016.34| 79.98| 480.05| [476.9418514217447]|\n", - "| 10.03| 43.13| 1014.85| 70.09| 482.16| [476.1059605448468]|\n", - "| 10.04| 41.62| 1013.36| 95.17| 463.87| [470.5313939263834]|\n", - "| 10.05| 41.58| 1021.35| 95.19| 469.03| [469.2805814291723]|\n", - "| 10.06| 34.69| 1027.9| 71.73| 477.68| [477.8063480544951]|\n", - "| 10.08| 37.92| 1010.47| 66.37| 474.63| [475.4396167315097]|\n", - "| 10.08| 41.16| 1023.14| 96.03| 469.17|[471.84298416925895]|\n", - "| 10.09| 41.01| 1019.89| 96.55| 471.15| [471.4895037099167]|\n", - "| 10.1| 41.4| 1024.29| 85.94| 474.28|[477.99005354475656]|\n", - "| 10.11| 39.35| 1015.19| 90.74| 479.83| [478.0068968374228]|\n", - "| 10.11| 39.72| 1019.1| 69.68| 476.8|[474.19764922516293]|\n", - "| 10.11| 42.49| 1010.22| 82.11| 483.56|[477.02467192323263]|\n", - "| 10.12| 41.55| 1005.78| 62.34| 475.46|[475.25159731408183]|\n", - "| 10.12| 41.78| 1013.43| 73.47| 477.67| [475.3766641350098]|\n", - "| 10.13| 39.18| 1024.09| 85.48| 479.42|[478.11614912148826]|\n", - "| 10.15| 39.22| 1020.09| 68.75| 474.87| [477.1365182805492]|\n", - "| 10.15| 41.46| 1019.78| 83.56| 481.31|[479.42546663321224]|\n", - "| 10.15| 43.41| 1018.4| 82.07| 473.43|[476.34953390951085]|\n", - "| 10.16| 39.3| 1019.71| 81.21| 480.74|[476.95324226094823]|\n", - "+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\n", - "only showing top 20 rows" - ] - } - ], - "source": [ - "from pyspark.ml.feature import SQLTransformer\n", - "from pysparkling.ml import H2OAutoML\n", - "from pyspark.ml import Pipeline\n", - "\n", - "temperatureTransformer = SQLTransformer(statement=\"SELECT * FROM __THIS__ WHERE TemperatureCelcius > 10\")\n", - "\n", - "automlEstimator = H2OAutoML(maxModels=2, predictionCol=\"HourlyEnergyOutputMW\", seed=1)\n", - "\n", - "pipeline = Pipeline(stages=[temperatureTransformer, automlEstimator])\n", - "\n", - "# Fit AutoML model\n", - "model = pipeline.fit(train)\n", - "\n", - "# Generate predictions using fitted model\n", - "predicted = model.transform(for_predictions)\n", - "\n", - "predicted.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Display the leaderboard metrics" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+\n", - "|model_id |mean_residual_deviance|rmse |mse |mae |rmsle |\n", - "+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+\n", - "|StackedEnsemble_BestOfFamily_AutoML_20181206_193040|11.204658852035337 |3.3473360829225585|11.204658852035337|2.509389117612746 |0.007425043374216511|\n", - "|StackedEnsemble_AllModels_AutoML_20181206_193040 |11.204658852035337 |3.3473360829225585|11.204658852035337|2.509389117612746 |0.007425043374216511|\n", - "|DRF_1_AutoML_20181206_193040 |11.349494687812056 |3.3689011098297406|11.349494687812056|2.5426288374345605|0.007472705853530634|\n", - "|XRT_1_AutoML_20181206_193040 |11.464865035526516 |3.3859806608317355|11.464865035526516|2.545269555169042 |0.007510102104757881|\n", - "+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+" - ] - } - ], - "source": [ - "automlEstimator.leaderboard().show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Evaluate predictions on held-out data\n", - "- As expected, we find that the mean absolute error (mae) on the for_predictions data is similar to the leaderboard mae" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "language": "python" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Mean absolute error: 2.3167231443313843" - ] - } - ], - "source": [ - "from pyspark.sql.functions import *\n", - "from pyspark.ml.evaluation import RegressionEvaluator\n", - "\n", - "scores = predicted.select(predicted['HourlyEnergyOutputMW'], predicted['prediction_output']['value'].alias('prediction'))\n", - "\n", - "evaluator = RegressionEvaluator(predictionCol=\"prediction\",\n", - " labelCol=\"HourlyEnergyOutputMW\",\n", - " metricName=\"mae\")\n", - "\n", - "mae = evaluator.evaluate(scores)\n", - "\n", - "print(\"Mean absolute error:\", mae)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Configuration settings for scaling to larger data\n", - "\n", - "## Number and size of nodes in our Kubernetes cluster\n", - "We can control the number and size of nodes in our Kubernetes cluster via the node-vm-size and node-count switches in our `aks create` command:\n", - "\n", - "`az aks create --name mycluster --resource-group myrg --generate-ssh-keys --node-vm-size Standard_DS14_v2 --node-count 3 --kubernetes-version 1.10.9`\n", - "\n", - "More information is available [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-on-aks?view=sqlallproducts-allversions#create-a-kubernetes-cluster).\n", - "\n", - "## Number of Spark pods\n", - "We can control the number of Spark pods via the CLUSTER_STORAGE_POOL_REPLICAS environment variable used by `mssqlctl create cluster`:\n", - "\n", - "SET CLUSTER_STORAGE_POOL_REPLICAS=2\n", - "\n", - "## YARN scheduler memory and cores\n", - "We can control the YARN scheduler memory and cores via the following environment variable used by `mssqlctl create cluster`:\n", - "\n", - "- YARN_SCHEDULER_MAX_MEMORY\n", - "- YARN_SCHEDULER_MAX_VCORES\n", - "- YARN_NODEMANAGER_RESOURCE_MEMORY\n", - "- YARN_NODEMANAGER_RESOURCE_VCORES\n", - "\n", - "Further information regarding mssqlctl environtment variables is available [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-guidance?view=sqlallproducts-allversions#define-environment-variables).\n", - "\n", - "## Livy timeout\n", - "The Livy timeout sets a limit on the runtime of a cell in a PySpark3 Jupyter notebook. In SQL Server 2019 Big Data CTP 2.1, the Livy timeout defaults to 1 hour. In CTP 2.2, it defaults to 24 days. One can modify this as follows:\n", - "\n", - "- Log into the mssql-master-pool-0 pod using this command (requires permission to run kubectl):\n", - "\n", - "```\n", - "kubectl exec -it mssql-master-pool-0 -n -- /bin/bash\n", - "```\n", - "- To set the Livy timeout to 24 days, run the following command or edit /livy/conf/livy.conf accordingly:\n", - "\n", - "```\n", - "echo 'livy.server.session.timeout = 24d' | cat >> /livy/conf/livy.conf \n", - "```\n", - "- Then restart the Livy server by running the following command:\n", - "\n", - "```\n", - "supervisorctl restart livy\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Monitoring and Diagnostics\n", - "## YARN UI\n", - "\n", - "Access from the \"View Yarn History\" button in Azure Data Studio (ADS) or at `https://:30443/gateway/default/yarn`\n", - "\n", - "## Spark UI\n", - "\n", - "Access from the \"Spark UI\" link that appears after running the first cell in a notebook in a (Py)Spark kernel in ADS or by clicking the ApplicationMaster link of a running application in the YARN UI\n", - "\n", - "## Spark History\n", - "\n", - "Access from the \"View Spark History\" button in ADS or at `https://:30443/gateway/default/sparkhistory`\n", - "\n", - "## H2O Flow UI\n", - "- The command `H2OContext.getOrCreate(spark)` outputs the IP address and port number for connection to H2O’s Flow UI, for example:\n", - "\n", - " `H2O connection url: http://10.244.0.16:54325`\n", - "\n", - "- This connection can be forwarded to one’s workstation using this command (requires permission to run kubectl):\n", - "\n", - " `kubectl -n test port-forward `\n", - "\n", - " Here is an example:\n", - "\n", - " `kubectl -n test port-forward mssql-storage-pool-default-0 54325`\n", - "\n", - " The port number is the number after the colon in the H2O connection URL.\n", - " \n", - " To determine ``, run this command in the PySpark3 kernel in the ADS notebook:\n", - "\n", - "```python\n", - " # Print the hostname of the pod where the driver is running\n", - " stdout = subprocess.check_output( \n", - " \"hostname\", \n", - " stderr=subprocess.STDOUT, \n", - " shell=True).decode(\"utf-8\") \n", - " print(stdout) \n", - "```\n", - "\n", - "- After setting up port forwarding, the Flow UI can be accessed at `http://localhost:`; for example, `http://localhost:54325`" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "PySpark3", - "language": "", - "name": "pyspark3kernel" - }, - "language_info": { - "codemirror_mode": { - "name": "python", - "version": 3 - }, - "mimetype": "text/x-python", - "name": "pyspark3", - "pygments_lexer": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} + "nbformat_minor": 2, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "source": "# Powerplant Output Prediction\n- This notebook is based on the power plant output prediction example presented in H2O’s [blog post on H2O AutoML in Spark](https://www.h2o.ai/blog/h2os-automl-in-spark/).\n- Run this notebook in Azure Data Studio connected to a SQL Server 2019 Big Data Cluster by following the instructions [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/notebooks-guidance?view=sqlallproducts-allversions).", + "metadata": {} + }, + { + "cell_type": "markdown", + "source": "## Spark Configuration\n- We can control the Spark Driver and Executor memory, cores, and number of executors per pod using the “%%configure” cell magic\n- Additional configuration settings are listed at the end of this notebook\n", + "metadata": {} + }, + { + "cell_type": "code", + "source": "%%configure -f\n{\n \"executorMemory\": \"4g\",\n \"driverMemory\": \"4g\",\n \"executorCores\": 2,\n \"driverCores\": 2,\n \"numExecutors\": 2\n}", + "metadata": { + "language": "python" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": "Current session configs: {'executorMemory': '4g', 'driverMemory': '4g', 'executorCores': 2, 'driverCores': 2, 'numExecutors': 2, 'kind': 'pyspark3'}
", + "text/plain": "" + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/html": "No active sessions.", + "text/plain": "" + }, + "metadata": {} + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Install H2O\n- This cell downloads the h2o_pysparkling_* python package and installs it on the pod where the Spark driver is currently running, if it is not already installed. Propagating the software to additional pods is handled automatically once we launch H2O.\n- _**Be sure to match the version of pysparkling to the version of Spark on your cluster:**_\n - For clusters running Spark 2.3 (SQL Server 2019 CTP 2.3 and earlier), use \"pip3 install h2o_pysparkling_2.3\"\n - For clusters running Spark 2.4 (SQL Server 2019 CTP 2.4 and later), use \"pip3 install h2o_pysparkling_2.4\"\n- For an enterprise scenario where we cannot reach out to the PyPi repository on the Internet, pip3 can be pointed to a local copy.", + "metadata": {} + }, + { + "cell_type": "code", + "source": "import subprocess\n\n# Install H2O PySparkling\nstdout = subprocess.check_output(\n \"pip3 install h2o_pysparkling_2.3\",\n stderr=subprocess.STDOUT,\n shell=True).decode(\"utf-8\")\nprint(stdout)\n", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "Starting Spark application\n" + }, + { + "output_type": "display_data", + "data": { + "text/html": "\n
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
24application_1543381571657_0025pyspark3idleLinkLink
", + "text/plain": "" + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": "SparkSession available as 'spark'.\nCollecting h2o_pysparkling_2.3\nRequirement already satisfied (use --upgrade to upgrade): tabulate in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): six in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): future in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): requests in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): colorama>=0.3.8 in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): pyspark<=2.3.2,>=2.3.0 in /usr/local/lib/python3.5/dist-packages (from h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): idna<2.8,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests->h2o_pysparkling_2.3)\nRequirement already satisfied (use --upgrade to upgrade): py4j==0.10.7 in /usr/local/lib/python3.5/dist-packages (from pyspark<=2.3.2,>=2.3.0->h2o_pysparkling_2.3)\nInstalling collected packages: h2o-pysparkling-2.3\nSuccessfully installed h2o-pysparkling-2.3-2.3.18\nYou are using pip version 8.1.1, however version 18.1 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command." + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Download and copy data to HDFS", + "metadata": {} + }, + { + "cell_type": "code", + "source": "dataFileName = \"powerplant_output.csv\"\ndataFileUrl = \"https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/automl/data/\" + dataFileName\n\n# Download data file and copy to HDFS, if not already there\ncmd = 'hdfs dfs -ls /tmp/' + dataFileName + ' || ' \\\n '(wget ' + dataFileUrl + ' && ' \\\n 'hdfs dfs -copyFromLocal ' + dataFileName + ' /tmp && ' \\\n 'rm ' + dataFileName + ')'\n\nstdout = subprocess.check_output(\n cmd,\n stderr=subprocess.STDOUT,\n shell=True).decode(\"utf-8\")\nprint(stdout)", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "ls: `/tmp/powerplant_output.csv': No such file or directory\n--2018-12-06 19:29:35-- https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/automl/data/powerplant_output.csv\nResolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133\nConnecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 308777 (302K) [text/plain]\nSaving to: 'powerplant_output.csv'\n\n 0K .......... .......... .......... .......... .......... 16% 574K 0s\n 50K .......... .......... .......... .......... .......... 33% 1.14M 0s\n 100K .......... .......... .......... .......... .......... 49% 30.9M 0s\n 150K .......... .......... .......... .......... .......... 66% 1.16M 0s\n 200K .......... .......... .......... .......... .......... 82% 42.7M 0s\n 250K .......... .......... .......... .......... .......... 99% 77.2M 0s\n 300K . 100% 2937G=0.2s\n\n2018-12-06 19:29:36 (1.68 MB/s) - 'powerplant_output.csv' saved [308777/308777]" + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Start H2O engine", + "metadata": {} + }, + { + "cell_type": "code", + "source": "from pysparkling import H2OContext\n\nhc = H2OContext.getOrCreate(spark)", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "Connecting to H2O server at http://10.244.0.66:54323... successful.\n-------------------------- ---------------------------------------------------\nH2O cluster uptime: 13 secs\nH2O cluster timezone: Etc/UTC\nH2O data parsing timezone: UTC\nH2O cluster version: 3.22.0.2\nH2O cluster version age: 14 days, 12 hours and 24 minutes\nH2O cluster name: sparkling-water-root_application_1543381571657_0021\nH2O cluster total nodes: 2\nH2O cluster free memory: 6.928 Gb\nH2O cluster total cores: 32\nH2O cluster allowed cores: 4\nH2O cluster status: accepting new members, healthy\nH2O connection url: http://10.244.0.66:54323\nH2O connection proxy:\nH2O internal security: False\nH2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4\nPython version: 3.5.2 final\n-------------------------- ---------------------------------------------------\n\nSparkling Water Context:\n * H2O name: sparkling-water-root_application_1543381571657_0021\n * cluster size: 2\n * list of used nodes:\n (executorId, host, port)\n ------------------------\n (1,mssql-storage-pool-default-1.service-storage-pool-default.test.svc.cluster.local,54321)\n (2,mssql-storage-pool-default-0.service-storage-pool-default.test.svc.cluster.local,54321)\n ------------------------\n\n Open H2O Flow in browser: http://10.244.0.66:54323 (CMD + click in Mac OSX)\n\n \n * Yarn App ID of Spark application: application_1543381571657_0021" + } + ], + "execution_count": 1 + }, + { + "cell_type": "code", + "source": "# Print the hostname of the pod where the driver is running\nstdout = subprocess.check_output(\n \"hostname\",\n stderr=subprocess.STDOUT,\n shell=True).decode(\"utf-8\")\nprint(stdout)", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "mssql-storage-pool-default-0" + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Read and split data", + "metadata": {} + }, + { + "cell_type": "code", + "source": "powerplant_df = spark.read.option(\"inferSchema\", \"true\").csv(\"/tmp/powerplant_output.csv\", header=True)\n\nsplits = powerplant_df.randomSplit([0.8, 0.2], seed=1)\ntrain = splits[0]\nfor_predictions = splits[1]\n", + "metadata": {}, + "outputs": [], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Training and Prediction\n- Fit AutoML model on training data\n- Generate predictions on \"for_predictions\" data", + "metadata": {} + }, + { + "cell_type": "code", + "source": "from pyspark.ml.feature import SQLTransformer\nfrom pysparkling.ml import H2OAutoML\nfrom pyspark.ml import Pipeline\n\ntemperatureTransformer = SQLTransformer(statement=\"SELECT * FROM __THIS__ WHERE TemperatureCelcius > 10\")\n\nautomlEstimator = H2OAutoML(maxModels=2, predictionCol=\"HourlyEnergyOutputMW\", seed=1)\n\npipeline = Pipeline(stages=[temperatureTransformer, automlEstimator])\n\n# Fit AutoML model\nmodel = pipeline.fit(train)\n\n# Generate predictions using fitted model\npredicted = model.transform(for_predictions)\n\npredicted.show()\n", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\n|TemperatureCelcius|ExhaustVacuumHg|AmbientPressureMillibar|RelativeHumidity|HourlyEnergyOutputMW| prediction_output|\n+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\n| 10.01| 41.17| 1018.78| 86.84| 479.4|[477.07336140008584]|\n| 10.02| 39.66| 1016.34| 79.98| 480.05| [476.9418514217447]|\n| 10.03| 43.13| 1014.85| 70.09| 482.16| [476.1059605448468]|\n| 10.04| 41.62| 1013.36| 95.17| 463.87| [470.5313939263834]|\n| 10.05| 41.58| 1021.35| 95.19| 469.03| [469.2805814291723]|\n| 10.06| 34.69| 1027.9| 71.73| 477.68| [477.8063480544951]|\n| 10.08| 37.92| 1010.47| 66.37| 474.63| [475.4396167315097]|\n| 10.08| 41.16| 1023.14| 96.03| 469.17|[471.84298416925895]|\n| 10.09| 41.01| 1019.89| 96.55| 471.15| [471.4895037099167]|\n| 10.1| 41.4| 1024.29| 85.94| 474.28|[477.99005354475656]|\n| 10.11| 39.35| 1015.19| 90.74| 479.83| [478.0068968374228]|\n| 10.11| 39.72| 1019.1| 69.68| 476.8|[474.19764922516293]|\n| 10.11| 42.49| 1010.22| 82.11| 483.56|[477.02467192323263]|\n| 10.12| 41.55| 1005.78| 62.34| 475.46|[475.25159731408183]|\n| 10.12| 41.78| 1013.43| 73.47| 477.67| [475.3766641350098]|\n| 10.13| 39.18| 1024.09| 85.48| 479.42|[478.11614912148826]|\n| 10.15| 39.22| 1020.09| 68.75| 474.87| [477.1365182805492]|\n| 10.15| 41.46| 1019.78| 83.56| 481.31|[479.42546663321224]|\n| 10.15| 43.41| 1018.4| 82.07| 473.43|[476.34953390951085]|\n| 10.16| 39.3| 1019.71| 81.21| 480.74|[476.95324226094823]|\n+------------------+---------------+-----------------------+----------------+--------------------+--------------------+\nonly showing top 20 rows" + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Display the leaderboard metrics", + "metadata": {} + }, + { + "cell_type": "code", + "source": "automlEstimator.leaderboard().show(truncate=False)", + "metadata": { + "language": "python" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+\n|model_id |mean_residual_deviance|rmse |mse |mae |rmsle |\n+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+\n|StackedEnsemble_BestOfFamily_AutoML_20181206_193040|11.204658852035337 |3.3473360829225585|11.204658852035337|2.509389117612746 |0.007425043374216511|\n|StackedEnsemble_AllModels_AutoML_20181206_193040 |11.204658852035337 |3.3473360829225585|11.204658852035337|2.509389117612746 |0.007425043374216511|\n|DRF_1_AutoML_20181206_193040 |11.349494687812056 |3.3689011098297406|11.349494687812056|2.5426288374345605|0.007472705853530634|\n|XRT_1_AutoML_20181206_193040 |11.464865035526516 |3.3859806608317355|11.464865035526516|2.545269555169042 |0.007510102104757881|\n+---------------------------------------------------+----------------------+------------------+------------------+------------------+--------------------+" + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Evaluate predictions on held-out data\n- As expected, we find that the mean absolute error (mae) on the for_predictions data is similar to the leaderboard mae", + "metadata": {} + }, + { + "cell_type": "code", + "source": "from pyspark.sql.functions import *\nfrom pyspark.ml.evaluation import RegressionEvaluator\n\nscores = predicted.select(predicted['HourlyEnergyOutputMW'], predicted['prediction_output']['value'].alias('prediction'))\n\nevaluator = RegressionEvaluator(predictionCol=\"prediction\",\n labelCol=\"HourlyEnergyOutputMW\",\n metricName=\"mae\")\n\nmae = evaluator.evaluate(scores)\n\nprint(\"Mean absolute error:\", mae)", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "Mean absolute error: 2.3167231443313843" + } + ], + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": "# Configuration settings for scaling to larger data\n\n## Number and size of nodes in our Kubernetes cluster\nWe can control the number and size of nodes in our Kubernetes cluster via the node-vm-size and node-count switches in our `aks create` command:\n\n`az aks create --name mycluster --resource-group myrg --generate-ssh-keys --node-vm-size Standard_DS14_v2 --node-count 3 --kubernetes-version 1.10.9`\n\nMore information is available [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-on-aks?view=sqlallproducts-allversions#create-a-kubernetes-cluster).\n\n## Number of Spark pods\nWe can control the number of Spark pods via the CLUSTER_STORAGE_POOL_REPLICAS environment variable used by `mssqlctl create cluster`:\n\nSET CLUSTER_STORAGE_POOL_REPLICAS=2\n\n## YARN scheduler memory and cores\nWe can control the YARN scheduler memory and cores via the following environment variable used by `mssqlctl create cluster`:\n\n- YARN_SCHEDULER_MAX_MEMORY\n- YARN_SCHEDULER_MAX_VCORES\n- YARN_NODEMANAGER_RESOURCE_MEMORY\n- YARN_NODEMANAGER_RESOURCE_VCORES\n\nFurther information regarding mssqlctl environtment variables is available [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-guidance?view=sqlallproducts-allversions#define-environment-variables).\n\n## Livy timeout\nThe Livy timeout sets a limit on the runtime of a cell in a PySpark3 Jupyter notebook. In SQL Server 2019 Big Data CTP 2.1, the Livy timeout defaults to 1 hour. In CTP 2.2, it defaults to 24 days. One can modify this as follows:\n\n- Log into the mssql-master-pool-0 pod using this command (requires permission to run kubectl):\n\n```\nkubectl exec -it mssql-master-pool-0 -n -- /bin/bash\n```\n- To set the Livy timeout to 24 days, run the following command or edit /livy/conf/livy.conf accordingly:\n\n```\necho 'livy.server.session.timeout = 24d' | cat >> /livy/conf/livy.conf \n```\n- Then restart the Livy server by running the following command:\n\n```\nsupervisorctl restart livy\n```", + "metadata": {} + }, + { + "cell_type": "markdown", + "source": "# Monitoring and Diagnostics\n## YARN UI\n\nAccess from the \"View Yarn History\" button in Azure Data Studio (ADS) or at `https://:30443/gateway/default/yarn`\n\n## Spark UI\n\nAccess from the \"Spark UI\" link that appears after running the first cell in a notebook in a (Py)Spark kernel in ADS or by clicking the ApplicationMaster link of a running application in the YARN UI\n\n## Spark History\n\nAccess from the \"View Spark History\" button in ADS or at `https://:30443/gateway/default/sparkhistory`\n\n## H2O Flow UI\n- The command `H2OContext.getOrCreate(spark)` outputs the IP address and port number for connection to H2O’s Flow UI, for example:\n\n `H2O connection url: http://10.244.0.16:54325`\n\n- This connection can be forwarded to one’s workstation using this command (requires permission to run kubectl):\n\n `kubectl -n test port-forward `\n\n Here is an example:\n\n `kubectl -n test port-forward mssql-storage-pool-default-0 54325`\n\n The port number is the number after the colon in the H2O connection URL.\n \n To determine ``, run this command in the PySpark3 kernel in the ADS notebook:\n\n```python\n # Print the hostname of the pod where the driver is running\n stdout = subprocess.check_output( \n \"hostname\", \n stderr=subprocess.STDOUT, \n shell=True).decode(\"utf-8\") \n print(stdout) \n```\n\n- After setting up port forwarding, the Flow UI can be accessed at `http://localhost:`; for example, `http://localhost:54325`", + "metadata": {} + } + ] +} \ No newline at end of file