Update Spark example

occopus · Jan 18, 2021 · b3f33a7 · b3f33a7
1 parent aae1539
commit b3f33a7
Show file tree

Hide file tree

Showing 6 changed files with 11 additions and 15 deletions.
diff --git a/sphinx/source/tutorial-bigdata-ai.rst b/sphinx/source/tutorial-bigdata-ai.rst
@@ -6,7 +6,7 @@ Big Data and AI applications
 Apache Hadoop cluster
 ~~~~~~~~~~~~~~~~~~~~~
 
-This tutorial sets up a complete Apache Hadoop infrastructure. It contains a Hadoop Master node and Hadoop Slave worker nodes, which can be scaled up or down. To register Hadoop Slave nodes Consul is used.
+This tutorial sets up a complete Apache Hadoop (version **2.10.1**) infrastructure. It contains a Hadoop Master node and Hadoop Slave worker nodes, which can be scaled up or down. To register Hadoop Slave nodes Consul is used.
 
 **Features**
 
@@ -19,7 +19,6 @@ This tutorial sets up a complete Apache Hadoop infrastructure. It contains a Had
 
  - accessing a cloud through an Occopus-compatible interface (e.g EC2, Nova, Azure, etc.)
  - target cloud contains a base Ubuntu OS image with cloud-init support
- - generated ssh key-pair (or for testing purposes one is attached)
 
 **Download**
 
@@ -113,16 +112,12 @@ You can download the example as `tutorial.examples.hadoop-cluster <https://raw.g
 
       14032858-d628-40a2-b611-71381bd463fa
 
-#. You can check the  health and statistics of the cluster through the following web pages:
+#. You can check the health and statistics of the cluster through the following web pages:
 
    - Health of nodes: ``http://[HadoopMasterIP]:50070``
    - Job statistics: ``http://[HadoopMasterIP]:8088``
 
-#. To launch a Hadoop MapReduce job copy your input and executable files to the Hadoop Master node, and perform the submission described `here <https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html>`_ . To login to the Hadoop Master node use the private key attached to the tutorial package:
-
-   .. code:: bash
-
-      ssh -i builtin_hadoop_private_key hduser@[HadoopMaster ip]
+#. To launch a Hadoop MapReduce job copy your input and executable files to the Hadoop Master node, and perform the submission described `here <https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html>`_.
 
 #. Finally, you may destroy the infrastructure using the infrastructure id returned by ``occopus-build``
 
@@ -134,7 +129,7 @@ You can download the example as `tutorial.examples.hadoop-cluster <https://raw.g
 Apache Spark cluster with RStudio Stack
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more information visit the `official Apache Spark page <https://spark.apache.org>`_ .
+This tutorial sets up a complete Apache Spark (version **2.4.7**) infrastructure with HDFS (Hadoop Distributed File System) (version **2.10.1**) and RStudio server. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more information visit the `official Apache Spark page <https://spark.apache.org>`_ .
 
 Apache Spark cluster together with HDFS (Hadoop Distributed File System) represents one of the most important tool for Big Data and machine learning applications, enabling the parallel processing of large data sets on many virtual machines, which are running Spark workers. On the other hand, setting up a Spark cluster with HDFS on clouds is not straightforward, requiring deep knowledge of both cloud and Apache Spark architecture. To save this hard work for scientists we have created and made public the required infrastructure descriptors by which Occopus can automatically deploy Spark clusters with the number of workers specified by the user.
 One of the most typical application area of Big Data technology is the statistical data processing that is usually done by the programming language R. In order to facilitate the work of statisticians using Spark on cloud, we have created an extended version of the Spark infrastructure descriptors placing the sparklyr library on Spark workers, too. Finally, we have also integrated the user-friendly RStudio user interface into the Spark system. As a result, researchers using the statistical R package can easily and quickly deploy a complete R-oriented Spark cluster on clouds containing the following components: RStudio, R, sparklyr, Spark and HDFS.
@@ -187,8 +182,7 @@ You can download the example as `tutorial.examples.spark-cluster-with-r <https:/
      If you want Occopus to monitor (health_check) your Spark Master and it is to be deployed in a different network, make sure you assign public (floating) IP to the Master node.
 
 
-#. Generally speaking, a Spark cluster and its services are not deployed on the public internet. They are generally private services, and should only be accessible within the network of the organization that deploys Spark. Access to the hosts and ports used by Spark services should be limited to origin hosts that need to access the services.
-This means that you need to create a firewall rule to allow **all traffic between Spark nodes** and the **required ports** [web UI and job submission port(s)] should be allowed **only from your IP address**.
+#. Generally speaking, a Spark cluster and its services are not deployed on the public internet. They are generally private services, and should only be accessible within the network of the organization that deploys Spark. Access to the hosts and ports used by Spark services should be limited to origin hosts that need to access the services. This means that you need to create a firewall rule to allow **all traffic between Spark nodes** and the **required ports** [web UI and job submission port(s)] should be allowed **only from your IP address**.
 
    **Main UI port list:**
 
@@ -355,7 +349,7 @@ This means that you need to create a firewall rule to allow **all traffic betwee
 Apache Spark cluster with Jupyter notebook and PySpark
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more information visit the `official Apache Spark page <https://spark.apache.org>`_ .
+This tutorial sets up a complete Apache Spark (version **2.4.7**) infrastructure with HDFS (Hadoop Distributed File System) (version **2.10.1**) and PySpark. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For more information visit the `official Apache Spark page <https://spark.apache.org>`_ .
 
 Apache Spark cluster together with HDFS (Hadoop Distributed File System) represents one of the most important tool for Big Data and machine learning applications, enabling the parallel processing of large data sets on many virtual machines, which are running Spark workers. On the other hand, setting up a Spark cluster with HDFS on clouds is not straightforward, requiring deep knowledge of both cloud and Apache Spark architecture. To save this hard work for scientists we have created and made public the required infrastructure descriptors by which Occopus can automatically deploy Spark clusters with the number of workers specified by the user.
 Spark also provides a special library called “Spark MLlib” for supporting machine learning applications. Similarly, to the R-oriented Spark environment, we have developed the infrastructure descriptors for the creation of a machine learning environment in the cloud. Here, the programming language is Python and the user programming environment is Jupyter. The complete machine learning environment consists of the following components: Jupyter, Python, Spark and HDFS. Deploying this machine learning environment is also automatically done by Occopus and the number of Spark workers can be defined by the user.
@@ -407,6 +401,7 @@ This tutorial sets up a complete Apache Spark infrastructure integrated with HDF
 
 
 #. Generally speaking, a Spark cluster and its services are not deployed on the public internet. They are generally private services, and should only be accessible within the network of the organization that deploys Spark. Access to the hosts and ports used by Spark services should be limited to origin hosts that need to access the services.
+
 This means that you need to create a firewall rule to allow **all traffic between Spark nodes** and the **required ports** [web UI and job submission port(s)] should be allowed **only from your IP address**.
 
    **Main UI port list:**

diff --git a/tutorials/spark-cluster-with-python.tar.gz b/tutorials/spark-cluster-with-python.tar.gz
diff --git a/tutorials/spark-cluster-with-python/infra-spark-cluster.yaml b/tutorials/spark-cluster-with-python/infra-spark-cluster.yaml
@@ -16,7 +16,7 @@ variables:
   HADOOP_VERSION: 2.10.1
   SPARK_VERSION: 2.4.7
   SPARK_HADOOP_VERSION: 2.7
-  CONSUL_VERSION: 1.8.5
+  CONSUL_VERSION: 1.9.1
   CONSUL_TEMPLATE_VERSION: 0.25.1
 
 dependencies:

diff --git a/tutorials/spark-cluster-with-r.tar.gz b/tutorials/spark-cluster-with-r.tar.gz
diff --git a/tutorials/spark-cluster-with-r/infra-spark-cluster.yaml b/tutorials/spark-cluster-with-r/infra-spark-cluster.yaml
@@ -16,8 +16,9 @@ variables:
   HADOOP_VERSION: 2.10.1
   SPARK_VERSION: 2.4.7
   SPARK_HADOOP_VERSION: 2.7
-  CONSUL_VERSION: 1.8.5
+  CONSUL_VERSION: 1.9.1
   CONSUL_TEMPLATE_VERSION: 0.25.1
+  RSTUDIO_VERSION: 1.3.1073-amd64
 
 dependencies:
   -

diff --git a/tutorials/spark-cluster-with-r/nodes/cloud_init_spark_master.yaml b/tutorials/spark-cluster-with-r/nodes/cloud_init_spark_master.yaml
@@ -14,7 +14,7 @@ write_files:
     SPARK_HADOOP_VERSION={{variables.SPARK_HADOOP_VERSION}}
     CONSUL_VERSION={{variables.CONSUL_VERSION}}
     CONSUL_TEMPLATE_VERSION={{variables.CONSUL_TEMPLATE_VERSION}}
-    RSTUDIO_VERSION=1.3.1073-amd64
+    RSTUDIO_VERSION={{variables.RSTUDIO_VERSION}}
 
 
     echo "Creating SPARKUSER starts."