[SW-340] Various updates to README file.

- Split README to more files - Convert to rst - Fix some incorrect versions - Remove some non-relevant parts
h2oai · Jun 14, 2017 · a1dd57c · a1dd57c
1 parent d388c52
commit a1dd57c
Show file tree

Hide file tree

Showing 9 changed files with 824 additions and 639 deletions.
diff --git a/README.md b/README.md
diff --git a/README.rst b/README.rst
@@ -0,0 +1,212 @@
+Sparkling Water
+===============
+
+|Join the chat at https://gitter.im/h2oai/sparkling-water| |image1|
+|image2| |image3| |Powered by H2O.ai|
+
+Sparkling Water integrates |H2O|'s fast scalable machine learning engine with Spark. It provides:
+
+- Utilities to publish Spark data structures (RDDs, DataFrames, Datasets) as H2O's frames and vice versa.
+- DSL to use Spark data structures as input for H2O's algorithms.
+- Basic building blocks to create ML applications utilizing Spark and H2O APIs.
+- Python interface enabling use of Sparkling Water directly from PySpark.
+
+Getting Started
+---------------
+
+Select right version
+~~~~~~~~~~~~~~~~~~~~
+
+The Sparkling Water is developed in multiple parallel branches. Each
+branch corresponds to a Spark major release (e.g., branch **rel-2.1**
+provides implementation of Sparkling Water for Spark **2.1**).
+
+Please, switch to the right branch:
+
+- For Spark 2.1 use branch `rel-2.1 <https://github.com/h2oai/sparkling-water/tree/rel-2.1>`__
+- For Spark 2.0 use branch `rel-2.0 <https://github.com/h2oai/sparkling-water/tree/rel-2.0>`__
+- For Spark 1.6 use branch `rel-1.6 <https://github.com/h2oai/sparkling-water/tree/rel-1.6>`__ (Only critical fixes)
+
+   **Note:** The `master <https://github.com/h2oai/sparkling-water/tree/master>`__
+   branch includes the latest changes for the latest Spark version.
+   They are back-ported into older Sparkling Water versions.
+
+Requirements
+~~~~~~~~~~~~
+
+-  Linux/OS X/Windows
+-  Java 7+
+-  Python 2.6+ For Python version of Sparkling Water (PySparkling)
+-  `Spark 1.6+ <https://spark.apache.org/downloads.html>`__
+   -  ``SPARK_HOME`` shell variable must point to your local Spark installation
+
+Download Binaries
+~~~~~~~~~~~~~~~~~
+
+For each Sparkling Water you can download binaries here:
+
+- `Sparkling Water - Latest version <http://h2o-release.s3.amazonaws.com/sparkling-water/master/latest.html>`__
+- `Sparkling Water - Latest 2.1 version <http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.1/latest.html>`__
+- `Sparkling Water - Latest 2.0 version <http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.0/latest.html>`__
+- `Sparkling Water - Latest 1.6 version <http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.6/latest.html>`__
+
+Maven
+~~~~~
+
+Each Sparkling Water release is published into Maven central. Published artifacts are provided with the following Scala
+versions:
+
+- Sparkling Water 2.1.x - Scala 2.11
+- Sparkling Water 2.0.x - Scala 2.11
+- Sparkling Water 1.6.x - Scala 2.10
+
+The artifacts coordinates are:
+
+- ``ai.h2o:sparkling-water-core_{{scala_version}}:{{version}}`` - includes core of Sparkling Water.
+- ``ai.h2o:sparkling-water-examples_{{scala_version}}:{{version}}`` - includes example applications.
+
+   **Note:** The ``{{version}}`` references to a release version of Sparkling Water, the ``{{scala_version}}``
+   references to Scala base version (``2.10`` or ``2.11``). For example:
+   ``ai.h2o:sparkling-water-examples_2.11:2.1.0``
+
+The full list of published packages is available
+`here <http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22ai.h2o%22%20AND%20a%3Asparkling-water*>`__.
+
+---------------
+
+Use Sparkling Water
+-------------------
+
+Sparkling Water is distributed as a Spark application library which can be used by any Spark application.
+Furthermore, we provide also zip distribution which bundles the library and shell scripts.
+
+There are several ways of using Sparkling Water:
+
+- Sparkling Shell
+- Sparkling Water driver
+- Spark Shell and include Sparkling Water library via ``--jars`` or ``--packages`` option
+- Spark Submit and include Sparkling Water library via ``--jars`` or ``--packages`` option
+- PySpark with PySparkling
+
+
+Run Sparkling shell
+~~~~~~~~~~~~~~~~~~~
+
+The Sparkling shell encapsulates a regular Spark shell and append Sparkling Water library on the classpath via ``--jars`` option.
+The Sparkling Shell supports creation of an |H2O| cloud and execution of |H2O| algorithms.
+
+1. Either download or build Sparkling Water
+2. Configure the location of Spark cluster:
+
+   .. code:: bash
+
+      export SPARK_HOME="/path/to/spark/installation"
+      export MASTER="local[*]"
+
+
+   In this case, ``local[*]`` points to an embedded single node cluster.
+
+3. Run Sparkling Shell:
+
+   .. code:: bash
+
+      bin/sparkling-shell
+
+   Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor, use the ``spark.executor.memory`` parameter: ``bin/sparkling-shell --conf "spark.executor.memory=4g"``
+
+4. Initialize H2OContext
+
+   .. code:: scala
+
+      import org.apache.spark.h2o._
+      val hc = H2OContext.getOrCreate(spark)
+
+   ``H2OContext`` starts H2O services on top of Spark cluster and provides primitives for transformations between |H2O| and Spark data structures.
+
+
+Use Sparkling Water with PySpark
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Sparkling Water can be also used directly from PySpark and the integration is called PySparkling.
+
+See `PySparkling README <py/README.rst>`__ to learn about PySparkling.
+
+Use Sparkling Water via Spark Packages
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To see how Sparkling Water can be used as Spark package, please see `Use as Spark Package <doc/spark_package.rst>`__.
+
+Use Sparkling Water in Windows environments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+See `Windows Tutorial <doc/windows_manual.rst>`__ to learn how to use Sparkling Water in Windows environments.
+
+Sparkling Water examples
+~~~~~~~~~~~~~~~~~~~~~~~~
+To see how to run examples for Sparkling Water, please see `Running Examples <doc/running_examples.rst>`__.
+
+--------------
+
+Sparkling Water Backends
+------------------------
+
+Sparkling water supports two backend/deployment modes - internal and
+external. Sparkling Water applications are independent on the selected
+backend. The backend can be specified before creationg of the
+``H2OContext``.
+
+For more details regarding the internal or external backend, please see
+`Backends <doc/backends.rst>`__.
+
+--------------
+
+FAQ
+---
+
+List of all Frequently Asked Questions is available at `FAQ <doc/FAQ.rst>`__.
+
+--------------
+
+Development
+-----------
+
+Build Sparkling Water
+~~~~~~~~~~~~~~~~~~~~~
+
+To see how to build Sparkling Water, please see `Build Sparkling Water <doc/build.rst>`__.
+
+Develop applications with Sparkling Water
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An application using Sparkling Water is regular Spark application which
+bundling Sparkling Water library. See Sparkling Water Droplet providing
+an example application `here <https://github.com/h2oai/h2o-droplets/tree/master/sparkling-water-droplet>`__.
+
+Contributing
+~~~~~~~~~~~~
+
+Look at our `list of JIRA
+tasks <https://0xdata.atlassian.net/issues/?filter=13600>`__ for new
+contributors or send your idea to support@h2o.ai.
+
+Issues
+~~~~~~
+
+To report issues, please use our JIRA page at
+`http://jira.h2o.ai/ <https://0xdata.atlassian.net/projects/SW/issues>`__.
+
+We also respond to questions tagged with sparkling-water and h2o tags on
+the `Stack
+Overflow <https://stackoverflow.com/questions/tagged/sparkling-water>`__.
+
+--------------
+
+.. |Join the chat at https://gitter.im/h2oai/sparkling-water| image:: https://badges.gitter.im/Join%20Chat.svg
+   :target: https://gitter.im/h2oai/sparkling-water?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
+.. |image1| image:: https://travis-ci.org/h2oai/sparkling-water.svg?branch=master
+   :target: https://travis-ci.org/h2oai/sparkling-water
+.. |image2| image:: https://maven-badges.herokuapp.com/maven-central/ai.h2o/sparkling-water-core_2.11/badge.svg
+   :target: http://search.maven.org/#search%7Cgav%7C1%7Cg:%22ai.h2o%22%20AND%20a:%22sparkling-water-core_2.11%22
+.. |image3| image:: https://img.shields.io/badge/License-Apache%202-blue.svg
+   :target: LICENSE
+.. |Powered by H2O.ai| image:: https://img.shields.io/badge/powered%20by-h2oai-yellow.svg
+   :target: https://github.com/h2oai/
+.. |H2O| replace:: H\ :sub:`2`\ O
diff --git a/doc/FAQ.rst b/doc/FAQ.rst
@@ -0,0 +1,110 @@
+Frequently Asked Questions
+--------------------------
+
+-  Where do I find the Spark logs?
+
+    **Standalone mode**: Spark executor logs are located in the
+    directory ``$SPARK_HOME/work/app-<AppName>`` (where ``<AppName>`` is
+    the name of your application). The location contains also
+    stdout/stderr from H2O.
+
+    **YARN mode**: The executors logs are available via
+    ``yarn logs -applicationId <appId>`` command. Driver logs are by
+    default printed to console, however, H2O also writes logs into
+    ``current_dir/h2ologs``.
+
+    The location of H2O driver logs can be controlled via Spark property
+    ``spark.ext.h2o.client.log.dir`` (pass via ``--conf``) option.
+
+-  How to display Sparkling Water information in the Spark History
+   Server?
+
+    Sparkling Water reports the information already, you just
+    need to add the sparkling-water classes on the classpath of the Spark
+    history server. > To see how to configure the spark application for
+    logging into the History Server, please see `Spark Monitoring
+    Configuration <http://spark.apache.org/docs/latest/monitoring.html>`__
+
+-  Spark is very slow during initialization or H2O does not form a
+   cluster. What should I do?
+
+    Configure the Spark variable ``SPARK_LOCAL_IP``. For example:
+
+    .. code:: bash
+
+        export SPARK_LOCAL_IP='127.0.0.1'
+
+
+-  How do I increase the amount of memory assigned to the Spark
+   executors in Sparkling Shell?
+
+    Sparkling Shell accepts common Spark Shell arguments. For example,
+    to increase the amount of memory allocated by each executor, use the
+    ``spark.executor.memory`` parameter:
+    ``bin/sparkling-shell --conf "spark.executor.memory=4g"``
+
+-  How do I change the base port H2O uses to find available ports?
+
+    The H2O accepts ``spark.ext.h2o.port.base`` parameter via Spark
+    configuration properties:
+    ``bin/sparkling-shell --conf "spark.ext.h2o.port.base=13431"``. For
+    a complete list of configuration options, refer to `Devel
+    Documentation <https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md#sparkling-water-configuration-properties>`__.
+
+-  How do I use Sparkling Shell to launch a Scala ``test.script`` that I
+   created?
+
+    Sparkling Shell accepts common Spark Shell arguments. To pass your
+    script, please use ``-i`` option of Spark Shell:
+    ``bin/sparkling-shell -i test.script``
+
+-  How do I increase PermGen size for Spark driver?
+
+    Specify
+    ``--conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"``
+
+-  How do I add Apache Spark classes to Python path?
+
+    Configure the Python path variable ``PYTHONPATH``:
+
+    .. code:: bash
+
+        export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
+        export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
+
+-  Trying to import a class from the ``hex`` package in Sparkling Shell
+   but getting weird error:
+   ``missing arguments for method hex in object functions;   follow this method with '_' if you want to treat it as a partially applied``
+
+    In this case you are probably using Spark 1.5+ which is importing SQL
+    functions into Spark Shell environment. Please use the following
+    syntax to import a class from the ``hex`` package:
+
+    .. code:: scala
+
+        import _root_.hex.tree.gbm.GBM
+
+
+-  Trying to run Sparkling Water on HDP Yarn cluster, but getting error: ``java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig``
+
+    The Yarn time service is not compatible with libraries provided by Spark. Please disable time service via setting
+    ``spark.hadoop.yarn.timeline-service.enabled=false``. For more details, please visit
+    https://issues.apache.org/jira/browse/SPARK-15343
+
+-  Getting non-deterministic H2O Frames after the Spark Data Frame to
+   H2O Frame conversion.
+
+    This is caused by what we think is a bug in Apache Spark. On
+    specific kinds of data combined with higher number of partitions we
+    can see non-determinism in BroadCastHashJoins. This leads to to
+    jumbled rows and columns in the output H2O frame. We recommend to
+    disable broadcast based joins which seem to be non-deterministic as:
+
+    .. code:: scala
+
+        sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
+
+    The issue can be tracked as
+    `PUBDEV-3808 <https://0xdata.atlassian.net/browse/PUBDEV-3808>`__.
+    On the Spark side, the following issues are related to the problem:
+    `Spark-17806 <https://issues.apache.org/jira/browse/SPARK-17806>`__