A very simple spring web app that use apache spark hive
Author Julien Diener
First, the spark assembly jar (provided in the installation spark lib folder) should be added to the maven local repo. The command is given in the Note on spark dependencies section below.
packaging and install:
run using the jetty plugin:
To run the code on local hard disc (no hdfs), the folder
/user/hive/warehouse should be created:
sudo mkdir -P /user/hive/warehouse sudo chown $USER /user/hive/warehouse
Then start jetty
cd webapp mvn jetty:run
And go to http://localhost:8080/gen, it will create a default table
(and a folder in
This spring package implements 2 web services:
gen that generates (overriding if necessary) a table and return the created data as json. The (optional) parameters are:
namethe name of the table (default "mytable"),
nthe number of rows to add to the table,
namenodethe hdfs namenode to use (default is "file:///") and
masterthe spark master to use (default "local")
request that returns a table content as json. The (optional) parameter is
name: the name of the table to request content from. Note that the master is always local (I just did not implement the option). As well, I did not implement a namenode parameter, thus the last that has been used with gen is kept. However, the table is looked for on the hadoop or local file system is has been written to. So, the request can only work on table of the file system used last (with gen).
To use a (real) spark cluster, the spark master url should be given (using the
master parameter) as it is
written in the spark master web ui page.
This web app works with spark-hive, without a hive server. It is uses the hadoop hive dependencies which
(to my understanding) contains a, probably simple, hive server. Hive create a
metastore_db folder in the
running folder ,as well as a
derby.log file. Then tables are stored in
/user/hive/metastore folder of
either the local file system or on the hdfs given by the namenode parameter.
These folders should thus exist and have suitable writing permissions.
I found that sometimes request on hdfs fails, but not often. Re-running the request, or refreshing the page on the browser however works. It looks to be some kind of instability in the communication with hdfs...
###Note on spark dependencies
The maven spark dependencies are not usable at run time (well I did not manage to). The solution is to use the spark assembly jar provided by the spark installation, as it is the actual jar dep that is required to run stand alone app.
To do so, there are 2 solutions:
- Add the jar to the generated war. However, it is quite a big jar.
- Add it to the servlet container
Here I use the second option. I use the maven jetty plugin (see the pom.xml file) and added it into the plugin dependency. It is thus necessary to first add the spark assembly jar in the maven local repo, using the following command:
mvn install:install-file \ -Dfile=$SPARK_HOME/lib/spark-assembly-1.1.1-hadoop2.4.0.jar \ -DgroupId=org.apache.spark -DartifactId=spark-assembly-jar \ -Dversion=1.1.1 -Dpackaging=jar
See this SO question for more details.
To run hive, the datanucleus core (3.3.2), api-jdo (3.2.1) and rdbms (3.2.1) are added as maven dependencies.
The version were chosen to match those included in