-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation notes
- Each data source is configured independently, and files are downloaded and its timestamp is parsed by a different bolt. Files for each data source
- are written to a different HDFS path, with an Avro file per month.
- an Apache Kafka topic is devoted to each data source, so further processing for the data source is performed in a dedicated topology
The methods close() and cleanup() for Storm bolts and spouts are useless in cluster mode, as "There is no guarentee that close [cleanup] will be called, because the supervisor kill -9's worker processes on the cluster.". As a result the Avro file stored in HDFS is corrupted. An alternative would be writing to an HBase table instead of HDFS.
Using 0.9.1-incubating, storm works in local mode but in cluster mode or just when run from storm jar
it uses the libraries comming from the lib folder of storm instead of the versions used for the components of the topology. This breaks the program as then we get missing fields in classes and so, as a workaround, in particular for the storm-ingestion
topology we might replace the jars for guava, httpclient and httpcore at the lib folder of the storm installation by the following:
- Guava 17 from http://code.google.com/p/guava-libraries/, instead of guava 13
- Apache Http Client and Fluent version 4.3.2 instead of 4.1.1 involves three jars instead of two:
I haven't been able to find any web page hosting the javadoc, the closer you get is the javadoc linked in the old Apache Metamodel site but that is for Metamodel version 3.4.11. My solution to this is 1) download the source for MetaModel-4.1.0-RC1-incubating-source-release.zip; 2) unzip and execute mvn site
, in the pom can be seen that javadoc generation is triggered by this goal, using mvn javadoc:javadoc
generates the documentation for each project in a separate folder, but will mvn site
all the javadocs are generated together in target/sit/apidocs
; 3) upload that folder to my github.io site as a mirror.
To test a Spark Streaming program reading from a Kafka topic generated by the Storm ingestion
- Ensure the Redis service is started
$ sudo service redis start
- Ensure Kafka is started
[cloudera@localhost kafka_2.10-0.8.1.1]$ pwd
/usr/lib/bicingbcn/lib/kafka_2.10-0.8.1.1
[cloudera@localhost kafka_2.10-0.8.1.1]$ bin/kafka-server-start.sh config/server.properties
- Start test bicing service, this server is configured as a datasource that is sent to the Kafka topic
test_bicing_station_data
[cloudera@localhost resources]$ pwd
/home/cloudera/git/bicing-bcn/storm/ingestion/src/test/resources
[cloudera@localhost resources]$ ./bicing_test_service.py 9999
- Opening a terminal kafka listener is useful for debugging, as we it proves that kafka messages are being sent correctly
[cloudera@localhost kafka_2.10-0.8.1.1]$ pwd
/usr/lib/bicingbcn/lib/kafka_2.10-0.8.1.1
[cloudera@localhost kafka_2.10-0.8.1.1]$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test_bicing_station_data
- Now launch the Storm ingestion as a local process for testing. In seconds the terminal where
bicing_test_service.py
is running should start sending messages, and also the terminal forkafka-console-consumer.sh
[cloudera@localhost ingestion]$ pwd
/home/cloudera/git/bicing-bcn/storm/ingestion
[cloudera@localhost ingestion]$ storm jar target/storm-ingestion-0.0.1-SNAPSHOT.jar org.collprod.bicingbcn.ingestion.IngestionTopology
Saiku Server 2.5 with Foodmart example database it's installed at /usr/lib/bicingbcn/lib/saiku-server
as part of the bicing installation, which also includes:
- Install the jars for Phoenix into Saiku:
[cloudera@localhost saiku-server]$ pwd
/usr/lib/bicingbcn/lib/saiku-server
[cloudera@localhost saiku-server]$ cp ../phoenix-3.0.0-incubating/common/phoenix-3.0.0-incubating-client-without-hbase.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hbase/hbase.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hadoop/hadoop-common.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hadoop/hadoop-auth.jar tomcat/lib/
- The Saiku datasource declaration for Phoenix is available at
/usr/lib/bicingbcn/lib/saiku-server/tomcat/webapps/saiku/WEB-INF/classes/saiku-datasources/phoenix
- The Mondrian definition for the BicingBCN cubes is available at
/usr/lib/bicingbcn/lib/saiku-server/tomcat/webapps/saiku/WEB-INF/classes/phoenix/Phoenix.xml
-
WARNING: Saiku runs at port
8080
just like YARN in CDH4, so YARN should be off when Saiku is running. In a sensible deploy Saiku would be running in a "frontend" server, separate from the cluster
In order to start Saiku while monitoring its execution open 3 horizontal Terminator terminals at /usr/lib/bicingbcn/lib/saiku-server
and execute in them:
tailf tomcat/logs/catalina.out
tailf tomcat/logs/saiku.log
- Leave the missing terminal for starting and stopping Saiku with
./start-saiku.sh
and./stop-saiku.sh
That leaves Saiku running at localhost:8080
, the default login is admin - admin
If something goes wrong the tomcat server for Saiku appears as Bootstrap
at jps
. Note Saiku is using Phoenix and therefore HBase as datasource, so HBase must be running ok in Cloudera Manager for Saiku to start properly.
Note: for reasons unknown the only deployment of Saiku that is working with Phoenix is that of /home/cloudera/Sistemas/Saiku/saiku-server1
, it doesn't work even when copying the same folder somewhere else. Using a symlink.