Implementation notes

Scalability

Ingestion Scalability

Each data source is configured independently, and files are downloaded and its timestamp is parsed by a different bolt. Files for each data source
- are written to a different HDFS path, with an Avro file per month.
- an Apache Kafka topic is devoted to each data source, so further processing for the data source is performed in a dedicated topology

Problems with Storm

Failed raw data ingestion

The methods close() and cleanup() for Storm bolts and spouts are useless in cluster mode, as "There is no guarentee that close [cleanup] will be called, because the supervisor kill -9's worker processes on the cluster.". As a result the Avro file stored in HDFS is corrupted. An alternative would be writing to an HBase table instead of HDFS.

Library conflicts with the jars

Using 0.9.1-incubating, storm works in local mode but in cluster mode or just when run from storm jar it uses the libraries comming from the lib folder of storm instead of the versions used for the components of the topology. This breaks the program as then we get missing fields in classes and so, as a workaround, in particular for the storm-ingestion topology we might replace the jars for guava, httpclient and httpcore at the lib folder of the storm installation by the following:

Guava 17 from http://code.google.com/p/guava-libraries/, instead of guava 13
Apache Http Client and Fluent version 4.3.2 instead of 4.1.1 involves three jars instead of two:

Apache Metamodel

I haven't been able to find any web page hosting the javadoc, the closer you get is the javadoc linked in the old Apache Metamodel site but that is for Metamodel version 3.4.11. My solution to this is 1) download the source for MetaModel-4.1.0-RC1-incubating-source-release.zip; 2) unzip and execute mvn site, in the pom can be seen that javadoc generation is triggered by this goal, using mvn javadoc:javadoc generates the documentation for each project in a separate folder, but will mvn site all the javadocs are generated together in target/sit/apidocs; 3) upload that folder to my github.io site as a mirror.

Testing Spark Streaming

To test a Spark Streaming program reading from a Kafka topic generated by the Storm ingestion

Ensure the Redis service is started

$ sudo service redis start

Ensure Kafka is started

[cloudera@localhost kafka_2.10-0.8.1.1]$ pwd
/usr/lib/bicingbcn/lib/kafka_2.10-0.8.1.1
[cloudera@localhost kafka_2.10-0.8.1.1]$ bin/kafka-server-start.sh config/server.properties

Start test bicing service, this server is configured as a datasource that is sent to the Kafka topic test_bicing_station_data

[cloudera@localhost resources]$ pwd
/home/cloudera/git/bicing-bcn/storm/ingestion/src/test/resources
[cloudera@localhost resources]$ ./bicing_test_service.py 9999

Opening a terminal kafka listener is useful for debugging, as we it proves that kafka messages are being sent correctly

[cloudera@localhost kafka_2.10-0.8.1.1]$ pwd
/usr/lib/bicingbcn/lib/kafka_2.10-0.8.1.1
[cloudera@localhost kafka_2.10-0.8.1.1]$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test_bicing_station_data

Now launch the Storm ingestion as a local process for testing. In seconds the terminal where bicing_test_service.py is running should start sending messages, and also the terminal for kafka-console-consumer.sh

[cloudera@localhost ingestion]$ pwd
/home/cloudera/git/bicing-bcn/storm/ingestion
[cloudera@localhost ingestion]$ storm jar target/storm-ingestion-0.0.1-SNAPSHOT.jar org.collprod.bicingbcn.ingestion.IngestionTopology

Running Saiku

Saiku Server 2.5 with Foodmart example database it's installed at /usr/lib/bicingbcn/lib/saiku-server as part of the bicing installation, which also includes:

Install the jars for Phoenix into Saiku:

[cloudera@localhost saiku-server]$ pwd
/usr/lib/bicingbcn/lib/saiku-server
[cloudera@localhost saiku-server]$ cp ../phoenix-3.0.0-incubating/common/phoenix-3.0.0-incubating-client-without-hbase.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hbase/hbase.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hadoop/hadoop-common.jar tomcat/lib/
[cloudera@localhost saiku-server]$ cp /usr/lib/hadoop/hadoop-auth.jar tomcat/lib/

The Saiku datasource declaration for Phoenix is available at /usr/lib/bicingbcn/lib/saiku-server/tomcat/webapps/saiku/WEB-INF/classes/saiku-datasources/phoenix
The Mondrian definition for the BicingBCN cubes is available at /usr/lib/bicingbcn/lib/saiku-server/tomcat/webapps/saiku/WEB-INF/classes/phoenix/Phoenix.xml
WARNING: Saiku runs at port 8080 just like YARN in CDH4, so YARN should be off when Saiku is running. In a sensible deploy Saiku would be running in a "frontend" server, separate from the cluster

In order to start Saiku while monitoring its execution open 3 horizontal Terminator terminals at /usr/lib/bicingbcn/lib/saiku-server and execute in them:

tailf tomcat/logs/catalina.out
tailf tomcat/logs/saiku.log
Leave the missing terminal for starting and stopping Saiku with ./start-saiku.sh and ./stop-saiku.sh

That leaves Saiku running at localhost:8080, the default login is admin - admin

If something goes wrong the tomcat server for Saiku appears as Bootstrap at jps. Note Saiku is using Phoenix and therefore HBase as datasource, so HBase must be running ok in Cloudera Manager for Saiku to start properly.

Note: for reasons unknown the only deployment of Saiku that is working with Phoenix is that of /home/cloudera/Sistemas/Saiku/saiku-server1, it doesn't work even when copying the same folder somewhere else. Using a symlink.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly