This is WIP. Getting Started
Current infrastructure:
- TweetCollector serializes Tweets (without code generation) to Avro and sent to Kafka
- TweetAnalyzer picks up serialized Tweets and monitor tweets for unexpected volume in Spark
- Volume thresholds and detected alerts managed in HDFS
-
Get Twitter credentials and fill them in
reference.conf.example
and rename toreference.conf
-
Start Kafka (instructions) in single-node mode on localhost
-
Start TweetCollector
./gradlew collect
This will start to read recent tweets, encode them to Avro and send to the Kafka cluster in binary format (Array[Byte]
).
- Start TweetAnalyzer
./gradlew analyze
This will run Spark streaming connected to the Kafka queue. In 5-second intervals the program reads tweets from Kafka, analyzes the tweet texts and print the 10 most tweeted company of the 400 S & P