I will admit I am a huge fan of Twitter, and of the data it produces because it can holds many interesting facts, and opinions of almost any subject from people from all around the world. During Thanksgiving 2015 (November 26), while everyone was eating turkey, I fired up Spark to capture tweets containing the keyword ‘thanksgiving’.
The reason I did this work was because I was interested in exploring the tweets generated during that period of time, in particular topics such as the most common retweet, and hashtag. Moreover, I wanted to try Apache Zeppelin, a web-based notebook (similar to iPython or Jupyter) for interactive data analytics.
- Apache Zeppelin and the Spark interpreter
- Spark Streaming
The dataset used is made of 177955 tweets obtained on November 26, 2015.
This repo holds an export of the Zeppelin notebook, the Scala code used to capture the tweets, a Pig script used for merging all the tweets into one file, and the dataset.