JFall Presentation: Sentiment Analysis of Social Media Posts with Apache Spark
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
src
.gitignore
LICENSE
README.md
pom.xml

README.md

Sentiment Analysis of Social Media Posts with Apache Spark

This repository contains the sample source code and presentation used in the ignite session I have on JFall 2015. I also wrote a blog post on the subject which you can find here.

Presentation

The presentation (as PDF) can be found here.

Spark Hello World

A small runnable example of how to do do a word-count analysis is shown in HelloSparkWorld.java.

Running the analysis

Downloading the data

The 5GB dataset can be downloader using your favorite torrent client using this link.

You should end up with a RC_2015-01.bz2 file around 5GB in size.

The application.properties file has the default input set to /tmp/RC_2015-01.bz2. If you downloaded the file to a different location please change the properties file accordingly.

Configuration

The application has two config settings that need to be set by you (if their defaults are incorrect), these settings are contained in application.properties.

The input property should point to RC_2015-01.bz2 you just downloaded. The output property should point to an empty directory. The application will create the full directory if possible.

Running the Analysis

You can run the analysis by simply starting running the Main class. It should start a spark context and start an analysis run. You can then connect to http://localhost:4040/ to see the progress. Keep in mind that this process will take quite some time, more than one hour on my machine.

First it reads all the JSON and parses it into internal comment structures and analyses these. The resulting data is stored in a temporary object store location. This isn't strictly needed at all but since this part takes by far the most amount of time it's done for convenience: running new reduce operations on this dataset takes a lot less time than going through the entire deserialization again.

The object file is then used to do the count and sentiment reductions which are then written to their corresponding files.

Links