Spark project written in Scala used to perform real time sentiment analysis on top of Twitter's streaming API
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.github
art
backup
docs
project
src
.editorconfig
.gitignore
.scalafmt.conf
.travis.yml
LICENSE.txt
README.md
build.sbt
submitToLocalSpark.sh
trainModel.sh

README.md

Roma Build Status https://pedrovgs.github.io/Roma

Machine learning and big data project built on top of Apache Spark and written in Scala. Roma performs a real-time sentiment analysis using Twitter's streaming API as data source and Support Vector Machine as machine learning algorithm.

You can check the results of the real-time sentiment analysis from https://pedrovgs.github.io/Roma/

A more detailed blog post about how we developed this project can be found here.

web

How does it work?

Using Twitter as input data and an already trained SVM model we can classify the incoming tweets into two different classess: positive and negative. Content classified where the accuracy is not good enough, where we can't guarantee if the content is positive or negative, will not be classified. The classification result is persisted into Firebase.

We can split the project project components into two main entities:

  • RomaApplication: Responsible of the streaming and real time tweets classification.
  • RomaMachineLearningTrainer: Responsible of the machine learning training and calssification metrics analysis.

Implementation details related to features extraction or the training process are quite interesting. We strongly recommend you to take a look at the implementation.

Real time classification

Using Spark Streaming API we've connected our application to the official Twitter Streaming API. For every batch we use our already trained SVM Model to classify the streamed tweets as positive or negative. Tweets classified as positive or negative are persisted into Firebase.

The following screencast shows the console output during the classification process:

classificationScreencast

The steps followed to classify tweets are:

  • Load the application configuration needed to initialize the project.
  • Configure the tweets stream using the Twitter OAuth configuration.
  • Load the Firebase connection and the already trained classification model.
  • For every batch of tweets we apply our SVM model to predict the classification. Tweets will be transformed into classified tweets.

As SVM is a binary classification model and we need to discard some tweets and we can't always guarantee if the content is positive or negative we've applied a threshold to the prediction score generated as the result of the Support Vector Machine prediction.

The classification results and stats are saved into Firebase and the classification is performed into an AWS Cluster. We decided to use this storage platform and AWS because these platforms are free for some time and we don't have any sponsor for this project. Firebase does not support aggregation and we have to implement it manually. Even when the performance result is not the best, this implementation is the best we could get for free. In the future, if we have any sponsor for the systems infrastructure we will move the implementation to any other database.

Training

As Support Vector Machine is a supervised classification model we need to train it. To be able to generate an already trained model we need a corpus based on a training and testing tweets datasets already classified. Thanks to Sentiment140 we've got a training dataset composed of 1.600.000 tweets into a GZIP file and a test dataset composed of 359 tweets.

The steps followed to train our model are:

  • Read the training and testing dataset.
  • Filter and tokenize tweets content to be able to extract features easily.
  • Extract tweet features.
  • Train model.
  • Measure classification results using our testing dataset.
  • Save the model and use it to classify new tweets.

The following screenshots shows the console output shown during the training:

  • Features extraction:

featuresExtraction

  • Training process:

trainingProcess

  • Classification metrics:

classification

One important detail is related to the usage of Apache Spark. As we are using this big data and machine learning framework all the code related to the training or classification will be executed in parallel.

The current training accuracy is close to the 80 % based on Sentiment140 training and testing datasets. In the future we'd like to improve our model based on an improved training dataset and an improved features extraction.

Build and test this project

To build and test this project you can execute sbt test. You can also use sbt interactive mode (you just have to execute sbt in your terminal) and then use the triggered execution to execute your tests using the following commands inside the interactive mode:

~ test
~ test-only *AnySpec

If you just need to assemble the jar you have to sent to the Apache Spark cluster you can simply execute sbt assembly. If you need to repeatedly build the binary you can execute sbt to enter into the interactive mode and then execute ~ assembly.

Configuration

This project needs some private Twitter OAuth configuration you can set creating a new file named application.conf into the /src/main/resources/ folder. Here you have an example:

roma {
  twitter4j {
    oauth {
      consumerKey = YOUR_CONSUMER_KEY
      consumerSecret = YOUR_CONSUMER_SECRET
      accessToken = YOUR_ACCESS_TOKEN
      accessTokenSecret = YOUR_ACCESS_TOKEN_SECRET
    }
  }
  
  firebase {
    databaseUrl = YOUR_FIREBASE_DATABASE_URL
  } 

  machineLearning {
    numberOfIterations = 100
    outputFolder = "./outputs/"
    modelFileName = "svmModel"
    positiveThreshold = 0.3
    negativeThreshold = -0.2
  }
}  

If you want to review the classification results from a simple web-site you'll have to modify the file web/js/main.sj and initialize Firebase configuration using your own credentials. Here you have an example:

var config = {
        apiKey: "YOUR API KEY",
        authDomain: "YOUR AUTH DOMAIN",
        databaseURL: "YOUR DATABASE URL",
        projectId: "YOUR PROJECT ID",
        storageBucket: "YOUR STORAGE BUCKET",
        messagingSenderId: "YOUR MESSAGING SENDER ID"
    };

You'll need to also add a new file named firebaseCredentials.json into the /src/main/resources/ folder. You can get this file from your Firebase console.

You can get these credentials by creating a Twitter application here and a new Firebase application here.

Trainig our SVM model

If after some changes during the features extraction or the trainig dataset you need to generate the model saved again you can easily follow these steps:

  • Assembly the application executing sbt assembly.
  • Run the training script ./trainModel.sh.
  • Review the classification metrics shown in the console logs.
  • Move the recently generated model into the resources folder.

After the training process you'll see how a new folder named outputs/svmModel will appear and some classification metrics will be shown.

Running on a cluster

Spark applications are developed to run on a cluster. Before to run your app you need to generate a .jar file you can submit to Spark to be executed. You can generate the roma.jar file executing sbt assembly. This will generate a binary file you can submit using spark-submit command. Ensure your local Spark version is Spark 2.1.1.

You can submit this application to your local spark installation executing these commands:

sbt assembly ./submitToLocalSpark.sh

If at some point you need to run this on a cluster my recommendation is to use Amazon EMR. To get the job running on a cluster you have to follow these steps:

  • Configure the application.conf and firebaseCredentials.json file as described below.
  • Create an AWS account.
  • Build the project using sbt assembly.
  • Upload to an amazon s3 bucket the binary you'll find in target/scala-2.11/roma.jar.
  • Upload to an amazon s3 bucket the svmModel folder found inside the resources directory.
  • Create a cluster using Spark as application.
  • Add a step to the cluster using the following configuration:

addStep

Once the step starts executing the Spark application you'll see how the web site is updated in real time using the classification stats and the classified tweets computed in the cluster.

Contributors

References

Developed By

Follow me on Twitter Add me to Linkedin

License

Copyright 2017 Pedro Vicente Gómez Sánchez

Licensed under the GNU General Public License, Version 3 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.gnu.org/licenses/gpl-3.0.en.html

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.