PredictionIO word2vec engine template (Scala-based parallelized engine)
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 8 commits ahead, 22 commits behind apache:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This template demonstrates how to integrate the Word2Vec implementation from deeplearning4j with PredictionIO.

The Word2Vec algorithm takes a corpus of text and computes a vector representation for each word. These representations can be subsequently used in many natural language processing applications and for further research.

Creating the project

To copy the template run the following command:

> pio template get pawel-n/template-scala-parallel-word2vec <YourEngineDir>
> cd <YourEngineDir>

Now create a new app:

> pio app new <AppName>
[INFO] [App$] Initialized Event Store for this app ID: 2.
[INFO] [App$] Created new app:
[INFO] [App$]       Name: <AppName>
[INFO] [App$]         ID: <AppId>
[INFO] [App$] Access Key: 

Make sure engine.js matches your app id:

    "appId": <AppId>

Importing the data

The example data set is a list of tweets with their sentiments. We will be using just the text. First download the file:

> wget
> unzip
> mv Sentiment\ Analysis\ Dataset.csv data/dataset.csv
> rm

Now run the "data/" script to import events:

> ./data/ --access_key=<AccessKey>

This can take a while. Feel free to make yourself a tea.

Build, train, deploy

If your engine is running out of memory try to increase the limit with "--executor-memory" and "--driver-memory" options:

pio build
pio train -- --executor-memory=10GB --driver-memory=10GB
pio deploy -- --executor-memory=10GB --driver-memory=10GB


Once the engine is deployed you can query it with the "data/" script:

> ./data/

The script will ask you for a word and give you a list of similar words. The distance between two words is computed as the cosine between their vector representations.


Due to large number of conflicts we use a custom merge strategy. It is also necessary to exclude a few dependencies of deeplearning4j-nlp.


In this section we describe briefly the algorithm of this engine. The rest of DASE components are trivial.

As the first step we define an input preprocessor. InputHomogenization normalizes the input sentences by removing punctuation and converting words to lower case.

object PreProcessor extends SentencePreProcessor {
  override def preProcess(s: String): String =
    new InputHomogenization(s).transform()
override def train(sc: SparkContext, data: PreparedData): Model = {
    val sentences = data.sentences.collect.toSeq.asJavaCollection
    val sentenceIterator = new CollectionSentenceIterator(PreProcessor, sentences)

After we have normalized the sentences, the next step is to split each sentence into a list of words. Apache UIMA provides a tokenizer that will take care of this.

    val tokenizerFactory = new UimaTokenizerFactory()

We create a new Word2Vec object with our parameters:

    val word2vec = new Word2Vec.Builder()

Finally we can train it and save as our model:
    new Model(word2vec)

The predict method simply calls a Word2Vec method to find the most similar words:

override def predict(model: Model, query: Query): PredictedResult = {
    val nearest = model.word2vec.wordsNearest(query.word, query.num)