Text similarity based on Word2Vec vectors.
Scala Shell
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
data devel0 Feb 11, 2016
project Update template for PIO v0.10.0-incubating Feb 7, 2017
src/main/scala Update template for PIO v0.10.0-incubating Feb 7, 2017
README.md Update template for PIO v0.10.0-incubating Feb 7, 2017
build.sbt Update template for PIO v0.10.0-incubating Feb 7, 2017
template.json Update template for PIO v0.10.0-incubating Feb 7, 2017


Text Similarity Based on Word2Vec

Text similarity engine based on Word2Vec algorithm. Builds vectors of full documents in training phase. Finds similar documents in query phase.

Model Training Modes

Similarity model can be trained using 2 sources of information.

Training modes are switched using engine.json configuration file. The change can’t be done using PubNub queues.

Basic Training

Basic training is based only on "text" field from training event. In this mode score equal 0 means that we are lacking training information for query phrase. Results and matches in such case are random.

Composite Training

In this mode we have two sources of information. We are concatenating fields "text" and "extTrainWords". This gives much more flexibility. If we have a case that we want to distinguish two quite similar phrases we can use second field to add additional desiding information allowing to match precisely and according to our needs.

Note that both fields can contain free texts. We don’t need to use single words there.

Engine Configuration

File engine.json contains configuration of engine.


Only tokens of size equal and bigger than minTokenSize will be used to traing similarity model.


If set to true query result displays "text" field from training events.


If set to true query result displays "desc" field from training events.


If set to true composite training is used. Concatenated "text" and "extTrainWords" are source of training data. If set to false only "text" field is used as source of data for training.


Default: false. If set to true all texts used for training are stored inside the model together with text vectors. Model is kept in memory. If training set is huge we are in danger of filling up significant portion of memory. If set to false model is very memory efficient. Only vectors of doubles are stored. By default vectors have 100 dimentions. This is configurable using "vectorSize".

Docker Part

docker pull goliasz/docker-predictionio
docker run --hostname tc1 --name tc1 -it goliasz/docker-predictionio /bin/bash

PIO Part

root@tc1:/# pio-start-all
root@tc1:/# mkdir MyEngine
root@tc1:/# cd MyEngine
root@tc1:/MyEngine# pio template get goliasz/pio-template-text-similarity --version "0.9.1" textsim
root@tc1:/MyEngine# cd textsim
root@tc1:/MyEngine/textsim# vi engine.json

Set application name to “textsim”

root@pio1:/MyEngine/textsim# pio build --verbose
root@pio1:/MyEngine/textsim# pio app new textsim --access-key 1234
root@pio1:/MyEngine/textsim# sh ./data/import_test.sh 1
root@pio1:/MyEngine/textsim# pio train
root@pio1:/MyEngine/textsim# pio deploy --port 8000 &


Event Server Status

curl -i -X GET http://localhost:7070

Event Server: get all events

curl -i -X GET http://localhost:7070/events.json?accessKey=[YOUR ACCESS KEY FROM "pio app new textsim" output]

Query similarity score for a text a little bit similar to id:6

curl -X POST -H "Content-Type: application/json" -d '{"doc": "DJs flock by when MTV ax quiz prog. Five quacking zephyrs jolt my wax bed.", "limit", 3}' http://localhost:8000/queries.json


Template in version >= 0.8 requires Spark 1.6.1. You can get PredictionIO bundled with Spark 1.6.1 here https://hub.docker.com/r/goliasz/docker-predictionio-dev/


This Software is licensed under the Apache Software Foundation version 2 licence found here: http://www.apache.org/licenses/LICENSE-2.0