Skip to content

compute semantic similarity between arbitrary words and phrases in many languages

Notifications You must be signed in to change notification settings

linguatools/disco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DISCO API

Java API for word embeddings

This is the source code repository for the linguatools DISCO API. For more information on DISCO visit http://www.linguatools.de/disco/disco_en.html.

Quickstart

Install DISCO API

Download the source code by cloning this repository:

git clone git@github.com:linguatools/disco.git

Go into the repository folder and build the executable jar with dependencies:

cd disco/
./gradlew shadowJar

For instructions on command line usage call DISCO API without any parameters:

java -jar build/libs/disco-3.0.0-all.jar

or consult the web page.

Import a vector file from fastText

Download a fastText vector file in text format and unpack it:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.vec.gz
gunzip cc.de.300.vec.gz

Download DISCO Builder:

wget http://www.linguatools.de/disco/DISCOBuilder-1.1.1.tar.bz2
tar jxf DISCOBuilder-1.1.1.tar.bz2

Convert the vector file into a DISCO DenseMatrix:

java -Xmx8g -cp DISCOBuilder-1.1.1/DISCOBuilder-1.1.1-all.jar de.linguatools.disco.builder.Import -in cc.de.300.vec -out cc.de.300.col.denseMatrix -wsType COL 

Query the new DISCO word space from the command line with the DISCO API:

java -Xmx4g -jar ~/repos-linguatools/disco/build/libs/disco-3.0.0-all.jar cc.de.300.col.denseMatrix/cc.de.300-COL.denseMatrix -s Haus Wohnung COSINE
0.64413786

Java API

To include DISCO in your Maven or Gradle project see below or visit the DISCO page on JitPack.

Gradle

Add this to your build.gradle file:

repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    compile 'com.github.linguatools:disco:v3.0.0'
}

Maven

Add this to your pom.xml file:

<repositories>
	<repository>
	    <id>jitpack.io</id>
	    <url>https://jitpack.io</url>
	</repository>
</repositories>

<dependency>
    <groupId>com.github.linguatools</groupId>
    <artifactId>disco</artifactId>
    <version>v3.0.0</version>
</dependency>

Example Java code

DISCO disco = DISCO.load("cc.de.300-COL.denseMatrix");
float sim = disco.semanticSimilarity("Haus", "Häuschen", 
      	    	DISCO.getVectorSimilarity(SimilarityMeasure.COSINE));
System.out.println("similarity between 'Haus' and 'Häuschen': "+sim);
// get word vector for "Haus" as map
Map<String,Float> wordVectorHaus = disco.getWordvector("Haus");
// get word embedding for "Haus" as float array
float[] wordEmbeddingHaus = ((DenseMatrix) disco).getWordEmbedding("Haus");
// solve analogy x is to "Frau" as "König" is to "Mann"
List<ReturnDataCol> result = Compositionality.solveAnalogy("Frau", "König", "Mann", disco); 

Documentation

How to get word spaces for DISCO?

Features

  • native Java API
  • the API provides many useful methods for computing text similarity, solving analogies, clustering of similar words, compositional semantics, etc.
  • efficient storage of high-dimensional sparse matrices (distributional count vectors) as well as low-dimensional dense matrices (word embeddings)
  • higher-order word similarities can be stored and retrieved efficiently
  • API is open source with Apache license.