JLanI 2

Reimplementation of the venerable language identification tool JLanI. It identifies the most likely language of an unknown text.

JLanI by default uses the wordlist files contained in ./resources/jlani/wordlists. The included languages are Chinese, English, French, German, Korean, Japanese, Mandalorian and Spanish.

Example:

import com.wn.nlp.jlani.Request;
import com.wn.nlp.jlani.JLanI;

class Example {
	public static void main(String[] args) {
		var jLanI = new JLanI();
		
		var request = new Request("JLanI identifies the most likely language of an unknown text");
		var response = jLanI.evaluate(request);
		
		var language = response.getMostLikelyResult().getKey();
		System.out.println(language.name()); // prints 'en'
	}
}

Other languages

Create a wordlist from your own reference corpus:

import com.wn.nlp.jlani.WordListCreator;
import com.wn.nlp.jlani.util.IOUtil;

import java.io.IOException;
import java.nio.file.Path;

class TextWordListCreator {
	public static void main(String[] args) throws IOException {
		var text = """
				any text representing the target language as best it can
				""";
		try (var writer = IOUtil.newFileWriter(Path.of("./lang.txt"), true)) {
			new WordListCreator().createFromText(text, writer);
		}
	}
}

and drop it in ./resources/jlani/wordlists to be available on startup.

A good starting point for corpora is the Leipzig Corpora Collection. Use WortschatzWordListCreator to create a wordlist from one of their word files.

Using resources

If you don't want to use actual files, you can for example deliver your wordlist files as resources in your JAR file. Simply read the wordlists and pass them to the JLanI constructor:

import com.wn.nlp.jlani.JLanI;
import com.wn.nlp.jlani.WordList;
import com.wn.nlp.jlani.WordLists;
import com.wn.nlp.jlani.impl.SimpleWordList;
import com.wn.nlp.jlani.util.IOUtil;
import com.wn.nlp.jlani.value.Language;

import java.io.IOException;
import java.nio.file.Path;
import java.util.List;

class ResourcesExample {
	public static void main(String[] args) throws IOException {
		var de = wordListOfResource("de.txt.gz");
		var en = wordListOfResource("en.txt.gz");
		var wordLists = new WordLists(List.of(de, en));
		
		var jLanI = new JLanI(wordLists);
	}
	
	private static WordList wordListOfResource(final String name) throws IOException {
		var path = Path.of(name);
		try (var reader = IOUtil.newResourceReader(path)) {
			return SimpleWordList.ofWordlistFileReader(reader, Language.ofPath(path));
		}
	}
}

Configuration

There are three ways to configure JLanI:

Edit the resources file src/main/resources/config/jlani/lanikernel.ini.
Create the external file ./config/jlani/lanikernel.ini. If it exists, it takes precedence over the resources file.
Set the configuration programmatically via the Preferences class.

Any changes are applied only to new JLanI instances.

Changes from the original

Preprocessing:
- Splits words on any whitespace of any length
- Performs Unicode normalization to NFKC
- Removes numbers
- Removes empty strings
- Samples from the whole input text instead of sampling the first n words

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
resources/jlani/wordlists		resources/jlani/wordlists
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JLanI 2

Other languages

Using resources

Configuration

Changes from the original

About

Releases 1

Packages

Languages

License

nilswende/JLanI

Folders and files

Latest commit

History

Repository files navigation

JLanI 2

Other languages

Using resources

Configuration

Changes from the original

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages