LLClass is a Java tool that can be used for a number of text classification problems including:
- Language Identification (LID) - especially twitter data
- Unreliable Article Style Detector : Download new model - trained on in-house hackathon training data
- Automatic text difficulty assessment
- Sentiment analysis
It includes a number of different classifiers including MIRA, SVM, and a perceptron.
It also includes a simple REST service for doing classification and some pre trained models.
More documentation can be found under docs :
See below for some performance benchmarks.
- scala 2.11.8
- sbt
- Java 1.8
At top-level directory type:
sbt assembly
Running this command will cause SBT to download some dependencies, this may take some time depending on your internet connection.
If you use a proxy, you may need to adjust your local proxy settings to allow SBT to fetch dependencies.
If you are behind a firewall, you may need to add a configuration file in your ~/.sbt
directory. See the SBT Behind a Firewall section for more details.
This creates a jar under target at
[info] Packaging ... target/scala-2.11/LLClass-assembly-1.1.jar
For examples below, you can add a link to the jar (from the top-level directory):
ln -s target/scala-2.11/LLClass-assembly-1.1.jar LLClass.jar
- The data should one line per example, separated with a tab or whitespace between label and document.
en this is english.
fr quelle langue est-elle?
- split: 0.10 (90/10 train-test split)
- word-ngram-order: 1
- char-ngram-order: 3
- bkg-min-count: 2
- slack: 0.01
- iterations: 20
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz
(truncated from above)
2015-10-05 15:56:25.912 [INFO] Completed training
2015-10-05 15:56:25.912 [INFO] Training complete.
2015-10-05 15:56:27.325 [INFO] # of trials: 200
2015-10-05 15:56:27.325 [INFO] ru fa es dar N class %
2015-10-05 15:56:27.325 [INFO] ru 50 0 0 0 50 1.000000
2015-10-05 15:56:27.325 [INFO] fa 0 46 0 4 50 0.920000
2015-10-05 15:56:27.326 [INFO] es 0 0 50 0 50 1.000000
2015-10-05 15:56:27.326 [INFO] dar 0 0 0 50 50 1.000000
2015-10-05 15:56:27.326 [INFO] accuracy = 0.985
- Model - save a LID model for later application onto new data
- Log - log the parameters, accuracy per language, overall accuracy, debugging
- Score - generate a file with LID scores on each sentence
- Data - use (-all) to generate models or do a train/test split. Data should be in TSV format and gzipped (gzip myfile.tsv)
- Train/Test - if not using (-all with optional -split), then specify separate train and test sets (-train mytrain.tsv.gz -test mytest.tsv.gz). This is useful for training out-of-domain followed by testing in-domain.
- Output Files - Scored files, model files, and log files are only saved when the user specifies them on command line at runtime
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 10
Save score files, model files, and log files, use 85/15 train/test split (optional - specify and save the resulting model, log and score files):
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 30 -model news4L.mod -log news4L.log -score news4L.score
java -jar LLClass.jar LID -test test/no_nl_da_en_5K.tsv.gz -model models/fourLang.mod.gz
java -jar LLClass.jar LID -train test/no_nl_da_en_5k.tsv.gz -test test/no_nl_da_en_500.tsv.gz
There are two main functions to score text.
- textLID() returns the language code and a confidence value for that code.
- textLIDFull() returns a set of language labels ranked by most likely to least likely and a confidence value for each one. The confidence values range [0,1] where larger numbers imply higher confidence.
- Import LLClass language id package
import mitll.lid
- Create an instance of the
Scorer
class and specify the LID model
var newsRunner = new lid.Scorer("path/to/lid/model")
- call the function mitll.Scorer.textLID()
var (language, confidence) = newsRunner.textLID("what language is this text string?")
- or call the function mitll.Scorer.textLIDFull()
var langConfArray : Array[(Symbol,Double)] = newsRunner.textLIDFull("what language is this text string?")
- Start the service
java -jar LLClass.jar REST
- Simple text call in a browser (http://localhost:8080/classify?q=...)
http://localhost:8080/classify?q=No%20quiero%20pagar%20la%20cuenta%20del%20hospital.
es
- Curl
curl --noproxy localhost http://localhost:8080/classify?q=Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.
es
- return JSON (http://localhost:8080/classify/json?q=...)
http://localhost:8080/classify/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22
{
class: "es",
confidence: "47.82"
}
- Model labels
http://localhost:8080/labels
{
labels: [
"dar",
"es",
"fa",
"ru"
]
}
- JSON scores for all labels in model (http://localhost:8080/classify/all/json?q=...)
http://localhost:8080/classify/all/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22
results: [
{
class: "es",
confidence: 0.7623826612993418
},
{
class: "dar",
confidence: -0.128890820239746
},
{
class: "ru",
confidence: -0.2614167250657798
},
{
class: "fa",
confidence: -0.37207511599381426
}
]
}
- See RESTServiceSpec for details
- LIDSpec has more usage examples.
- EvalSpec has tests that show swapping out an LLClass classifier for a langid.py service classifier.
- TwitterEvalSpec has tests for running against twitter data from Evaluating language identification performance
- RESTServiceSpec shows variations on running the RESTService
###sbt behind a firewall
- You may need to add a repositories file like this under your ~/.sbt directory:
515918-mitll:.sbt $ cat repositories
[repositories]
local
my-ivy-proxy-releases: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
my-maven-proxy-releases: http://repo1.maven.org/maven2/
Tweet normalization was done by the LLString tweet normalizer. See that repo for instructions on how to install it and run the python normalization script.
From Evaluating language identification performance
Note that the precision_oriented dataset had 69000 tweets but we only could actually download 51567 tweets.
Labels with fewer than 500 examples were excluded.
Info | Value |
---|---|
Train | 44352 |
Test | 6654 |
Labels(44) | ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW |
Accuracy | 0.86654645 |
The und label marked undefined tweets which could match several languages.
Info | Value |
---|---|
Train | 34546 |
Test | 5183 |
Labels(43) | ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW |
Accuracy | 0.949836 |
This model can be found in the release directory if you want to try it yourself.
Ran a python script to attempt to normalize tweet text to remove markup, hashtags, etc.
Info | Value |
---|---|
Train | 44080 |
Test | 6614 |
Labels(44) | ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW |
Accuracy | 0.86815846 |
Info | Value |
---|---|
Train | 34540 |
Test | 5183 |
Labels(43) | ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW |
Accuracy | 0.95504534 |
Initial data had 72000 of 87585 tweets from recall_oriented.
Info | Value |
---|---|
Train | 71196 |
Test | 10682 |
Labels (67) | am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW |
Accuracy | 0.9237971 |
See model in releases.
Initial data had 72000 of 87585 tweets from recall_oriented.
Info | Value |
---|---|
Train | 71187 |
Test | 10680 |
Labels (67) | am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW |
Accuracy | 0.9238764 |
Info | Value |
---|---|
Train | 76442 |
Test | 11467 |
Labels (13) | ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und |
Accuracy | 0.9098282 |
Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label, no und label
Info | Value |
---|---|
Train | 76442 |
Test | 10402 |
Labels (12) | ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr |
Accuracy | 0.9699096 |
Info | Value |
---|---|
Train | 69344 |
Test | 11228 |
Labels (13) | ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und |
Accuracy | 0.9291058 |
Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label, no und label
Info | Value |
---|---|
Train | 69338 |
Test | 10401 |
Labels (12) | ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr |
Accuracy | 0.9808672 |
Info | Value |
---|---|
Train | 74854 |
Test | 3150 |
Labels (21) | bg cs da de el en es et fi fr hu it lt lv nl pl pt ro sk sl sv |
Accuracy | 0.99841267 |
See releases for model.
2016-04-15 16:11:37.257 [INFO] # of trials: 825
2016-04-15 16:11:37.258 [INFO] zh uk ru no nl ko id fa en da ar N class %
2016-04-15 16:11:37.258 [INFO] zh 74 1 0 0 0 0 0 0 0 0 0 75 0.986667
2016-04-15 16:11:37.258 [INFO] uk 1 40 28 2 0 0 0 2 0 2 0 75 0.533333
2016-04-15 16:11:37.258 [INFO] ru 0 4 71 0 0 0 0 0 0 0 0 75 0.946667
2016-04-15 16:11:37.258 [INFO] no 1 1 0 28 11 1 10 0 8 15 0 75 0.373333
2016-04-15 16:11:37.258 [INFO] nl 1 0 0 3 57 0 6 0 4 4 0 75 0.760000
2016-04-15 16:11:37.258 [INFO] ko 0 0 0 0 0 74 0 0 1 0 0 75 0.986667
2016-04-15 16:11:37.258 [INFO] id 1 0 0 1 2 0 68 0 2 1 0 75 0.906667
2016-04-15 16:11:37.258 [INFO] fa 1 1 0 0 0 0 1 69 0 0 3 75 0.920000
2016-04-15 16:11:37.258 [INFO] en 1 1 0 3 0 0 3 0 57 10 0 75 0.760000
2016-04-15 16:11:37.258 [INFO] da 0 0 0 9 1 0 4 1 6 54 0 75 0.720000
2016-04-15 16:11:37.258 [INFO] ar 0 0 0 0 0 0 0 1 0 0 74 75 0.986667
2016-04-15 16:11:37.258 [INFO] accuracy = 0.807273