HTTP API over Weka for training and serving classifiers — single-user local dev, no auth, persistence on the host filesystem.
- Prerequisites
- Local setup with Docker Compose
- Running the test suite
- Project layout
- Configuration
- API reference
- End-to-end walkthrough
- Troubleshooting
- Security note
- Docker Desktop (or any Docker engine) with Compose v2 (
docker compose ..., notdocker-compose). - About 1 GB free disk for the first image build.
- Port
7070free on127.0.0.1.
To run the JUnit test suite outside Docker you also need:
- JDK 17 (
java -versionshould report17.x). - Maven 3.8+ (
mvn -v).
Install on macOS: brew install --cask docker and brew install maven.
From the repo root (the directory containing this README):
docker compose up --buildWhat this does:
- Builds the
api/image via a multi-stage Dockerfile (Maven build oneclipse-temurin:17-jdk, runtime on:17-jre). - Produces a shaded uber jar at
/app/app.jarinside the container. - Starts the container as
weka-api, binds only127.0.0.1:7070 → 7070. - Mounts
./models→/app/modelsand./data→/app/dataso trained models and uploaded datasets persist across restarts.
The first build downloads Weka + Javalin + Jackson into the local Maven cache and takes a few minutes. Subsequent builds reuse the cache.
Verify it's up:
curl http://localhost:7070/health
# → {"status":"ok","wekaVersion":"3.9.6"}# foreground, with logs
docker compose up --build
# background
docker compose up --build -d
# follow logs
docker compose logs -f weka-api
# stop (preserves models/ and data/)
docker compose down
# stop and remove built image (keep data)
docker compose down --rmi local
# rebuild from scratch
docker compose build --no-cache && docker compose upmodels/*.model, models/*.header, and data/*.{arff,csv} live on the host. docker compose down does not delete them. To start clean:
docker compose down
rm -f models/* data/* # keep the .gitkeep filesThe suite uses Javalin's in-process test runner — no running container required — and currently includes 22 tests covering health, algorithms, security, the train → predict → evaluate flow, EDA, transform (apply + preview), filter metadata, leakage-safe filtered training, model-structure (tree/graph), and the five diagnostics endpoints.
cd api
mvn testExpected output ends with Tests run: 22, Failures: 0, Errors: 0, Skipped: 0.
If you'd rather not install Java locally, run Maven inside a throwaway container against your source tree:
cd /path/to/repo
docker run --rm \
-v "$(pwd)/api":/build \
-v ~/.m2:/root/.m2 \
-w /build \
maven:3.9-eclipse-temurin-17 \
mvn -B test--rmdeletes the container when it exits-v "$(pwd)/api":/buildmounts the source-v ~/.m2:/root/.m2shares your Maven cache so re-runs are fast (~2s after warm-up)maven:3.9-eclipse-temurin-17is the official Maven image bundled with JDK 17
Target a single class with -Dtest:
docker run --rm -v "$(pwd)/api":/build -v ~/.m2:/root/.m2 -w /build \
maven:3.9-eclipse-temurin-17 \
mvn -B test -Dtest=DiagnosticsITIf you'd like docker compose run test to be the canonical command, append a service to compose.yaml:
test:
image: maven:3.9-eclipse-temurin-17
working_dir: /build
volumes:
- ./api:/build
- maven-cache:/root/.m2
command: ["mvn", "-B", "test"]
profiles: ["test"]
volumes:
maven-cache:profiles: ["test"] keeps the service out of docker compose up. Run it explicitly:
docker compose run --rm test # all tests
docker compose run --rm test mvn -B test -Dtest=EdaIT # one classHealthControllerTest—GET /healthreturns 200 with a non-blankwekaVersion.AlgorithmControllerTest—GET /algorithmsincludesweka.classifiers.trees.J48.SecurityTest—algorithm=java.lang.Runtime,modelName=../escape, andname=../fooupload all return 400.TrainAndPredictIT— upload iris → train J48 → predict 2 instances; distributions sum to ~1.0.EvaluateIT— train on iris, evaluate on iris, assert accuracy > 0.9.EdaIT— exercises all 5 EDA endpoints +INVALID_ATTRIBUTEguard.TransformIT—/filterslisting, Normalize chain, andINVALID_FILTERguard.TransformPreviewIT— preview returns sample rows, caps head at 200, never writes toDATA_DIR.FilterMetadataIT— listing exposessupervised/levelflags; per-filter metadata returns description + options schema; allowlist + missing-param errors.TrainFilteredIT— train + predict + evaluate + Drawable extraction throughFilteredClassifier; rejects non-Weka filters; multi-filter chains echoed in response.ModelGraphIT— J48 returns DOT tree, Logistic returnsNOT_DRAWABLE.DiagnosticsIT— exercises all 5 diagnostics endpoints +INVALID_CLASS_VALUEguard.
A sample iris.arff ships at api/src/test/resources/iris.arff and is reused by every test.
.
├── SPEC.md # source-of-truth spec
├── README.md # this file
├── compose.yaml # Docker Compose service definition
├── api/ # Javalin project
│ ├── Dockerfile # multi-stage: build → runtime
│ ├── pom.xml # Maven build, shade plugin → uber jar
│ └── src/
│ ├── main/java/com/wekaapi/
│ │ ├── App.java # Javalin bootstrap + routes
│ │ ├── config/Config.java # env-var loader
│ │ ├── controller/ # HTTP handlers
│ │ ├── service/ # business logic (Weka calls)
│ │ ├── dto/ # request shapes
│ │ └── error/ # ApiException + handler
│ └── test/ # JUnit 5 tests + iris.arff
├── models/ # mounted into container, ignored by git
└── data/ # mounted into container, ignored by git
All vars have defaults — docker compose up works without any .env file.
| Variable | Default | Purpose |
|---|---|---|
PORT |
7070 |
HTTP port Javalin binds to |
MODELS_DIR |
/app/models |
Where serialized .model files live |
DATA_DIR |
/app/data |
Where uploaded ARFF/CSV files live |
MAX_UPLOAD_MB |
100 |
Reject dataset uploads above this size |
LOG_LEVEL |
INFO |
Root Logback level |
To override, edit the environment: block in compose.yaml or pass via -e:
docker compose run --rm -e LOG_LEVEL=DEBUG -p 127.0.0.1:7070:7070 weka-apiBase URL: http://localhost:7070. All bodies are application/json unless stated. Errors return {"error": "...", "code": "..."} with an HTTP status from the table at the end of this section.
Liveness probe and Weka version.
curl http://localhost:7070/health{ "status": "ok", "wekaVersion": "3.9.6" }Lists Weka classifier classnames grouped by family (trees, bayes, functions, lazy, rules, meta, misc). Cached for the process lifetime.
curl http://localhost:7070/algorithms{
"classifiers": {
"bayes": ["weka.classifiers.bayes.NaiveBayes", "..."],
"functions": ["weka.classifiers.functions.Logistic", "..."],
"lazy": ["weka.classifiers.lazy.IBk", "..."],
"meta": ["weka.classifiers.meta.AdaBoostM1", "..."],
"rules": ["weka.classifiers.rules.JRip", "..."],
"trees": ["weka.classifiers.trees.J48", "weka.classifiers.trees.RandomForest", "..."]
}
}multipart/form-data:
| Field | Required | Notes |
|---|---|---|
file |
yes | An ARFF (.arff) or CSV (.csv) file. |
name |
no | Stored filename base (no extension). Defaults to the uploaded filename minus its extension. Must not contain /, \, or ... |
curl -F file=@api/src/test/resources/iris.arff \
-F name=iris \
http://localhost:7070/datasets201 Created:
{
"name": "iris",
"path": "iris.arff",
"format": "arff",
"numInstances": 150,
"numAttributes": 5,
"classAttribute": "class"
}The last attribute is treated as the class by convention.
curl http://localhost:7070/datasets{ "datasets": [{ "name": "iris", "format": "arff", "sizeBytes": 7045 }] }curl http://localhost:7070/datasets/iris{
"name": "iris",
"format": "arff",
"numInstances": 150,
"attributes": [
{ "name": "sepallength", "type": "numeric" },
{ "name": "sepalwidth", "type": "numeric" },
{ "name": "petallength", "type": "numeric" },
{ "name": "petalwidth", "type": "numeric" },
{ "name": "class", "type": "nominal",
"values": ["Iris-setosa", "Iris-versicolor", "Iris-virginica"] }
],
"classAttribute": "class"
}curl -X DELETE http://localhost:7070/datasets/iris
# 204 No ContentRequest body:
| Field | Required | Notes |
|---|---|---|
dataset |
yes | Name of an uploaded dataset. |
algorithm |
yes | Fully-qualified Weka classname; must start with weka.classifiers. (allowlist). |
options |
no | Weka CLI-style options as a string array, e.g. ["-C","0.25","-M","2"]. |
modelName |
yes | Filename to persist to (no extension, no path separators). |
classIndex |
no | Zero-based index. Default -1 = last attribute. |
curl -X POST http://localhost:7070/train \
-H 'Content-Type: application/json' \
-d '{
"dataset": "iris",
"algorithm": "weka.classifiers.trees.J48",
"options": ["-C","0.25","-M","2"],
"modelName": "iris-j48"
}'201 Created:
{
"modelName": "iris-j48",
"algorithm": "weka.classifiers.trees.J48",
"trainedOn": "iris",
"trainingTimeMs": 142,
"summary": "J48 pruned tree\n------------------\n..."
}The classifier and dataset header are saved to MODELS_DIR/iris-j48.model and MODELS_DIR/iris-j48.header.
curl http://localhost:7070/models{ "models": [{ "name": "iris-j48", "sizeBytes": 8123 }] }curl http://localhost:7070/models/iris-j48{
"name": "iris-j48",
"algorithm": "weka.classifiers.trees.J48",
"summary": "J48 pruned tree\n------------------\n..."
}Deletes the .model and .header files and invalidates the in-memory cache entry.
curl -X DELETE http://localhost:7070/models/iris-j48
# 204 No Content| Field | Required | Notes |
|---|---|---|
model |
yes | Name of a trained model. |
instances |
yes | Non-empty array. Each object maps attribute name → value. Missing keys are treated as missing values. |
curl -X POST http://localhost:7070/predict \
-H 'Content-Type: application/json' \
-d '{
"model": "iris-j48",
"instances": [
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2},
{"sepallength":6.7,"sepalwidth":3.0,"petallength":5.2,"petalwidth":2.3}
]
}'200 OK (nominal class):
{
"model": "iris-j48",
"predictions": [
{ "predictedClass": "Iris-setosa",
"distribution": { "Iris-setosa": 1.0, "Iris-versicolor": 0.0, "Iris-virginica": 0.0 } },
{ "predictedClass": "Iris-virginica",
"distribution": { "Iris-setosa": 0.0, "Iris-versicolor": 0.02, "Iris-virginica": 0.98 } }
]
}For a numeric class problem, distribution is omitted and predictedClass is the numeric value rendered as a string.
| Field | Required | Notes |
|---|---|---|
model |
yes | Trained model name. |
dataset |
yes | Test dataset name. Class index is taken from the model's saved header. |
curl -X POST http://localhost:7070/evaluate \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris"}'200 OK:
{
"model": "iris-j48",
"dataset": "iris",
"numInstances": 150,
"correct": 147,
"incorrect": 3,
"accuracy": 0.98,
"kappa": 0.97,
"weightedFMeasure": 0.98,
"confusionMatrix": [[50,0,0],[0,49,1],[0,2,48]],
"classLabels": ["Iris-setosa","Iris-versicolor","Iris-virginica"],
"summary": "Correctly Classified Instances ..."
}All five EDA endpoints take a dataset name in the path. None of them mutate state. They share a common sampling convention: pass sample (max 5000, default 500) and seed (default 42) to control the random shuffle used for scatter-style endpoints.
Per-attribute summary using Weka's AttributeStats.
curl 'http://localhost:7070/datasets/iris/attribute-stats?attribute=petallength'Numeric:
{
"name": "petallength",
"type": "numeric",
"count": 150, "missing": 0, "distinct": 43, "unique": 2,
"numeric": { "min": 1.0, "max": 6.9, "mean": 3.7587, "stdDev": 1.7644, "sum": 563.8 }
}Nominal:
{
"name": "class",
"type": "nominal",
"count": 150, "missing": 0, "distinct": 3, "unique": 0,
"nominalCounts": { "Iris-setosa": 50, "Iris-versicolor": 50, "Iris-virginica": 50 }
}Bulk version — returns the stats above for every attribute on the dataset.
curl http://localhost:7070/datasets/iris/summaryFor numeric attributes, equal-width bins between min and max. For nominal, one bin per value. Pass groupBy=class to break each bin down by class label.
curl 'http://localhost:7070/datasets/iris/histogram?attribute=petallength&bins=5&groupBy=class'{
"attribute": "petallength",
"type": "numeric",
"bins": [
{ "lo": 1.0, "hi": 2.18, "count": 50, "byClass": { "Iris-setosa": 50, "Iris-versicolor": 0, "Iris-virginica": 0 } },
{ "lo": 2.18, "hi": 3.36, "count": 0, "byClass": { "Iris-setosa": 0, "Iris-versicolor": 0, "Iris-virginica": 0 } }
],
"missing": 0,
"classLabels": ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
}Per-point JSON for an x/y plot. Points carry a class field when the dataset has a nominal class. Set jitter=true to add Gaussian noise when an axis is nominal (does not mutate the dataset).
curl 'http://localhost:7070/datasets/iris/scatter?x=petallength&y=petalwidth&sample=200'{
"x": "petallength", "y": "petalwidth",
"xType": "numeric", "yType": "numeric",
"classAttribute": "class",
"totalInstances": 150, "sampled": 150,
"points": [{ "x": 1.4, "y": 0.2, "class": "Iris-setosa" }]
}All unordered pairs of the listed attributes (server caps at 6 attributes → up to 15 pairs).
curl 'http://localhost:7070/datasets/iris/scatter-matrix?attributes=sepallength,sepalwidth,petallength&sample=200'{
"attributes": ["sepallength", "sepalwidth", "petallength"],
"classAttribute": "class",
"totalInstances": 150, "sampled": 150,
"pairs": [
{ "x": "sepallength", "y": "sepalwidth", "points": [{ "x": 5.1, "y": 3.5, "class": "Iris-setosa" }] }
]
}Lists every Weka filter discoverable on the classpath, grouped by family (unsupervised.attribute, unsupervised.instance, supervised.attribute, supervised.instance, plus misc for top-level filters like MultiFilter). Each entry carries flags so the client picker can distinguish leakage-prone supervised filters from safe unsupervised ones.
curl http://localhost:7070/filters{
"filters": {
"unsupervised.attribute": [
{ "classname": "weka.filters.unsupervised.attribute.Normalize",
"supervised": false,
"level": "attribute" }
],
"supervised.attribute": [
{ "classname": "weka.filters.supervised.attribute.Discretize",
"supervised": true,
"level": "attribute" }
],
"misc": [
{ "classname": "weka.filters.AllFilter",
"supervised": null,
"level": null }
]
}
}supervised is derived from the SupervisedFilter / UnsupervisedFilter marker interfaces — null for top-level filters that implement neither.
Per-filter introspection — globalInfo() description and the full listOptions() schema, so a client picker can render an options form without hardcoding any filter knowledge.
curl 'http://localhost:7070/filters/metadata?filter=weka.filters.unsupervised.attribute.Normalize'{
"classname": "weka.filters.unsupervised.attribute.Normalize",
"supervised": false,
"level": "attribute",
"family": "unsupervised.attribute",
"description": "Normalizes all numeric values in the given dataset (apart from the class attribute, if set)...",
"options": [
{ "name": "S", "synopsis": "-S <num>", "description": "The scaling factor (default 1.0).", "numArguments": 1, "default": "1.0" },
{ "name": "T", "synopsis": "-T <num>", "description": "The translation (default 0.0).", "numArguments": 1, "default": "0.0" },
{ "name": "unset-class-temporarily", "synopsis": "-unset-class-temporarily", "description": "...", "numArguments": 0, "default": false }
]
}For boolean flags (numArguments == 0), default is the literal true/false — Weka boolean flags default to off when absent. For value flags (numArguments >= 1), default is the stringified value (omitted when no default is set).
400 INVALID_FILTER if the FQN is outside the weka.filters. allowlist or doesn't resolve to a Filter. 400 BAD_REQUEST if the filter query param is missing.
Apply a chain of filters to an existing dataset and persist the result as a new dataset.
| Field | Required | Notes |
|---|---|---|
dataset |
yes | Source dataset name. |
filters |
yes | Non-empty array. Each element: {"filter": "<weka.filters.* FQN>", "options": [...]}. Filter must start with weka.filters. (allowlist). |
outputName |
yes | New dataset name (no extension, no path separators). |
format |
no | "arff" (default) or "csv". |
curl -X POST http://localhost:7070/transform \
-H 'Content-Type: application/json' \
-d '{
"dataset": "iris",
"filters": [
{"filter": "weka.filters.unsupervised.attribute.Normalize", "options": []}
],
"outputName": "iris-norm"
}'201 Created:
{
"name": "iris-norm",
"format": "arff",
"path": "iris-norm.arff",
"numInstances": 150,
"numAttributes": 5,
"filtersApplied": ["weka.filters.unsupervised.attribute.Normalize"]
}The new dataset is immediately usable for /train, /evaluate, EDA endpoints, etc.
Note on PCA: weka.filters.unsupervised.attribute.PrincipalComponents works on datasets with a nominal class — it ignores the class attribute when computing components and passes it through unchanged.
Same request body as POST /transform minus outputName (ignored if present). Runs the chain in memory and returns metadata plus a small sample of transformed rows. Nothing is written to DATA_DIR — use this to iterate on a filter chain before committing.
| Query | Default | Bounds |
|---|---|---|
head |
20 | 1 to 200 |
seed |
42 | any long (controls the row shuffle) |
curl -X POST 'http://localhost:7070/transform/preview?head=10' \
-H 'Content-Type: application/json' \
-d '{
"dataset": "iris",
"filters": [
{"filter": "weka.filters.unsupervised.attribute.Normalize", "options": []}
]
}'{
"dataset": "iris",
"numInstances": 150,
"totalInstances": 150,
"sampled": 10,
"numAttributes": 5,
"attributes": [
{ "name": "petallength", "type": "numeric" },
{ "name": "class", "type": "nominal", "values": ["Iris-setosa","Iris-versicolor","Iris-virginica"] }
],
"filtersApplied": ["weka.filters.unsupervised.attribute.Normalize"],
"head": [
{ "sepallength": 0.222, "sepalwidth": 0.625, "petallength": 0.068, "petalwidth": 0.042, "class": "Iris-setosa" }
]
}There are two ways to apply preprocessing before training, and they're not equivalent.
Use when the chain is purely unsupervised (Normalize, Standardize, Discretize unsupervised, PrincipalComponents, ReplaceMissingValues, etc.). The filter is fit on the full dataset once, the result is persisted, and you train on it like any other dataset.
# 1. upload raw data
curl -F file=@iris.arff -F name=iris http://localhost:7070/datasets
# 2. iterate on the filter chain in memory
curl -X POST 'http://localhost:7070/transform/preview?head=5' \
-H 'Content-Type: application/json' \
-d '{"dataset":"iris","filters":[{"filter":"weka.filters.unsupervised.attribute.Normalize","options":[]}]}'
# 3. happy with the chain → commit it
curl -X POST http://localhost:7070/transform \
-H 'Content-Type: application/json' \
-d '{"dataset":"iris","filters":[{"filter":"weka.filters.unsupervised.attribute.Normalize","options":[]}],"outputName":"iris-norm"}'
# 4. train on the new dataset
curl -X POST http://localhost:7070/train \
-H 'Content-Type: application/json' \
-d '{"dataset":"iris-norm","algorithm":"weka.classifiers.trees.J48","modelName":"iris-norm-j48"}'
# 5. predict/evaluate — features sent in are the NORMALIZED ones
curl -X POST http://localhost:7070/evaluate \
-H 'Content-Type: application/json' \
-d '{"model":"iris-norm-j48","dataset":"iris-norm"}'Path A is unsafe when the filter is supervised — i.e. when it looks at the class attribute to decide its parameters. Common offenders:
weka.filters.supervised.attribute.Discretize— Fayyad–Irani MDL discretization uses class infoweka.filters.supervised.attribute.AttributeSelection— feature selection by class-driven scoringweka.filters.supervised.attribute.NominalToBinary— class-aware one-hotweka.filters.supervised.attribute.MergeNominalValues
Running these via /transform on your full dataset and then training on the result leaks class signal into the features. Cross-validation on the transformed dataset will report optimistic accuracy that won't hold on held-out data. The supervised: true flag in /filters is your signal to use Path B instead.
Use whenever any filter in your chain is supervised. The request looks the same as a normal /train, but with an extra filters array. The API wraps your classifier in weka.classifiers.meta.FilteredClassifier; the filter is fit on each training fold's data only, not the whole dataset.
# 1. train — filters embedded in the model, raw dataset name
curl -X POST http://localhost:7070/train \
-H 'Content-Type: application/json' \
-d '{
"dataset": "iris",
"algorithm": "weka.classifiers.trees.J48",
"modelName": "iris-fc-j48",
"filters": [
{"filter": "weka.filters.supervised.attribute.AttributeSelection",
"options": ["-E","weka.attributeSelection.CfsSubsetEval","-S","weka.attributeSelection.BestFirst"]}
]
}'
# 2. predict on RAW features — the model applies the filter chain internally
curl -X POST http://localhost:7070/predict \
-H 'Content-Type: application/json' \
-d '{"model":"iris-fc-j48","instances":[{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2}]}'
# 3. evaluate on the RAW dataset — same story
curl -X POST http://localhost:7070/evaluate \
-H 'Content-Type: application/json' \
-d '{"model":"iris-fc-j48","dataset":"iris"}'The training response echoes the applied filter chain in a filters field. All downstream endpoints — /predict, /evaluate, /diagnostics/*, /models/{name}/tree — work transparently; the wrapped classifier delegates Drawable/Classifier calls through to the base learner.
Single-filter chains skip MultiFilter for efficiency; chains of two or more get bundled into a weka.filters.MultiFilter automatically.
For classifiers that implement weka.core.Drawable (most tree classifiers and Bayes nets), you can extract the model structure as Graphviz DOT.
curl http://localhost:7070/models/iris-j48/drawable-type{ "name": "iris-j48", "type": "tree" }Possible types: "tree", "graph" (Bayes net), "newick", or "none".
Returns the classifier's tree in Graphviz DOT for tree-based classifiers (J48, RandomTree, REPTree, M5P, LMT, HoeffdingTree). 400 NOT_DRAWABLE if the classifier isn't a tree.
curl http://localhost:7070/models/iris-j48/tree{
"name": "iris-j48",
"type": "tree",
"format": "dot",
"graph": "digraph J48Tree {\nN0 [label=\"petalwidth\"]\n..."
}Same shape, for Bayes-net classifiers (weka.classifiers.bayes.BayesNet).
All five POST endpoints share the same request shape:
| Field | Required | Notes |
|---|---|---|
model |
yes | Trained model name. |
dataset |
yes | Evaluation dataset. |
classValue |
no | Nominal class label (defaults to index 0). |
bins |
no | Number of bins for calibration (default 10, max 100). |
sample |
no | Max points to return for /errors (default 500, max 5000). |
seed |
no | Random seed for sampling (default 42). |
Per-instance predicted vs. actual — Weka's "Visualize classifier errors".
curl -X POST http://localhost:7070/diagnostics/errors \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris"}'{
"model": "iris-j48", "dataset": "iris",
"classType": "nominal",
"totalInstances": 150, "sampled": 150,
"points": [
{ "index": 0, "actual": "Iris-setosa", "predicted": "Iris-setosa", "correct": true }
]
}For numeric class: points[i] has actual, predicted, and error (|p - a|).
ROC / threshold curve for a chosen positive class. Includes AUC. Numeric-class models → 400 NOT_NOMINAL_CLASS.
curl -X POST http://localhost:7070/diagnostics/threshold-curve \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris","classValue":"Iris-versicolor"}'{
"model": "iris-j48", "dataset": "iris",
"classValue": "Iris-versicolor",
"auc": 0.99,
"points": [
{ "threshold": 0.0, "truePositiveRate": 1.0, "falsePositiveRate": 1.0,
"precision": 0.33, "recall": 1.0, "fMeasure": 0.5 }
]
}Cumulative margin distribution — Weka's MarginCurve. Useful for ensemble diagnostics (AdaBoost, Bagging) to see whether margins improve as boosting iterations accrue. Numeric-class models → 400 NOT_NOMINAL_CLASS.
curl -X POST http://localhost:7070/diagnostics/margin-curve \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris"}'{
"model": "iris-j48",
"dataset": "iris",
"points": [
{ "margin": -1.0, "current": 0.0, "cumulative": 0.0 },
{ "margin": 0.92, "current": 3.0, "cumulative": 0.02 },
{ "margin": 1.0, "current": 147.0, "cumulative": 1.0 }
]
}Field meaning: margin is the difference between the probability assigned to the actual class and the highest probability for any other class. current and cumulative together describe the empirical CDF of margins.
Drummond–Holte cost curve for a chosen positive class. Plots the Normalized Expected Cost against the Probability Cost Function — i.e. how sensitive the model's expected cost is to changes in class skew. Numeric-class models → 400 NOT_NOMINAL_CLASS.
curl -X POST http://localhost:7070/diagnostics/cost-curve \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris","classValue":"Iris-versicolor"}'{
"model": "iris-j48",
"dataset": "iris",
"classValue": "Iris-versicolor",
"points": [
{ "probabilityCostFunction": 0.0, "normalizedExpectedCost": 0.0 },
{ "probabilityCostFunction": 0.5, "normalizedExpectedCost": 0.04 },
{ "probabilityCostFunction": 1.0, "normalizedExpectedCost": 0.0 }
]
}Plot probabilityCostFunction on the x-axis and normalizedExpectedCost on the y-axis. A flat curve near zero means the model is robust to class skew; a curve that spikes near the centre means cost is highly sensitive.
Reliability diagram + Brier score. Manually binned because Weka doesn't ship a first-party utility.
curl -X POST http://localhost:7070/diagnostics/calibration \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris","classValue":"Iris-versicolor","bins":10}'{
"model": "iris-j48", "dataset": "iris",
"classValue": "Iris-versicolor",
"brierScore": 0.04,
"totalInstances": 150,
"bins": [
{ "bin": 0, "lo": 0.0, "hi": 0.1, "count": 99, "predictedProb": 0.01, "observedFraction": 0.0 }
]
}| Code | HTTP | When |
|---|---|---|
DATASET_NOT_FOUND |
404 | No file at DATA_DIR/{name}.{ext} |
MODEL_NOT_FOUND |
404 | No file at MODELS_DIR/{name}.model |
INVALID_NAME |
400 | Name contains /, \, or .. |
INVALID_ALGORITHM |
400 | Classname outside weka.classifiers. allowlist |
INVALID_FORMAT |
400 | Unsupported file extension |
UPLOAD_TOO_LARGE |
413 | Exceeds MAX_UPLOAD_MB |
TRAINING_FAILED |
422 | Weka threw during buildClassifier |
PREDICTION_FAILED |
422 | Weka threw during prediction |
EVALUATION_FAILED |
422 | Weka threw during evaluation |
INVALID_FILTER |
400 | Filter outside weka.filters. allowlist or unknown class |
TRANSFORM_FAILED |
422 | Weka threw applying a filter chain |
INVALID_ATTRIBUTE |
400 | Attribute name not on dataset |
INVALID_CLASS_VALUE |
400 | Class value not in the class attribute's domain |
NOT_DRAWABLE |
400 | Classifier doesn't implement Drawable / wrong graph type |
NOT_NOMINAL_CLASS |
400 | Diagnostic requires a nominal class but the model's class is numeric |
NOT_NUMERIC_CLASS |
400 | Reserved for future numeric-class diagnostics |
BAD_REQUEST |
400 | Malformed JSON or missing required fields |
INTERNAL_ERROR |
500 | Anything uncaught (also logged with full stacktrace) |
Once the stack is up:
# 1. upload the sample iris dataset that ships with the repo
curl -F file=@api/src/test/resources/iris.arff \
-F name=iris \
http://localhost:7070/datasets
# 2. train a J48 decision tree
curl -X POST http://localhost:7070/train \
-H 'Content-Type: application/json' \
-d '{"dataset":"iris","algorithm":"weka.classifiers.trees.J48","modelName":"iris-j48"}'
# 3. predict on two flowers
curl -X POST http://localhost:7070/predict \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","instances":[
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2},
{"sepallength":6.7,"sepalwidth":3.0,"petallength":5.2,"petalwidth":2.3}
]}'
# 4. evaluate against the training set (sanity check)
curl -X POST http://localhost:7070/evaluate \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris"}'
# 4a. EDA on the dataset
curl 'http://localhost:7070/datasets/iris/summary'
curl 'http://localhost:7070/datasets/iris/histogram?attribute=petallength&bins=10&groupBy=class'
curl 'http://localhost:7070/datasets/iris/scatter?x=petallength&y=petalwidth'
# 4b. apply a filter to produce a new dataset
curl -X POST http://localhost:7070/transform \
-H 'Content-Type: application/json' \
-d '{"dataset":"iris","filters":[{"filter":"weka.filters.unsupervised.attribute.Normalize","options":[]}],"outputName":"iris-norm"}'
# 4c. post-training diagnostics
curl -X POST http://localhost:7070/diagnostics/threshold-curve \
-H 'Content-Type: application/json' \
-d '{"model":"iris-j48","dataset":"iris","classValue":"Iris-versicolor"}'
curl http://localhost:7070/models/iris-j48/tree
# 5. confirm persistence: stop and restart
docker compose down
docker compose up -d
curl http://localhost:7070/models # iris-j48 still listeddocker: command not found / Docker isn't running.
Install/start Docker Desktop. docker info should return a daemon block.
docker compose complains it isn't a recognised command.
You're on Compose v1. Either upgrade to Compose v2 (ships with Docker Desktop) or substitute docker-compose everywhere in this README.
bind: address already in use on port 7070.
Something already owns that port. Either kill it (lsof -nP -iTCP:7070 -sTCP:LISTEN) or change the host port mapping in compose.yaml:
ports:
- "127.0.0.1:7171:7070" # external 7171 → internal 7070First build hangs at dependency:go-offline.
Maven is fetching Weka (~20 MB) and its transitive deps. Give it 2–5 minutes on a fresh cache. Subsequent builds reuse the BuildKit cache mount.
Build fails resolving weka-dev:3.9.7-SNAPSHOT.
The repo defaults to the stable 3.9.6 release for exactly this reason. If you want to try the snapshot, set <weka.version>3.9.7-SNAPSHOT</weka.version> in api/pom.xml — it's pulled from https://oss.sonatype.org/content/repositories/snapshots/.
GET /algorithms returns a short list.
The endpoint relies on Weka's ClassDiscovery. If it can't enumerate the classpath at runtime, the controller falls back to a curated set of common classifiers (J48, RandomForest, NaiveBayes, Logistic, IBk, etc.). All Weka classifiers on the classpath are still usable via /train — the listing is for convenience.
/predict returns 400 BAD_REQUEST "attribute X value not in domain".
Nominal attributes only accept values from the set declared in the training data. Check GET /datasets/{name} to see the allowed values.
/predict returns 404 MODEL_NOT_FOUND after restart.
Make sure ./models is bind-mounted (it is by default in compose.yaml). If you started the container with docker run directly without the mount, models won't persist.
mvn test fails with class file has wrong version.
You're on a JDK older than 17. Install JDK 17 (brew install --cask temurin@17) and either set JAVA_HOME or use jenv/sdkman to switch.
Container starts but curl hangs.
Check the logs: docker compose logs weka-api. If you see Address already in use, see the port note above. If the JVM crashed during init, the stack trace will be there.
Builds against the stable weka-dev release (3.9.6) from Maven Central. The spec preferred 3.9.7-SNAPSHOT but allows the documented downgrade when the snapshot is unavailable; flip <weka.version> in api/pom.xml if you want the snapshot.