A tool for converting Mallet InstanceList files between versions using JSON as an intermediate format.
Mallet 2.0.8 used GNU Trove for primitive collections, while Mallet 2.1+ uses HPPC. This change breaks Java serialization compatibility. This tool provides a way to convert old Mallet data files to the new format.
Requires Java 11+ and Maven.
# Build reader (for Mallet 2.0.8 files)
mvn clean package -P mallet-208 -DskipTests
# Build writer (for Mallet 2.1 files)
mvn clean package -P mallet-21 -DskipTestsNote: The mallet-21 profile expects Mallet 2.1.0 at ../Mallet/target/mallet-2.1.0.jar. Adjust the path in pom.xml if needed.
Step 1: Convert old format to JSON (using reader built with Mallet 2.0.8)
mvn package -P mallet-208 -DskipTests
java -cp "target/mallet-json-1.0.0.jar:$(mvn -P mallet-208 dependency:build-classpath -q -Dmdep.outputFile=/dev/stdout)" \
cc.mallet.json.MalletJsonConverter to-json \
-i old_data.mallet \
-o data.json \
--prettyStep 2: Convert JSON to new format (using writer built with Mallet 2.1)
mvn package -P mallet-21 -DskipTests
java -cp "target/mallet-json-1.0.0.jar:$(mvn -P mallet-21 dependency:build-classpath -q -Dmdep.outputFile=/dev/stdout)" \
cc.mallet.json.MalletJsonConverter from-json \
-i data.json \
-o new_data.malletto-json- Convert Mallet binary to JSONfrom-json- Convert JSON to Mallet binaryconvert- Direct conversion (requires same Mallet version for read/write)
-i, --input- Input file (required)-o, --output- Output file (required)--pretty- Pretty-print JSON output
InstanceListwithFeatureSequenceorFeatureVectordataAlphabetandLabelAlphabetLabelandLabelVectortargets- Instance properties and weights
The intermediate JSON format captures:
{
"version": "1.0",
"alphabets": {
"data": { "id": "...", "entries": ["word1", "word2", ...] },
"target": { "id": "...", "entries": ["label1", "label2", ...] }
},
"instances": [
{
"name": "doc1",
"data": { "type": "FeatureSequence", "features": [0, 5, 12, ...] },
"target": { "type": "Label", "index": 0 }
}
]
}MIT License - see LICENSE file.