Releases: kmpoon/hlta
v2.3
Version 2.3 Hotfix 3 (updated on Jan28, 2019)
Requirement: Java 8
Features:
Switched to using Stanford NLP to preprocess English
Added correlation test in HLTA model building (optional)
Removed unnecessary .dict.csv files in converting text to data
Seed words can now read .dict.csv files directly
Reduced HLTA-deps.jar size
Bug fixes:
Tried to upgrade pdfbox to solve the pdfbox.baseParser.pushBackSize issue
Fixed the issue of processing "bro-\r\nken word" in PDF
Fixed the library issue in TopicCompactness
Fixed the issue of having infinity in TopicCoherence
Fixed wrong directory for website dependencies
Fixed large memory consumption in reading .sparse.txt
This version includes all functions and new features from v2.2
v2.2: release multi-core version
Features:
- multi-core training
- remove island bridging (old feature)
- release multiple machines by MPI on MultiMachineByMPI branch
- update readme (explain the options and assemble)
- release new HLTA.jar
Notice that:
- HLTA.jar of this version is only for training(subroutine2), please use HLTA.jar from old version for topic building and predicting.
v2.1
Please download the HLTA-deps.jar from v2.0
Features:
Added loglikelihood evaluation support for various data format
Added support for training set and testing set split
Made number of keywords in NDT Doc2VecAssignment adjustable
Bug fixes:
Fixed bug that LDA data all becomes binary
Fixed failure in ExtractTopicTree --broad option
Fixed failure in preprocessing Chinese
2.0
Requirement: Java 8
New Features:
- All-in-one command for hierarchical topic detection
- Webpage visualization with direct link to corresponding documents
- Evaluation metrics: topic coherence, topic compactness(scala ver.)
- Allow input document to be listed line by line
- Supports non-ascii characters
- Supports LDA data format
- Added option to skip tree level
- Simplified HLTA parameters
- Supports seedwords of any word length
- Parallel computation in computing word-pair MI
Other changes:
- Default using Narrowly Defined Topics
- Scala calls use Stepwise EM for parameter estimation
- User defined encoding scheme in data conversion
- Pre-processor now remove punctuation instead of replacing it with underscore
- Subroutines now accept all data formats, while sparse data will be the default format
- Data Conversion default only outputs sparse data format
- Data Conversion now reads PDF directly
- Sparse data format now counts docId from 0
- HLCM data format now uses extension .hlcm
- Legacy fixes of collision with .bif format reserved words
- Fixed invalid json format
1.4.1
1.4
1.3
1.2
General improvements and fixes:
- Added support StochasticPEM
- Added support for narrowly defined topics
- Updated the n-gram algorithm, where it now controls the number of concatenations used for building n-grams
- Moved the Convert object from tm.pdf package to tm.text package
- Combined the steps for extracting topics and generating Javascript topic tree
- Shows 3 decimal places in topic size in topic tree
Only the HLTA.jar
has been updated. The HLTA-deps.jar
in the previous release can be used.