Skip to content

Releases: kmpoon/hlta

v2.3

04 Nov 07:47
Compare
Choose a tag to compare

Version 2.3 Hotfix 3 (updated on Jan28, 2019)

Requirement: Java 8

Features:
Switched to using Stanford NLP to preprocess English
Added correlation test in HLTA model building (optional)
Removed unnecessary .dict.csv files in converting text to data
Seed words can now read .dict.csv files directly
Reduced HLTA-deps.jar size

Bug fixes:
Tried to upgrade pdfbox to solve the pdfbox.baseParser.pushBackSize issue
Fixed the issue of processing "bro-\r\nken word" in PDF
Fixed the library issue in TopicCompactness
Fixed the issue of having infinity in TopicCoherence
Fixed wrong directory for website dependencies
Fixed large memory consumption in reading .sparse.txt

This version includes all functions and new features from v2.2

v2.2: release multi-core version

13 Aug 14:01
Compare
Choose a tag to compare

Features:

  1. multi-core training
  2. remove island bridging (old feature)
  3. release multiple machines by MPI on MultiMachineByMPI branch
  4. update readme (explain the options and assemble)
  5. release new HLTA.jar

Notice that:

  1. HLTA.jar of this version is only for training(subroutine2), please use HLTA.jar from old version for topic building and predicting.

v2.1

06 May 10:10
47e3f01
Compare
Choose a tag to compare

Please download the HLTA-deps.jar from v2.0

Features:
Added loglikelihood evaluation support for various data format
Added support for training set and testing set split
Made number of keywords in NDT Doc2VecAssignment adjustable

Bug fixes:
Fixed bug that LDA data all becomes binary
Fixed failure in ExtractTopicTree --broad option
Fixed failure in preprocessing Chinese

2.0

19 Mar 14:25
Compare
Choose a tag to compare
2.0

Requirement: Java 8

New Features:

  • All-in-one command for hierarchical topic detection
  • Webpage visualization with direct link to corresponding documents
  • Evaluation metrics: topic coherence, topic compactness(scala ver.)
  • Allow input document to be listed line by line
  • Supports non-ascii characters
  • Supports LDA data format
  • Added option to skip tree level
  • Simplified HLTA parameters
  • Supports seedwords of any word length
  • Parallel computation in computing word-pair MI

Other changes:

  • Default using Narrowly Defined Topics
  • Scala calls use Stepwise EM for parameter estimation
  • User defined encoding scheme in data conversion
  • Pre-processor now remove punctuation instead of replacing it with underscore
  • Subroutines now accept all data formats, while sparse data will be the default format
  • Data Conversion default only outputs sparse data format
  • Data Conversion now reads PDF directly
  • Sparse data format now counts docId from 0
  • HLCM data format now uses extension .hlcm
  • Legacy fixes of collision with .bif format reserved words
  • Fixed invalid json format

1.4.1

27 Mar 14:14
Compare
Choose a tag to compare

Bug fixes:

  • fixed a bug that caused frequent word not considered in building n-grams
  • fixed a bug that outputs a filename that may crash the reader in the HLCM data format

(HLTA-deps.jar has not been changed)

1.4

08 Mar 04:17
Compare
Choose a tag to compare
1.4

Bug fixes

1.3

25 Jan 07:35
Compare
Choose a tag to compare
1.3

Performance update:

  • Added StepwiseEMHLTA
  • Includes also the sparse data format as output during the conversion

1.2

16 Dec 07:02
Compare
Choose a tag to compare
1.2

General improvements and fixes:

  • Added support StochasticPEM
  • Added support for narrowly defined topics
  • Updated the n-gram algorithm, where it now controls the number of concatenations used for building n-grams
  • Moved the Convert object from tm.pdf package to tm.text package
  • Combined the steps for extracting topics and generating Javascript topic tree
  • Shows 3 decimal places in topic size in topic tree

Only the HLTA.jar has been updated. The HLTA-deps.jar in the previous release can be used.

1.1

31 Jul 15:15
Compare
Choose a tag to compare
1.1
  • Split package files into core file (HLTA.jar) and dependency file (HLTA-deps.jar).
  • Reduced memory footprint substantially for tm.pdf.Convert.
  • Better option parser and logger support for tm.pdf.Convert.

1.0.1

21 Jul 16:17
Compare
Choose a tag to compare

Fixed some possibly missing dependencies.