non-parametric Bayesian text segmenter implemented in Perl
Perl
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
samples
scripts
.gitignore
LICENSE
README.rst

README.rst

Obsolete Perl code for non-parametric Bayesian text segmentation

This repository contains Perl code I no longer use, including

  • a group of Dirichlet/Pitman-Yor processes,
  • a character-bigram-based zerogram word model, and
  • unigram/bigram word models with token-based, block and type-based sampling.

Requirements

The following CPAN modules are required:

  • Math::GSL
  • Math::Cephes
  • Regexp::Assemble
  • Carp::Assert

Run a sample script

% perl -Ilib scripts/sample-token.pl --seed=1 --type=Dirichlet --input=samples/alice.unseg --iter=100 --nested --debug --randInit=0.1