unicode input to mitlm #16

GoogleCodeExporter · 2015-10-08T09:12:01Z

Dear Sir,

I want to make some request.

Can you modify mitlm to take unicode files  as input?

Original issue reported on code.google.com by gsrvijay...@gmail.com on 22 Jan 2010 at 6:07

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-10-08T09:12:01Z

Hi, you could probably overcome this by mapping all unicode words in your 
corpora to
some IDs, and running mitlm on the mapped corpora. The resulting language model 
will
contain only IDs, so you also need to do inverse-mapping back to unicode 
strings.

If your unicode characters can be converted to some 8bit charset, try to use 
iconv.
For Czech I use 'iconv -f utf8 -t iso8859-2 < corpora_unicode.txt > 
corpora_iso.txt'

Good luck,
Miso

Original comment by michal.f...@gmail.com on 23 Feb 2010 at 9:22

GoogleCodeExporter · 2015-10-08T09:12:01Z

UTF-8 without BOM seems to work fine under Windows (0.4) and under Ubuntu 
(r50). Interestingly, it doesn't work under WINE (0.4).

This is the beauty of UTF-8.

Note: Haven't checked UTF-8 with BOM. It would be nice if mitlm would ignore 
the BOM if it doesn't already.

Original comment by adubin...@almson.net on 2 Nov 2012 at 7:58

GoogleCodeExporter · 2015-10-08T09:12:01Z

Issue 21 has been merged into this issue.

Original comment by giuliop...@gmail.com on 3 Feb 2013 at 6:27

GoogleCodeExporter · 2015-10-08T09:12:01Z

Going to open a separate ticket for this, but in case anyone else is looking at 
this issue for a solution MITLM seemingly taking a dislike to certain 
characters in its input, I found that when using count files, a # character at 
the start of a token will cause it to crash with:

estimate-ngram: src/NgramModel.cpp:800: void 
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != 
NgramVector::Invalid)' failed.
Aborted (core dumped)

I'm guessing it interprets # as a comment if it occurs at the start of a line 
of text in the counts file. Not very helpful, especially since estimate-ngram 
-wc will itself write out lines beginning with # if a token beginning with # 
(like a hashtag) occurs in the source text.

Original comment by matt...@swiftkey.com on 10 Feb 2015 at 6:31

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Oct 8, 2015

GoogleCodeExporter mentioned this issue Oct 8, 2015

UTF-8 support #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode input to mitlm #16

unicode input to mitlm #16

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

unicode input to mitlm #16

unicode input to mitlm #16

Comments

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015

GoogleCodeExporter commented Oct 8, 2015