Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode input to mitlm #16

Open
GoogleCodeExporter opened this issue Oct 8, 2015 · 4 comments
Open

unicode input to mitlm #16

GoogleCodeExporter opened this issue Oct 8, 2015 · 4 comments

Comments

@GoogleCodeExporter
Copy link

Dear Sir,

I want to make some request.

Can you modify mitlm to take unicode files  as input? 




Original issue reported on code.google.com by gsrvijay...@gmail.com on 22 Jan 2010 at 6:07

@GoogleCodeExporter
Copy link
Author

Hi, you could probably overcome this by mapping all unicode words in your 
corpora to
some IDs, and running mitlm on the mapped corpora. The resulting language model 
will
contain only IDs, so you also need to do inverse-mapping back to unicode 
strings.

If your unicode characters can be converted to some 8bit charset, try to use 
iconv.
For Czech I use 'iconv -f utf8 -t iso8859-2 < corpora_unicode.txt > 
corpora_iso.txt'

Good luck,
Miso

Original comment by michal.f...@gmail.com on 23 Feb 2010 at 9:22

@GoogleCodeExporter
Copy link
Author

UTF-8 without BOM seems to work fine under Windows (0.4) and under Ubuntu 
(r50). Interestingly, it doesn't work under WINE (0.4).

This is the beauty of UTF-8.

Note: Haven't checked UTF-8 with BOM. It would be nice if mitlm would ignore 
the BOM if it doesn't already.

Original comment by adubin...@almson.net on 2 Nov 2012 at 7:58

@GoogleCodeExporter
Copy link
Author

Issue 21 has been merged into this issue.

Original comment by giuliop...@gmail.com on 3 Feb 2013 at 6:27

@GoogleCodeExporter
Copy link
Author

Going to open a separate ticket for this, but in case anyone else is looking at 
this issue for a solution MITLM seemingly taking a dislike to certain 
characters in its input, I found that when using count files, a # character at 
the start of a token will cause it to crash with:

estimate-ngram: src/NgramModel.cpp:800: void 
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != 
NgramVector::Invalid)' failed.
Aborted (core dumped)

I'm guessing it interprets # as a comment if it occurs at the start of a line 
of text in the counts file. Not very helpful, especially since estimate-ngram 
-wc will itself write out lines beginning with # if a token beginning with # 
(like a hashtag) occurs in the source text.

Original comment by matt...@swiftkey.com on 10 Feb 2015 at 6:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant