Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
In the Corpus Formatting Notes, it states that the input data should be tokenized. I'm unclear what the actual formatting of the input text should be. Is it sufficient to have each sentence on a separate line? Does each sentence need to be word tokenized?
It's generally good practice to tokenize first so that "foo," turns into "foo ," instead of treating these as separate words. However, nobody can agree on what the exact tokenization rules should be. https://github.com/kpu/preprocess has some stuff based on the Moses tokenizer.