Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(languages): Add english-ngrams #109

Merged
merged 1 commit into from
Feb 5, 2024
Merged

feat(languages): Add english-ngrams #109

merged 1 commit into from
Feb 5, 2024

Conversation

heysokam
Copy link
Contributor

Based on the app and wordlist from:
https://github.com/ranelpadon/ngram-type

Copy link
Owner

@max-niederman max-niederman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a useful dictionary to add, but I'm not sure about the naming. "N-gram" can refer to sequences of any kinds of symbols, including words. In fact, all of our dictionaries are lists of common 1-grams, where the symbols are words. I suggest english-nchars, unless you have another idea.

@heysokam
Copy link
Contributor Author

heysokam commented Feb 2, 2024

They are still n-grams, not n-chars, even if they are using characters as their symbols.
The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word".
The case of ngrams symbols being words is the rare case, not the opposite.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

@max-niederman
Copy link
Owner

They are still n-grams, not n-chars, even if they are using characters as their symbols. The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word".

This is not true; neither is more technically correct because "n-gram" is a very broad term and applies to both. That's why I'm hesitant to call only one "n-gram" as its distinguishing feature.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

This is a valid point, though. "N-gram" is more searchable, at the very least because of ngram-type. I'm going to go ahead and merge this, although in v2 I think this'll need to be replaced by n-gram generation, which is already planned.

@max-niederman max-niederman merged commit 2bd7555 into max-niederman:main Feb 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants