Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize words #52

Closed
aljbri opened this issue Mar 29, 2021 · 1 comment
Closed

Tokenize words #52

aljbri opened this issue Mar 29, 2021 · 1 comment

Comments

@aljbri
Copy link

aljbri commented Mar 29, 2021

In Tokenize part, it didn't separate the character و from the word when it is not a part of the original words, like in the example:

>>> from pyarabic.araby import tokenize, is_arabicrange, strip_tashkeel
>>> text = u"ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
>>> tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)
        ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']
@linuxscout
Copy link
Owner

Salam,
Thank you for your message,
The tokenization process spearate words from texts only, It doesn't make any analysis on words.
If you want to get lemma or stems from words, I suggest to use Qalsadi Morphological analyzer .
Or you can use only stemmer like Tashaphyne to extract stems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants