Tokenize words #52

aljbri · 2021-03-29T03:25:39Z

In Tokenize part, it didn't separate the character و from the word when it is not a part of the original words, like in the example:

>>> from pyarabic.araby import tokenize, is_arabicrange, strip_tashkeel
>>> text = u"ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
>>> tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)
        ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']

linuxscout · 2021-03-29T17:48:33Z

Salam,
Thank you for your message,
The tokenization process spearate words from texts only, It doesn't make any analysis on words.
If you want to get lemma or stems from words, I suggest to use Qalsadi Morphological analyzer .
Or you can use only stemmer like Tashaphyne to extract stems.

linuxscout closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize words #52

Tokenize words #52

aljbri commented Mar 29, 2021

linuxscout commented Mar 29, 2021

Tokenize words #52

Tokenize words #52

Comments

aljbri commented Mar 29, 2021

linuxscout commented Mar 29, 2021