ARASTEM is a corpus dedicated to the Arabic stemming field, where it contains several documents containing grouped words which are semantically and morphologically related.
Hence, the corpus was constructed manually by the full intervention of native Arabic speakers after collecting several texts from different Arabic discussion forums. Furthermore, it contains words belonging to the Standard Arabic, Dialectical Arabic and Modern Pseudo Arabic languages.
- Ibtissem Abainia
- Ahmed Kedaya
- Chouaib Fellah
- Otman Bordjiba
- Reviwed by Taha Zerrouki
The new reviewed version is deived into two parts:
- Roots oriented Data: words are grouped according to their roots
- Stems oriented Data: words are grouped according to their lemma
The data is developed to evaluate the ARLStem stemmerARLStem stemmer.
The ARLStemmer is included in NLTK frameworkNLTK framework .
The data is developed to evaluate the ARLStem stemmer, To cite this corpus use
@article{abainia2017novel,
title={A novel robust Arabic light stemmer},
author={Abainia, Kheireddine and Ouamour, Siham and Sayoud, Halim},
journal={Journal of Experimental \& Theoretical Artificial Intelligence},
volume={29},
number={3},
pages={557--573},
year={2017},
publisher={Taylor \& Francis}
}