Skip to content

ARASTEM is a new corpus dedicated to the Arabic stemming field, where it contains several documents containing grouped words which are semantically and morphologically related. Hence, the corpus was constructed manually by the full intervention of native Arabic speakers after collecting several texts from different Arabic discussion forums. Furt…

License

Notifications You must be signed in to change notification settings

linuxscout/ARASTEM-corpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AraStem-corpus

ARASTEM is a corpus dedicated to the Arabic stemming field, where it contains several documents containing grouped words which are semantically and morphologically related.

Hence, the corpus was constructed manually by the full intervention of native Arabic speakers after collecting several texts from different Arabic discussion forums. Furthermore, it contains words belonging to the Standard Arabic, Dialectical Arabic and Modern Pseudo Arabic languages.

Contributors:

  • Ibtissem Abainia
  • Ahmed Kedaya
  • Chouaib Fellah
  • Otman Bordjiba
  • Reviwed by Taha Zerrouki

Parts

The new reviewed version is deived into two parts:

  • Roots oriented Data: words are grouped according to their roots
  • Stems oriented Data: words are grouped according to their lemma

Links

The data is developed to evaluate the ARLStem stemmerARLStem stemmer.

The ARLStemmer is included in NLTK frameworkNLTK framework .

Citation

The data is developed to evaluate the ARLStem stemmer, To cite this corpus use

@article{abainia2017novel,
  title={A novel robust Arabic light stemmer},
  author={Abainia, Kheireddine and Ouamour, Siham and Sayoud, Halim},
  journal={Journal of Experimental \& Theoretical Artificial Intelligence},
  volume={29},
  number={3},
  pages={557--573},
  year={2017},
  publisher={Taylor \& Francis}
}

About

ARASTEM is a new corpus dedicated to the Arabic stemming field, where it contains several documents containing grouped words which are semantically and morphologically related. Hence, the corpus was constructed manually by the full intervention of native Arabic speakers after collecting several texts from different Arabic discussion forums. Furt…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published