Awesome Persian NLP/IR, Tools And Resources

This list is curation of the best, not of everything. Please participate in its development.Thanks to ACL WEB.

Tools

Part-of-Speech Tagger

farsiNLPTools - Open-source dependency parser, part-of-speech tagger, and text normalizer for Farsi (Persian).
HAZM - Python library for digesting Persian text.
Persian Language Model for HunPoS - HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000) allowing the user to tune the tagger by using different feature settings.
Maryam Tavafi POS Tagger - This software includes implementation of a Persian part of speech tagger based on Structured Support Vector Machines.
[Perstem] (https://sourceforge.net/projects/perstem/) - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
Persianp Toolbox - Multi-purpose persian NLP toolbox.
UM-wtlab pos tagger - This software is a C# implementation of the Viberbi and Brill part-of-speech taggers.

Language Detection

Google language detect (python port) - Light Weight language detector, its performance for persian is excellent.

Tokenization & Segmentation

HAZM - Python library for digesting Persian text.
polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
tok-tok - Tok-tok is a fast, simple, multilingual tokenizer(single .pl file).
segmental - You can train your model based on plain-text corpus for text segmentation by powerful deep learning platform.
Persian Sentence Segmenter and Tokenizer: SeTPer - Regex based sentence segmenter.

Normalizer And Text Cleaner

HAZM - Python library for digesting Persian text.
Persian Pre-processor: PrePer - Another signle .pl tools that normals your persian text.
virastar - Cleanning up Persian text!.replace double dash to ndash and triple dash to mdash, replace English numbers with their Persian equivalent, correct :;,.?! spacing (one space after and no space before), replace English percent sign to its Persian equivalent and many other normalization. Virastar is written by ruby.
Virastyar - A collection of C# libraries for Persian text processing (Spell Checking, Purification, Punctuation Correction, Persian Character Standardization, Pinglish Conversion & ...)

Transliterator

Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.

Named Entity Recognition

Embeddings

Morphological Analysis

polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Stemmer

Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Sentiment Analysis

polyglot (polarity) - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
NRC-Persian-Lexicon - NRC Word-Emotion Association Lexicon useful for persian sentiment analysis.

Data

Part-of-Speech Tagger

Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.
Mojgan Seraji Corpus - Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.

Dependency Parsing

Persian Syntactic Dependency Treebank - This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon.
Uppsala Persian Dependency Treebank: UPDT - Dependency-based syntactically annotated corpus.
Pretrained model
Universal Dependencies 1.3 - Multi lingual corpus that holds IOB gold data for dependency parsing
HamleDT 3.0 - HArmonized Multi-LanguagE Dependency Treebank is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.

Text Categorization and Classification

Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.
Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.

Spell Checking

FAspell - FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English.

Machine Tanslation

TEP: Tehran English-Persian Parallel Corpus - First free English-Persian corpus.
OPUS: the open parallel corpus - OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

Web Collected

W2C – Web to Corpus – Corpora - A set of corpora for 120 languages automatically collected from wikipedia and the web.
dotIR Collection - dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web.dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.It is A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.It is not very large, so that it does not require high processing resources.

IR Ranking Evaluation

Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.

IR Crawling And Linking Evaluation

dotIR Collection - dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web.dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.It is A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.It is not very large, so that it does not require high processing resources.

MISC

PersPred - PersPred, is the first online multilingual syntactic and semantic database of Persian compound verbs (complex predicates), developed by the members of the research unit Mondes iranien et indien (CNRS, Sorbonne Nouvelle, Inalco, EPHE) within the ANR-DFG project PERGRAM (2008-2012) and the LR4.1 work package of the Strand 6 of the Labex Empirical Foundations of Linguistics (EFL).
mhbashari stopword list - Experimental list of stopwords that is suitable for topic modelling and word embedding.
Hazm stop words - Stop words list, good for IR.

Papers

Contribute

Contributions welcome! Read the contribution guidelines first.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
README.md		README.md
contributing.md		contributing.md
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

contributing.md

contributing.md

stopwords.txt

stopwords.txt

Repository files navigation

Awesome Persian NLP/IR, Tools And Resources

Contents

Tools

Part-of-Speech Tagger

Language Detection

Tokenization & Segmentation

Normalizer And Text Cleaner

Transliterator

Named Entity Recognition

Embeddings

Morphological Analysis

Stemmer

Sentiment Analysis

Data

Part-of-Speech Tagger

Dependency Parsing

Text Categorization and Classification

Spell Checking

Machine Tanslation

Web Collected

IR Ranking Evaluation

IR Crawling And Linking Evaluation

MISC

Papers

Contribute

License

About

Releases

Packages

htaghizadeh/awesome-persian-nlp-ir

Folders and files

Latest commit

History

Repository files navigation

Awesome Persian NLP/IR, Tools And Resources

Contents

Tools

Part-of-Speech Tagger

Language Detection

Tokenization & Segmentation

Normalizer And Text Cleaner

Transliterator

Named Entity Recognition

Embeddings

Morphological Analysis

Stemmer

Sentiment Analysis

Data

Part-of-Speech Tagger

Dependency Parsing

Text Categorization and Classification

Spell Checking

Machine Tanslation

Web Collected

IR Ranking Evaluation

IR Crawling And Linking Evaluation

MISC

Papers

Contribute

License

About

Resources

Stars

Watchers

Forks