lucene-arabic-analyzer

Apache Lucene analyzer for Arabic language with root based stemmer.

lucene-arabic-analyzer

Introduction

Stemming algorithms are used in information retrieval systems, text classifiers, indexers and text mining to extract roots of different words, so that words derived from the same stem or root are grouped together.

Version 2.x is based on Alkhlil Morpho System.
Version 1.x is based on Khoja stemmer.

ArabicRootExtractorAnalyzer is responsible to do the following:

Normalize input text by removing diacritics: e.g. "الْعَالَمِينَ" will be converted to "العالمين".
Extract word's root: e.g. "العالمين" will be converted to "علم".

This way, documents will be indexed depending on its words roots, so, when you want to search in the index, you can input "علم" or "عالم" to get all documents containing "الْعَالَمِينَ".

Installation

Maven

<dependency>
  <groupId>com.github.msarhan</groupId>
  <artifactId>lucene-arabic-analyzer</artifactId>
  <version>[VERSION]</version>
</dependency>

Usage

//Initialize the index
Directory index = new RAMDirectory();
Analyzer analyzer = new ArabicRootExtractorAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, config);

Document doc = new Document();
doc.add(new StringField("number", "1", Field.Store.YES));
doc.add(new TextField("title", "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ", Field.Store.YES));
writer.addDocument(doc);

doc = new Document();
doc.add(new StringField("number", "2", Field.Store.YES));
doc.add(new TextField("title", "الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ", Field.Store.YES));
writer.addDocument(doc);

doc = new Document();
doc.add(new StringField("number", "3", Field.Store.YES));
doc.add(new TextField("title", "الرَّحْمَنِ الرَّحِيمِ", Field.Store.YES));
writer.addDocument(doc);
writer.close();
//~

//Query the index
String queryStr = "راحم";
Query query = new QueryParser("title", analyzer)
    .parse(queryStr);

int hitsPerPage = 5;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, hitsPerPage, Sort.INDEXORDER);

ScoreDoc[] hits = docs.scoreDocs;
//~

//Print results
System.out.println("Found " + hits.length + " hits:");
for (ScoreDoc hit : hits) {
    int docId = hit.doc;
    Document d = searcher.doc(docId);
    System.out.printf("\t(%s): %s\n", d.get("number"), d.get("title"));
}
//~

Usage of `ArabicRootExtractorStemmer`

ArabicRootExtractorStemmer stemmer = new ArabicRootExtractorStemmer();

assertTrue(stemmer.stem("الرَّحْمَنِ").stream().anyMatch(s -> s.equals("رحم")));
assertTrue(stemmer.stem("الْعَالَمِينَ").stream().anyMatch(s -> s.equals("علم")));
assertTrue(stemmer.stem("الْمُؤْمِنِينَ").stream().anyMatch(s -> s.equals("ءمن")));
assertTrue(stemmer.stem("يَتَنَازَعُونَ").stream().anyMatch(s -> s.equals("نزع")));

Integration with Elasticsearch

To use this Analyzer with Elasticsearch, use elasticsearch-arabic-analyzer plugin.

Building

# Install AlKhalil jar files in your local maven repository
cd alkhalil && ./maven-install.sh

# The resulting jar file will include Alkhalil dependencies
mvn package

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
alkhalil		alkhalil
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Tester.java		Tester.java
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alkhalil

alkhalil

src

src

.dockerignore

.dockerignore

.editorconfig

.editorconfig

.gitignore

.gitignore

.travis.yml

.travis.yml

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

Tester.java

Tester.java

pom.xml

pom.xml

Repository files navigation

lucene-arabic-analyzer

Introduction

Installation

Usage

Usage of `ArabicRootExtractorStemmer`

Integration with Elasticsearch

Building

About

Releases

Packages

Contributors 2

Languages

License

msarhan/lucene-arabic-analyzer

Folders and files

Latest commit

History

Repository files navigation

lucene-arabic-analyzer

Introduction

Installation

Usage

Usage of ArabicRootExtractorStemmer

Integration with Elasticsearch

Building

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Usage of `ArabicRootExtractorStemmer`