# Text preprocessing using LangPi

**The API needs more work**

The preprocessing is located in "karim.langpi.basic".
There are many sub-packages : arabic, chinese, finnish, hebrew, hindi, japanese, persian, spanish, basque, czech, 
french, hungarian,  korean, portuguese, bulgarian, dutch, german,  indonesian, norwegian, 
romanian,   swedish, catalan,   english, greek, italian, nynorsk, russian, thai, turkish.

Each language afford these classes: Info, Normalizer, Segmenter, Stemmer, SWEliminator

In [1]:
%%pom
repositories:
  - id: jitpack
    layout: default
    url: https://jitpack.io
    snapshots:
      enabled: false

dependencies:
    - com.github.kariminf:k-toolja:1.1.0
    - com.github.kariminf:langpi:1.1.5
        
# downloading the module
    
#%maven com.github.kariminf:k-hebmorph:2.0.3 //for hebrew
#%maven com.github.kariminf:k-jhazm:1.0.2 //for persian (farsi)
#%maven com.github.kariminf:k-opennlp1.4:1.4.4 //for Thai segmentation
    

In [2]:
%%pom
dependencies:
    - org.apache.lucene:lucene-core:4.10.2
    - org.apache.lucene:lucene-analyzers-common:4.10.2
    - org.apache.lucene:lucene-analyzers-kuromoji:4.10.2 # for japanese
    - org.apache.opennlp:opennlp-maxent:3.0.2-incubating
    - org.apache.opennlp:opennlp-tools:1.7.2


# for chinese
# org.apache.lucene:lucene-analyzers-smartcn:4.10.2


In [3]:
%%java
import kariminf.langpi.basic.arabic.ArInfo;
import kariminf.langpi.basic.english.EnInfo;
import kariminf.langpi.basic.french.FrInfo;
import kariminf.langpi.basic.japanese.JaInfo;
import kariminf.langpi.basic.BasicInfo;


BasicInfo[] infos = new BasicInfo[]{new ArInfo(), new EnInfo(), new FrInfo(), new JaInfo()};

for (BasicInfo info: infos){
    System.out.println("-------------------------------------");
    System.out.println("Code: " + info.getIndicator());
    System.out.println("English name: " + info.getLangEnglishName());
    System.out.println("Original name: " + info.getLangName());
}


-------------------------------------
Code: ar
English name: Arabic
Original name: العربية
-------------------------------------
Code: en
English name: English
Original name: English
-------------------------------------
Code: fr
English name: French
Original name: français
-------------------------------------
Code: ja
English name: Japanese
Original name: 日本語


## I. Text normalization

In [4]:
%%java
// Arabic text normalization : delete diacretics and line breaks
import kariminf.langpi.basic.arabic.ArNormalizer;
import kariminf.langpi.basic.Normalizer;

String input = "سُنًتَدِرب على الرماية.";
System.out.println(ArNormalizer.removeDiacritics(input));

Normalizer norm = new ArNormalizer();
System.out.println(norm.normalize(input));

سنتدرب على الرماية.
سنتدرب على الرماية.


In [5]:
%%java
// English normalization: delete line breaks and multiple spaces
import kariminf.langpi.basic.english.EnNormalizer;
import kariminf.langpi.basic.Normalizer;

String input = "This             is a text\n with return line.";
Normalizer norm = new EnNormalizer();
System.out.println(norm.normalize(input));

This is a text with return line.


## II. Text segmentation

In [6]:
%%java
// Arabic example : RegEx based
import kariminf.langpi.basic.arabic.ArSegmenter;
import kariminf.langpi.basic.Segmenter;

String in = "أنا ذاهب إلى السوق. هل تريد أن أحضر لك شيء ما؟ هكذا إذن! نلتقي بعد أن أعود.";
Segmenter seg = new ArSegmenter();
System.out.println(seg.splitToSentences(in));
System.out.println(seg.segmentWords(in));

[أنا ذاهب إلى السوق., هل تريد أن أحضر لك شيء ما؟, هكذا إذن!, نلتقي بعد أن أعود.]
[أنا, ذاهب, إلى, السوق, هل, تريد, أن, أحضر, لك, شيء, ما, هكذا, إذن, نلتقي, بعد, أن, أعود]


In [7]:
%%java
//English example: OpenNLP based
import kariminf.langpi.basic.english.EnSegmenter;
import kariminf.langpi.basic.Segmenter;

String in = "This is a sentence. It contains some words from Dr. Who.";
Segmenter seg = new EnSegmenter();
System.out.println(seg.splitToSentences(in));
System.out.println(seg.segmentWords(in));

[This is a sentence., It contains some words from Dr. Who.]
[This, is, a, sentence, It, contains, some, words, from, Dr., Who]


## III. StopWords Filtering

In [8]:
%%java
import kariminf.langpi.basic.arabic.ArSWEliminator;
import kariminf.langpi.basic.SWEliminator;
import java.util.*;

List<String> tstList = new ArrayList<String>();
tstList.add("أنا");
tstList.add("سأذهب");
tstList.add("إلى");
tstList.add("المحل");
tstList.add("المجاور");
tstList.add("ثم");
tstList.add("أعود");
tstList.add("بعد");
tstList.add("ذلك");
tstList.add("للعمل");
tstList.add("من");
tstList.add("جديد");

SWEliminator eliminator = new ArSWEliminator();
eliminator.deleteSW(tstList);

System.out.println(tstList);

[سأذهب, المحل, المجاور, أعود, للعمل, جديد]


## IV. Text stemming

In [9]:
import kariminf.langpi.basic.arabic.ArStemmer;
import kariminf.langpi.basic.Stemmer;

ArStemmer Stemmer=new ArStemmer();
String ArabicWord="تستعمل";
List<String> lst = new ArrayList<String>();
lst.add("تستعمل");
lst.add("المعلوماتية");
lst.add("العليا");
lst.add("للإعلام");

System.out.println(Stemmer.stemWord(ArabicWord));
System.out.println(lst);
lst = Stemmer.stemListWords(lst);
System.out.println(lst);

تستعمل
[تستعمل, المعلوماتية, العليا, للإعلام]
[تستعمل, علم, علا, علم]
