# Text preprocessing using OpenNLP

Downloading the api from maven repository may take some time. 
You can find models in this link: https://opennlp.apache.org/models.html 
Sourceforge models can be downloaded here: http://opennlp.sourceforge.net/models-1.5/

In [7]:
%%pom
dependencies:
    - org.apache.opennlp:opennlp-tools:1.9.3

#you can use this with maven
#<dependency>
#    <groupId>org.apache.opennlp</groupId>
#    <artifactId>opennlp-tools</artifactId>
#    <version>1.9.3</version>
#</dependency>

## I. Language detection

### I.1. Detection using a trained model
Here, we will use **langdetect** model found in https://opennlp.apache.org/models.html 

In [5]:
%%java
import java.io.*;
import opennlp.tools.langdetect.*;

String[] texts = new String[]{
    "A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations automatically.",
    "Un ordinateur est un système de traitement de l'information programmable tel que défini par Alan Turing et qui fonctionne par la lecture séquentielle d'un ensemble d'instructions.",
    "La computadora también denominada computador​ u ordenador es una máquina digital programable que ejecuta una serie de comandos para procesar los datos de entrada, obteniendo convenientemente información que posteriormente se envía a las unidades de salida.",
    "الحَاسُوب هو آلة إلكترونية لها قابلية استقبال البيانات ومعالجتها إلى معلومات ذات قيمة.",
    "رایانِه یا کامپیوتِر دستگاهی الکترونیک است که می‌تواند برنامه‌ریزی شود تا دستور های ریاضیاتی و منطقی را به‌صورت خودکاره از طریق برنامه‌نویسی انجام دهد.",
    "コンピュータは、主にトランジスタを含む電子回路を応用し、数値計算、情報処理、データ処理、文書作成、動画編集、遊戯など、複雑な（広義の）計算を高速、大量におこなうことを目的として開発された機械である。",
    "电子计算机是利用数字电子技术，根据一系列指令指示並且自动执行任意算术或逻辑操作序列的设备。",
};
//english, french, spanish, arabic, persian, japanese, chinese
    
try{
    InputStream modelIn = new FileInputStream("/home/kariminf/Data/OpenNLP/langdetect-183.bin");
    LanguageDetectorModel model = new LanguageDetectorModel(modelIn);
    LanguageDetectorME detecter = new LanguageDetectorME(model);
    for (String text: texts){
        System.out.println("----------------------------------------------------");
        System.out.println(text);
        Language bestLanguage = detecter.predictLanguage(text);
        System.out.println("Best language: " + bestLanguage.getLang());
        System.out.println("Best language confidence: " + bestLanguage.getConfidence());
    }
}
catch(IOException e){
    System.out.println("Model not found!");
}

----------------------------------------------------
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations automatically.
Best language: eng
Best language confidence: 0.13386670528129072
----------------------------------------------------
Un ordinateur est un système de traitement de l'information programmable tel que défini par Alan Turing et qui fonctionne par la lecture séquentielle d'un ensemble d'instructions.
Best language: fra
Best language confidence: 0.3152233825312163
----------------------------------------------------
La computadora también denominada computador​ u ordenador es una máquina digital programable que ejecuta una serie de comandos para procesar los datos de entrada, obteniendo convenientemente información que posteriormente se envía a las unidades de salida.
Best language: spa
Best language confidence: 0.43167282617777747
----------------------------------------------------
الحَاسُوب هو آلة إلكترونية لها قابلي

### I.2. Training a model
Let's build a model to detect numeral systems: binary (BIN), Decimal (DEC) and Hexadecimal (HEX). In the training file, each line contains an exemple of a language starting with the language code followed by a tabulation followed by the example.

In [19]:
%%java
import java.io.*;
import opennlp.tools.langdetect.*;
import opennlp.tools.util.*;
import java.nio.charset.StandardCharsets;
import opennlp.tools.util.model.ModelUtil;

// Read file with greetings in many languages
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("num.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
ObjectStream<LanguageSample> sampleStream = new LanguageDetectorSampleStream(lineStream);
 
// Train a model from the greetings with many languages.
LanguageDetectorModel model = LanguageDetectorME.train(sampleStream,
    ModelUtil.createDefaultTrainingParameters(), new LanguageDetectorFactory());
 
// Serialize model to some file so that next time we don't have to again train a
// model. Next time We can just load this file directly into model.
//model.serialize(new File("num.bin"));

Indexing events with OnePass using cutoff of 5

	Computing event counts...  done. 20 events
	Indexing...  Dropped event BIN:[]
Dropped event BIN:[]
Dropped event BIN:[]
Dropped event BIN:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event DEC:[]
Dropped event HEX:[]
Dropped event HEX:[d]
Dropped event HEX:[f]
done.
Sorting and merging events... done. Reduced 5 events to 1.
Done indexing in 0.00 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 1
	    Number of Outcomes: 3
	  Number of Predicates: 1
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-5.493061443340549	0.0
  2:  ... loglikelihood=-2.5541281188299547	1.0
  3:  ... loglikelihood=-1.6823611831060645	1.0
  4:  ... loglikelihood=-1.2565721414045303	1.0
  5:  ... loglikelihood=-1.0033534773107562	1.0
  6:  ... loglikelihood=-0.8352704233158316	1.0
 

## II. Sentence boundary detection

### II.1. Detection using a model

Here, we will use English sentence detection model found in https://opennlp.apache.org/models.html 

In [22]:
%%java
import java.io.*;
import opennlp.tools.sentdetect.*;

String text = "This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.";
    
try{
    InputStream modelIn = new FileInputStream("/home/kariminf/Data/OpenNLP/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin");
    SentenceModel model = new SentenceModel(modelIn);
    SentenceDetectorME detecter = new SentenceDetectorME(model);
    String sentences[] = detecter.sentDetect(text);
    for (String sentence: sentences){
        System.out.println(sentence);
    }
    
}
catch(IOException e){
    System.out.println("Model not found!");
}

This is a text written by Mr. Aries.
It uses U.S. english to illustrate sentence tokenization.


### II.2. Training a model
Let's build a model to detect sentences boundaries. In the training file, each line represents a sentence.

In [27]:
%%java
import java.io.*;
import opennlp.tools.sentdetect.*;
import opennlp.tools.util.*;
import java.nio.charset.StandardCharsets;

// read the sentences file
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("en-sent.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);

SentenceModel model;

try (ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) {
  model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
}


try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("en-detect.bin"))) {
  model.serialize(modelOut);
}

//TODO fix

Indexing events with TwoPass using cutoff of 5

	Computing event counts...  done. 7 events
	Indexing...  done.
Sorting and merging events... done. Reduced 7 events to 3.
Done indexing in 0.00 s.


jdk.jshell.EvalException: Training data must contain more than one outcome
	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
	at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
	at opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:325)
	at opennlp.tools.sentdetect.SentenceDetectorME.train(SentenceDetectorME.java:310)
	at .(#385:1)


## III. Word tokenization
Here, we will use English words tokenization model found in https://opennlp.apache.org/models.html 

In [30]:
%%java
import java.io.*;
import opennlp.tools.tokenize.*;

String text = "This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.";
    
try{
    InputStream modelIn = new FileInputStream("/home/kariminf/Data/OpenNLP/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin");
    TokenizerModel model = new TokenizerModel(modelIn);
    Tokenizer tokenizer = new TokenizerME(model);
    String words[] = tokenizer.tokenize(text);
    for (String word: words){
        System.out.print(word + " | ");
    }
    
}
catch(IOException e){
    System.out.println("Model not found!");
}

This | is | a | text | written | by | Mr. | Aries | . | It | uses | U.S. | english | to | illustrate | sentence | tokenization | . | 

In [None]:
## IV. Word tokenization
Here, we will use English words tokenization model found in https://opennlp.apache.org/models.html 