# Text preprocessing using CoreNLP

Available models are: arabic, chinese, english, french, german and spanish.
Stanford CoreNLP affords a pipeline of tasks. 
In this tutorial, we will choose only the tasks of preprocessing.

In [1]:
%%pom
dependencies:
    - edu.stanford.nlp:stanford-corenlp:4.2.2
    - groupId: edu.stanford.nlp
      artifactId: stanford-corenlp
      version: 4.2.2
      classifier: models
    - groupId: edu.stanford.nlp
      artifactId: stanford-corenlp
      version: 4.2.2
      classifier: models-arabic

# <dependencies>
# <dependency>
#     <groupId>edu.stanford.nlp</groupId>
#     <artifactId>stanford-corenlp</artifactId>
#     <version>4.0.0</version>
# </dependency>
# <dependency>
#     <groupId>edu.stanford.nlp</groupId>
#     <artifactId>stanford-corenlp</artifactId>
#     <version>4.0.0</version>
#     <classifier>models</classifier>
# </dependency>
# </dependencies>

## I. Preprocessing pipeline

Here, we will show how to choose pipeline tasks and launch the pipeline.
The sentence split "ssplit" depends on word tokenization "tokenize". 
Lemmatization "lemma" depends on tokenization and part of speech annotation "pos".

In [2]:
%%java

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ie.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;

String text = "This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.";

// set up pipeline properties
Properties props = new Properties();

// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,lemma");

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// create a document object
CoreDocument document = new CoreDocument(text);

// annnotate the document
pipeline.annotate(document);


## II. Get tokens

A "CoreDocument" is composed of a list of "CoreLabel". 
This list can be obtained using the method "tokens()". 
A "CoreLabel" affords many properties:
- word(): to get the word; if the "tokenize" task has been chosen.
- lemma(): to get the lemma; if the "lemma" task has been chosen.
- ner(): to get the class of the named entity; if the "ner" task has been chosen.
- tag(): to get the word's tag; if the "tag" task has been chosen. 

Here, we will just use "word()" to recover the word of each "CoreLabel".

In [3]:
%%java
import java.util.*;

List<String> words = new ArrayList<String>();

for (CoreLabel token: document.tokens()){
    words.add(token.word());
}

words;

[This, is, a, text, written, by, Mr., Aries, ., It, uses, U.S., english, to, illustrate, sentence, tokenization, .]

## III. Get sentences

A "CoreDocument" is composed of a list of "CoreSentence". 
This list can be obtained using the method "sentences()". 
A "CoreSentence" affords many properties:
- word(): to get the word; if the "tokenize" task has been chosen.
- lemma(): to get the lemma; if the "lemma" task has been chosen.
- ner(): to get the class of the named entity; if the "ner" task has been chosen.
- tag(): to get the word's tag; if the "tag" task has been chosen. 

Here, we will just use "word()" to recover the word of each "CoreLabel".

In [4]:
%%java
import java.util.*;

//to get sentences as texts
List<String> sentences = new ArrayList<String>();
//to get sentences as tokens
List<List<String>> sentencesWords = new ArrayList<>();

for (CoreSentence sentence: document.sentences()){
    sentences.add(sentence.text());
    List<String> words = new ArrayList<String>();
    sentencesWords.add(words);
    for(CoreLabel token: sentence.tokens()){
        words.add(token.word());
    }
}

System.out.println("---------- Sentences ------------");
System.out.println(sentences);

System.out.println("---------- Words in each sentence ------------");
System.out.println(sentencesWords);

---------- Sentences ------------
[This is a text written by Mr. Aries., It uses U.S. english to illustrate sentence tokenization.]
---------- Words in each sentence ------------
[[This, is, a, text, written, by, Mr., Aries, .], [It, uses, U.S., english, to, illustrate, sentence, tokenization, .]]


## IV. Get lemmas

In [5]:
import java.util.*;

List<String> lemmas = new ArrayList<String>();

for (CoreLabel token: document.tokens()){
    lemmas.add(token.lemma());
}

lemmas;

[this, be, a, text, write, by, Mr., Aries, ., it, use, U.S., english, to, illustrate, sentence, tokenization, .]

## V. Other languages

We will try arabic. Lemmatization is not afforded.

In [6]:
%%java
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ie.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;

String text = "أنا ذاهب إلى السوق. هل تريد أن أحضر لك شيء ما؟ هكذا إذن! نلتقي بعد أن أعود.";

// set up pipeline properties
Properties props = new Properties();

// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.language", "ar");
props.setProperty("segment.model", "edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz");
props.setProperty("ssplit.boundaryTokenRegex", "[.]|[!?]+|[!\u061F]+");

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// create a document object
CoreDocument document = new CoreDocument(text);

// annnotate the document
pipeline.annotate(document);

for (CoreSentence sentence: document.sentences()){
    System.out.println("-------------------------------------");
    System.out.println(sentence.text());
    for(CoreLabel token: sentence.tokens()){
        System.out.print(token.word() +  ", ");
    }
    System.out.println();
}

-------------------------------------
أنا ذاهب إلى السوق.
انا, ذاهب, الى, السوق, ., 
-------------------------------------
هل تريد أن أحضر لك شيء ما؟
هل, تريد, ان, احضر, ل, ك, شيء, ما, ?, 
-------------------------------------
هكذا إذن!
هكذا, اذن, !, 
-------------------------------------
نلتقي بعد أن أعود.
نلتقي, بعد, ان, اعود, ., 


In [7]:
//Here, we pass the name of the properties file which is located inside the arabic model
//under the name : "StanfordCoreNLP-arabic.properties" which contaains configurations
// build pipeline
StanfordCoreNLP pipeline2 = new StanfordCoreNLP("arabic");

// create a document object
CoreDocument document2 = new CoreDocument(text);

// annnotate the document
pipeline2.annotate(document2);

for (CoreSentence sentence: document2.sentences()){
    System.out.println("-------------------------------------");
    System.out.println(sentence.text());
    for(CoreLabel token: sentence.tokens()){
        System.out.print(token.word() + ", ");
    }
    System.out.println();
}

-------------------------------------
أنا ذاهب إلى السوق.
انا, ذاهب, الى, السوق, ., 
-------------------------------------
هل تريد أن أحضر لك شيء ما؟
هل, تريد, ان, احضر, ل, ك, شيء, ما, ?, 
-------------------------------------
هكذا إذن!
هكذا, اذن, !, 
-------------------------------------
نلتقي بعد أن أعود.
نلتقي, بعد, ان, اعود, ., 
