Source: Natural Language Processing with Java Cookbook [Packt](https://www.packtpub.com/product/natural-language-processing-with-java-cookbook/9781789801156)

## Tokenizing with OpenNLP Library

### Load POM dependencies from [OpenNLP](https://opennlp.apache.org/maven-dependency.html)

In [6]:
%%loadFromPOM
<repositories>
  <repository>
    <id>apache opennlp snapshot</id>
    <url>https://repository.apache.org/content/repositories/snapshots/</url>
  </repository>
</repositories>

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.9.0</version>
</dependency>

Use **SimpleTokenizer**

In [7]:
import opennlp.tools.tokenize.SimpleTokenizer;

public void tokenizeSentence(String sentence) {
    SimpleTokenizer simpletkn = SimpleTokenizer.INSTANCE;
    String tokenList[] = simpletkn.tokenize(sentence);
    for (String token: tokenList) {
        System.out.println(token);
    }
}

String phrase = "This is the best day of my life, as some would say.";
tokenizeSentence(phrase);

This
is
the
best
day
of
my
life
,
as
some
would
say
.


## Tokenizing with OpenNLP's Maximum Entropy 

In [8]:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;

In [9]:
public void tokenizeMaxEntropy(String phrase){
    try (InputStream modelInputStream = new FileInputStream(new File("../models/", "en-token.bin"))) {
        TokenizerModel tknModel = new TokenizerModel(modelInputStream);
        Tokenizer tokenizer = new TokenizerME(tknModel);
        
        String tokenList[] = tokenizer.tokenize(phrase);
        for (String token: tokenList) { System.out.println(token);}
    } catch (FileNotFoundException e) {
        System.out.println("File is not found");
    } catch (IOException e) {
        // Handle
    }
}

String sampleText = "This is the best day indeed!";
tokenizeMaxEntropy(sampleText)

This
is
the
best
day
indeed
!


## Tokenizing manually with Scanner

In [10]:
import java.util.ArrayList;
import java.util.Scanner;

public void tokenizeManually(String phrase){
    Scanner scanner = new Scanner(phrase);
    ArrayList<String> list = new ArrayList<>();
    while (scanner.hasNext()) {
        String token = scanner.next();
        list.add(token);
    }
    
    for (String token : list) { System.out.println(token); }
}

String phrase = "This is the best day of my life, as some would say.";
tokenizeManually(phrase);

This
is
the
best
day
of
my
life,
as
some
would
say.


### Training NN tokenizer using specialized text
- training text in [`../data/training.train`](../data/training.train)

In [11]:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenSample;
import opennlp.tools.tokenize.TokenSampleStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerFactory;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

In [13]:
// Create input text to train on
InputStreamFactory inputStreamFactory = new InputStreamFactory() {
    public InputStream createInputStream() throws FileNotFoundException {
        return new FileInputStream("../data/training-data.train");
    }
};

try (
    ObjectStream<String> stringObjectStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
    ObjectStream<TokenSample> tokenSampleStream = new TokenSampleStream(stringObjectStream);) {
    // create model   
    TokenizerModel tokenizerModel = TokenizerME.train(tokenSampleStream,
                                                      new TokenizerFactory("en", null, true, null), 
                                                      TrainingParameters.defaultParams());
    BufferedOutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(
                                                    new File("../models/mymodel.bin")));
    tokenizerModel.serialize(modelOutputStream);
} catch (IOException ex) {
    // Handle exception
}

Indexing events with TwoPass using cutoff of 5

	Computing event counts...  done. 36 events
	Indexing...  done.
Sorting and merging events... done. Reduced 36 events to 12.
Done indexing in 0.03 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 12
	    Number of Outcomes: 2
	  Number of Predicates: 9
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-24.95329850015802	0.8611111111111112
  2:  ... loglikelihood=-14.200654164477221	0.8611111111111112
  3:  ... loglikelihood=-11.526745527757855	0.8611111111111112
  4:  ... loglikelihood=-9.984657035211438	0.8888888888888888
  5:  ... loglikelihood=-8.837634767583115	0.8888888888888888
  6:  ... loglikelihood=-7.925934782768229	0.8888888888888888
  7:  ... loglikelihood=-7.182391009502338	0.8888888888888888
  8:  ... loglikelihood=-6.565411011241649	0.8888888888888888
  9:  ... loglikelihood=-6.045907913839374	0.9166666666666666
 10:  ... loglikelihood=-5.602806368635076

In [15]:
// Test moddel 
String sampleText = "In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
        new File("../models", "mymodel.bin"));) {
            TokenizerModel tokenizerModel = new TokenizerModel(modelInputStream);
            Tokenizer tokenizer = new TokenizerME(tokenizerModel);
            String tokenList[] = tokenizer.tokenize(sampleText);
            for (String token : tokenList) {
                System.out.println(token);
            }
} catch (FileNotFoundException e) {
    // Handle exception
} catch (IOException e) {
    // Handle exception
}

In
addition
,
the
rook
was
moved
too
far
to
be
effective
.


### Stemming using OpenNLP's `PorterStemmer`

In [16]:
import opennlp.tools.stemmer.PorterStemmer;

String words[] = {"draft", "drafted", "drafting", "drafts", "drafty", "draftsman"};
PorterStemmer porterStemmer = new PorterStemmer();
for (String word:words) {
    String stem = porterStemmer.stem(word);
    System.out.println("The stem of " + word + " is " + stem);
}

The stem of draft is draft
The stem of drafted is draft
The stem of drafting is draft
The stem of drafts is draft
The stem of drafty is drafti
The stem of draftsman is draftsman


### Determining Lexical meaning of a word

In [17]:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;

In [21]:
LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream("../models/en-lemmatizer.bin")){
    lemmatizerModel = new LemmatizerModel(modelInputStream);
    LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);
} catch (FileNotFoundException e) {
    // Handle exception
} catch (IOException e) {
    // Handle exception
}

String[] tokens = new String[] { 
    "The", "girls", "were", "leaving", "the", 
    "clubhouse", "for", "another", "adventurous", 
    "afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD", 
    "VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };
String[] lemmas = lemmatizer.lemmatize(tokens, posTags);
for (int i = 0; i < tokens.length; i++) {
    System.out.println(tokens[i] + " - " + lemmas[i]);
}


CompilationException: 

EvalException: Undefined cell magic 'bash'