## Extracting Sentences from a text (Sentence Boundary Decision `SBD`) 
- Finding sentences using the Java core API
- Performing SBD using the BreakIterator class
- Using OpenNLP to perform SBD
- Using the Stanford NLP API to perform SBD
- Using the LingPipe and chunking to perform SBD
- Performing SBD on specialized text
- Training a neural network to perform SBD with specialized text


### Use Core Java Library

In [1]:
import java.util.regex.Matcher;
import java.util.regex.Pattern;

In [2]:
String text = "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";

In [3]:
String sentenceDelimiters = "[.?!]";
String[] sentences = (text.split(sentenceDelimiters));
for (String sentence : sentences) {
    System.out.println(sentence);
}

We will start with a simple sentence
 However, is it possible for a sentence to end with a question mark
 Obviously that is possible
 Another complication is the use of a number such as 56
32 or ellipses such as 


 Ellipses may be found 


 with a sentence
 Of course, we may also find the use of abbreviations such as Mr
 Smith or Dr
 Jones


In [4]:
Pattern sentencePattern = Pattern.compile("\\s+[^.!?]*[.!?]");
Matcher matcher = sentencePattern.matcher(text);
while (matcher.find()) {
    System.out.println(matcher.group());
}

 will start with a simple sentence.
 However, is it possible for a sentence to end with a question mark?
 Obviously that is possible!
 Another complication is the use of a number such as 56.
 or ellipses such as .
 Ellipses may be found .
 with a sentence!
 Of course, we may also find the use of abbreviations such as Mr.
 Smith or Dr.
 Jones.


### Using `BreakIterator` Class

In [8]:
import java.text.BreakIterator;

private static String text =     "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";

BreakIterator breakIterator = BreakIterator.getSentenceInstance();
breakIterator.setText(text);

In [10]:
int startPosition = breakIterator.first();
int endingPosition = breakIterator.first();

while (true) {
    endingPosition = breakIterator.next();
    if (endingPosition == BreakIterator.DONE) {
        break;
    } else {
        System.out.println(startPosition + "-" + endingPosition + " [" + text.substring(startPosition, endingPosition) + "]");
        startPosition = endingPosition;
    }
}

0-38 [We will start with a simple sentence. ]
38-106 [However, is it possible for a sentence to end with a question mark? ]
106-134 [Obviously that is possible! ]
134-216 [Another complication is the use of a number such as 56.32 or ellipses such as ... ]
216-259 [Ellipses may be found ... with a sentence! ]
259-324 [Of course, we may also find the use of abbreviations such as Mr. ]
324-337 [Smith or Dr. ]
337-343 [Jones.]


In [16]:
// to get the last sentence
breakIterator.setText(text);

int endingPosition = breakIterator.last();
int startingPosition = breakIterator.previous();
System.out.println(startPosition + "-" + endingPosition + " [" + text.substring(startingPosition, endingPosition) + "] ");

343-343 [Jones.] 


### Using OpenNLP's `SentenceDetectorME`

In [19]:
%%loadFromPOM
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.9.0</version>
</dependency>

In [20]:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

In [22]:
private static String text = 
    "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";

In [33]:
try (InputStream inputStream = new FileInputStream(new File("../models/en-sent.bin"))) {
    // Prints      
    SentenceModel sentenceModel = new SentenceModel(inputStream);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentenceModel);
    String sentences[] = sentenceDetector.sentDetect(text);
    for (String sentence : sentences) {
        System.out.println("[" + sentence + "]");
    }
    
    // Span objects hold the starting and ending position
    Span spans[] = sentenceDetector.sentPosDetect(text);
    for (Span span : spans) {
        System.out.println(span);
    }
    
    double probablities[] = sentenceDetector.getSentenceProbabilities();
    for(int i=0; i<sentences.length; i++) {
        System.out.printf("Sentence %d: %6.4f\n",i, probablities[i]);
    }
    
} catch (FileNotFoundException ex) {
    // Handle exceptions
    System.out.println("System file not found");
} catch (IOException ex) {
    // Handle exceptions
}

[We will start with a simple sentence.]
[However, is it possible for a sentence to end with a question mark?]
[Obviously that is possible!]
[Another complication is the use of a number such as 56.32 or ellipses such as ... Ellipses may be found ... with a sentence!]
[Of course, we may also find the use of abbreviations such as Mr. Smith or Dr. Jones.]
[0..37)
[38..105)
[106..133)
[134..258)
[259..343)
Sentence 0: 0.9999
Sentence 1: 0.8117
Sentence 2: 0.9898
Sentence 3: 0.9953
Sentence 4: 0.9706


### Using StanfordNLP's `WordToSentenceProcessor` 

In [40]:
%%loadFromPOM
<!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.2.0</version>
</dependency>

:: problems summary ::
:::: ERRORS
	unknown resolver null

	unknown resolver null

	unknown resolver null



In [41]:
import java.io.StringReader;
import java.util.List;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.process.WordToSentenceProcessor;

In [42]:
private static String text =  "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";

PTBTokenizer<CoreLabel> ptbTokenizer = new PTBTokenizer<CoreLabel>(new StringReader(text),new CoreLabelTokenFactory(), null);
WordToSentenceProcessor<CoreLabel> wordToSentenceProcessor = new WordToSentenceProcessor<CoreLabel>();
List<List<CoreLabel>> sentenceList = wordToSentenceProcessor.process(ptbTokenizer.tokenize());

for (List<CoreLabel> sentence : sentenceList) {
    System.out.println(sentence);
}

[We, will, start, with, a, simple, sentence, .]
[However, ,, is, it, possible, for, a, sentence, to, end, with, a, question, mark, ?]
[Obviously, that, is, possible, !]
[Another, complication, is, the, use, of, a, number, such, as, 56.32, or, ellipses, such, as, ..., Ellipses, may, be, found, ..., with, a, sentence, !]
[Of, course, ,, we, may, also, find, the, use, of, abbreviations, such, as, Mr., Smith, or, Dr., Jones, .]


In [43]:
for (List<CoreLabel> sentence : sentenceList) {
    for (CoreLabel coreLabel : sentence) {
        System.out.print(coreLabel + " ");
    }
    System.out.println();
}

We will start with a simple sentence . 
However , is it possible for a sentence to end with a question mark ? 
Obviously that is possible ! 
Another complication is the use of a number such as 56.32 or ellipses such as ... Ellipses may be found ... with a sentence ! 
Of course , we may also find the use of abbreviations such as Mr. Smith or Dr. Jones . 


In [44]:
//  To get position of each word
for (List<CoreLabel> sentence : sentenceList) {
    for (CoreLabel coreLabel : sentence) {
        System.out.print(coreLabel.word() + " - " +  
            coreLabel.beginPosition() + ":" +
            coreLabel.endPosition() + " ");
    }
}
System.out.println();

We - 0:2 will - 3:7 start - 8:13 with - 14:18 a - 19:20 simple - 21:27 sentence - 28:36 . - 36:37 However - 38:45 , - 45:46 is - 47:49 it - 50:52 possible - 53:61 for - 62:65 a - 66:67 sentence - 68:76 to - 77:79 end - 80:83 with - 84:88 a - 89:90 question - 91:99 mark - 100:104 ? - 104:105 Obviously - 106:115 that - 116:120 is - 121:123 possible - 124:132 ! - 132:133 Another - 134:141 complication - 142:154 is - 155:157 the - 158:161 use - 162:165 of - 166:168 a - 169:170 number - 171:177 such - 178:182 as - 183:185 56.32 - 186:191 or - 192:194 ellipses - 195:203 such - 204:208 as - 209:211 ... - 212:215 Ellipses - 216:224 may - 225:228 be - 229:231 found - 232:237 ... - 238:241 with - 242:246 a - 247:248 sentence - 249:257 ! - 257:258 Of - 259:261 course - 262:268 , - 268:269 we - 270:272 may - 273:276 also - 277:281 find - 282:286 the - 287:290 use - 291:294 of - 295:297 abbreviations - 298:311 such - 312:316 as - 317:319 Mr. - 320:323 Smith - 324:329 or - 330:332 Dr. - 333:336 Jone

### Using LINGPipe & Chunking

In [45]:
%%loadFromPOM
<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>aliasi-lingpipe</artifactId>
    <version>4.1.0</version>
</dependency>

In [46]:
import java.util.ArrayList;
import java.util.List;
import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;

In [53]:
String text = 
    "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel = new IndoEuropeanSentenceModel();

List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();

In [54]:
Tokenizer tokenizer = tokenizerFactory.tokenizer(text.toCharArray(), 0, text.length());
tokenizer.tokenize(tokenList, whiteList);

int[] sentenceBoundaries = sentenceModel.boundaryIndices(
    tokenList.toArray(new String[tokenList.size()]),
    whiteList.toArray(new String[whiteList.size()]));

In [55]:
// display sentences
int start = 0;
for (int boundary : sentenceBoundaries) {
    System.out.print("[");
    while (start <= boundary) {
        System.out.print(tokenList.get(start) + 
            whiteList.get(start + 1));
        start++;
    }
    System.out.println("]");
}

[We will start with a simple sentence. ]
[However, is it possible for a sentence to end with a question mark? ]
[Obviously that is possible! ]
[Another complication is the use of a number such as 56.32 or ellipses such as ... Ellipses may be found ... with a sentence! ]
[Of course, we may also find the use of abbreviations such as Mr. Smith or Dr. Jones.]


In [57]:
// Print sentence boundaries
int begin = 0;
for (int boundary : sentenceBoundaries) {
    System.out.println(begin + ":" + boundary);
    begin = boundary;
}

0:7
7:22
22:27
27:52
52:73


### Using LINGPipe for specialized text
e.g., medical literature

In [58]:
%%loadFromPOM
<dependency>
     <groupId>de.julielab</groupId>
     <artifactId>aliasi-lingpipe</artifactId>
     <version>4.1.0</version>
 </dependency>

In [59]:
import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunking;
import com.aliasi.sentences.MedlineSentenceModel;
import com.aliasi.sentences.SentenceChunker;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;

In [60]:
String text = "In total, 33 patients with AIS and 39 healthy "
    + "controls were enrolled in this study. The major "
    + "findings were as follows: (1) The stroke group had "
    + "a significantly lower level of serum Hg (6.4?±?4.3 "
    + "µg/L vs. 9.8?±?7.0 µg/L, P =?0.032, OR?=?0.90, 95% "
    + "CI?=?0.81–0.99) and a lower level of urine Hg "
    + "(0.7?±?0.7 µg/L vs. 1.2?±?0.6 µg/L, P =?0.006, "
    + "OR?=?0.27, 95% CI?=?0.11–0.68) than the control "
    + "group. (2) No significant difference in serum "
    + "Pb (S-Pb), As (S-As), and Cd (S-Cd) levels and "
    + "urine Pb (U-Pb), As (U-As) and Cd (U-Cd) levels "
    + "was observed in either group.";

In [61]:
TokenizerFactory tokenizerfactory = IndoEuropeanTokenizerFactory.INSTANCE;
MedlineSentenceModel medlineSentenceModel = new MedlineSentenceModel();

In [63]:
SentenceChunker sentenceChunker = new SentenceChunker(tokenizerfactory,  medlineSentenceModel);
Chunking chunking = sentenceChunker.chunk(text.toCharArray(),0, text.length());
String slice = chunking.charSequence().toString();

for (Chunk chunk : chunking.chunkSet()) {
    System.out.println("[" +  slice.substring(chunk.start(), chunk.end()) + "]");
}

[In total, 33 patients with AIS and 39 healthy controls were enrolled in this study.]
[The major findings were as follows: (1) The stroke group had a significantly lower level of serum Hg (6.4?±?4.3 µg/L vs. 9.8?±?7.0 µg/L, P =?0.032, OR?=?0.90, 95% CI?=?0.81–0.99) and a lower level of urine Hg (0.7?±?0.7 µg/L vs. 1.2?±?0.6 µg/L, P =?0.006, OR?=?0.27, 95% CI?=?0.11–0.68) than the control group.]
[(2) No significant difference in serum Pb (S-Pb), As (S-As), and Cd (S-Cd) levels and urine Pb (U-Pb), As (U-As) and Cd (U-Cd) levels was observed in either group.]


### Training an NN for specialized text

In [64]:
%%loadFromPOM
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.9.0</version>
</dependency>


In [65]:
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.charset.Charset;
import opennlp.tools.sentdetect.SentenceDetectorFactory;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.sentdetect.SentenceSample;
import opennlp.tools.sentdetect.SentenceSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

String terminators[] = { ".", "!", "?", "..." };
String sampleSentences[] = {"A simple sentence", "Another sentence a bit longer", "Last sentence"};

StringBuilder stringBuilder = new StringBuilder();
for (String sentenceTerminator : terminators) {
    for (String sentence : sampleSentences) {
        stringBuilder.append(sentence).append(sentenceTerminator);
        stringBuilder.append(System.lineSeparator());
    }
}

String trainingSentences = stringBuilder.toString();

In [74]:
try (ObjectStream<String> lineStream = new PlainTextByLineStream(
                                            () -> new ByteArrayInputStream(trainingSentences.getBytes()), Charset.forName("UTF-8"));
    ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) {
    SentenceDetectorFactory sentenceDetectorFactory = 
    new SentenceDetectorFactory("en", true, null, null);
    SentenceModel sentenceModel = SentenceDetectorME.train(
        "en", sampleStream, sentenceDetectorFactory, TrainingParameters.defaultParams());
    
    OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream("../models/modelFile"));
    sentenceModel.serialize(modelOutputStream);
    
    String text = "We will start with a simple sentence. However, is it "
    + "possible for a sentence to end with a question "
    + "mark? Obviously that is possible! Another "
    + "complication is the use of a number such as 56.32 "
    + "or ellipses such as ... Ellipses may be found ... "
    + "with a sentence! Of course, we may also find the "
    + "use of abbreviations such as Mr. Smith or "
    + "Dr. Jones.";
 
    SentenceDetectorME sentenceDetector = null;
    InputStream inputStrean = new FileInputStream("../models/modelFile");
    sentenceModel = new SentenceModel(inputStrean);
    sentenceDetector = new SentenceDetectorME(sentenceModel);
    String sentences[] = sentenceDetector.sentDetect(text);
    for (String sentence : sentences) {
        System.out.println("[" + sentence + "]");
    }
} catch (FileNotFoundException ex) {
    // Handle exceptions
    System.out.println("No such file exists!");
} catch (IOException ex) {
    // Handle exceptions
}

Indexing events with TwoPass using cutoff of 5

	Computing event counts...  done. 18 events
	Indexing...  done.
Sorting and merging events... done. Reduced 18 events to 14.
Done indexing in 0.01 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 14
	    Number of Outcomes: 2
	  Number of Predicates: 10
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-12.476649250079015	0.6666666666666666
  2:  ... loglikelihood=-10.1788351655278	0.6666666666666666
  3:  ... loglikelihood=-9.365516819465263	0.7222222222222222
  4:  ... loglikelihood=-8.806470120346262	0.7222222222222222
  5:  ... loglikelihood=-8.370920558495053	0.7222222222222222
  6:  ... loglikelihood=-8.016876393960429	0.7222222222222222
  7:  ... loglikelihood=-7.721000129401042	0.7222222222222222
  8:  ... loglikelihood=-7.468442127395042	0.7222222222222222
  9:  ... loglikelihood=-7.249170139289655	0.8333333333333334
 10:  ... loglikelihood=-7.05609282775734	0