# Parsing using CoreNLP

Available models are: arabic, chinese, english, french, german and spanish.
Stanford CoreNLP affords a pipeline of tasks. 
In this tutorial, we will choose only the tasks of preprocessing.

In [1]:
%%pom
dependencies:
    - edu.stanford.nlp:stanford-corenlp:4.2.2
    - groupId: edu.stanford.nlp
      artifactId: stanford-corenlp
      version: 4.2.2
      classifier: models
    - groupId: edu.stanford.nlp
      artifactId: stanford-corenlp
      version: 4.2.2
      classifier: models-arabic

# <dependencies>
# <dependency>
#     <groupId>edu.stanford.nlp</groupId>
#     <artifactId>stanford-corenlp</artifactId>
#     <version>4.0.0</version>
# </dependency>
# <dependency>
#     <groupId>edu.stanford.nlp</groupId>
#     <artifactId>stanford-corenlp</artifactId>
#     <version>4.0.0</version>
#     <classifier>models</classifier>
# </dependency>
# </dependencies>

## I. Constituancy Parsing


In [18]:
%%java

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;

String text = "This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.";

// set up pipeline properties
Properties props = new Properties();

// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,parse");

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation(text);

// annotate
pipeline.annotate(document);

document

This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.

In [19]:
%%java
import edu.stanford.nlp.trees.*;

// get trees
for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)){
    Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
    System.out.println(tree);
}


(ROOT (S (NP (DT This)) (VP (VBZ is) (NP (NP (DT a) (NN text)) (VP (VBN written) (PP (IN by) (NP (NNP Mr.) (NNP Aries)))))) (. .)))
(ROOT (S (NP (PRP It)) (VP (VBZ uses) (S (NP (NNP U.S.) (NNP english)) (VP (TO to) (VP (VB illustrate) (NP (NN sentence) (NN tokenization)))))) (. .)))


In [20]:
%%java
// get the first sentence's tree
Tree tree = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0)
                      .get(TreeCoreAnnotations.TreeAnnotation.class);
Set<Constituent> treeConstituents = tree.constituents(new LabeledScoredConstituentFactory());

for (Constituent constituent : treeConstituents) {
    if (constituent.label() != null && (constituent.label().toString().equals("VP") || constituent.label().toString().equals("NP"))) {
        System.out.println("found constituent: " + constituent.toString());
        System.out.println(tree.getLeaves().subList(constituent.start(), constituent.end()+1));
    }
}

found constituent: NP(0,0)
[This]
found constituent: NP(2,3)
[a, text]
found constituent: NP(6,7)
[Mr., Aries]
found constituent: NP(2,7)
[a, text, written, by, Mr., Aries]
found constituent: VP(4,7)
[written, by, Mr., Aries]
found constituent: VP(1,7)
[is, a, text, written, by, Mr., Aries]


## II. Dependency Parsing

In [21]:
%%java

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;

String text = "This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.";

// set up pipeline properties
Properties props = new Properties();

// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,depparse");

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation(text);

// annotate
pipeline.annotate(document);

document

This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.

In [23]:
%%java
import edu.stanford.nlp.trees.*;

// get trees
for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)){
    SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
    System.out.println(dependencies);
}

-> text/NN (root)
  -> This/DT (nsubj)
  -> is/VBZ (cop)
  -> a/DT (det)
  -> written/VBN (acl)
    -> Aries/NNP (obl:by)
      -> by/IN (case)
      -> Mr./NNP (compound)
  -> ./. (punct)

-> uses/VBZ (root)
  -> It/PRP (nsubj)
  -> english/NNP (obj)
    -> U.S./NNP (compound)
  -> illustrate/VB (xcomp)
    -> to/TO (mark)
    -> tokenization/NN (obj)
      -> sentence/NN (compound)
  -> ./. (punct)



In [25]:
%%java

SemanticGraph dependencies = document.get(CoreAnnotations.SentencesAnnotation.class).get(0)
                                     .get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);

for (SemanticGraphEdge edge : dependencies.edgeListSorted()) {
    String reln = edge.getRelation().toString();
    String gov = (edge.getSource()).word();
    //int govIdx = (edge.getSource()).index();
    String dep = (edge.getTarget()).word();
    //int depIdx = (edge.getTarget()).index();
    System.out.println(gov + " ---- " + reln + " ---> " + dep);
 }

text ---- nsubj ---> This
text ---- cop ---> is
text ---- det ---> a
text ---- acl ---> written
Aries ---- case ---> by
Aries ---- compound ---> Mr.
written ---- obl:by ---> Aries
text ---- punct ---> .
