<a href="https://colab.research.google.com/github/prakashsukhwal/Machine-Learning-in-Healthcare/blob/main/Stanza_OpenIE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stanza: A Tutorial on the Python CoreNLP Interface

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.


This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials.

## 1. Installation

Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook.

### Installing Stanza

Installing and importing Stanza are as simple as running the following commands:

In [None]:
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import stanza
import stanza

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 5.1 MB/s 
Collecting emoji
  Downloading emoji-1.6.1.tar.gz (170 kB)
[K     |████████████████████████████████| 170 kB 54.2 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.1-py3-none-any.whl size=169314 sha256=78f53629e97b010467fcb491653d9f10f3b9f24000f0cc224270c9b72438c572
  Stored in directory: /root/.cache/pip/wheels/ea/5f/d3/03d313ddb3c2a1a427bb4690f1621eea60fe6f2a30cc95940f
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-1.6.1 stanza-1.3.0


### Setting up Stanford CoreNLP

In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.

Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:

In [None]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

2021-10-19 05:25:56 INFO: Installing CoreNLP package into ./corenlp...


Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        …



That's all for the installation! 🎉  We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:

In [None]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

build.xml				  jollyday.jar
corenlp.sh				  LIBRARY-LICENSES
CoreNLP-to-HTML.xsl			  LICENSE.txt
ejml-core-0.39.jar			  Makefile
ejml-core-0.39-sources.jar		  patterns
ejml-ddense-0.39.jar			  pom-java-11.xml
ejml-ddense-0.39-sources.jar		  pom-java-17.xml
ejml-simple-0.39.jar			  pom.xml
ejml-simple-0.39-sources.jar		  protobuf-java-3.11.4.jar
input.txt				  README.txt
input.txt.out				  RESOURCE-LICENSES
input.txt.xml				  SemgrexDemo.java
istack-commons-runtime-3.0.7.jar	  ShiftReduceDemo.java
istack-commons-runtime-3.0.7-sources.jar  slf4j-api.jar
javax.activation-api-1.2.0.jar		  slf4j-simple.jar
javax.activation-api-1.2.0-sources.jar	  stanford-corenlp-4.3.1.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.3.1-javadoc.jar
javax.json.jar				  stanford-corenlp-4.3.1-models.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.3.1-sources.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   StanfordCoreNlpDemo.java
jaxb-impl-2.4.0-b180830.0438.jar	  StanfordDependenciesManual.p

**Note 1**:
If you are want to use the interface in a terminal (instead of a Colab notebook), you can properly set the `CORENLP_HOME` environment variable with:

```bash
export CORENLP_HOME=path_to_corenlp_dir
```

Here we instead set this variable with the Python `os` library, simply because `export` command is not well-supported in Colab notebook.


**Note 2**:
The `stanza.install_corenlp()` function is only available since Stanza v1.1.1. If you are using an earlier version of Stanza, please check out our [manual installation page](https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation) for how to install CoreNLP on your computer.

**Note 3**:
Besides the installation function, we also provide a `stanza.download_corenlp_models()` function to help you download additional CoreNLP models for different languages that are not shipped with the default installation. Check out our [automatic installation website page](https://stanfordnlp.github.io/stanza/client_setup.html#automated-installation) for more information on how to use it.

## 2. Annotating Text with CoreNLP Interface

### Constructing CoreNLPClient

At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.

We wrap these functionalities in a `CoreNLPClient` class. Therefore, we need to start by importing this class from Stanza.

In [None]:
# Import client module
from stanza.server import CoreNLPClient

After the import is done, we can construct a `CoreNLPClient` instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization and named entity recognition (NER). 

Additionally, the client constructor accepts a `memory` argument, which specifies how much memory will be allocated to the background Java process. An `endpoint` option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.

Also, here we manually set `be_quiet=True` to avoid an IO issue in colab notebook. You should be able to use `be_quiet=False` on your own computer, which will print detailed logging information from CoreNLP during usage.

For more options in constructing the clients, please refer to the [CoreNLP Client Options List](https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options).

In [None]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'openie', 'coref'], 
    memory='8G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2021-10-19 05:26:21 INFO: Writing properties to tmp file: corenlp_server-57073e27b8714bce.props
2021-10-19 05:26:21 INFO: Starting server with command: java -Xmx8G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-57073e27b8714bce.props -annotators tokenize,ssplit,pos,lemma,ner,openie,coref -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7f5a39003310>


In [None]:
# import nltk
# nltk.download('punkt')

In [None]:
# # try coref using corenlp
# from nltk import tokenize
# def pronoun_resolution(text):

#     ann = client.annotate(text)
#     modified_text = tokenize.sent_tokenize(text)

#     for coref in ann.corefChain:

#         antecedent = []
#         for mention in coref.mention:
#             phrase = []
#             for i in range(mention.beginIndex, mention.endIndex):
#                 phrase.append(ann.sentence[mention.sentenceIndex].token[i].word)
#             if antecedent == []:
#                 antecedent = ' '.join(word for word in phrase)
#             else:
#                 anaphor = ' '.join(word for word in phrase)
#                 modified_text[mention.sentenceIndex] = modified_text[mention.sentenceIndex].replace(anaphor, antecedent)

#     modified_text = ' '.join(modified_text)

#     return modified_text

# text = '''Panax ginseng C.A.Mey. (Korea red ginseng) has been used in Asia to treat inflammatory skin diseases. 
# Recently, it is emerging as a good candidate for treating atopic dermatitis (AD) because of its anti-allergic and anti-inflammatory effects.
# Despite much effort, no systemic prevention strategy has been established for AD currently.'''
# pronoun_resolution(text)

After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running.

In [None]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java

    139 java -Xmx8G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-57073e27b8714bce.props -annotators tokenize,ssplit,pos,lemma,ner,openie,coref -preload -outputFormat serialized
    160 /bin/bash -c ps -o pid,cmd | grep java
    162 grep java


### Annotating Text

Annotating a piece of text is as simple as passing the text into an `annotate` function of the client object. After the annotation is complete, a `Document`  object will be returned with all annotations.

Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient.

In [None]:
# text = """

# Panax ginseng C.A.Mey. (Korea red ginseng) has been used in Asia to treat inflammatory skin diseases. 
# Recently, it is emerging as a good candidate for treating atopic dermatitis (AD) because of its anti-allergic and anti-inflammatory effects.
# Despite much effort, no systemic prevention strategy has been established for AD currently. 
# Therefore, the aim of this study was to determine the preventive effect of a combination of KRG extract and probiotics on AD-like skin lesions of mice. Forty NC/Nga mice were randomly divided into eight groups: Sham, AD control, Cyclosporine, KRG, Duolac ATPÂ® (ATP), BYO Plant Origin Skin Probiotics (BYO), KRGÂ +Â ATP, and KRGÂ +Â BYO. Mice were administered orally with KRG and/or other agents using a gastric tube for 5 days prior to challenge with 1-chloro-2,4-dinitrobenzene (DNCB). AD-like skin lesions were induced by percutaneous challenge with DNCB on ears and backs of NC/Nga mice. Effects of each treatment were evaluated based on the following: Clinical severity score, ear thickness, transepidermal water loss (TEWL), total serum Immunoglobulin E (IgE) level, mRNA expression levels and immunohistochemistry analysis of IFN-Î³, IL-4, and TSLP in cutaneous lesions. TEWL, serum IgE level, and expression of immunohistopathologic markers were more improved in the group using KRG combined with probiotics than in the group using KRG or probiotics alone. ATP, KRGÂ +Â ATP, and KRGÂ +Â BYO groups showed reduced TEWL increase (Î”TEWL) at 48Â h (pÂ &lt;Â 0.005). KRGÂ +Â ATP showed a preventive effect on the increase of serum IgE level (pÂ =Â 0.009). In immunohistopathologic analysis, KRG, ATP, BYO, KRGÂ +Â ATP, and KRGÂ +Â BYO groups showed significantly reduced expression levels of IFN-Î³ at 1Â h, 6Â h, and 48Â h (all pÂ &lt;Â 0.05). KRG, ATP, BYO, and KRGÂ +Â BYO groups showed reduced expression levels of IL-4 compared to the AD control group at 6Â h and 24Â h. KRG, ATP, BYO, KRGÂ +Â ATP, and KRGÂ +Â BYP groups showed significantly lower expression levels of TSLP than the AD control group at 1Â h and 24Â h. KRG can suppress increases of allergic and inflammatory cytokines and increase of TEWL. A combination of KRG and probiotics might have better effects than KRG or probiotics alone for preventing an AD flare-up.

# """

text = """  
Panax ginseng C.A.Mey . ( Korea red ginseng ) has been used in Asia to treat inflammatory skin diseases . Recently , Korea red ginseng ( KRG means Korea red ginseng ) is emerging as a good candidate for treating atopic dermatitis ( AD means atopic dermatitis ) because of its anti-allergic and anti-inflammatory effects . Despite much effort , no systemic prevention strategy has been established for AD means atopic dermatitis currently . Therefore , the aim of this study was to determine the preventive effect of a combination of KRG means Korea red ginseng extract and probiotics on AD-like skin lesions of mice . Forty NC/Nga mice were randomly divided into eight groups : Sham , AD means atopic dermatitis control , Cyclosporine , KRG means Korea red ginseng , Duolac ATP means ATP® ® ( ATP means ATP® ) , BYO means BYO Plant Origin Skin Probiotics Plant Origin Skin Probiotics ( BYO means BYO Plant Origin Skin Probiotics ) , KRG means Korea red ginseng   +   ATP means ATP® , and KRG means Korea red ginseng   +   BYO means BYO Plant Origin Skin Probiotics . Mice were administered orally with KRG means Korea red ginseng and/or other agents using a gastric tube for 5 days prior to challenge with 1-chloro-2,4-dinitrobenzene ( DNCB means days prior to challenge with 1-chloro-2,4-dinitrobenzene ) . AD-like skin lesions were induced by percutaneous challenge with DNCB means days prior to challenge with 1-chloro-2,4-dinitrobenzene on ears and backs of NC/Nga mice . Effects of each treatment were evaluated based on the following : Clinical severity score , ear thickness , transepidermal water loss ( TEWL means transepidermal water loss ) , total serum Immunoglobulin E ( IgE means Immunoglobulin E ) level , mRNA expression levels and immunohistochemistry analysis of IFN-γ , IL-4 , and TSLP in cutaneous lesions . TEWL means transepidermal water loss , serum IgE means Immunoglobulin E level , and expression of immunohistopathologic markers were more improved in the group using KRG means Korea red ginseng combined with probiotics than in the group using KRG means Korea red ginseng or probiotics alone . ATP means ATP® , KRG means Korea red ginseng   +   ATP means ATP® , and KRG means Korea red ginseng   +   BYO means BYO Plant Origin Skin Probiotics groups showed reduced TEWL means transepidermal water loss increase ( ΔTEWL ) at 48   h ( p   & lt ;   0.005 ) . KRG means Korea red ginseng   +   ATP means ATP® showed a preventive effect on the increase of serum IgE means Immunoglobulin E level ( p   =   0.009 ) . In immunohistopathologic analysis , KRG means Korea red ginseng , ATP means ATP® , BYO means BYO Plant Origin Skin Probiotics , KRG means Korea red ginseng   +   ATP means ATP® , and KRG means Korea red ginseng   +   BYO means BYO Plant Origin Skin Probiotics groups showed significantly reduced expression levels of IFN-γ at 1   h , 6   h , and 48   h ( all p   & lt ;   0.05 ) . KRG means Korea red ginseng , ATP means ATP® , BYO means BYO Plant Origin Skin Probiotics , and KRG means Korea red ginseng   +   BYO means BYO Plant Origin Skin Probiotics groups showed reduced expression levels of IL-4 compared to the AD means atopic dermatitis control group at 6   h and 24   h. KRG means Korea red ginseng , ATP means ATP® , BYO means BYO Plant Origin Skin Probiotics , KRG means Korea red ginseng   +   ATP means ATP® , and KRG means Korea red ginseng   +   BYP groups showed significantly lower expression levels of TSLP than the AD means atopic dermatitis control group at 1   h and 24   h. KRG means Korea red ginseng can suppress increases of allergic and inflammatory cytokines and increase of TEWL means transepidermal water loss . A combination of KRG means Korea red ginseng and probiotics might have better effects than KRG means Korea red ginseng or probiotics alone for preventing an AD means atopic dermatitis flare-up 
"""
document = client.annotate(text, output_format='json')
triples = []
for sentence in document['sentences']:
    for triple in sentence['openie']:
        triples.append({
           'subject': triple['subject'],
           'relation': triple['relation'],
            'object': triple['object']
        })
print(triples)

[{'subject': 'Korea red ginseng', 'relation': 'treat', 'object': 'inflammatory skin diseases'}, {'subject': 'Korea red ginseng', 'relation': 'treat', 'object': 'skin diseases'}, {'subject': 'Korea red ginseng', 'relation': 'has', 'object': 'has used'}, {'subject': 'Korea red ginseng', 'relation': 'has', 'object': 'has used in Asia'}, {'subject': 'Korea ginseng', 'relation': 'Recently is emerging for', 'object': 'treating atopic dermatitis'}, {'subject': 'Korea ginseng', 'relation': 'is emerging for', 'object': 'atopic dermatitis'}, {'subject': 'Korea ginseng', 'relation': 'Recently is emerging as', 'object': 'candidate'}, {'subject': 'Korea ginseng', 'relation': 'Recently is emerging for', 'object': 'dermatitis'}, {'subject': 'Korea ginseng', 'relation': 'is emerging for', 'object': 'dermatitis'}, {'subject': 'Korea ginseng', 'relation': 'is emerging as', 'object': 'good candidate'}, {'subject': 'Korea ginseng', 'relation': 'is emerging because', 'object': 'its effects'}, {'subject': '

In [None]:
document

In [None]:
# triples

In [None]:
import pandas as pd
import csv
import json
with open("output.csv","w",newline="") as f:  # python 2: open("output.csv","wb")
    title = "subject,relation,object".split(",") # quick hack
    cw = csv.DictWriter(f,title,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    cw.writeheader()
    cw.writerows(triples)


In [None]:
!ls

corenlp  corenlp_server-57073e27b8714bce.props	output.csv  sample_data


In [None]:
# pd.read_csv('/content/output.csv')

In [None]:
j = triples

with open('output.tsv', 'w') as output_file:
    dw = csv.DictWriter(output_file, sorted(j[0].keys()), delimiter='\t')
    dw.writeheader()
    dw.writerows(j)

In [None]:
d = pd.read_csv('output.tsv', sep='\t')
d

Unnamed: 0,object,relation,subject
0,inflammatory skin diseases,treat,Korea red ginseng
1,skin diseases,treat,Korea red ginseng
2,has used,has,Korea red ginseng
3,has used in Asia,has,Korea red ginseng
4,treating atopic dermatitis,Recently is emerging for,Korea ginseng
...,...,...,...
193,Korea ginseng,means,combination
194,probiotics,combination,Korea ginseng
195,probiotics,combination,Korea red ginseng
196,Korea ginseng,means,KRG


In [None]:
# # Annotate some text
# text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
# document = client.annotate(text)
# print(type(document))

## 3. Accessing Annotations

Annotations can be accessed from the returned `Document` object.

A `Document` contains a list of `Sentence`s, which contain a list of `Token`s. Here let's first explore the annotations stored in all tokens.

In [None]:
# # Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags
# print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

# for i, sent in enumerate(document.sentence):
#     print("[Sentence {}]".format(i+1))
#     for t in sent.token:
#         print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
#     print("")

Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:

In [None]:
# # Iterate over all detected entity mentions
# print("{:30s}\t{}".format("Mention", "Type"))

# for sent in document.sentence:
#     for m in sent.mentions:
#         print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct.

In [None]:
# # Print annotations of a token
# print(document.sentence[0].token[0])

# # Print annotations of a mention
# print(document.sentence[0].mentions[0])

**Note**: Since the Stanza CoreNLP client interface simply ports the CoreNLP annotation results to native Python objects, for a comprehensive lists of available annotators and how their annotation results can be accessed, you will need to visit the [Stanford CoreNLP website](https://stanfordnlp.github.io/CoreNLP/).

## 4. Shutting Down the CoreNLP Server

To shut down the background CoreNLP server process, simply call the `stop` function of the client. Note that once a server is shutdown, you'll have to restart the server with the `start()` function before any annotation is requested.

In [None]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java

### More Information

For more information on how to use the `CoreNLPClient`, please go to the [CoreNLPClient documentation page](https://stanfordnlp.github.io/stanza/corenlp_client.html).

## 5. Simplifying Client Usage with the Python `with` statement

In the above demo, we explicitly called the `client.start()` and `client.stop()` functions to start and stop a client-server connection. However, doing this in practice is usually suboptimal, since you may forget to call the `stop()` function at the end, resulting in an unused server process occupying your machine memory.

To solve is, a simple solution is to use the client interface with the [Python `with` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The `with` statement provides an elegant way to automatically start and stop the server process in your Python program, without you needing to worry about this. The following code snippet demonstrates how to establish a client, annotate an example text and then stop the server with a simple `with` statement. Note that we **always recommend** you to use the `with` statement when working with the Stanza CoreNLP client interface.

In [None]:
# print("Starting a server with the Python \"with\" statement...")
# with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
#                    memory='4G', endpoint='http://localhost:9001', be_quiet=True) as client:
#     text = "Albert Einstein was a German-born theoretical physicist."
#     document = client.annotate(text)

#     print("{:30s}\t{}".format("Mention", "Type"))
#     for sent in document.sentence:
#         for m in sent.mentions:
#             print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

# print("\nThe server should be stopped upon exit from the \"with\" statement.")

## 6. Other Resources

- [Stanza Homepage](https://stanfordnlp.github.io/stanza/)
- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)
- [GitHub Repo](https://github.com/stanfordnlp/stanza)
- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)
