# Stanza: A Tutorial on the Python CoreNLP Interface

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

## Fuente: https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb

While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.


This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials.

## 1. Installation (tomado de la fuente oficial)

Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook.

### Installing Stanza

Installing and importing Stanza are as simple as running the following commands:

In [1]:
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import stanza
import stanza

Collecting stanza
  Downloading stanza-1.6.1-py3-none-any.whl (881 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m881.2/881.2 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m358.9/358.9 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji, stanza
Successfully installed emoji-2.8.0 stanza-1.6.1


### Setting up Stanford CoreNLP

In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.

Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:

In [2]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

INFO:stanza:Installing CoreNLP package into ./corenlp


Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        ‚Ä¶



That's all for the installation! üéâ  We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:

In [3]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

build.xml				  LIBRARY-LICENSES
corenlp.sh				  LICENSE.txt
CoreNLP-to-HTML.xsl			  Makefile
ejml-core-0.39.jar			  patterns
ejml-core-0.39-sources.jar		  pom-java-11.xml
ejml-ddense-0.39.jar			  pom-java-17.xml
ejml-ddense-0.39-sources.jar		  pom.xml
ejml-simple-0.39.jar			  protobuf-java-3.19.6.jar
ejml-simple-0.39-sources.jar		  README.txt
input.txt				  RESOURCE-LICENSES
input.txt.out				  sample-project-pom.xml
input.txt.xml				  SemgrexDemo.java
istack-commons-runtime-3.0.7.jar	  ShiftReduceDemo.java
istack-commons-runtime-3.0.7-sources.jar  slf4j-api.jar
javax.activation-api-1.2.0.jar		  slf4j-simple.jar
javax.activation-api-1.2.0-sources.jar	  stanford-corenlp-4.5.5.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.5.5-javadoc.jar
javax.json.jar				  stanford-corenlp-4.5.5-models.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.5.5-sources.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   StanfordCoreNlpDemo.java
jaxb-impl-2.4.0-b180830.0438.jar	  StanfordDependenci

### Anotando en espa√±ol con CoreNLP

In [4]:
nlp = stanza.Pipeline('es')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/tokenize/ancora.pt:   0%|      ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/mwt/ancora.pt:   0%|          |‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/pos/ancora_charlm.pt:   0%|    ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/lemma/ancora_nocharlm.pt:   0%|‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/constituency/combined_charlm.pt‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/depparse/ancora_charlm.pt:   0%‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/sentiment/tass2020.pt:   0%|   ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/ner/conll02.pt:   0%|          ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/pretrain/fasttextwiki.pt:   0%|‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/backward_charlm/newswiki.pt:   ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/pretrain/conll17.pt:   0%|     ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.6.0/models/forward_charlm/newswiki.pt:   0‚Ä¶

INFO:stanza:Loading these models for language: es (Spanish):
| Processor    | Package         |
----------------------------------
| tokenize     | ancora          |
| mwt          | ancora          |
| pos          | ancora_charlm   |
| lemma        | ancora_nocharlm |
| constituency | combined_charlm |
| depparse     | ancora_charlm   |
| sentiment    | tass2020        |
| ner          | conll02         |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [5]:
doc = nlp('Albert Einstein fue un f√≠sico te√≥rico alem√°n. Desarroll√≥ la Teor√≠a de la Relatividad.')

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Albert Albert PROPN
Einstein Einstein PROPN
fue ser AUX
un uno DET
f√≠sico f√≠sico NOUN
te√≥rico te√≥rico ADJ
alem√°n alem√°n ADJ
. . PUNCT
Desarroll√≥ desarrollar VERB
la el DET
Teor√≠a Teor√≠a PROPN
de de ADP
la el DET
Relatividad Relatividad PROPN
. . PUNCT


In [6]:
for sentence in doc.sentences:
    print(sentence.ents)

[{
  "text": "Albert Einstein",
  "type": "PER",
  "start_char": 0,
  "end_char": 15
}]
[{
  "text": "Teor√≠a de la Relatividad",
  "type": "MISC",
  "start_char": 60,
  "end_char": 84
}]


In [None]:
# Ejercicio 1: Procesa un texto con CoreNLP y extrae un diccionario cuyas claves sean los tipos de entidades y cuyos valores sean la lista de menciones de dicho tipo que hay en el texto.


In [12]:
doc = nlp('Albert Einstein fue un f√≠sico te√≥rico de Alemania. Desarroll√≥ la Teor√≠a de la Relatividad. Muri√≥ en Suiza. Pedro S√°nchez es el presidente de Espa√±a.')
diccionario = dict()
for sentence in doc.sentences:
    for entity in sentence.ents:
        entity_text = entity.text  # Obtiene el texto de la entidad
        entity_type = entity.type  # Obtiene el tipo de la entidad
        # Agrega la entidad al diccionario
        if entity_type in diccionario:
            diccionario[entity_type].append(entity_text)
        else:
            diccionario[entity_type] = [entity_text]

print(diccionario)

{'PER': ['Albert Einstein', 'Pedro S√°nchez'], 'LOC': ['Alemania', 'Suiza', 'Espa√±a'], 'MISC': ['Teor√≠a de la Relatividad']}


In [16]:
for sentence in doc.sentences:
    print(sentence.dependencies)

[({
  "id": 5,
  "text": "f√≠sico",
  "lemma": "f√≠sico",
  "upos": "NOUN",
  "xpos": "ncms000",
  "feats": "Gender=Masc|Number=Sing",
  "head": 0,
  "deprel": "root",
  "start_char": 23,
  "end_char": 29
}, 'nsubj', {
  "id": 1,
  "text": "Albert",
  "lemma": "Albert",
  "upos": "PROPN",
  "xpos": "np00000",
  "head": 5,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 6
}), ({
  "id": 1,
  "text": "Albert",
  "lemma": "Albert",
  "upos": "PROPN",
  "xpos": "np00000",
  "head": 5,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 6
}, 'flat', {
  "id": 2,
  "text": "Einstein",
  "lemma": "Einstein",
  "upos": "PROPN",
  "head": 1,
  "deprel": "flat",
  "start_char": 7,
  "end_char": 15
}), ({
  "id": 5,
  "text": "f√≠sico",
  "lemma": "f√≠sico",
  "upos": "NOUN",
  "xpos": "ncms000",
  "feats": "Gender=Masc|Number=Sing",
  "head": 0,
  "deprel": "root",
  "start_char": 23,
  "end_char": 29
}, 'cop', {
  "id": 3,
  "text": "fue",
  "lemma": "ser",
  "upos": "AUX",
  "xpos": "v

In [33]:
# Ejercicio 2 (ampliaci√≥n): Procesa un texto con CoreNLP y extrae la lista de tuplas (verbo, sujeto) de sus oraciones.
doc = nlp('Luis cocina mucho. Albert Einstein fue un f√≠sico te√≥rico alem√°n. La ni√±a fue al colegio. El perro sale a pasear.')
lista = []
for sentence in doc.sentences:
  verbo = None
  sujeto = None
  for word in sentence.words:
    if word.upos == "VERB" or word.upos == "AUX":
      verbo = word.text
    if "PROPN" in word.upos or "NOUN" in word.upos and "nsubj" in word.deprel:
      sujeto = word.text
  if verbo and sujeto:
    lista.append((verbo, sujeto))

print(lista)


[('cocina', 'Luis'), ('fue', 'Einstein'), ('fue', 'ni√±a'), ('pasear', 'perro')]
