# Demonstration: Running Unitex CasSys cascades with Python

The Unitex/Gramlab program should be installed: https://unitexgramlab.org/.

This notebook shows how to use the Python code available in this repository to execute Unitex CasSys cascades for annotating texts. 
The `Tutorial` CasSys cascade provided in the [Tutorial_CasSys+Graphs.zip](https://github.com/ludovicmoncla/python-unitex-cassys/tree/main/Unitex-CasSys) folder is used as an example. It illustrates how dates and addresses can be annotated and parsed.

In the current version, [spaCy](https://spacy.io) should be installed in your Python environment for the POS tagging preprocessing step. 
Functions for converting output from other tools such as Treetagger and Stanza will be available soon.

## Import Python libraries

In [None]:
import os
import spacy
from bs4 import BeautifulSoup

from scripts.Unitex import Unitex
from scripts.posTagger_to_unitex import spacy2unitex

In [None]:
# pip install -U spacy
#!python -m spacy download en_core_web_sm
#!python -m spacy download fr_core_news_sm

## Run spaCy POS tagger and create the Unitex compliant input txt file

In [None]:
text = ("Mark your calendars for an exciting event at 35 Charles Street, London, on the 3rd of June 2023.")

# load the spaCy model
nlp = spacy.load("en_core_web_sm")

# run spaCY
doc = nlp(text)

# convert the output to Unitex format
unitex_input = spacy2unitex(doc)

# show the result
print(unitex_input)

# save the txt file on disk
filename = 'tmp'
filepath = os.path.join('output', filename + '.txt')
with open(filepath, 'w') as f:
    f.write(unitex_input)

## Run Unitex as a Python code snippet

In [None]:
# configuration
version = "Tutorial"    # name of the directory in '{unitex-directory}/{language}/CasSys/' and '{unitex-directory}/{language}/Graphs/'
lang = 'English'        # name of the language directory in '{unitex-directory}/'

install_path = "{replace by your Unitex/GramLab personal working directory}"     # Unitex/GramLab personal working directory (containing language directories and cascades and graphs)
install_path_app = "{replace by your Unitex/GramLab installation directory}/App" # Unitex/GramLab installation directory (containing the App directory)
delete_tmp_files = True

# filepath of the input file (without extension), the script will produce a file with the same name in the same directory)
filepath = os.path.join('output', filename)

# run Unitex CasSys cascades
unitex = Unitex(version, lang, install_path, install_path_app, delete_tmp_files)
doc = unitex.run(filepath)

# show the result
print(doc)

## Parse the XML result

The CasSys cascades add annotations on the text stored in XML format. 
This section shows how you can parse the XML with Python (using the [BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)).

In this tutorial, the `address` graph (see figure below) adds the tag `address` when it matches a specific pattern. 
![alt](https://github.com/ludovicmoncla/python-unitex-cassys/blob/be6f59cd5b82aeb6e4e64fe221fa48a8836f8cb7/Unitex-CasSys/img/Tutorial_address_graph.png?raw=true)

The `synthesis` cascade transforms the output of the `analysis` cascade into a valid XML markup language following the annotation tagset defined in the analysis cascade. In this case this annotation will produce \<address\> elements.

Thus, once the XML output is produced you can look for all the XML element named `address` and get their content. As a remainder, every token is annotated with its `pos` and `lemma` in a \<w\> element.



In [None]:
# convert the string output to XML object (with the BeautifulSoup library)
root = BeautifulSoup(doc, 'xml')

# show the XML content
print(root.prettify())

In [None]:
# get the values of the elements within the root element and having the name given in the tag argument
def get_element_values(root, tag):
    values = []
    for element in root.find_all(tag):
        content = ''
        # get the string content of all <w> elements within the current element
        for w in element.find_all('w'):
            content += w.string
        values.append(content.trim())
    return values

print('text:', text)
# print the values of date and address elements in the XML output (if any)
print("dates:", get_element_values(root, 'date'))
print("address:", get_element_values(root, 'address'))


## Run `Unitex.py` as a Python script

In [None]:
version = "Tutorial"
lang = 'English'
install_path = "/Users/lmoncla/Nextcloud-LIRIS/Programmes/Unitex-GramLab-3.2"
install_path_app = "/Users/lmoncla/Programmes/Unitex-GramLab-3.2/App"
filepath = os.path.join('output', filename)

!python scripts/Unitex.py -i $filepath -l $lang -c $version --install_path $install_path --install_path_app $install_path_app

with open(filepath + "_csc_csc.xml", "r") as file:
    doc = file.read()
