# Testing a rule-based system

### Context

This notebook is meant to drive you through the usage of Corleone and Express, two softwares developed at the Joint Research Center (European Commission, Ispra, Italy) by Jakub Piskorski. These softwares are in use in production applications, notably the Europe Media Monitor. They are free for academic usage but you should ask a license if you want to use them beyond this tutorial.

### Tools

- **Corleone** (Core Linguistic Entity Online Extraction) is a set of lightweight linguistic processing components (text scanner, tokenizer, sentence splitter, morphological analysis and gazetteer lookup).
- **Express** (Extraction Pattern Recognition Engine and Specification Suite) is an information extraction grammar engine, which consists of a grammar parser and a grammar interpreter.

### Objectives

We will not manage to do build a full information extraction pipeline within the allocated time (we would need at leat a week!). Here the objective is to **give you an idea** of how things work. We will therefore focus on 2 components: Gazetteers and Grammar, trying to build a small engine to recognize (some) person names. We will rely on an already compiled tokeniser. You will develop a person name gazetteer, and 2 or 3 grammar rules relying on it. Let's get started.

## 0. Setup

In [None]:
import os

### Folder structure

In both corleone and express repositories, you will find the following structures:

```bash 
.
├── compiled-resources # this is where your compiled resources will go
├── documentation # user-guide is available here
├── experiments # a playground folder, already with some inputs
│   ├── input
│   └── output
├── resources # the 'row' resources, i.e. gazetteers and grammar file before they get compiled
└── scripts # the scripts to use to compile or apply the components
```

### Download the JARs

In [None]:
# the URL to insert in this variable will be communicated
# during the workshop as the tool license does not allow us to
# further distribute it (i.e. putting it on GitHub)
download_link = ""

In [None]:
!wget {download_link} -O ../libraries.zip

In [None]:
!unzip ../libraries.zip -d ../

In [None]:
!cp ../librairies/*.jar ../rule-based/lib/

In [None]:
ls -al ../rule-based/lib/

In [None]:
# clean up
! rm -r ../librairies/
! rm libraries.zip

## 1. CORLEONE: creating, compiling and applying a gazetteer

As a first exercise, we will create a small gazetteer for person names.

> The CorLEONE gazetteer look-up (dictionary look-up) component matches an input stream of characters or tokens against a gazetteer (dictionary) list, and produces an adequate annotation for the matched text fragment. It allows for associating each entry in the gazetteer with a list of arbitrary flat attribute-value pairs*. (Corleone documentation, Piskorski, 2018.)

### 1.1 Creating a person name gazetteer

The resources you need to manipulate are under the repository `resources`:

- The **raw gazetteer file**, e.g. `person_name_gazetteer.txt`, is the entry file you need to edit with gazetteers elements. Each line represents a single gazetteer entry in the following format: `keyword (attribute:value)+`. 

```bash
# Example of gazetteer, one entry per line, where the input separator is "|",
# and the attribute/value separator is ":"
New York | GTYPE:location | SUBTYPE:city | CONTINENT: north america
G. Bush  | GTYPE: person | SUBTYPE: politician | position: president 
# => here we are declaring that the string "New York" has the GTYPE 'location', the SUBTYPE 'city', etc.

# for ambiguous forms, one line per referent:
Washington | GTYPE:city | LOCATION:USA | SUBTYPE:cap_city 
Washington | GTYPE:person | GENDER:m_f 
Washington | GTYPE:organization | SUBTYPE:commercial 
Washington | GTYPE:region | LOCATION:US
```

- The **attribute file** lists all attribute names, where each line stands for a single attribute name. 

  ```bash
  # for our gazetteer above, we need to declare the following types:
  GTYPE
  SUBTYPE
  CONTINENT
  LOCATION
  GENDER
  # => this are the types with which we want to describe our gazetteer entries
  ```
  
- The **type file** (optional) can be used in order to facilitate more strict encoding of the gazetteer entries in order to specify: (a) an attribute that is used to encode the type of the entry, which has to be provided in all entries in the entry file, and (b) a list of appropriate attributes for each type.

  ```bash
  # for our example above, the type file would contain:
  GTYPE # means that all entries need this type, and they can have 'city', 'person' or 'region' as values
  city location subtype # means that if an entry is of GTYPE city, it can have the 'location' and 'subtype' attributes
  person gender subtype position 
  region location
  
  # this is more specific than the type file, this is to declare the possible values for each type.
  # this is not mandatory, we can skip it for our exercise.
  ```

### 1.2 Compiling a person name gazetteer 

A very small person name gazetteer already exists. Let's try to use it.

`<digression type'short'>`

**What's going on?** In the cells below we are using a special syntax to run **bash** commands (e.g. `cd`, `ls -la`, etc.) **from within** the notebook.

By using the `!` prefix at the beginning of a line we tell Jupyter than the line content should be interpreted and executed as a bash command (rather than a Python statement).

`</digression>`

In [None]:
NOTEBOOK_HOME = "~/notebooks/"
CORLEONE_RESOURCES_DIR = "/home/jovyan/rule-based/corleone/resources/"
CORLEONE_SCRIPTS_DIR = "/home/jovyan/rule-based/corleone/scripts/"
CORLEONE_COMPILED_RESOURCES = "/home/jovyan/rule-based/corleone/compiled-resources/"
EXPERIMENTS_OUTPUT_DIR = "/home/jovyan/rule-based/corleone/experiments/output/"
# TODO: add other dirs here below and change to ~

In [None]:
# go in the resource folder and look at the person_name_gazetteer.txt file
os.chdir(CORLEONE_RESOURCES_DIR)
! ls -la
! head person_name_gazetteer.txt

Look at the configuration files if you wish: they are already ready, you do not need to edit them

In [None]:
# go in the /scripts folder of corleone
os.chdir(CORLEONE_SCRIPTS_DIR)

In [None]:
# execute the 'compile component script' with the component alias (basicGazetteer)
# and the component configuration file (located in the resource folder)

component_alias = "basicGazetteer"
component_config_file = "../resources/person_nameGazetteer.cfg"

! ./compileComp.sh {component_alias} {component_config_file}

In [None]:
# go in the compiled resources and check if your compiled component is there
os.chdir(CORLEONE_COMPILED_RESOURCES)
! ls -la

### 1.3 Apply the compiled gazetteer to some inputs

In [None]:
# go in the /scripts folder of corleone

os.chdir(CORLEONE_SCRIPTS_DIR)

In [None]:
# execute the 'apply component script' with the component alias
# (basicGazetteer) and the component configuration file (located in the resource folder)

! ./applyComp.sh basicGazetteer ../resources/person_name_gaz_application.cfg

In [None]:
# go in the experiment folder and check the output

! head '../experiments/output/\person_gazetteer_input.txt.out'

### 1.4 Iterate 

Now that you did the first edit-compiling-applying cycle, you can go back to the entry file and add more entries. The information you enter will be used by the grammar rules that you will develop next.

## 2 Express: Creating, Compiling and Applying a grammar

### 2.1 Some explanations


For now we will apply a grammar containing 2 rules, which you can already use. The explainations below are to give you some information.

A grammar cascade definition consists of three main ingredients:

#### 1. Definition of feature structure types*

The type specification file contains the types and their attributes which are used in the grammars.
It is in this file that you define that you want to create a type 'person' or 'organisation', with their attributes 'firstname' and 'headquarters', for examples.

Express has an interface with Corleone components and is able to "understand" some of Corleone types by default. For example, Corleone tokenizer has the type "token" with the attributes "type" and "surface".

Given our person name gazetteers and the Corleone modules that we use, we define our grammar type file as follows:

```
basic-token := [SURFACE] # coming from Corleone component
token := [TYPE,SURFACE] # coming from Corleone component, a more refined tokenizer
person := [NAME,FIRST_NAME,LAST_NAME,INITIAL,TITLE,RULE] # the type we want to manipulate in our preons grammar file
gaz_given_name := [SURFACE] # coming from the person gazetteer
gaz_title := [GNUMBER,SURFACE] # coming from the person gazetteer
gaz_initial := [SURFACE] # coming from the person gazetteer
gaz_name_infix := [SURFACE] # coming from the person gazetteer
```

This file type is already defined in the folder structure of the hands-on, normally there is not need to change nothing (unless you add more types in our gazetteer).

#### 2. Set of grammar specifications


This is the grammar file per se, which consists of 2 parts:

**A. Setting part** 

**Normally you do not need to change the setting part for the hands-on.**

- MODULES: to specify the list of pre-processing modules which will be applied before the grammar interpreter. In our case, we use Corleone modules, including the person gazetteer compiled in the previous step. These pre-processing components provide the grammar interpreter with a stream of input feature structures.

- SEARCH_MODE: defines the matchin strategy. Here we choose "longest match".

- OUTPUT: defines what the interpreter should outputs, its own feature structures (grammar), all the feature structures of all applied components (all), all the feature structures of applied and non-applied conponents (grammar_and_unconsumed).


**B. Rule part**

**This is the part you might want to update**

The part between **PATTERNS** and **END_PATTERNS** contains the rule definitions.

#### 3. Set of **rule prioritisation** definition
We do not use this part in the hands-on.


### 2.2 Using a grammar

A grammar already exists in our folder, we will try to use it.

**Let's first have a look at the grammar file:**

In [None]:
# defining our paths
EXPRESS_RESOURCES_DIR = os.path.expanduser("~/rule-based/express/resources/")
EXPRESS_SCRIPTS_DIR = os.path.expanduser("~/rule-based/express/scripts/")
EXPRESS_COMPILED_RESOURCES = os.path.expanduser("~/rule-based/express/compiled-resources/")
EXPRESS_OUTPUT_DIR = os.path.expanduser("~/rule-based/express/experiments/output/")

In [None]:
# we go in the express resource folder
os.chdir(EXPRESS_RESOURCES_DIR)

In [None]:
# we look at the rule file
! cat grammar_person_rules_0.txt

**Let's check the resources of the grammar*

In [None]:
# First we copy our compiled person gazetteer ".gaz" into express 
# compiled resources folder, so that the grammar interpreter an use it.

cp corleone/compiled-resources/person_names.gaz express/compiled-resources/data/

**Let's list our compiled resource folder:**
``` bash
.
├── data # contains the compiled resources, this is where the person gaz and grammar compiled files go.
│   ├── BasicTokenizer.btk
│   ├── ClassifyingTokenizer_EN.tok
│   ├── grammar_person.grm
│   ├── person_grammar_types.txt
│   └── person_names.gaz
└── modules_grammar # contains the configuration for the pre-processing modules, no need to change
    ├── basic_tokenizer_configuration.cfg
    ├── classifying_tokenizer_configuration.cfg
    └── gazetteerPersonConfig.cfg
```

**Let's compile our grammar**

In [None]:
os.chdir(EXPRESS_SCRIPTS_DIR)

In [None]:
./runParser.sh ../resources/compilation.cfg

**Let's see if the grammar file *.grm is in the compiled resources folder**

In [None]:
ll ../compiled-resources/data/ 

**We can now apply our grammar on texts**

The input folder in experiments contains a small file with some person names. Let's open it:


In [None]:
less ../experiments/input/grammar_person_input.txt 

**Let's try to apply the grammar on this file first**. The output will appear on the experiment/output folder.

In [None]:
./runInterpreter.sh ../resources/execution.cfg

**Let's observe the results**

In [None]:
less ../experiments/output\grammar_person_input.txt-result.txt # here there is the same pb with the "\"

**What you can try next:**

- After this first application, try to change the OUTPUT attribute from 'grammar' to 'all' in the grammar file, recompile the grammar and apply it again on the same file. You should see more information in the output file.
- move more input files in the "input" folder (some examples in different languages are just one level up), and observe the results
- try to write a rule to recognize the missed names in ``
