# Testing a rule-based system

## Context

This notebook is meant to drive you through the usage of Corleone and Express, two softwares developed at the Joint Research Center (European Commission, Ispra, Italy) by Jakub Piskorski. These softwares are in use in production applications, notably the Europe Media Monitor. They are free for academic usage but you should ask a license if you want to use them beyond this tutorial.

## Tools

- **Corleone** (Core Linguistic Entity Online Extraction) is a set of lightweight linguistic processing components (text scanner, tokenizer, sentence splitter, morphological analysis and gazetteer lookup).
- **Express** (Extraction Pattern Recognition Engine and Specification Suite) is an information extraction grammar engine, which consists of a grammar parser and a grammar interpreter.

## Objectives

We will not manage to do build a full information extraction pipeline within the allocated time (we would need at leat a week!). Here the objective is to **give you an idea** of how things work. We will therefore focus on 2 components: Gazetteers and Grammar, trying to build a small engine to recognize (some) person names. We will rely on an already compiled tokeniser. You will develop a person name gazetteer, and 2 or 3 grammar rules relying on it. Let's get started.

## Setup

In [1]:
import os

### Folder structure

In both corleone and express repositories, you will find the following structures:

```bash 
.
├── compiled-resources # this is where your compiled resources will go
├── documentation # user-guide is available here
├── experiments # a playground folder, already with some inputs
│   ├── input
│   └── output
├── resources # the 'row' resources, i.e. gazetteers and grammar file before they get compiled
└── scripts # the scripts to use to compile or apply the components
```

### Download the JARs

In [1]:
# the URL to insert in this variable will be communicated
# during the workshop as the tool license does not allow us to
# further distribute it (i.e. putting it on GitHub)
download_link = ""

In [3]:
!wget {download_link} -O ../libraries.zip

--2019-07-09 11:13:04--  https://filesender.switch.ch/filesender/download.php?vid=367de57b-dd7b-f828-0a16-00004c01ecfc
Resolving filesender.switch.ch (filesender.switch.ch)... 86.119.34.170
Connecting to filesender.switch.ch (filesender.switch.ch)|86.119.34.170|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2220575 (2.1M) [application/octet-stream]
Saving to: ‘../libraries.zip’


2019-07-09 11:13:18 (169 KB/s) - ‘../libraries.zip’ saved [2220575/2220575]



In [4]:
!unzip ../libraries.zip -d ../

Archive:  ../libraries.zip
  inflating: ../lib-rulebased/brics_automaton.jar  
  inflating: ../lib-rulebased/corleone_6_20_2019.jar  
  inflating: ../lib-rulebased/express_6_20_2019.jar  
  inflating: ../lib-rulebased/log4j-1.2.16.jar  


In [5]:
!cp ../lib-rulebased/*.jar ../rule-based/lib/

In [10]:
# clean up
! rm -r ../lib-rulebased/
! rm ../libraries.zip

rm: cannot remove '../libraries.zip': No such file or directory


## CORLEONE: creating, compiling and applying a gazetteer

As a first exercise, we will create a small gazetteer for person names.

> The CorLEONE gazetteer look-up (dictionary look-up) component matches an input stream of characters or tokens against a gazetteer (dictionary) list, and produces an adequate annotation for the matched text fragment. It allows for associating each entry in the gazetteer with a list of arbitrary flat attribute-value pairs*. (Corleone documentation, Piskorski, 2018.)

### Creating a person name gazetteer

The resources you need to manipulate are under the repository `resources`:

- The **raw gazetteer file**, e.g. `person_name_gazetteer.txt`, is the entry file you need to edit with gazetteers elements. Each line represents a single gazetteer entry in the following format: `keyword (attribute:value)+`. 

```bash
# Example of gazetteer, one entry per line, where the input separator is "|",
# and the attribute/value separator is ":"
New York | GTYPE:location | SUBTYPE:city | CONTINENT: north america
G. Bush  | GTYPE: person | SUBTYPE: politician | position: president 
# => here we are declaring that the string "New York" has the GTYPE 'location', the SUBTYPE 'city', etc.

# for ambiguous forms, one line per referent:
Washington | GTYPE:city | LOCATION:USA | SUBTYPE:cap_city 
Washington | GTYPE:person | GENDER:m_f 
Washington | GTYPE:organization | SUBTYPE:commercial 
Washington | GTYPE:region | LOCATION:US
```

- The **attribute file** lists all attribute names, where each line stands for a single attribute name. 

  ```bash
  # for our gazetteer above, we need to declare the following types:
  GTYPE
  SUBTYPE
  CONTINENT
  LOCATION
  GENDER
  # => this are the types with which we want to describe our gazetteer entries
  ```
  
- The **type file** (optional) can be used in order to facilitate more strict encoding of the gazetteer entries in order to specify: (a) an attribute that is used to encode the type of the entry, which has to be provided in all entries in the entry file, and (b) a list of appropriate attributes for each type.

  ```bash
  # for our example above, the type file would contain:
  GTYPE # means that all entries need this type, and they can have 'city', 'person' or 'region' as values
  city location subtype # means that if an entry is of GTYPE city, it can have the 'location' and 'subtype' attributes
  person gender subtype position 
  region location
  
  # this is more specific than the type file, this is to declare the possible values for each type.
  # this is not mandatory, we can skip it for our exercise.
  ```

### Compiling a person name gazetteer 

A very small person name gazetteer already exists. Let's try to use it.

`<digression type'short'>`

**What's going on?** In the cells below we are using a special syntax to run **bash** commands (e.g. `cd`, `ls -la`, etc.) **from within** the notebook.

By using the `!` prefix at the beginning of a line we tell Jupyter than the line content should be interpreted and executed as a bash command (rather than a Python statement).

`</digression>`

In [20]:
NOTEBOOK_HOME = os.path.expanduser("~/notebooks/")
CORLEONE_RESOURCES_DIR = os.path.expanduser("~/rule-based/corleone/resources/")
CORLEONE_SCRIPTS_DIR = os.path.expanduser("~/rule-based/corleone/scripts/")
CORLEONE_COMPILED_RESOURCES = os.path.expanduser("~/rule-based/corleone/compiled-resources/")
EXPERIMENTS_INPUT_DIR = os.path.expanduser("~/rule-based/corleone/experiments/")
EXPERIMENTS_OUTPUT_DIR = os.path.expanduser("~/rule-based/corleone/experiments/output/")
# TODO: add other dirs here below and change to ~

In [21]:
# go in the resource folder and look at the person_name_gazetteer.txt file
os.chdir(CORLEONE_RESOURCES_DIR)
! ls -la
! head person_name_gazetteer.txt

total 36
drwxr-xr-x 1 jovyan jovyan 4096 Jul  9 06:42 .
drwxr-xr-x 1 jovyan jovyan 4096 Jul  9 06:42 ..
-rwxr-xr-x 1 jovyan jovyan   24 Jul  9 06:42 PersonNameAttributes.txt
-rwxr-xr-x 1 jovyan jovyan  392 Jul  9 06:42 person_name_gaz_application.cfg
-rwxr-xr-x 1 jovyan jovyan  334 Jul  9 06:42 person_nameGazetteer.cfg
-rwxr-xr-x 1 jovyan jovyan 7895 Jul  9 06:42 person_name_gazetteer.txt
-rwxr-xr-x 1 jovyan jovyan  105 Jul  9 06:42 types.txt

John| GTYPE>gaz_given_name | SURFACE>John
Philip| GTYPE>gaz_given_name | SURFACE>Philip
Edward| GTYPE>gaz_given_name | SURFACE>Edward

A | GTYPE>gaz_initial | SURFACE>A
B | GTYPE>gaz_initial | SURFACE>B
C | GTYPE>gaz_initial | SURFACE>C
D | GTYPE>gaz_initial | SURFACE>D
E | GTYPE>gaz_initial | SURFACE>E


Look at the configuration files if you wish: they are already ready, you do not need to edit them

In [22]:
# go in the /scripts folder of corleone
os.chdir(CORLEONE_SCRIPTS_DIR)

In [23]:
# execute the 'compile component script' with the component alias (basicGazetteer)
# and the component configuration file (located in the resource folder)

component_alias = "basicGazetteer"
component_config_file = "../resources/person_nameGazetteer.cfg"

! ./compileComp.sh {component_alias} {component_config_file}

2019-07-09 11:18:20,712: INFO main it.jrc.lt.core.component.Component - Create an instance of: basicGazetteer
2019-07-09 11:18:20,725: INFO main it.jrc.lt.core.component.Component - Read configuration from file: ../resources/person_nameGazetteer.cfg
2019-07-09 11:18:20,728: INFO main it.jrc.lt.core.component.Component - Compile resources
2019-07-09 11:18:20,729: INFO main it.jrc.lt.core.component.Component - Read gazetteer resources from file
2019-07-09 11:18:20,729: INFO main it.jrc.lt.core.component.Component - Reading attributes from: ../resources/PersonNameAttributes.txt
2019-07-09 11:18:20,731: INFO main it.jrc.lt.core.component.Component - Reading entries from: ../resources/person_name_gazetteer.txt
2019-07-09 11:18:20,732: INFO main it.jrc.lt.core.component.Component - Building the gazetteer
2019-07-09 11:18:20,732: INFO main it.jrc.lt.core.component.Component - Constructing the gazetteer
2019-07-09 11:18:20,732: INFO main it.jrc.lt.core.component.Component - Analyzing the entri

In [24]:
# go in the compiled resources and check if your compiled component is there
os.chdir(CORLEONE_COMPILED_RESOURCES)
! ls -la

total 740
drwxr-xr-x 1 jovyan jovyan   4096 Jul  9 11:18 .
drwxr-xr-x 1 jovyan jovyan   4096 Jul  9 06:42 ..
-rwxr-xr-x 1 jovyan jovyan 131306 Jul  9 06:42 BasicScanner_EN.bsc
-rwxr-xr-x 1 jovyan jovyan 131112 Jul  9 06:42 BasicTokenizer.btk
-rwxr-xr-x 1 jovyan jovyan 469326 Jul  9 06:42 ClassifyingTokenizer_EN.tok
-rw-r--r-- 1 jovyan jovyan   1781 Jul  9 11:18 person_names.gaz


### Apply the compiled gazetteer to some inputs

Let's have a look at the input:

In [25]:
ls -la {EXPERIMENTS_INPUT_DIR}

total 44
drwxr-xr-x 1 jovyan jovyan 4096 Jul  9 06:42 [0m[01;34m.[0m/
drwxr-xr-x 1 jovyan jovyan 4096 Jul  9 06:42 [01;34m..[0m/
-rwxr-xr-x 1 jovyan jovyan 1079 Jul  9 06:42 [01;32mbasicTokenizerInput.txt[0m*
-rwxr-xr-x 1 jovyan jovyan   10 Jul  9 06:42 [01;32mclassifyingTokenizerInput1.txt[0m*
-rwxr-xr-x 1 jovyan jovyan 1079 Jul  9 06:42 [01;32mclassifyingTokenizerInput2.txt[0m*
drwxr-xr-x 1 jovyan jovyan 4096 Jul  9 06:42 [01;34moutput[0m/
-rwxr-xr-x 1 jovyan jovyan   68 Jul  9 06:42 [01;32mperson_gazetteer_input.txt[0m*
-rwxr-xr-x 1 jovyan jovyan   62 Jul  9 06:42 [01;32msample_person_names_1.txt[0m*
-rwxr-xr-x 1 jovyan jovyan  298 Jul  9 06:42 [01;32msample_person_names_2.txt[0m*
-rwxr-xr-x 1 jovyan jovyan 1554 Jul  9 06:42 [01;32msentenceSplitterInput1.txt[0m*


In [26]:
! cat {EXPERIMENTS_INPUT_DIR}/sample_person_names_2.txt

Roberta Mugabe
Robertem Mugabe
Robercie Mugabe
Mugabe Roberta
G. Pilarz
Pilarz Grzegorz
Grzegorza Pilarza
Grzegorzowi Pilarzowi
Grzegorzem Pilarzem
Grzegorzu Pilarzu
Trump
D. Trump
Donald John Trump
Donald J. Trump
Mr. Trump
Mr. Trump's
Mr Trump
Mr Trump's
Trump's
Trump Donald


In [27]:
# go in the /scripts folder of corleone

os.chdir(CORLEONE_SCRIPTS_DIR)

In [28]:
# execute the 'apply component script' with the component alias
# (basicGazetteer) and the component configuration file (located in the resource folder)

! ./applyComp.sh basicGazetteer ../resources/person_name_gaz_application.cfg

2019-07-09 11:18:29,736: INFO main it.jrc.lt.core.component.Component - Create an instance of: basicGazetteer
2019-07-09 11:18:29,749: INFO main it.jrc.lt.core.component.Component - Read configuration from file: ../resources/person_name_gaz_application.cfg
2019-07-09 11:18:29,751: INFO main it.jrc.lt.core.component.Component - Initialize the instance of: basicGazetteer
2019-07-09 11:18:29,823: INFO main it.jrc.lt.core.component.Component - Current Characterset is: UTF-8
2019-07-09 11:18:29,823: INFO main it.jrc.lt.core.component.Component - Processing file: /home/jovyan/rule-based/corleone/scripts/../experiments/person_gazetteer_input.txt
2019-07-09 11:18:29,832: INFO main it.jrc.lt.core.component.Component - Writing result to the file: ../experiments/output/\person_gazetteer_input.txt.out


In [29]:
# go in the experiment folder and check the output

! head '../experiments/output/\person_gazetteer_input.txt.out'

journalist [START: 7, END: 16, SURFACE: journalist, GTYPE: gaz_title, GNUMBER: sg]
----------------------------------
John [START: 18, END: 21, SURFACE: John, GTYPE: gaz_given_name]
----------------------------------
K [START: 23, END: 23, SURFACE: K, GTYPE: gaz_initial]
----------------------------------
P [START: 25, END: 25, SURFACE: P, GTYPE: gaz_initial]
----------------------------------
Professor [START: 42, END: 50, SURFACE: Professor, GTYPE: gaz_title]
----------------------------------


### Iterate 

Now that you did the first edit-compiling-applying cycle, you can go back to the entry file and add more entries. The information you enter will be used by the grammar rules that you will develop next.

## Express: Creating, Compiling and Applying a grammar

### Some explanations


For now we will apply a grammar containing 2 rules, which you can already use. The explainations below are to give you some information.

A grammar cascade definition consists of three main ingredients:

#### Definition of feature structure types*

The type specification file contains the types and their attributes which are used in the grammars.
It is in this file that you define that you want to create a type 'person' or 'organisation', with their attributes 'firstname' and 'headquarters', for examples.

Express has an interface with Corleone components and is able to "understand" some of Corleone types by default. For example, Corleone tokenizer has the type "token" with the attributes "type" and "surface".

Given our person name gazetteers and the Corleone modules that we use, we define our grammar type file as follows:

```
basic-token := [SURFACE] # coming from Corleone component
token := [TYPE,SURFACE] # coming from Corleone component, a more refined tokenizer
person := [NAME,FIRST_NAME,LAST_NAME,INITIAL,TITLE,RULE] # the type we want to manipulate in our preons grammar file
gaz_given_name := [SURFACE] # coming from the person gazetteer
gaz_title := [GNUMBER,SURFACE] # coming from the person gazetteer
gaz_initial := [SURFACE] # coming from the person gazetteer
gaz_name_infix := [SURFACE] # coming from the person gazetteer
```

This file type is already defined in the folder structure of the hands-on, normally there is not need to change nothing (unless you add more types in our gazetteer).

#### Set of grammar specifications


This is the grammar file per se, which consists of 2 parts:

**A. Setting part** 

**Normally you do not need to change the setting part for the hands-on.**

- MODULES: to specify the list of pre-processing modules which will be applied before the grammar interpreter. In our case, we use Corleone modules, including the person gazetteer compiled in the previous step. These pre-processing components provide the grammar interpreter with a stream of input feature structures.

- SEARCH_MODE: defines the matchin strategy. Here we choose "longest match".

- OUTPUT: defines what the interpreter should outputs, its own feature structures (grammar), all the feature structures of all applied components (all), all the feature structures of applied and non-applied conponents (grammar_and_unconsumed).


**B. Rule part**

**This is the part you might want to update**

The part between **PATTERNS** and **END_PATTERNS** contains the rule definitions.

#### Set of **rule prioritisation** definition
We do not use this part in the hands-on.


### Using a grammar

A grammar already exists in our folder, we will try to use it.

**Let's first have a look at the grammar file:**

In [30]:
# defining our paths
EXPRESS_RESOURCES_DIR = os.path.expanduser("~/rule-based/express/resources/")
EXPRESS_SCRIPTS_DIR = os.path.expanduser("~/rule-based/express/scripts/")
EXPRESS_COMPILED_RESOURCES = os.path.expanduser("~/rule-based/express/compiled-resources/")
EXPRESS_OUTPUT_DIR = os.path.expanduser("~/rule-based/express/experiments/output/")

In [31]:
# we go in the express resource folder
os.chdir(EXPRESS_RESOURCES_DIR)

In [32]:
# we look at the rule file
! cat grammar_person_rules_0.txt

%
% Very basic grammar for person names
%

SETTINGS:

{
% Modules (currently used Corleone components)
% the 'gazetteerPersons' modules will load the gazetteer defined previously

    MODULES: <CorleoneTokenizer>,<CorleoneBasicTokenizer>,<gazetteerPersons>

    %SEARCH_MODE: all_longest_matches
    SEARCH_MODE: longest_match
    %SEARCH_MODE: all_match
    % Output

    OUTPUT: grammar
}

PATTERNS:

% Example 1
% This is a basic pattern for detecting person names based on known first names,
% followed by 1 to 3 first capital words identified thanks to the tokenizer.
% The surface of identified tokens are kept by defining them as 'variables' in the rule (#last_name).
% These variable are "feeding" a new feature structure "person" composed of NAME, FIRST_NAME, LAST_NAME and RULE.
% The NAME attribute of the type "person" is built by aplying a concatenation of the first and last name
% (thanks to a grammar functional operator called "ConcWithBlanks")


perso

**Let's check the resources of the grammar*

In [33]:
# First we copy our compiled person gazetteer ".gaz" into express 
# compiled resources folder, so that the grammar interpreter an use it.

! cp {CORLEONE_COMPILED_RESOURCES}person_names.gaz {EXPRESS_COMPILED_RESOURCES}data/

**Let's list our compiled resource folder:**
``` bash
.
├── data # contains the compiled resources, this is where the person gaz and grammar compiled files go.
│   ├── BasicTokenizer.btk
│   ├── ClassifyingTokenizer_EN.tok
│   ├── grammar_person.grm
│   ├── person_grammar_types.txt
│   └── person_names.gaz
└── modules_grammar # contains the configuration for the pre-processing modules, no need to change
    ├── basic_tokenizer_configuration.cfg
    ├── classifying_tokenizer_configuration.cfg
    └── gazetteerPersonConfig.cfg
```

**Let's compile our grammar**

In [34]:
os.chdir(EXPRESS_SCRIPTS_DIR)

In [35]:
! ./runParser.sh ../resources/compilation.cfg

Parsing the type file: ../resources/grammar_person_TypeFile.txt
Parsing grammar file: ../resources/grammar_person_rules_0.txt
Syntax of the types and grammar files is correct
Perform semantic analysis
Type file: ../resources/grammar_person_TypeFile.txt
Grammar file: ../resources/grammar_person_rules_0.txt
Checking grammar
Checking output types
Checking search mode
Checking component names
Checking rules
Convert each rule to a finite-state representation
Converting rule: person_name_rule1
Converting rule: person_name_rule2
Conversion of rule automata into Rule Filtering automaton
Determinisation of Rule Filtering automaton
Converting Deterministic Rule Filtering automaton into efficient representation
Initialise auxilliary data structures
Parsing succesfull
Parsing succesfull
Compilation succesfull


**Let's see if the grammar file *.grm is in the compiled resources folder**

In [36]:
ll ../compiled-resources/data/ | grep grm

-rw-r--r-- 1 jovyan   2598 Jul  9 11:18 grammar_person.grm


**We can now apply our grammar on texts**

The input folder in experiments contains a small file with some person names. Let's open it:


In [37]:
ls -la ../experiments/input

total 64
drwxr-xr-x 1 jovyan jovyan  4096 Jul  9 06:42 [0m[01;34m.[0m/
drwxr-xr-x 1 jovyan jovyan  4096 Jul  9 06:42 [01;34m..[0m/
-rw-r--r-- 1 jovyan jovyan  1411 Jul  9 06:42 die-zeit-2019-07-06.txt
-rw-r--r-- 1 jovyan jovyan 14035 Jul  9 06:42 GDL-1815-02-21-a-i0011-p3.txt
-rw-r--r-- 1 jovyan jovyan 14189 Jul  9 06:42 GDL-1815-02-21-a-i0011.txt
-rwxr-xr-x 1 jovyan jovyan   310 Jul  9 06:42 [01;32mgrammar_person_input.txt[0m*
-rw-r--r-- 1 jovyan jovyan  1474 Jul  9 06:42 guardian-2019-07-06.txt
-rw-r--r-- 1 jovyan jovyan  5445 Jul  9 06:42 JDG-1977-03-24-a-i0080-p11.txt
-rw-r--r-- 1 jovyan jovyan   460 Jul  9 06:42 luxwort-1938-01-19-a-i0031-p3.txt


In [38]:
! cat ../experiments/input/grammar_person_input.txt 

Police apprehended Philip Kowalski
Breaking News: Professor Kowalski was taken into custody by the Ispra Police Department.
Former President Barack Obama attended the funerals of former President G. W. Bush.
Dominique de Villepin is about to pronounce a speech at UN Council, journalist Bob Manello reports.

In [39]:
mkdir ../experiments/output

**Let's try to apply the grammar on this file first**. The output will appear on the experiment/output folder.

In [40]:
! ./runInterpreter.sh ../resources/execution.cfg

2019-07-09 11:19:30,845: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Launching the grammar interpreter
2019-07-09 11:19:30,846: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Loading resources started
2019-07-09 11:19:30,846: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Reading configuration properties
2019-07-09 11:19:30,846: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Load cascaded grammar from the file: ../compiled-resources/data/grammar_person.grm
2019-07-09 11:19:30,912: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Initialize modules specified in the module configuration directory: ../compiled-resources/modules_grammar
2019-07-09 11:19:30,913: INFO main it.jrc.lt.regexpfs.GrammarInterpreter - Launching module specified in the file: /home/jovyan/rule-based/express/scripts/../compiled-resources/modules_grammar/classifying_tokenizer_configuration.cfg
2019-07-09 11:19:31,009: INFO main it.jrc.lt.regexpfs.module.Module - Module named: CorleoneTokenizer has been

**Let's observe the results**

In [41]:
! cat ../experiments/output\\grammar_person_input.txt-result.txt # here there is the same pb with the "\"

"Philip Kowalski Breaking News"	19	48	person	NAME	Philip Kowalski Breaking News	FIRST_NAME	Philip	LAST_NAME	Kowalski Breaking News	RULE	person_name_rule1
"Professor Kowalski"	51	68	person	NAME	Kowalski	FIRST_NAME	Kowalski	LAST_NAME		INITIAL		TITLE	Professor	RULE	person_name_rule2
"journalist Bob Manello"	279	300	person	NAME	Bob Manello	FIRST_NAME	Bob	LAST_NAME	Manello	INITIAL		TITLE	journalist	RULE	person_name_rule2


**What you can try next:**

- After this first application, try to change the OUTPUT attribute from 'grammar' to 'all' in the grammar file, recompile the grammar and apply it again on the same file. You should see more information in the output file.
- move more input files in the "input" folder (some examples in different languages are just one level up), and observe the results
- try to write a rule to recognize the missed names in ``
