In [1]:
# --- This is not needed if NXP is installed --- #
import sys
import os.path as op

# add the src/ directory to the Python path
sys.path.insert(0,op.realpath('../src'))
import nxp


# Matching expressions with NXP

In this short tutorial, we will see how to define and match text patterns using NXP.


## Matching numbers

As a first example, we will try to find numbers in a given string. <br>
Numbers in text are usually integers or floating-point numbers, so we first define a regular expression for each case separately:

In [3]:
from nxp import Regex

num_integer = Regex( r'-?\d+' )
num_float = Regex( r'-?\d*\.\d+([eE][-+]?\d+)?' )

In order to match integers and floating-point numbers alike, these expressions need to be combined. To do so, we use the alias `Either`, which is equivalent to (but clearer than): `Set( [TokenList], max=1 )`

In [4]:
from nxp import Either

number = Either( num_integer, num_float )
print(number) 

{-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}


When printed, notice how the two patterns are regrouped within a curly-bracket delimited list; in NXP, token sets are represented with curly brackets `{}`, and sequences with square brackets `[]`.

Great, now let's try to match a string with numbers in it!

In [5]:
matches = nxp.findall( number, 'sqrt(2) is approximately equal to 1.414' )
print(matches)

[<nxp.expr.match.TMatch object at 0x7fd788f63d18>, <nxp.expr.match.TMatch object at 0x7fd788f63db8>, <nxp.expr.match.TMatch object at 0x7fd788f63e58>]


Ok, why are there 3 matches?

In [6]:
for match in matches: 
    print(match)

(0, 5) - (0, 6) 2
(0, 34) - (0, 35) 1
(0, 35) - (0, 39) .414


Hmm. It looks like the integer and decimal parts of `1.414` were matched separately.  Weird.

We will find out what went wrong soon enough, but first notice taht the string representation of a match is formatted as follows:
```
position_begin - position_end text_matched
```
where the positions have the format `(line,col)`. That's cool, but wouldn't it be better to show the match within the surrounding text? 

Well actually, this is not directly possible, which allows us to make an important point: matches only carry restricted information in order to remain lightweight objects, and in particular they have no knowledge of the surrounding text. To make them aware, and place a match within its context, it is necessary to provide the `Buffer` object that contains the entire text.

_What buffer object? We just gave a string to nxp.findall!_

That's right, but under the hood, a buffer had to be created in order to wrap this string, and to generate a cursor pointing to that buffer. This might sound complicated, but in short, if you want to show more information about your matches, here is what you should do:

In [7]:
text = 'sqrt(2) is approximately equal to 1.414'
cursor = nxp.make_cursor(text)
matches = number.findall(cursor)

for match in matches:
    print(match.insitu(cursor.buffer))

sqrt(2) is approxim
     -             
ely equal to 1.414
             -    
ly equal to 1.414
             ----


That's better! Notice the three main differences:

1. We had to manually create a cursor for the text, using `nxp.make_cursor()`.
2. We used the token directly in order to find matches `number.findall(cursor)`, instead of calling `nxp.findall( number, text )`.
3. Detailed information about the match is provided by the method `match.insitu(buffer)`.

Great, now back to the main question: **why are there 3 matches?**

Well, the [documentation](https://jhadida.github.io/nxp/#/expr/intro?id=composition) says that tokens in a set are matched *sequentially*, i.e. in the order specified. This is NOT to say that the tokens _have to_ match in that order — and in fact they do not — but rather that we _check_ for each of them in that order, one after the other. Read this again to make sure you understand the distinction.

Because of this, we can diagnose why the integer and decimal parts of `1.414` were matched separately: it is because the first token `num_integer` was able to match the integer part before `num_float` was checked, and by then the cursor had already moved on to `.414`, which is actually a valid floating-point number. Does that make sense?

This teaches us an important lesson when combining patterns: **when successive tokens are susceptible to match overlapping strings, it is important to list them in the "right" order**. In our case, we just need to reorder the tokens within the set in order to fix the problem.

In [8]:
number = Either( num_float, num_integer )
matches = number.findall( cursor.reset() )

for match in matches:
    print(match.insitu(cursor.buffer))

sqrt(2) is approxim
     -             
ely equal to 1.414
             -----


Now that's the output we expected!

If you feel like practicing on a more complicated example, try to write an expression to capture numbers in scientific notation. E.g.: `"The Avogadro constant is exactly equal to 6.022 140 76×10^23"`

## With repetitions

In order to allow a pattern to match several times (once or more) in NXP, we can use the alias `Many`, which looks for repetitions of 2 or more patterns:

In [9]:
from nxp import Regex, Many, make_cursor

text = 'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'
cursor = make_cursor(text)
expr = Many( r'chuck', sep=r'\s+' )

for match in expr.finditer(cursor):
    print(match.insitu(cursor.buffer))

 would a woodchuck chuck if a woodchu
             -----------             


## Case insensitive

This last example illustrates how to create case-insensitive patterns (by default, patterns are case-sensitive):

In [10]:
from nxp import Regex, Either, Many, make_cursor

text = 'Abracadabra! Abraham Lincoln had a cadillac.'
cursor = make_cursor(text)

expr1 = Either( Regex('abra',case=True), Regex('cad') )
expr2 = Either( Regex('abra',case=False), Regex('cad') )

print('## CASE SENSITIVE ##')
for match in Many(expr1).finditer(cursor.reset()):
    print(match.insitu(cursor.buffer))
    
print('## CASE INSENSITIVE ##')
for match in expr2.finditer(cursor.reset()):
    print(match.insitu(cursor.buffer))

## CASE SENSITIVE ##
Abracadabra! Abraham Lin
    -------             
## CASE INSENSITIVE ##
Abracadabra! Abra
----             
Abracadabra! Abraham
    ---             
Abracadabra! Abraham Lin
       ----             
Abracadabra! Abraham Lincoln h
             ----             
incoln had a cadillac.
             ---      


There are two things to notice here:<br><br>

- Firstly is the difference between the case-sensitive and case-insensitive results. As expected, the second expression matches the pattern `Abra` with a capital A, whereas the first expression does not.<br><br>

- Secondly, notice that we used `Many` in the first case, which lead to matching contiguous occurrences of `expr1`. In contrast, notice how without using `Many` in the second example, `cad` and `abra` are now distinct matches. 