In [1]:
# --- This is not needed if NXP is installed --- #
import sys
import os.path as op

# add the src/ directory to the Python path
sys.path.insert(0,op.realpath('../src'))


# Matching expressions with NXP

In this short tutorial, we will see how to define and match text patterns using NXP.


In [2]:
import nxp

## Matching numbers

As a first example, we will try to find numbers in a given string. <br>
Numbers in text are usually integers or floating-point numbers, so we first define a regular expression for each case separately:

In [3]:
from nxp import Regex

num_integer = Regex( r'-?\d+' )
num_float = Regex( r'-?\d*\.\d+([eE][-+]?\d+)?' )

In order to match integers and floating-point numbers alike, these expressions need to be combined. To do so, we use the alias `Either`, which is equivalent to (but clearer than): `Set( [TokenList], max=1 )`

In [4]:
from nxp import Either

number = Either( num_integer, num_float )
print(number) 

{-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}


When printed, notice how the two patterns are regrouped within a curly-bracket delimited list; in NXP, token sets are represented with curly brackets `{}`, and sequences with square brackets `[]`.

Great, now let's try to match a string with numbers in it!

In [5]:
matches = nxp.findall( number, 'sqrt(2) is approximately equal to 1.414' )
print(matches)

[<nxp.expr.match.TMatch object at 0x7fd0602e6508>, <nxp.expr.match.TMatch object at 0x7fd063b61548>, <nxp.expr.match.TMatch object at 0x7fd06025b208>]


Ok. That seems to have worked, but why are there 3 matches?

In [6]:
for match in matches: 
    print(match)

[0] (0, 5) - (0, 6) 2
[0] (0, 34) - (0, 35) 1
[0] (0, 35) - (0, 39) .414


Hmm. It looks like the integer and decimal parts of `1.414` were matched separately.  Weird.

We will find out what went wrong soon enough, but there is another issue: this printing isn't particularly useful, and what are these `[0]` at the beginning of each line? 
The string representation of the matches seems to be formatted as follows:
```
[0] position_begin - position_end text_matched
```
where the positions have the format `(line,col)`. That seems reasonable, except for the `[0]`, but it would be better to show the match within the surrounding text, wouldn't it? 

This actually leads to an important point: matches only carry restricted information in order to remain lightweight objects. In particular, they have no knowledge of the surrounding text. In order to have this information, it is necessary to provide the `Buffer` object that contains the entire text.

_What buffer object? We just gave a string to nxp.findall!_

Yes, but under the hood, a buffer had to be created in order to wrap this string, and to generate a cursor pointing to that buffer. This probably sounds more complicated than it is, but in short, if you want to show more information about your matches, here is what you should do:

In [7]:
text = 'sqrt(2) is approximately equal to 1.414'
cursor = nxp.make_cursor(text)
matches = number.findall(cursor)

for match in matches:
    print(match.insitu(cursor.buffer))

Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] sqrt(2) is approxim
	         -             
Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] ely equal to 1.414
	                 -    
Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] ly equal to 1.414
	                 ----


Notice the three main differences:

1. We had to manually create a cursor for the text, using `nxp.make_cursor()`.
2. We used the token directly in order to find matches `number.findall(cursor)`, instead of calling `nxp.findall( number, text )`.
3. Detailed information about the match is provided by the method `match.insitu(buffer)`.

Now about the `[0]`: remember that a match can have **several** repetitions (see the [docs](https://jhadida.github.io/nxp/#/expr/intro?id=multiplicity)). So the reason we only see `[0]` here is simply that each of these 3 matches only captured a single occurrence of a number. We will see an example of matching with repetitions later on. <br> <br>

Great, now that this is sorted, let us tackle the main question. **Why are there 3 matches?**

Well, the documentation also says that tokens in a set are matched *sequentially*, i.e. in the order specified. This is NOT to say that the tokens _have to_ match in that order — and in fact they do not — but rather that we _check_ each of them in that order, one after the other. This is an important distinction to understand.

Because of this, we can diagnose why the integer and decimal parts of `1.414` were matched separately: it is because the first token `num_integer` was able to match the integer part before `num_float` was checked, and by then the cursor had already moved on to `.414`, which is actually a valid floating-point number. Does that make sense?

This teaches us an important lesson when combining patterns: **when successive tokens are susceptible to match overlapping strings, it is important to list them in the "right" order**. In our case, we just need to reorder the tokens within the set in order to fix the problem.

In [8]:
number = Either( num_float, num_integer )
matches = number.findall( cursor.reset() )

for match in matches:
    print(match.insitu(cursor.buffer))

Pattern: {-?\d*\.\d+([eE][-+]?\d+)?, -?\d+}
	[0] sqrt(2) is approxim
	         -             
Pattern: {-?\d*\.\d+([eE][-+]?\d+)?, -?\d+}
	[0] ely equal to 1.414
	                 -----


If you feel like practicing on a more complicated example, try to write an expression to capture numbers in scientific notation.<br>
E.g.: `"The Avogadro constant is exactly equal to 6.022 140 76×10^23"`

## With repetitions

One of the weird things with the previous example was the presence of a `[0]` prefix when printing the matches. We said that this was because each match has its own *multiplicity*, which allows for contiguous repetitions of the same pattern (see the [docs](https://jhadida.github.io/nxp/#/expr/intro?id=multiplicity)).

In order to allow a pattern to match several times (once or more) in NXP, we can use the alias `Many`:

In [9]:
from nxp import Regex, Many, make_cursor

text = 'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'
cursor = make_cursor(text)
expr = Regex( r'chuck\s*' )

for match in Many(expr).finditer(cursor):
    print(match.insitu(cursor.buffer))

Pattern: chuck\s*
	[0]  would a woodchuck chuck if a wo
	                 ------             
	[1]  a woodchuck chuck if a woodchuc
	                 ------             
Pattern: chuck\s*
	[0] uck if a woodchuck could chuck w
	                 ------             
Pattern: chuck\s*
	[0] dchuck could chuck wood?
	                 ------     


Notice how the first match now has _two_ repetitions of the pattern, listed with `[0]` and `[1]`. It is important to understand the difference between multiple matches, and the multiplicity of a match; in practice, both will often be possible, and it will be up to you to specify which output you expect.

Finally, just note that using `Many(token)` doesn't actually change the input token, but instead creates a new one with multiplicity `'1+'`. You can check this by looking at the `mul` property:

In [10]:
print(expr.mul)
print(Many(expr).mul)

[(1, 1)]
[(1, inf)]


## Case insensitive

This last example illustrates how to create case-insensitive patterns (by default, patterns are case-sensitive), and we also use this opportunity to emphasize once more the difference between multiple matches, and the multiplicity of a match:

In [11]:
from nxp import Regex, Either, Many, make_cursor

text = 'Abracadabra! Abraham Lincoln had a cadillac.'
cursor = make_cursor(text)

expr1 = Either( Regex('abra',case=True), Regex('cad') )
expr2 = Either( Regex('abra',case=False), Regex('cad') )

print('## CASE SENSITIVE ##')
for match in Many(expr1).finditer(cursor.reset()):
    print(match.insitu(cursor.buffer))
    
print('## CASE INSENSITIVE ##')
for match in expr2.finditer(cursor.reset()):
    print(match.insitu(cursor.buffer))

## CASE SENSITIVE ##
Pattern: {abra, cad}
	[0] Abracadabra! Abraham
	        ---             
	[1] Abracadabra! Abraham Lin
	           ----             
Pattern: {abra, cad}
	[0] incoln had a cadillac.
	                 ---      
## CASE INSENSITIVE ##
Pattern: {abra, cad}
	[0] Abracadabra! Abra
	    ----             
Pattern: {abra, cad}
	[0] Abracadabra! Abraham
	        ---             
Pattern: {abra, cad}
	[0] Abracadabra! Abraham Lin
	           ----             
Pattern: {abra, cad}
	[0] Abracadabra! Abraham Lincoln h
	                 ----             
Pattern: {abra, cad}
	[0] incoln had a cadillac.
	                 ---      


There are two things to notice here:<br><br>

- Firstly is the difference between the case-sensitive and case-insensitive results. As expected, the second expression matches the pattern `Abra` with a capital A, whereas the first expression does not.<br><br>

- Secondly, notice that we used the modifier `Many` in the first case, and since the first two matches are contiguous in the text, they were regrouped as successive repetitions of the same pattern. In contrast, notice how without using `Many` in the second example, the first three matches are distinct, even though they are contiguous in the text. As I said before, it will be up to you in practice to decide which output you want depending on the situation.