# Matching expressions with NXP

In this short tutorial, we will see how to match expressions using NXP.

In [1]:
# --- This is not needed if NXP is installed --- #

# add the src/ directory to the Python path
import sys
import os.path as op
sys.path.insert(0,op.realpath('../src'))

In [2]:
import nxp

## Matching numbers

As a first example, we will try to find numbers in a given string. 
Numbers in text are usually integers or floating-point numbers, so we first define regular expression for both cases:

In [3]:
from nxp import Regex

num_integer = Regex( r'-?\d+' )
num_float = Regex( r'-?\d*\.\d+([eE][-+]?\d+)?' )

In order to match integers and floating-point numbers alike, these expressions need to be combined, e.g. using the alias `Either`.

In [4]:
from nxp import Either

number = Either( num_integer, num_float )

# equivalent to (but clearer than): 
# number = nxp.Set( [num_integer, num_float], max=1 )

print(number) # sets are represented with {}, and sequences with []

{-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}


Great, now let's try to match a string with numbers in it!

In [5]:
matches = nxp.findall( number, 'sqrt(2) is approximately equal to 1.414' )
print(matches)

[<nxp.expr.match.TElement object at 0x7fce1f1251c8>, <nxp.expr.match.TElement object at 0x7fce1f057488>, <nxp.expr.match.TElement object at 0x7fce0ef19d88>]


Ok. That seems to have worked, but why are there 3 matches?

In [6]:
for match in matches: 
    print(match)

[0] (0, 5) - (0, 6) 2
[0] (0, 34) - (0, 35) 1
[0] (0, 35) - (0, 39) .414


Hmm. It looks like the integer and decimal parts of `1.414` were matched separately.  Weird.

We will find out what went wrong soon enough, but there is another issue: this printing isn't particularly useful, and what are these `[0]` at the beginning of each line? 
The string representation of the matches seems to be formatted as follows:
```
[0] position_begin - position_end text_matched
```
where the positions have the format `(line,col)`. That seems fairly reasonable, apart from the `[0]`, but it would be better to show the match within the surrounding text, wouldn't it? 

This actually leads to an important point: matches only carry a minimum amount of information in order to remain lightweight objects. In particular, they have no knowledge of the surrounding text. In order to have this information, it is necessary to provide the `Buffer` object that contains the entire text.

_What buffer object? We just gave a string to nxp.findall!_

Yes, but under the hood, a buffer had to be created in order to wrap this string, and to generate a cursor pointing to that buffer. This probably sounds more complicated than it is, but in short, if you want to show more information about your matches, here is what you should do instead:

In [7]:
text = 'sqrt(2) is approximately equal to 1.414'
cursor = nxp.make_cursor(text)
matches = number.findall( cursor )

for match in matches:
    print(match.insitu(cursor.buffer))

Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] sqrt(2) is approxim
	         -             
Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] ely equal to 1.414
	                 -    
Pattern: {-?\d+, -?\d*\.\d+([eE][-+]?\d+)?}
	[0] ly equal to 1.414
	                 ----


Notice the three main differences:

1. We had to manually create a cursor for the text, using `nxp.make_cursor()`.
2. We used the token directly in order to find matches `number.findall( cursor )`, instead of calling `nxp.findall( token, text )`.
3. The information about the match is now provided by the method `match.insitu( cursor )`.

Now about the `[0]`: remember that a match can have **several** repetitions (see the [docs]()). So the reason we only see `[0]` here is simply that each of the three matches only captured a single instance of a number. We will see an example of matching with repetitions later on.

Great, now that we have cleared this up, let us tackle the main question. **Why are there 3 matches?**

Well, the documentation also says that tokens in a set are matched *sequentially*, in the order specified. This is NOT to say that the tokens _have to_ match in that order — and in fact they do not — but rather that we _check_ each of them, one after the other, in the order specified. This is an important distinction.

The reason for the split is that the first token overlaps with the beginning of the second token. To fix the problem, we just need to reorder the tokens within the set.

In [8]:
number = Either( num_float, num_integer )
matches = number.findall( cursor.reset() )

for match in matches:
    print(match.insitu(cursor.buffer))

Pattern: {-?\d*\.\d+([eE][-+]?\d+)?, -?\d+}
	[0] sqrt(2) is approxim
	         -             
Pattern: {-?\d*\.\d+([eE][-+]?\d+)?, -?\d+}
	[0] ely equal to 1.414
	                 -----


## Matching with repetitions

In [9]:
from nxp import Regex, Many, make_cursor

text = 'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'
cursor = make_cursor(text)
expr = Regex( r'chuck\s*' )

for match in Many(expr).finditer(cursor):
    print(match.insitu(cursor.buffer))

Pattern: chuck\s*
	[0]  would a woodchuck chuck if a wo
	                 ------             
	[1]  a woodchuck chuck if a woodchuc
	                 ------             
Pattern: chuck\s*
	[0] uck if a woodchuck could chuck w
	                 ------             
Pattern: chuck\s*
	[0] dchuck could chuck wood?
	                 ------     


Notice the `[1]` due to the repetition.