# Finding things

> Computer languages of the future will be more concerned with goals and less with procedures specified by the programmer.
\--_Marvin Minsky_

Just like with indexing, there seems to be a baffling array of ways to locate items in arrays in APL. As stated in [The Zen of Python](https://zen-of-python.info/): 

> There should be one-- and preferably only one --obvious way to do it.

Aha. About that...

![laugh](https://media.giphy.com/media/JmD9mkDmzvXE7nxy7j/giphy-downsized-medium.gif)

In [1]:
⎕IO ← 0
]box on
]rows on

## Equality `=`

So how do you locate where an element resides in an array in APL? Well, the obvious way to do it is to exploit equality and scalar pervasion. Where is the number 2?

In [1]:
]DISPLAY ↑(2=data)(data ← 4 1 22 20 16 10 25 7 18 11 15 2 12 23 9 17 6 14 21 19 3 8 5 13 24)

That's surely sufficiently _Zen_ in the _Zen of Python_ sense. Hands up every item that equals 2. We can find out the actual index, too, by using _Iota underbar_, `⍸`, which, as you may recall from an [earlier chapter](./manip.ipynb), in its monadic form is appropriately enough called [_Where_](https://help.dyalog.com/latest/index.htm#Language/Primitive%20Functions/Where.htm):

In [3]:
⍸2=data ⍝ Where is data = 2?

As we should expect by now, scalar pervasion lets scalar functions penetrate any level of nesting:

In [13]:
⎕ ← nested ← (1 2 (3 4 5))(4 (1 2(3 4)))
4=nested

## Match `≡`

[_Match_](http://help.dyalog.com/latest/index.htm#Language/Primitive%20Functions/Where.htm) we've already met. It is like equality but for non-scalar things. You can be forgiven to think that if you look for a particular vector in a vector-of-vectors, you should be able to use equality, but that doesn't work:

In [2]:
strings ← 'aaa' 'bbb' 'ccc' 'ddd' 'ccc' 'aaa' 'ccc' 'bbb'
'ccc'=strings  ⍝ LENGTH ERROR!

LENGTH ERROR
      'ccc'=strings  ⍝ LENGTH ERROR!
           ∧


Instead we need either _Jot match each_ (`∘≡¨`) or _Enclose match each_:

In [6]:
⍸'ccc'∘≡¨strings
⍸(⊂'ccc')≡¨strings

## Index of `A⍳B`

Another way we can find the index of things is via [_Index of_](http://help.dyalog.com/latest/index.htm#Language/Primitive%20Functions/Index%20Of.htm), dyadic `⍳`, which we have actually met before:

In [14]:
data⍳2

Dyadic _iota_ finds the _first_ index - or 1 plus the last index if not found. This has the nice feature that we can feed it an array to the right:

In [18]:
data⍳2 3

...which, if we want to use equality requires a bit more dexterity -- but this approach would also spot multiples:

In [3]:
⍸∨⌿2 3 ∘.= data

VALUE ERROR: Undefined name: data
      ⍸∨⌿2 3∘.=data
               ∧


What about if we crank the rank a bit?

In [26]:
⎕ ← mat ← 4 4⍴2 15 16 14 11 9 12 10 13 1 7 5 4 8 6 3

In [29]:
⎕ ← mask ← 7=mat
⍸mask

Gotta love APL. 

## Find `⍷`

There is of course also a glyph that's called [_Find_](http://help.dyalog.com/latest/index.htm#Language/Primitive%20Functions/Find.htm) (`⍷`) which locates the start-points of subsequences:

In [4]:
'ana' ⍷ 'banana'

That combines nicely with `⍸` as a tacit atop to give a bit of an idiom to commit to memory:

In [5]:
substr ← ⍸⍷
'ana' substr 'banana'

We can use find to locate arrays-in-arrays of higher ranks, too:

In [1]:
]DISPLAY needle ← 2 2⍴0 1 1 0
]DISPLAY haystack ← 4 4⍴0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0

In [7]:
needle ⍷ haystack

The 1s represent the top-left corners of the 'needle' array in 'haystack':

![find1](./IMG/find1.png)

and we can pick out the actual coordinates using the same idiom we used for `substring` above:

In [10]:
needle (⍸⍷) haystack

## Regular expessions

Dyalog supports regular expressions through the most excellent [PCRE](http://help.dyalog.com/latest/index.htm#Language/Appendices/PCRE%20Regular%20Expression%20Details.htm) engine. Dyalog's docs on the topic can be found [here](http://help.dyalog.com/latest/index.htm#Language/System%20Functions/r.htm), and an APL Orchard cultivation was dedicated to the [topic](https://chat.stackexchange.com/rooms/52405/conversation/lesson-24-r-and-s), too. How regexes themselves work is beyond the scope of this book, but an exceptional reference that belongs on every programmer's bookshelf is Jeffrey Friedl's [Mastering Regular Expressions](http://regex.info/book.html).

Let's take a brief look at how regexes are integrated in Dyalog.

Two system operators, `⎕S` and `⎕R`, implement regex search and replace respectively. They're both dyadic operators, taking regular expression(s) to the left and transformation(s) to the right. The derived function can be applied to text data.

Here's a simple example. Using a transformation string of `&`, `⎕S` returns a vector of what was matched:

In [11]:
'll.\sw'⎕S'&' ⊢ 'hello world well  worn'

The left operand can be a nested vector of regexes:

In [24]:
'he' 'wo' 'll'⎕S'&' ⊢ 'hello world well  worn'

The right operand can be a function, too. This is where it gets a bit <s>hairy</s> flexible. We could have written the above as:

In [4]:
'he' 'wo' 'll'⎕S{⍵.Match}'hello world well  worn'

In other words, the right operand function gets passed a _namespace_ representing the match at that point. In this case, the `⍵.Match` holds "what the current regex matched" - the same as the magic transformation short-hand `'&'`. 


For capture groups, we have numbered references `'\1', '\2'` etc, as per Perl:

In [13]:
'(a{2,}|b{2,})'⎕S'\1' ⊢'aaabababbbbbaaaa'

When using a function operand, we have the opportunity to apply the full might of APL to the matches. Let's find stretches of 2 or longer of `a` or `b`, and turn them to upper-case (using [_Case convert_, `⎕C`](http://help.dyalog.com/latest/index.htm#Language/System%20Functions/c.htm)):

In [22]:
'(a{2,}|b{2,})'⎕S{1⎕C ⍵.Match}'aaabababbbbbaaaa'

The regex replacement operator, `⎕R`, operates much in the same way, but here the right operand represents a substitution into the right argument:

In [19]:
'a(.)a'⎕R'A\1A'⊢'abababadabaaba' ⍝ \1 is the first capture group

A handy trick is to split strings based on a regular expression, removing the separators in the process. This is tricker than you might guess in Dyalog. Here's what [APL Cart](https://aplcart.info/?q=regex%20split#) suggests:

In [1]:
rsplit ← {(⊢/¨r)↓¨⍵⊂⍨(⍳≢⍵)∊1+⊃¨r←(⍺,'|^')⎕S 0 1⊢⍵} ⍝ APL Cart

In [2]:
'\d+' rsplit 'aaaa6666bbb1cccc999eee87'

Note the empty segment at the end.

### Overlapping matches

In normal operations, a regex "consumes" what it matches, and any susbequent matches will start where the previous one ended. For example, if we want to capture pairs of letters starting with an `a`:

In [1]:
'a.'⎕S'&'⊢'abaac' ⍝ Won't return ab aa ac

which won't capture the last pair `ac` as it overlaps with the previous match `aa`. If we want to capture potentially overlapping matches, we have two options. Option 1 is the time-honoured technique borrowed from Perl, using a capture group inside a [_zero-width lookahead assertion_](https://www.regular-expressions.info/lookaround.html):

In [2]:
'(?=(a.))'⎕S'\1'⊢'abaac'

Zero-width lookaheads (and lookbehinds) work just like normal patterns, except that they don't _consume_ what they match.

Option 2 is to tell the regex engine that we want to allow overlapping matches via a [_Variant_](http://help.dyalog.com/latest/index.htm#Language/Primitive%20Operators/Variant.htm) (`⍠`) setting to `⎕S`:

In [3]:
'a.'⎕S'&'⍠'OM'1⊢'abaac' 

This is both elegant and clear: variant `OM` is _Overlapping Matches_. See the [docs](http://help.dyalog.com/latest/index.htm#Language/System%20Functions/r.htm) for more details on the various options that can be enabled with _Variant_.