Permalink
Browse files

Add 11: "Regular Extremism"

  • Loading branch information...
janlelis committed May 11, 2015
1 parent bb02ae4 commit 2ac97dd2eac279a8c4fc2355004e141cb15406a9
Showing with 106 additions and 0 deletions.
  1. +106 −0 source/posts/11-regular-extremism.html.md
@@ -0,0 +1,106 @@
---
title: Regular Extremism
date: 2015-05-11
tags: strings, rexex
---

You are here for a collection of 10 advanced features of regular expressions in Ruby!

ARTICLE

## Regex Conditionals

Regular expressions can have embedded conditionals (*if-then-else*) with `(?ref)then|else`. "ref" stands for a group reference (number or name of a capture group):

# will match everything if string contains "ä", or only match first two chars
regex = /(?=(.*ä))?(?(1).*|..)/
"Ruby"[regex] #=> "Ru"
"Idiosyncrätic"[regex] #=> "Idiosyncrätic"
## Keep Expressions
The possible ways to [look around](http://www.regular-expressions.info/lookaround.html) within a regex are:
Syntax | Description | Example
---------|---------------------|-------------------------------
`(?=X)` | Positive lookahead | `"Ruby"[/.(?=b)/] #=> "u"`
`(?!X)` | Negative lookahead | `"Ruby"[/.(?!u)/] #=> "u"`
`(?<=X)` | Positive lookbehind | `"Ruby"[/(?<=u)./] #=> "b"`
`(?!X)` | Negative lookbehind | `"Ruby"[/(?<!R|^)./] #=> "b"`
But Ruby also has an additional shortcut syntax to do *positive lookbehinds* via `\K`:

"Ruby"[/Ru\Kby/] #=> "by"
"Ruby"[/ru\Kby/] #=> nil

## Character Class Intersections

You can nest character classes and AND-connect them with `&&`. Matching all non-vowels here:

"Idiosyncratic".scan /[[a-z]&&[^aeiou]]+/
# => ["d", "syncr", "t", "c"]

## Regex Sub-Expressions

You can recursively apply regex groups again with `\g<ref>`. "ref" stands for a group reference (number or name of a capture group). This is different from back-references (`\1` .. `\9`), which will re-match the already matched string, instead of executing the regex again:

# match any number of sequences of 3 identical chars
regex = /((.)\2{2})\g<1>*/
"aaa"[regex] #=> "aaa"
"abc"[regex] #=> nil
"aaab"[regex] #=> "aaa"
"aaabbb"[regex] #=> "aaabbb"
"aaabbbc"[regex] #=> "aaabbb"
"aaabbbccc"[regex] #=> "aaabbbccc"
## Match Characters that Belong Together
`\X` treats combined characters as a single character. See [grapheme clusters](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) for more information.
string = "R\u{030A}uby"
string[/./] #=> "R"
string[/.../] #=> "R̊u"
string[/\X\X/] #=> "R̊u"
## Relative Back-References
Back-refs can be relatively referenced from the current position via `\k<-n>`:
"Ruby by"[/(R)(u)(by) \k<-1>/] #=> "Ruby by"
## Deactivate Backtracking
[Atomic groups](http://www.regular-expressions.info/atomic.html), defined via `(?>X)`, will always try to match the first of all alternatives:
"Rüby"[/R(u*|ü)by/] #=> "Rüby"
"Rüby"[/R(?>u*|ü)by/] #=> nil
## Turn On Unicode-Matching for `\w`, `\d`, `\s`, and `\b`
"Rüby"[/\w*/] #=> "R"
"Rüby"[/(?u)\w*/] #=> "Rüby"

## Continue Matching at Last Match Position

When using a method that matches a regex multiple times against a string (like `String#gsub` or `String#scan`), you can reference the position of the last match via `\G`:

"923823723".scan(/\G(.)23/) #=> [["9"], ["8"], ["7"]]

## `String#split` with Capture Groups

The normal way of using `String#split` is this:

"0-0".split(/-/) #=> ["0", "0"]

But if you want to make your code as hard to read as possible, remember that captured groups will be added to the resulting array:

"0-0".split(/(-)/) #=> ["0", "-", "0"]
"0-0".split(/-(?=(.))/) #=> ["0", "0", "0"]
"0-0".split(/(((-)))/) #=> ["0", "-", "-", "-", "0"]

## Resources

- [RDoc: Regexp](http://ruby-doc.org/core-2.2.2/Regexp.html)
- [Onigmo Documentation](https://github.com/k-takata/Onigmo/blob/master/doc/RE)

4 comments on commit 2ac97dd

@leriksen

This comment has been minimized.

Copy link

leriksen replied May 12, 2015

line 30 - the negative lookbehind in the "syntax" column is incorrect - you've replicated the negative look-ahead syntax

(?!X) => (?<!X)

http://www.regular-expressions.info/lookaround.html

But other than tat, nice to see someone talking about these more advance patterns - maybe just caution the performance implecations - I believe lookbehinds can be orders of magnitude slower in some cases (read Friedl - I can recall the details)

@janlelis

This comment has been minimized.

Copy link
Owner Author

janlelis replied May 12, 2015

Thanks for the feedback, I've fixed it in the article!

@lazylester

This comment has been minimized.

Copy link

lazylester replied Jun 14, 2015

the negative lookahead example result is not confirmed by Rubular. Not sure what the correct syntax should be.

@janlelis

This comment has been minimized.

Copy link
Owner Author

janlelis replied Jun 16, 2015

@lazylester I tried it out, and it highlights "u", "b" and "y", which is correct (all letters, not followed by a "u").

The above example uses String#[] for matching, so only the first match gets returned.

Please sign in to comment.