Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Add 11: "Regular Extremism"
  • Loading branch information
janlelis committed May 11, 2015
1 parent bb02ae4 commit 2ac97dd
Showing 1 changed file with 106 additions and 0 deletions.
106 changes: 106 additions & 0 deletions source/posts/11-regular-extremism.html.md
@@ -0,0 +1,106 @@
---
title: Regular Extremism
date: 2015-05-11
tags: strings, rexex
---

You are here for a collection of 10 advanced features of regular expressions in Ruby!

ARTICLE

## Regex Conditionals

Regular expressions can have embedded conditionals (*if-then-else*) with `(?ref)then|else`. "ref" stands for a group reference (number or name of a capture group):

# will match everything if string contains "ä", or only match first two chars
regex = /(?=(.*ä))?(?(1).*|..)/

"Ruby"[regex] #=> "Ru"
"Idiosyncrätic"[regex] #=> "Idiosyncrätic"

## Keep Expressions

The possible ways to [look around](http://www.regular-expressions.info/lookaround.html) within a regex are:

Syntax | Description | Example
---------|---------------------|-------------------------------
`(?=X)` | Positive lookahead | `"Ruby"[/.(?=b)/] #=> "u"`
`(?!X)` | Negative lookahead | `"Ruby"[/.(?!u)/] #=> "u"`
`(?<=X)` | Positive lookbehind | `"Ruby"[/(?<=u)./] #=> "b"`
`(?!X)` | Negative lookbehind | `"Ruby"[/(?<!R|^)./] #=> "b"`

But Ruby also has an additional shortcut syntax to do *positive lookbehinds* via `\K`:

"Ruby"[/Ru\Kby/] #=> "by"
"Ruby"[/ru\Kby/] #=> nil

## Character Class Intersections

You can nest character classes and AND-connect them with `&&`. Matching all non-vowels here:

"Idiosyncratic".scan /[[a-z]&&[^aeiou]]+/
# => ["d", "syncr", "t", "c"]

## Regex Sub-Expressions

You can recursively apply regex groups again with `\g<ref>`. "ref" stands for a group reference (number or name of a capture group). This is different from back-references (`\1` .. `\9`), which will re-match the already matched string, instead of executing the regex again:

# match any number of sequences of 3 identical chars
regex = /((.)\2{2})\g<1>*/
"aaa"[regex] #=> "aaa"
"abc"[regex] #=> nil
"aaab"[regex] #=> "aaa"
"aaabbb"[regex] #=> "aaabbb"
"aaabbbc"[regex] #=> "aaabbb"
"aaabbbccc"[regex] #=> "aaabbbccc"

## Match Characters that Belong Together

`\X` treats combined characters as a single character. See [grapheme clusters](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) for more information.

string = "R\u{030A}uby"
string[/./] #=> "R"
string[/.../] #=> "R̊u"
string[/\X\X/] #=> "R̊u"

## Relative Back-References

Back-refs can be relatively referenced from the current position via `\k<-n>`:

"Ruby by"[/(R)(u)(by) \k<-1>/] #=> "Ruby by"


## Deactivate Backtracking

[Atomic groups](http://www.regular-expressions.info/atomic.html), defined via `(?>X)`, will always try to match the first of all alternatives:

"Rüby"[/R(u*|ü)by/] #=> "Rüby"
"Rüby"[/R(?>u*|ü)by/] #=> nil

## Turn On Unicode-Matching for `\w`, `\d`, `\s`, and `\b`

"Rüby"[/\w*/] #=> "R"
"Rüby"[/(?u)\w*/] #=> "Rüby"

## Continue Matching at Last Match Position

When using a method that matches a regex multiple times against a string (like `String#gsub` or `String#scan`), you can reference the position of the last match via `\G`:

"923823723".scan(/\G(.)23/) #=> [["9"], ["8"], ["7"]]

## `String#split` with Capture Groups

The normal way of using `String#split` is this:

"0-0".split(/-/) #=> ["0", "0"]

But if you want to make your code as hard to read as possible, remember that captured groups will be added to the resulting array:

"0-0".split(/(-)/) #=> ["0", "-", "0"]
"0-0".split(/-(?=(.))/) #=> ["0", "0", "0"]
"0-0".split(/(((-)))/) #=> ["0", "-", "-", "-", "0"]

## Resources

- [RDoc: Regexp](http://ruby-doc.org/core-2.2.2/Regexp.html)
- [Onigmo Documentation](https://github.com/k-takata/Onigmo/blob/master/doc/RE)

4 comments on commit 2ac97dd

@leriksen
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 30 - the negative lookbehind in the "syntax" column is incorrect - you've replicated the negative look-ahead syntax

(?!X) => (?<!X)

http://www.regular-expressions.info/lookaround.html

But other than tat, nice to see someone talking about these more advance patterns - maybe just caution the performance implecations - I believe lookbehinds can be orders of magnitude slower in some cases (read Friedl - I can recall the details)

@janlelis
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I've fixed it in the article!

@lazylester
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the negative lookahead example result is not confirmed by Rubular. Not sure what the correct syntax should be.

@janlelis
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lazylester I tried it out, and it highlights "u", "b" and "y", which is correct (all letters, not followed by a "u").

The above example uses String#[] for matching, so only the first match gets returned.

Please sign in to comment.