Skip to content
Permalink
Browse files
Add 11: "Regular Extremism"
  • Loading branch information
janlelis committed May 11, 2015
1 parent bb02ae4 commit 2ac97dd2eac279a8c4fc2355004e141cb15406a9
Showing with 106 additions and 0 deletions.
  1. +106 −0 source/posts/11-regular-extremism.html.md
@@ -0,0 +1,106 @@
---
title: Regular Extremism
date: 2015-05-11
tags: strings, rexex
---

You are here for a collection of 10 advanced features of regular expressions in Ruby!

ARTICLE

## Regex Conditionals

Regular expressions can have embedded conditionals (*if-then-else*) with `(?ref)then|else`. "ref" stands for a group reference (number or name of a capture group):

# will match everything if string contains "ä", or only match first two chars
regex = /(?=(.*ä))?(?(1).*|..)/

"Ruby"[regex] #=> "Ru"
"Idiosyncrätic"[regex] #=> "Idiosyncrätic"

## Keep Expressions

The possible ways to [look around](http://www.regular-expressions.info/lookaround.html) within a regex are:

Syntax | Description | Example
---------|---------------------|-------------------------------
`(?=X)` | Positive lookahead | `"Ruby"[/.(?=b)/] #=> "u"`
`(?!X)` | Negative lookahead | `"Ruby"[/.(?!u)/] #=> "u"`
`(?<=X)` | Positive lookbehind | `"Ruby"[/(?<=u)./] #=> "b"`
`(?!X)` | Negative lookbehind | `"Ruby"[/(?<!R|^)./] #=> "b"`

But Ruby also has an additional shortcut syntax to do *positive lookbehinds* via `\K`:

"Ruby"[/Ru\Kby/] #=> "by"
"Ruby"[/ru\Kby/] #=> nil

## Character Class Intersections

You can nest character classes and AND-connect them with `&&`. Matching all non-vowels here:

"Idiosyncratic".scan /[[a-z]&&[^aeiou]]+/
# => ["d", "syncr", "t", "c"]

## Regex Sub-Expressions

You can recursively apply regex groups again with `\g<ref>`. "ref" stands for a group reference (number or name of a capture group). This is different from back-references (`\1` .. `\9`), which will re-match the already matched string, instead of executing the regex again:

# match any number of sequences of 3 identical chars
regex = /((.)\2{2})\g<1>*/
"aaa"[regex] #=> "aaa"
"abc"[regex] #=> nil
"aaab"[regex] #=> "aaa"
"aaabbb"[regex] #=> "aaabbb"
"aaabbbc"[regex] #=> "aaabbb"
"aaabbbccc"[regex] #=> "aaabbbccc"

## Match Characters that Belong Together

`\X` treats combined characters as a single character. See [grapheme clusters](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) for more information.

string = "R\u{030A}uby"
string[/./] #=> "R"
string[/.../] #=> "R̊u"
string[/\X\X/] #=> "R̊u"

## Relative Back-References

Back-refs can be relatively referenced from the current position via `\k<-n>`:

"Ruby by"[/(R)(u)(by) \k<-1>/] #=> "Ruby by"


## Deactivate Backtracking

[Atomic groups](http://www.regular-expressions.info/atomic.html), defined via `(?>X)`, will always try to match the first of all alternatives:

"Rüby"[/R(u*|ü)by/] #=> "Rüby"
"Rüby"[/R(?>u*|ü)by/] #=> nil

## Turn On Unicode-Matching for `\w`, `\d`, `\s`, and `\b`

"Rüby"[/\w*/] #=> "R"
"Rüby"[/(?u)\w*/] #=> "Rüby"

## Continue Matching at Last Match Position

When using a method that matches a regex multiple times against a string (like `String#gsub` or `String#scan`), you can reference the position of the last match via `\G`:

"923823723".scan(/\G(.)23/) #=> [["9"], ["8"], ["7"]]

## `String#split` with Capture Groups

The normal way of using `String#split` is this:

"0-0".split(/-/) #=> ["0", "0"]

But if you want to make your code as hard to read as possible, remember that captured groups will be added to the resulting array:

"0-0".split(/(-)/) #=> ["0", "-", "0"]
"0-0".split(/-(?=(.))/) #=> ["0", "0", "0"]
"0-0".split(/(((-)))/) #=> ["0", "-", "-", "-", "0"]

## Resources

- [RDoc: Regexp](http://ruby-doc.org/core-2.2.2/Regexp.html)
- [Onigmo Documentation](https://github.com/k-takata/Onigmo/blob/master/doc/RE)

4 comments on commit 2ac97dd

@leriksen

This comment has been minimized.

Copy link

@leriksen leriksen replied May 12, 2015

line 30 - the negative lookbehind in the "syntax" column is incorrect - you've replicated the negative look-ahead syntax

(?!X) => (?<!X)

http://www.regular-expressions.info/lookaround.html

But other than tat, nice to see someone talking about these more advance patterns - maybe just caution the performance implecations - I believe lookbehinds can be orders of magnitude slower in some cases (read Friedl - I can recall the details)

@janlelis

This comment has been minimized.

Copy link
Owner Author

@janlelis janlelis replied May 12, 2015

Thanks for the feedback, I've fixed it in the article!

@lazylester

This comment has been minimized.

Copy link

@lazylester lazylester replied Jun 14, 2015

the negative lookahead example result is not confirmed by Rubular. Not sure what the correct syntax should be.

@janlelis

This comment has been minimized.

Copy link
Owner Author

@janlelis janlelis replied Jun 16, 2015

@lazylester I tried it out, and it highlights "u", "b" and "y", which is correct (all letters, not followed by a "u").

The above example uses String#[] for matching, so only the first match gets returned.

Please sign in to comment.