Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
Add 11: "Regular Extremism"
- Loading branch information
Showing
1 changed file
with
106 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
--- | ||
title: Regular Extremism | ||
date: 2015-05-11 | ||
tags: strings, rexex | ||
--- | ||
|
||
You are here for a collection of 10 advanced features of regular expressions in Ruby! | ||
|
||
ARTICLE | ||
|
||
## Regex Conditionals | ||
|
||
Regular expressions can have embedded conditionals (*if-then-else*) with `(?ref)then|else`. "ref" stands for a group reference (number or name of a capture group): | ||
|
||
# will match everything if string contains "ä", or only match first two chars | ||
regex = /(?=(.*ä))?(?(1).*|..)/ | ||
|
||
"Ruby"[regex] #=> "Ru" | ||
"Idiosyncrätic"[regex] #=> "Idiosyncrätic" | ||
|
||
## Keep Expressions | ||
|
||
The possible ways to [look around](http://www.regular-expressions.info/lookaround.html) within a regex are: | ||
|
||
Syntax | Description | Example | ||
---------|---------------------|------------------------------- | ||
`(?=X)` | Positive lookahead | `"Ruby"[/.(?=b)/] #=> "u"` | ||
`(?!X)` | Negative lookahead | `"Ruby"[/.(?!u)/] #=> "u"` | ||
`(?<=X)` | Positive lookbehind | `"Ruby"[/(?<=u)./] #=> "b"` | ||
`(?!X)` | Negative lookbehind | `"Ruby"[/(?<!R|^)./] #=> "b"` | ||
|
||
But Ruby also has an additional shortcut syntax to do *positive lookbehinds* via `\K`: | ||
|
||
"Ruby"[/Ru\Kby/] #=> "by" | ||
"Ruby"[/ru\Kby/] #=> nil | ||
|
||
## Character Class Intersections | ||
|
||
You can nest character classes and AND-connect them with `&&`. Matching all non-vowels here: | ||
|
||
"Idiosyncratic".scan /[[a-z]&&[^aeiou]]+/ | ||
# => ["d", "syncr", "t", "c"] | ||
|
||
## Regex Sub-Expressions | ||
|
||
You can recursively apply regex groups again with `\g<ref>`. "ref" stands for a group reference (number or name of a capture group). This is different from back-references (`\1` .. `\9`), which will re-match the already matched string, instead of executing the regex again: | ||
|
||
# match any number of sequences of 3 identical chars | ||
regex = /((.)\2{2})\g<1>*/ | ||
"aaa"[regex] #=> "aaa" | ||
"abc"[regex] #=> nil | ||
"aaab"[regex] #=> "aaa" | ||
"aaabbb"[regex] #=> "aaabbb" | ||
"aaabbbc"[regex] #=> "aaabbb" | ||
"aaabbbccc"[regex] #=> "aaabbbccc" | ||
|
||
## Match Characters that Belong Together | ||
|
||
`\X` treats combined characters as a single character. See [grapheme clusters](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) for more information. | ||
|
||
string = "R\u{030A}uby" | ||
string[/./] #=> "R" | ||
string[/.../] #=> "R̊u" | ||
string[/\X\X/] #=> "R̊u" | ||
|
||
## Relative Back-References | ||
|
||
Back-refs can be relatively referenced from the current position via `\k<-n>`: | ||
|
||
"Ruby by"[/(R)(u)(by) \k<-1>/] #=> "Ruby by" | ||
|
||
|
||
## Deactivate Backtracking | ||
|
||
[Atomic groups](http://www.regular-expressions.info/atomic.html), defined via `(?>X)`, will always try to match the first of all alternatives: | ||
|
||
"Rüby"[/R(u*|ü)by/] #=> "Rüby" | ||
"Rüby"[/R(?>u*|ü)by/] #=> nil | ||
|
||
## Turn On Unicode-Matching for `\w`, `\d`, `\s`, and `\b` | ||
|
||
"Rüby"[/\w*/] #=> "R" | ||
"Rüby"[/(?u)\w*/] #=> "Rüby" | ||
|
||
## Continue Matching at Last Match Position | ||
|
||
When using a method that matches a regex multiple times against a string (like `String#gsub` or `String#scan`), you can reference the position of the last match via `\G`: | ||
|
||
"923823723".scan(/\G(.)23/) #=> [["9"], ["8"], ["7"]] | ||
|
||
## `String#split` with Capture Groups | ||
|
||
The normal way of using `String#split` is this: | ||
|
||
"0-0".split(/-/) #=> ["0", "0"] | ||
|
||
But if you want to make your code as hard to read as possible, remember that captured groups will be added to the resulting array: | ||
|
||
"0-0".split(/(-)/) #=> ["0", "-", "0"] | ||
"0-0".split(/-(?=(.))/) #=> ["0", "0", "0"] | ||
"0-0".split(/(((-)))/) #=> ["0", "-", "-", "-", "0"] | ||
|
||
## Resources | ||
|
||
- [RDoc: Regexp](http://ruby-doc.org/core-2.2.2/Regexp.html) | ||
- [Onigmo Documentation](https://github.com/k-takata/Onigmo/blob/master/doc/RE) |
2ac97dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 30 - the negative lookbehind in the "syntax" column is incorrect - you've replicated the negative look-ahead syntax
(?!X) => (?<!X)
http://www.regular-expressions.info/lookaround.html
But other than tat, nice to see someone talking about these more advance patterns - maybe just caution the performance implecations - I believe lookbehinds can be orders of magnitude slower in some cases (read Friedl - I can recall the details)
2ac97dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I've fixed it in the article!
2ac97dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the negative lookahead example result is not confirmed by Rubular. Not sure what the correct syntax should be.
2ac97dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lazylester I tried it out, and it highlights "u", "b" and "y", which is correct (all letters, not followed by a "u").
The above example uses
String#[]
for matching, so only the first match gets returned.