# Regular expressions

Regular expressions (RegExes) are patterns used to match strings.

This is extremely useful anywhere; in bioinformatics, we can use RegExes for example to:

* Match each line that begins with an A and end with a T
* Check if there is DNA sequence with non-standard characters
* Check if I’m looking at a DNA or a protein sequence
* Find any sequence headers that contain the string “Bacillus”
* Find all lines that have the string “recA” in them


__ Matches in Perl are absolute:__ a pattern either matches or it doesn't; there are no partial matches

Always the left-most, longest substring that satisfies the pattern is matched.


__The easiest RegEx is one that literally matches a string, e.g.:__

In [1]:
%%perl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

$_ = "AGGATAGGATATTA";

if (/GGA/) {
    print "It matched!\n";
}

It matched!


* The special variable `$_` holds the string that the matching operation is performed on

* `//` the forward slashes serve as matching operator and contain the RegEx that is applied to `$_` (they can be replaced by `%%`, `&&` etc. if it helps)

Whitespaces matter in RegExes:
    
/GGA/  ≠ /G GA/

Capitalization also matters:

/GGA/  ≠ /gga/

The match operator // is similar to double quotes "", in that special backslash escape characters like newline \n as well as variables are __interpolated __

|special backslash escape character| matches|
|----------------------------------|-------|
|\w| any single character classified as a "word" character (= alphanumeric or `"_"`)|
|\W| any non-word character|
|\s| and whitespace character (space, tab, newline)|
|\S| any non-whitespace character|
|\d| many digit character, equivalent to [0-9]|
|\D| any non-digit character|

__ An empty match operator (`//`) will match any string__

## A matching test program
Write the following lines of code in your text editor and save as check_match.pl:

In [4]:
%%perl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

while (<STDIN>) {
  chomp;
  if (/YOUR_PATTERN_HERE/) {
    print "Matched!\n";
  } else {
    print "No match :(\n";
  }
}

## Metacharacters

Metacharacters are used to build more sophisticated and useful RegExes. If you want to use any of these to literally match a string, escape them with backslash \

|Metacharacter|matches|
|-------------|-------|
|^|beginning of string|
|$|end of string|
|.|any character except newline|
|*|match zero or more times|
|+|match one or more times|
|?|match zero or one times (= shortest match)|
|<code>&#124;</code> |alternative|
|()|grouping; storing|
|[]|set of characters|
|{}|repetition modifier|
|\|quote or special|

### Example use of metacharacters



```
/^Hello/	matches "Hello, World!" but not "World, I say Hello“

/Hello$/	matches "World, I say Hello" but not "Hello, World!"

/H.llo/		matches "Hello", "Hallo", "H3llo", and "H\sllo"

/Hel+o/		matches "Hello" and "Helo"

/Hel?o/		matches "Hello", "Helo", Heo" 

/He|allo/	matches "Hello" and "Hallo"
```

## Repetition operators

These are used in RegExes when the goa is to match the same character, or a set of characters a certain number of times:

|Repetition Operator|matches|
|-------------------|-------|
|a{m}|exactly m times|
|a{m,}| at least m times|
|a{m,n}|at least m times, up to n times|

## Grouping

Grouping with parentheses () helps to define what exactly should be matched:

`/Hello+/` matches also `“Hellooooooo”`

`/(Hello)+/` matches  `“HelloHelloHello”` 

The capture group match will automatically be saved by Perl in a special variable called `$1`.
If more than one capture groups are used, each one is saved in `$1, $2, $3`, etc. according to position in the RegEx pattern

## Back referencing
Capture groups can be used to match the same sub-pattern multiple times, by back referencing using \1, \2, \3 etc.:

`/(.)\1/`
This matches "Hello", "deep sea" (= any character twice)


The back reference does not have to immediately follow:

`/(ll)ow\s.{1,2}\1/`
This matches "Yellow Mellow" and "fellow swallow"


## Character classes

Character classes are used in RegEx to match groups of characters, e.g. groups of a certain kind (e.g. digits only etc.):

|Character class|matches|
|---------------|-------|
|[characters]|any of the characters given in brackets|
|[\-]|hyphen character `-`|
|[\n]|newline character \n|
|[^something]|anything except something|
|[a-zA-Z]|any uppper or lowercase letters|
|[0-9]|any digits from 0 to 9|

## Match modifiers

Modifiers can be used to control the matching behaviour:

Case insensitive matching: /i

“.” now also matches newline character: /s

Allow whitespaces in the pattern: /x

Match modifiers can be combined, e.g. /isx

Match at beginning of line: /A (same as ^)

Match at end of line: /Z (same as $)


## Binding operator
So far, we matched against the string contained in Perl’s special variable $_

However, we can also match pattern on the right to the string on the left:


string =~ /pattern/


For example:

```perl
if ($string =~ /[ACTG]+/){
	print "String is a nucleotide sequence\n";
}
```

## Automatic match variables

$_ holds default string for matching

`$1, $2, $3`, etc. hold strings that matched the capture group

$& holds that part of the string that actually matched the pattern

$` holds the part of the string before the matched portion (back tick)

$' holds the part of the string after the matched portion (single quote)

## Substring manipulation

Perl also allows to change matched portions of a string: 

The s/// (substitution) operator will match a string and replace the matched portion:

s/RegEx/REPLACEMENT/

e.g. s/CDS/coding sequence/ will replace "CDS" with "coding sequence"

__ s/// will replace the matched portion in the variable that holds the string! __

e.g.:

In [12]:
%%perl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

$_ = "CDS CDS CDS";
s/CDS/cds/;
print "$_\n";


cds CDS CDS


In order to replace all valid matches within a string use the global replacement operator /g:

In [11]:
%%perl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

$_ = "CDS CDS CDS";
s/CDS/cds/g;
print "$_\n";

cds cds cds


The binding operator can also be used with the substitution operator, e.g.:

In [15]:
%%perl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;

my $sequence = "ATTTGACTATA";
print "DNA: $sequence\n";
$sequence =~ s/T/U/g;
print "RNA: $sequence\n";

DNA: ATTTGACTATA
RNA: AUUUGACUAUA
