***
# <center>Regular Expressions</center>
***

# Tutorial Outline
1. <a href='#section1'>Getting ready</a>
1. <a href='#section2'>Regular Expressions</a>
    1. <a href='#section2a'>Escape sequences</a>
    1. <a href='#section2b'>Quantifiers</a>
    1. <a href='#section2c'>Position of pattern within the string</a>
    1. <a href='#section2d'>Operators</a>
    1. <a href='#section2e'>Character classes</a>
    1. <a href='#section2f'>General modes for patterns</a>
1. <a href='#section3'>Regular expressions in R</a>
    1. <a href='#section3a'>General modes for patterns</a>
    1. <a href='#section3b'>Examples using `sub()`</a>
    1. <a href='#section3c'>Examples using `sub()` to format or find/replace data</a>
1. <a href='#section3'>There's more...</a>


# <a id='section1'><font color=black>1. Getting ready</font></a>
***
Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters).

Fields of application range from validation, parsing strings, translating data to other formats and web scraping.

One of the most interesting features is that once you’ve learned the syntax, you can actually use this tool in (almost) all programming languages __(C / C++, bash, Python, Perl, Ruby, R, and many others)__ and text editors __(Notepad++, BBEdit, and jEdit)__ with the slightest distinctions about the support of the most advanced features and syntax versions supported by the engines).

In todays exercise we will learn to use regular expressions using the bash program __`grep`__ and in R using __`grep()`__ and __`sub()`__. 

Before we begin lets set up the notebook environment and make R accessable to the notebook.

In [5]:
#set up notebook environment
import rpy2.rinterface
%load_ext rpy2.ipython

***
# <a id='section2'><font color=black>2. Regular Expressions</font></a>

Regular expression is a pattern that describes a specific set of strings with a common structure. Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of __metacharacters__ that have specific meaning and we will use some small examples to introduce regular expression syntax.

In [5]:
%%bash 

# look at the first lines of a file
head basic.txt

Hi 
this 
is test	file 
to carry out few regular expressions 
practical "with" grep 
123 456 
Abcd
ABCD EFG

In [6]:
%%bash 

# grep searches for a string or pattern in a file or input text and returns any lines that have a match
grep "few" basic.txt

to carry out few regular expressions 


In [7]:
%%bash 

# return any line with a t character
grep "t" basic.txt

this 
is test	file 
to carry out few regular expressions 
practical "with" grep 


## <a id='section2a'><font color=black>__A. Escape sequences__</font></a>
There are some special characters in computer languages that cannot be directly coded in a string.  These characters require escaping.  This rule applies to all string functions in R, including regular expressions. See here for a complete list of R esacpe sequences. https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html

- `\'`: single quote. You don’t need to escape single quote inside a double-quoted string, so we can also use "'" in the previous example.
- `\"`: double quote. Similarly, double quotes can be used inside a single-quoted string, i.e. '"'.
- `\n`: newline.
- `\r`: carriage return.
- `\t`: tab character.

<div class="alert alert-block alert-info">
    <b>Note:</b> In bash, `grep` does not recognize `\n` or `\r` as valid search strings for newline or carriage return
</div>

In [8]:
%%bash 

# return any line containing a double quote
grep "\"" basic.txt

practical "with" grep 


## <a id='section2b'><font color=black>__B. Quantifiers__</font></a>
Quantifiers specify how many repetitions of the pattern.

- `*`: zero or more matches of the preceeding character or class
- `?`: zero or one matches of the preceeding character or class
- `+`: one or more matches of the preceeding character or class
- `{n}`: exactly n matches of the preceeding character or class
- `{min,}`: min or more matches of the preceeding character or class
- `{min,max}`: as few as min but no more than max matches of the preceeding character or class


In [9]:
%%bash

# we use the -E option to extend bash's grep to use more regular expression syntax than by default
# return any line with one or more s characters in a row
grep -E "s+" basic.txt

this 
is test	file 
to carry out few regular expressions 


In [10]:
%%bash

# return any line with exactly 2 s characters in a row
grep -E "s{2}" basic.txt

to carry out few regular expressions 


## <a id='section2c'><font color=black>__C. Position of pattern within the string__</font></a>

### Anchors

- ^: matches the start of the string.
- $: matches the end of the string.

In [11]:
%%bash

# return any line that starts with an i character
grep "^i" basic.txt

is test	file 



### Boundries

- \b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.
- \B: matches the empty string provided it is not at an edge of a word.

In [13]:
%%bash

# return any line that has a word that starts with the characters re
grep "\bre" basic.txt

to carry out few regular expressions 


In [15]:
%%bash

# return any line that has a word that ends with an s character
grep "s\B" basic.txt

is test	file 
to carry out few regular expressions 


## <a id='section2d'><font color=black>__D. Operators__</font></a>

- .: matches any single character, as shown in the first example.
- [...]: a character list, matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.
- [^...]: an inverted character list, similar to [...], but matches any characters except those inside the square brackets.
- \: suppress the special meaning of metacharacters in regular expression, i.e. __`$ * + . ? [ ] ^ { } | ( ) \`__, similar to its usage in escape sequences. Since \ itself needs to be escaped in R, we need to escape these metacharacters with double backslash like \\$.
- |: an “or” operator, matches patterns on either side of the |.
- (...): grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N, with N being the No. of (...) used. This is called backreference.

In [3]:
%%bash

# return any line that has an h or a t character followed by an i character
grep -E "h|ti" basic.txt

Hi 
this 
is test	file 
to carry out few regular expressions 
practical "with" grep 


In [30]:
%%bash

# return any line that has an i character followed by one or more other characters followed by an i character
grep -E "i.+i" basic.txt

is test	file 
practical "with" grep 


## <a id='section2e'><font color=black>__E. Character classes__</font></a>
Character classes allows to – surprise! – specify entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.

- [:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].
- \D: non-digits, equivalent to [^0-9].
- [:lower:]: lower-case letters, equivalent to [a-z].
- [:upper:]: upper-case letters, equivalent to [A-Z].
- [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].
- [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].
- \w: word characters, equivalent to [[:alnum:]] or [A-z0-9_].
- \W: not word, equivalent to [^A-z0-9_].
- [:xdigit:]: hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f].
- [:blank:]: blank characters, i.e. space and tab.
- [:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.
- \s: space, ` `.
- \S: not space.
- [:punct:]: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~.
- [:graph:]: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]].
- [:print:]: printable characters, equivalent to [[:alnum:][:punct:]\\s].
- [:cntrl:]: control characters, like \n or \r, [\x00-\x1F\x7F].

Note:

- [:...:] has to be used inside square brackets, e.g. [[:digit:]].
- \ itself is a special character that needs escape, e.g. \\d. Do not confuse these regular expressions with R escape sequences such as \t.

In [26]:
%%bash

# return any line that has an alphabetic character
grep -E [[:alpha:]] basic.txt

Hi 
this 
is test	file 
to carry out few regular expressions 
practical "with" grep 
Abcd
ABCD EFG


In [29]:
%%bash

# return any line that has a digit character
grep -E [[:digit:]] basic.txt

123 456 


# <a id='section3'><font color=black>3. Regular expressions in R</font></a>

## <a id='section3a'><font color=black>__A. General modes for patterns__</font></a>

There are different syntax standards for regular expressions, and R offers two:

- POSIX extended regular expressions (default)
- Perl-like regular expressions.

You can easily switch between by specifying perl = FALSE/TRUE in base R functions, such as grep() and sub(). For functions in the stringr package, wrap the pattern with perl(). The syntax between these two standards are a bit different sometimes and for this tutorial, we will only use R’s Perl-like regular expressions.To use the Perl-like regular expressions we need to add an addition escape character (`\`) for R to recognize the metacharacter.

#### Example for alpha numeric search
- POSIX
    `grep("[[:alnum:]]", string)`

- Perl-like
   `grep("\\w", string, perl=TRUE)`

There’s one last type of regular expression – “fixed”, meaning that the pattern should be taken literally. Specify this via fixed = TRUE (base R functions) or wrapping with fixed() (stringr functions). For example, "A.b" as a regular expression will match a string with “A” followed by any single character followed by “b”, but as a fixed pattern, it will only match a literal “A.b”.

## <a id='section3b'><font color=black>__B. Examples using  `grep()`__</font></a>

In [7]:
%%R

#Using grep, what indexes contain bog or Bog?

  string <- c("mog","Bog", "hog", "fog", "Log", "bog", "dog", "smog")

#pattern: [bB]og
grep("[bB]og",string,perl=TRUE)

[1] 2 6


In [15]:
%%R
    #how many genes are on chromosome 8 in Populus?
  
    #read populus annotation file

populus <- read.table("Ptrichocarpa_210_annotation_primary.txt", sep="\t", header=TRUE)

    #pattern: Potri.008G

indexes <- grep("Potri.008G", populus$gene)

length(indexes)

[1] 2267


In [33]:
%%R
    #Which genes have a description that contains the term F-box

Fbox <- grep("F-box", populus$description, perl=T)

head(populus[Fbox,c("gene", "description")])

                gene
94  Potri.001G009400
215 Potri.001G021500
216 Potri.001G021600
224 Potri.001G022400
355 Potri.001G035500
745 Potri.001G074500
                                                    description
94                                         F-box family protein
215                                        F-box family protein
216                                        F-box family protein
224                                        F-box family protein
355 F-box and associated interaction domains-containing protein
745                          F-box_RNI-like superfamily protein


In [23]:
%%R
    #What genes were annotated as gibberellin?
    #Note the [Gg] specifies a custom search that allows for a match of either upper or lower case g

    #pattern: [Gg]ibberellin

GA <- grep("[Gg]ibberellin", populus$description)

length(GA)

head(populus[GA,c("gene", "description")])

                 gene                                  description
1766 Potri.001G176600                      gibberellin 3-oxidase 1
2541 Potri.001G254100         Gibberellin-regulated family protein
2977 Potri.001G297700         Gibberellin-regulated family protein
3155 Potri.001G315500         Gibberellin-regulated family protein
3506 Potri.001G350600         Gibberellin-regulated family protein
3784 Potri.001G378400 Arabidopsis thaliana gibberellin 2-oxidase 1


## <a id='section3c'><font color=black>__C. Examples using `sub()` to format or find/replace data__</font></a>

_`sub()`_ function replaces the first match of a string, if the parameter is a string vector, replaces the first match of all elements. Information is captured using parentheses (...) and replaced using \\1 etc. The order of assignment for multiple captures is left to right and works from the inside out. 

```
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
```

- pattern: regular expression, or string for fixed=TRUE
- x: string, the character vector
- replacement: string, character vector for replacement
- ignore.case: case sensitive or not
- perl: logical. Should perl-compatible regexps be used? Has priority over extended
- fixed: logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments
- useBytes: logical. If TRUE the matching is done byte-by-byte rather than character-by-character


In [30]:
%%R

    #Convert mm/dd/yyyy to dd_mm_yyyy using sub

dates <- c("01/19/2016", "12/03/2012", "08/28/1993")

    #pattern: (\\d+)\\/(\\d+)\\/(\\d+)
    #replacement: \\2_\\1_\\3

new_date_format <- sub("(\\d+)\\/(\\d+)\\/(\\d+)", "\\2_\\1_\\3", dates, perl=TRUE)

new_date_format

[1] "19_01_2016" "03_12_2012" "28_08_1993"


In [31]:
%%R
    #Convert Species and genus
    #Example: Agalma elegans -> A. elegans

Scientific_names <- c("Liriodendron tulipifera", "Liquidambar styraciflua", "Eucalyptus grandis", "Populus trichocarpa", "Salix aegyptiaca", "Salix purpurea")

    #pattern: ^(\\w)\\w+(\\s\\w+)$
    #replacement: \\1.\\2

sub("(\\w)\\w+", "\\1.", Scientific_names, perl=TRUE)


[1] "L. tulipifera"  "L. styraciflua" "E. grandis"     "P. trichocarpa"
[5] "S. aegyptiaca"  "S. purpurea"   


In [32]:
%%R
    #Remove the transcript number from the ATG gene names in populus$ATG
    #Example AT5G17230.3 -> AT5G17230

sub("(\\w+)\\.\\d+", "\\1", populus$ATG, perl = TRUE)


    [1] "AT2G01050" "AT2G01050" ""          "AT2G30942" "AT1G55570" "AT1G55570"
    [7] "AT2G32080" "AT1G55550" "AT3G13410" ""          "AT2G03090" "AT1G55540"
   [13] "AT4G26410" "AT1G55535" "AT5G56340" "AT5G56350" "AT5G56360" ""         
   [19] ""          "AT3G13460" ""          "AT4G26330" "AT5G56460" "AT5G64810"
   [25] "AT3G13470" "AT5G56510" "AT1G55480" "AT5G36110" ""          "AT5G36110"
   [31] "AT5G36110" "AT4G04220" "AT3G13480" "AT4G09830" "AT5G56520" "AT5G64150"
   [37] "AT5G23710" "AT1G55360" "AT1G55350" "AT1G55340" "AT5G56540" "AT1G55325"
   [43] "AT4G26310" ""          "AT5G19760" "AT5G19770" "AT5G19790" "AT5G19820"
   [49] ""          "AT5G19830" "AT3G13540" ""          "AT5G19850" "AT5G11950"
   [55] "AT4G08810" "AT3G13600" "AT4G25650" ""          "AT4G25650" "AT2G42820"
   [61] "AT2G41890" "AT3G13600" "AT1G55300" "AT3G13570" "AT3G13560" "AT1G55320"
   [67] "AT1G55320" "AT3G13610" "AT3G13610" "AT1G15670" "AT1G55290" "AT3G13620"
   [73] "AT3G13620" "AT5G09280" ""      

## <a id='section4'><font color=black>4. There's more...</font></a>
***
We only scratched the surface of regular expressions in this exercise. There are many other commands that can be combined in an infinite number of ways.

Here are a few references:
- RegEx for bioinformatics at http://library.open.oregonstate.edu/computationalbiology/chapter/patterns-regular-expressions/
- R regular expressions at https://stat545.com/block022_regular-expression.html
- R manual page at https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
