#### R regex
- [POSIX Basic regular expression(BRE)](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions)
- [POSIX Extended regular expression(ERE)](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions): some bachslashes (escape) are removed.  
- [R is using TRE, a POSIX like engine](https://www.regular-expressions.info/posix.html)  
> The best way to use regular expressions with R is to pass the perl=TRUE parameter. This tells R to use the PCRE regular expressions library. When this website talks about R, it assumes you’re using the perl=TRUE parameter. Starting with R 4.0.0, passing perl=TRUE makes R use the PCRE2 library.


##### [R base regex page](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html)
> Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by `perl = TRUE`. There is also `fixed = TRUE` which can be considered to use a literal regular expression.

R default Extended Regular Expressions are 
> POSIX 1003.2 standard  
- [ACSII character set](https://www.ibm.com/docs/en/xl-fortran-linux/15.1.0?topic=appendix-ascii-ebcdic-character-sets) and [wiki control characters](https://en.wikipedia.org/wiki/Control_character) 
> Escaping non-metacharacters with a backslash is implementation-dependent. (about control character) 

In [31]:
# testing of metacharacters using `cat()` 
m1 <- '|'
# m2 <- '\|' error
m3 <- '\\|'
cat(m1)
cat('\n')
cat(m3)

|
\|

In [42]:
metachar <- c('⁠\\.', ' \\\\', '\\|', '\\(', '\\)', '\\[', '\\{', '\\^', '\\$', '\\*', '\\+', '\\?')
for (i in metachar) {cat(i); cat('\n')}

<U+2060>\.
 \\
\|
\(
\)
\[
\{
\^
\$
\*
\+
\?


- named classes of characters, i.e. character classes  
    - [:alnum:]⁠
        - [:alpha:]⁠ + [:digit:]⁠ = [:lower:]⁠ + [:upper:]⁠ + [:digit:]⁠
    - [:graph:]⁠ = [:alnum:]⁠ + [:punct:]  
    - [:print:] = [:alnum:]⁠ + [:punct:] + space
    - [:punct:]
        - ‘⁠! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~⁠’  
    - [:space:]⁠: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters.


- `[[:alnum:]_]` = `\w`; `[^[:alnum:]_]`⁠ = `\W`   
> Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.

- repitition
    - `?` ≤1
    - `*` ≥0
    - `+` ≥1
    - {3}, {3, 10}  
> This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

- concatenation

- alternation through infix. 

> Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.


#### PCRE
- Perl5.x syntax and semantics enables with `perl = TRUE`  
> For complete details please consult the man pages for PCRE, especially man pcrepattern and man pcreapi, on your system or from the sources at https://www.pcre.org. (The version in use can be found by calling extSoftVersion. It need not be the version described in the system's man page. PCRE1 (reported as version < 10.00 by extSoftVersion) has been feature-frozen for some time (essentially 2012), the man pages at https://www.pcre.org/original/doc/html/ should be a good match. PCRE2 (PCRE version >= 10.00) has man pages at https://www.pcre.org/current/doc/html/).

- positive and negative lookahead: `⁠(?=...)⁠` and `(?!...)⁠`  
- positive and negative lookbehind: `⁠(?<=...)⁠` and `(?<!...)⁠`. 
> Patterns ‘⁠(?=...)⁠’ and ‘⁠(?!...)⁠’ are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns ‘⁠(?<=...)⁠’ and ‘⁠(?<!...)⁠’ are the lookbehind equivalents: they do not allow repetition quantifiers nor ‘⁠\C⁠’ in ....

In [3]:
library(tidyverse)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.6     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.7     [32mv[39m [34mdplyr  [39m 1.0.9
[32mv[39m [34mtidyr  [39m 1.2.0     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.1.2     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ---------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [1]:
# Maxquant output, proteinGroup.txt, `Majority protein IDs`

tbl <- tibble(protid = c("sp|Q7KZ85|SPT6H_HUMAN", "sp|Q71U36|TBA1A_HUMAN;sp|P0DPH8|TBA3D_HUMAN;sp|P0DPH7|TBA3C_HUMAN;sp|Q9NY65|TBA8_HUMAN;sp|Q6PEY2|TBA3E_HUMAN"))

tidyr::extract(tbl, protid, into = c('uniprotID', 'geneName'),  regex = "(?<=\\|)([^|]+)(?=\\|)\\|(?<=\\|)([^\\_]+)(?=\\_HUMAN)")

ERROR: Error in tibble(protid = c("sp|Q7KZ85|SPT6H_HUMAN", "sp|Q71U36|TBA1A_HUMAN;sp|P0DPH8|TBA3D_HUMAN;sp|P0DPH7|TBA3C_HUMAN;sp|Q9NY65|TBA8_HUMAN;sp|Q6PEY2|TBA3E_HUMAN")): could not find function "tibble"


In [46]:
# extract group and sample/replicate info to construct experiment matrix
cnames <- c("Intensity.6A_7", "Intensity.6A_8", "Intensity.6A_9", "Intensity.6B_7", "Intensity.6B_8", "Intensity.6B_9")

stringr::str_match(cnames, "^Intensity\\.(.*)_(.*)")[, c(2,3)]

0,1
6A,7
6A,8
6A,9
6B,7
6B,8
6B,9


#### `sprintf` C-style formatting
|||
