In [14]:
import re

# **REGULAR EXPRESSIONS** - Guide VR

# Intro

Регулярные выражения - выражения для поиска и замены части текста в строке или файле. Для работы с ними необходимо подключить модуль **"re"** из стандартной библиотеки Python.

In [10]:
import re

Наиболее часто регулярные выражения используются для поиска в строке, разбиении строк, замены части строк. 

Also called "regex" or "regexp". It is smart "find" or "search" in the text.

The language comes form 1960's. It's **character** based.

[Documentation](https://docs.python.org/3/library/re.html)

This module provides regular expression matching operations similar to those found in Perl.

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

## **Escape sequences `\`**

Regular expressions use the backslash character `\` to indicate special forms or to allow special characters to be used without invoking their special meaning. 

> The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. [source](https://docs.python.org/3/reference/lexical_analysis.html)

This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write `\\\\` as the pattern string, because the regular expression must be `\\`, and each backslash must be expressed as `\\` (to let Python know we we want just one `\`) inside a regular Python string literal. 

In [227]:
import re
a = '\\'
b = re.findall('\\\\', a)
print(b)
print(b[0])

['\\']
\


Also, please note that any invalid escape sequences (игнорирование спецсимвола) in Python’s usage of the backslash in string literals now generate a `DeprecationWarning` and in the future this will become a `SyntaxError`. This behaviour will happen even if it is a valid escape sequence for a regular expression.

### Raw String usage `r'string'`

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with `'r'`. So `r"\n"` is a two-character string containing `'\'` and `'n'`, while `"\n"` is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

In [93]:
a = '\n'
b = r'\n'
print(f"a: '{a}' and its length is {len(a)}")
print(f"b: '{b}' and its length is {len(b)}")

a: '
' and its length is 1
b: '\n' and its length is 2


### Even number of `\` rule

Be aware of improper use of `\` even in the raw-strings. Take a look at these examples:

In [145]:
import re
a = '\\'
re.findall(r'\', a)

SyntaxError: unterminated string literal (detected at line 3) (2321787335.py, line 3)

In [148]:
import re
a = r'\'
re.findall(r'\\', a)

SyntaxError: unterminated string literal (detected at line 2) (1393654224.py, line 2)

A raw string cannot end in a single backslash since the backslash would escape the following quote character. This means: 
> **use only even** (not odd) **number of `\` in the end of the string, even if it is a raw-string**.

Raw strings are not 100% raw, there is still some rudimentary backslash-processing.  
[source](https://stackoverflow.com/questions/647769/why-cant-pythons-raw-string-literals-end-with-a-single-backslash)

In [240]:
import re
a = r'abc\\\'  # 3 back-slashes
re.findall(r'\\', a)

SyntaxError: unterminated string literal (detected at line 2) (837609679.py, line 2)

In [218]:
import re
a = '\\\\'  # 4 back-slashes
b = re.findall(r'\\', a)

print('1)', a)
print('2)', b, '\tand it only looks like two pairs in the list')
print('3)', b[0], 'and', b[1], '\t\twe have two lonely back-slashes in the list')


1) \\
2) ['\\', '\\'] 	and it only looks like two pairs in the list
3) \ and \ 		we have two lonely back-slashes in the list


### More examples

In [237]:
import re
a = '\\'
print('1)', a)
print('2)', re.findall(r'\\', a))
try:
    print('3)', re.findall('\\', a))
except Exception as err:
    print('We got error:', err)

1) \
2) ['\\']
We got error: bad escape (end of pattern) at position 0


Why the third case gets an error? Check the beginning of the unit.

But let's look at the problem from another side. As we just have seen the result in the 4 back-slashes in the given string `a` example, a pair of visible back-slashes `\\` in the `.findall` raw-string gets a pair of visible back-slashes `'\\'` in the  given string `a`. When we removed `r` literal before the `.findall` string our `\\` becomes just one `\` (the first `\` in the pair escapes the second one).

Thus, as we already know, to get the same result as in r'string' we need to type four (two pairs) of `\` in the ordinary `.findall` string - each pair produces one `\`, so we will get a needed pair to match the pair in the given string `a`. 

These examples just show the same proposition:

In [220]:
import re
a = '\\'
print('1)', a)
print('2)', re.findall(r'\\', a))
print('3)', re.findall('\\\\', a))  # 4 back-slahes match one given back-slash

1) \
2) ['\\']
3) ['\\']


In [223]:
import re
a = r'\\'
print('1)', a)
print('2)', re.findall(r'\\', a))
print('3)', re.findall('\\\\', a))  # 4 back-slahes match two given back-slashes (one at a time)

1) \\
2) ['\\', '\\']
3) ['\\', '\\']


So, in `.findall` raw-string, unlike the given `a` raw-string, we always need additional `\` to escape the followed `\`. More over, if we use ordinary `.findall` string we will need twice more `\` than in the raw one if we want to match the given escaped `\`.

And the last example just to fix this information:

In [248]:
import re
a = r'\n'
print(a, len(a))
print(re.findall(r'\n', a), '\t\tan empty list')
print(re.findall(r'\\n', a), ':', re.findall(r'\\n', a)[0], '\ta desired match')

\n 2
[] 		an empty list
['\\n'] : \n 	a desired match


# 1. Regular Expression Syntax

Regular expressions can contain both **special** and **ordinary** characters. 

## 1.1 **Types of characters**

### Ordinary characters

Most ordinary characters, like `'A'`, `'a'`, or `'0'`, are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so `last` matches the string `'last'`. 

In the rest of this section, we’ll write RE’s in this `special style`, usually without quotes, and strings to be matched `'in single quotes'`.

### Special characters

Special characters (like `'|'` or `'('`) 
- either stand for classes of ordinary characters, 
- or affect how the regular expressions around them are interpreted.

### Quantifiers

**Repetition operators** or **quantifiers** (`*`, `+`, `?`, `{m,n}`, etc) cannot be directly nested. This avoids ambiguity with the **non-greedy** modifier suffix `?`, and with other modifiers in other implementations. 

To apply a second repetition to an inner repetition, parentheses may be used. For example, the expression 

`(?:a{6})*` 

matches any multiple of six `'a'` characters.

## 1.2 **Special Characters**

### **`.`**

(Dot.) In the default mode, this matches **any character except a newline**. 

If the `DOTALL` flag has been specified, this matches any character including a newline.

In [21]:
import re
print(re.findall('h..', 'hey Hey Hey Hey'))
print(re.findall('e..', 'hey Hey Hey Hey'))
print(re.findall('a..', 'hey Hey Hey Hey'))

['hey']
['ey ', 'ey ', 'ey ']
[]


### **`^`**
(Caret.) Matches the start of the string, and in [MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) mode also matches immediately after each newline.

In [67]:
import re
print(re.findall('^.', 'hey Hey Hey Hey'))
print(re.findall('^..', 'hey Hey Hey Hey'))

['h']
['he']


In [66]:
import re
print(re.findall('^..', '\nhey Hey Hey Hey'))
print(re.findall('^..', '\nhey Hey Hey Hey', flags=re.MULTILINE))

[]
['he']


### **`$`**
Matches the **end of the string or just before the newline at the end of the string**, and in [MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) mode _also_ matches before a newline. 

`foo` matches both `‘foo’` and `‘foobar’`, while the regular expression `foo$` matches only `‘foo’`:

In [249]:
import re
print(re.findall('foo', 'foo'))
print(re.findall('foo$', 'foo'))

['foo']
['foo']


In [250]:
import re
print(re.findall('foo', 'foobar'))
print(re.findall('foo$', 'foobar'))

['foo']
[]


More interestingly, 

- searching for `foo.$` in `'foo1\nfoo2\n'` matches `‘foo2’` normally, but `‘foo1’` in [MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) mode:

In [62]:
import re
print(re.findall('foo.$', 'foo1\nfoo2\n'))
print(re.findall('foo.$', 'foo1\nfoo2\n', flags=re.MULTILINE))

['foo2']
['foo1', 'foo2']


- searching for a single `$` in `'foo\n'` will find two (empty) matches: one just before the newline, and one at the end of the string.

In [253]:
import re
print(re.findall('$', 'foo'))
print(re.findall('$', 'foo\n'))

['']
['', '']


In [61]:
import re
print(re.findall('$', 'foo1\nfoo2\n'))
print(re.findall('$', 'foo1\nfoo2\n', flags=re.MULTILINE))

['', '']
['', '', '']
