# Strings II

In [None]:
import string

import time
from IPython.display import YouTubeVideo
from IPython.display import clear_output

In [None]:
word = "silencio"

## Regular Expressions

[HOWTO](https://docs.python.org/3/howto/regex.html), [doc](https://docs.python.org/3/library/re.html)  
[Coding Train series](https://www.youtube.com/watch?v=7DG3kCDx53c&list=PLRqwX-V7Uu6YEypLuls7iidwHMdCM6o2w)  
[w<sub>3</sub> tutorial](https://www.w3schools.com/python/python_regex.asp), [RealPython tutorial 1](https://realpython.com/regex-python/), [tutorial 2](https://realpython.com/regex-python-part-2/)  
[old ugly website with *everything*](https://www.regular-expressions.info/), [test online website](https://regex101.com/)

In [None]:
import re

### Functions

[doc](https://docs.python.org/3/library/re.html#functions)

#### `re.search`

[(doc)](https://docs.python.org/3/library/re.html#re.search), returns a match if the pattern is found anywhere in the string.

In [None]:
print(word)
match = re.search("si", word)
print(match)
# use match. + tab to search options

**Note**:  
[walrus doc](https://docs.python.org/3/reference/expressions.html#grammar-token-python-grammar-assignment_expression)  
[RealPython walrus tutorial](https://realpython.com/python-walrus-operator/)

In [None]:
# the so-called walrus operator: executes the right-hand side
# and saves the result for reuse in one fell swoop!
if m := re.search("si", word):
    print(f"match found, span: {m.span()} ('{m[0]}')") 
    print(f"(span is just start: {m.start()}, end: {m.end()})")
    print(f"string used: {m.string}")
    print(f"regex used: {m.re}")
    print(f"we can now slice {word} using indices: {word[m.start():m.end()]} + {word[m.end():]}")

#### `re.finditer`

[(doc)](https://docs.python.org/3/library/re.html#re.finditer), returns an **iterator** with all the match objects (like `re.search`, but returning all matches, not just the first).

In [None]:
pat = re.compile("i")
print(f"Using `finditer` with '{pat.pattern}' on '{word}':")
for m in re.finditer(pat, word):
    print(f" - {m[0]} at {m.span()} ({m})")

In [None]:
# how to get a list out of it?
list(re.finditer("i", word))

##### Extra: More

[`re.match` doc](https://docs.python.org/3/library/re.html#re.match) is like `re.search`, but the pattern must match the *beginning* of the string.  
[`re.fullmatch` doc](https://docs.python.org/3/library/re.html#re.fullmatch) is like `re.search`, but the pattern must match the *entirety* of the string.

[`re.compile` doc](https://docs.python.org/3/library/re.html#re.compile)

Sometimes it can be useful to 'compile' the search pattern into an object, and reuse it (to perform the same search multiple times, say).

In [None]:
# create a regex object
pat = re.compile("si")

# same as:
# print(re.match(pat, word))
print(pat.match(word))

[`re.findall` (doc)](https://docs.python.org/3/library/re.html#re.findall) returns a **list** with all the string snippets found when performing the search.  

#### `re.split`

[re.split doc](https://docs.python.org/3/library/re.html#re.split)  

A more powerful version of `str.split()`...

In [None]:
# pattern, string to split
re.split("i", word)

In [None]:
# adding groups (parentheses) will make it return the delimiter as well!
# https://stackoverflow.com/a/2136580
re.split(r"(\s)", "Longtemps je me couchai de bonne heure.")

#### `re.sub`

[re.sub doc](https://docs.python.org/3/library/re.html#re.sub)  

A more powerful version of `str.replace()`...

In [None]:
# pattern, replacement, string
re.sub("i", "i" * 5, word)

### Syntax

[HOWTO metacharacters](https://docs.python.org/3/howto/regex.html#more-metacharacters)

The power of regular expressions comes from the ability to build complex search patterns using a dedicated syntax.

- `.`: represents any char except the newline
- `^`: start of line (compare `\A`, start of entire string)
- `$`: end of line (compare `\Z`, end of entire string)
- `\b` / `\B`: a word boundary / its complement (not a boundary)

In [None]:
print(re.match(r".", word))
print(re.match(r"^s", word))
print(re.search(r"o$", word))

print(re.search(r"len", word))
# len is inside 'silencio', will not match
print(re.search(r"\blen", word))

#### Extra: Flags

[HOWTO flags](https://docs.python.org/3/howto/regex.html#compilation-flags)

- `re.DOTALL`: the dot also matches newlines
- `re.IGNORECASE`: ignores the case in the pattern
- `re.MULTILINE`: `^` / `$` now mean the beginning of the multiline string, not the beginning / end of each line

These can also be set inside the pattern itself, search for `(?aiLmsux)` [here](https://docs.python.org/3/library/re.html#regular-expression-syntax) for the reference.

In [None]:
# S does not match with s
pat = re.compile(r"S")
print(re.match(pat, word))

# now it does – same as re.compile(r"(?i)S")
pat = re.compile(r"S", re.IGNORECASE)
print(re.match(pat, word))

#### Character classes

[HOWTO matching characters](https://docs.python.org/3/howto/regex.html#matching-characters)

- `\d` / `\D`: a number / not a number
- `\s` / `\S`: a space character (` \t\n\r\f\v`) / not a space
- `\w` / `\W`: a word character (including numbers) / not a word character


#### Repeaters (greedy / lazy)

[HOWTO repeating things](https://docs.python.org/3/howto/regex.html#repeating-things)  
[HOWTO greedy/non-greedy](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy)  

| matches | greedy<sup>1</sup> | lazy<sup>2</sup> |
| --- | --- | --- |
| 0 or 1 of<br>previous item<sup>3</sup> | `?` | |
| zero or more of<br>previous item<sup>3</sup>  |`*` | `*?` |
| one or more of<br>previous item<sup>3</sup> | `+` | `+?` |
|  from `m` to `n` repetitions | `{m,n}` |  `{m,n}?` |

1: **greedy**: matches as much text as possible  
2: **lazy**: as little as possible  
3: this either a character (`ab+` matches `abbbbb`), or a group (`a(bc)+` matches `abcbcbcbc`)

In [None]:
s = "what what what what is the word"
match_greedy = re.match(r"(what )+", s)
match_lazy = re.match(r"(what )+?", s)

print(f"Greedy match: {match_greedy}")
print(f"Greedy match: {match_lazy}")

#### Extra: Capture groups

[HOWTO grouping](https://docs.python.org/3/howto/regex.html#grouping)

- `[xyz]`: defines a *character set*, i.e. any of the characters x, y or z (similar to above)  
  `[a-z]`: for a range (matches all chars from a to z)  
  `[^xyz]`: the negative of a character set, i.e. all chars that are *not* x, y or z
- `()`: define a *group* (see below)  
  `\(\)`: to search for actual parentheses (they must be *escaped* using a slash)
- `(x|y|z)`: defines a regex *OR* (match either x, y or z)  
  `\|`: to search for an actual vertical bar (same, must be *escaped*)


In [None]:
s = "Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways."

print(s)
print()

# selecting only snippets of the text using the letters forming `Jeremie` (caps/no-caps)
print(re.sub(r"[^jeremieJEREMIE]+", " ■ ", s))
print()

# selecting the complement (only letters *not* using those letters)
print(re.sub(r"[jeremieJEREMIE]+", " □ ", s))

In [None]:
# source: https://resources.saylor.org/wwwresources/archived/site/wp-content/uploads/2011/01/Waiting-for-Godot.pdf
s = """
POZZO:
(peremptory). Who is Godot?
ESTRAGON:
Godot?
POZZO:
You took me for Godot.
VLADIMIR:
Oh no, Sir, not for an instant, Sir.
"""

# searching for either V or E speaking, a newline, and the entire line 
# p to the end (what they say) – selecting their lines only, removing Pozzo
for m in re.finditer(r"(VLADIMIR:|ESTRAGON:)\n.*", s):
    print(m)
print()

# searching for any number of word characters inside actual parentheses
print(re.search(r"\(\w+\)", s))

In [None]:
messy_string = "2234    234 58948   2080 345 455    0345"

# (\d): a first group containing a number (recalled with \1)
# \s+: one or more spaces
# (\d): a second group containing just one number (recalled with \2)
pat = re.compile(r"(\d)\s+(\d)")

print(pat.groups)

# replace matches with group one (\1), first digit, a comma, then group two (\2), second digit
clean_string = re.sub(pat, r"\1,\2", messy_string)

print(clean_string)

#### More

If you want even more:
- [non-capturing and named groups](https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups) (sometimes it's useful to define a group just for syntactic reasons, without it being counted; sometimes you want to give a name to your group, to be able to retrieve it with the name instead of the index)
- [lookaheads & lookbehinds](https://docs.python.org/3/howto/regex.html#lookahead-assertions) (sometimes you want to match only something that follows/precedes an expression: for instance, above, you could retrieve only what Estragon says, without the 'ESTRAGON:' included)

For even more functionality, see the [regex](https://pypi.org/project/regex/) module (install with `pip`, backward compatible with `re`).

## Advanced: Unicode, encoding and UTF-8 

[doc (aka the gory detail)](https://docs.python.org/3/howto/unicode.html)  
[`ord()` doc](https://docs.python.org/3/library/functions.html#ord)  
[`chr()` doc](https://docs.python.org/3/library/functions.html#chr)  
[`bin()` doc](https://docs.python.org/3/library/functions.html#bin)  
[`int()` doc](https://docs.python.org/3/library/functions.html#int)

- `unicode`: worldwide standard assigning one **number** to one **character**
- encoding: the way you **implement** this in computers (how to organise the 0s and 1s, binary representation, so as to get the computer to handle text properly)
- `utf-8` (Unicode Transformation Format – 8-bit): one specific implementation, in this case using 8 bits for each character


In [None]:
YouTubeVideo("MijmeoH9LT4", width=853, height=480) #  Characters, Symbols and the Unicode Miracle - Computerphile 

In [None]:
print("the unicode point for 'a' is:", ord("a"))

In [None]:
print("the character for unicode point 97:", chr(97))

In [None]:
print("binary representation of 97 is:", bin(97))

In [None]:
print("converting from binary to integer (base 10):", int('10', 2)) # try '0', '1', '10', '11' ...
print("'converting' from base 10 to integer (also base 10):", int('10', 10)) 

In [None]:
print("converting bytes for 97 back to 97:", int(bin(97), 2))

In [None]:
print("converting bytes for 97 back to 'a':", chr(int(bin(97), 2)))

In [None]:
# to convert a string to binary,
# first 'encode' to bytes
byte_string = "a".encode("utf8") # try adding more letters
# then turn the bytes into binary
# (the '0b' only indicates this is a binary string)
# (see the `int()` doc for details)
list_of_binary_strings = [bin(byte) for byte in byte_string]
print(list_of_binary_strings)

In [None]:
# Chinese characters take more than one byte!
byte_string = "龙".encode("utf8")
list_of_binary_strings = [bin(byte) for byte in byte_string]
print(list_of_binary_strings)

Jörg Piringer's [unicode](https://joerg.piringer.net/index.php?href=unicode/unicode.xml), going through numbers `0 - 65536 (49571 characters)`:

In [None]:
YouTubeVideo("Z_sl99D2a18", width=853, height=480) #  unicode 

In [None]:
# many characters are also space / invisible / etc., so won't display anything
for i in range(65536):
    print(chr(i))
    # try this if you also want the index
    # print(i, chr(i))
    time.sleep(.09)
    clear_output(wait=True)

## Extra: `str.translate`

[translate doc](https://docs.python.org/3/library/stdtypes.html#str.translate)  
[maketrans doc](https://docs.python.org/3/library/stdtypes.html#str.maketrans)

A useful tool to translate strings character by character (e.g. all "e"s become "a"s, or "remove all punctuation (all punctuation becomes ''").

In [None]:
# translate expects the `unicode` number!
word.translate(
    {
        ord("i"): ord("1"),
        ord("e"): ord("3"),
        ord("c"): ord("<"),
        ord("o"): ord("0"),
    }
)

In [None]:
word.translate(
    str.maketrans("ieco", "13<0")
)

In [None]:
# maketrans can take a third argument:
# here we say: translate everything as is ("" to "")
# and the last argument is for *everything that needs to be removed*
# (equivalent to do `ord(","): None` for all punctuation characters)
print(str.maketrans("", "", string.punctuation))

In [None]:
# now, we can remove all punctuation from text
print("Hello there! How are you? Yes, you...".translate(
    str.maketrans("", "", string.punctuation)
))