## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Regular Expressions

With **Regular Expressions** (often abbreviated as **regex**), users have the capability _to search_ for _specific strings_ using a wide range of rules they devise. For instance, they can locate phone numbers within a string.

Regular expressions are infamous for their _intricate syntax_. The reason for this complexity lies in their versatility. Regular expressions must be capable of filtering out any conceivable string pattern, necessitating a sophisticated string pattern format.

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

## Searching for Basic Patterns

Suppose we have the following string:

In [None]:
txt = "Hey, my new phone number is 99-98123-4150. Call me later!"

One way to find out if a word is inside the text is:

While valid, this solution _only works_ for checking **exact words or substrings**. It is not possible, for example, to check if there is a generic phone number that follows a given pattern/format. We would only be able to check if the _exact number_ `"99-98123-4150"` is present in the text.

<br/>

Let's start with **regular expressions** for simples cases and later more complex cases.

In [None]:
# built-in Python module


In [None]:
# define the regex/pattern to search
# in this case, the exact word 'number'


In [None]:
# search for the pattern occurs in the string/text


`re.search()` takes the _pattern_, _scan_ the text, and then return a **Match object**. If the pattern is found, it stores the **span**, that is, the **starting index** (included) and **ending index** (excluded) where the pattern was found.

In [None]:
# repeat the search and save it in a variable


In [None]:
# span


<br/>

What if the pattern/regex **is not found**?

In [None]:
# a word not present in the text


When the pattern is not found, the return of `re.search` is `None`. In Jupyter Notebook this just means that nothing is output below the cell.

In [None]:
match

In [None]:
print(match)

<br/>

But what if the pattern _occurs_ **more than once**?

In [None]:
txt = "I like my friends, they are good friends."

In [None]:
len(txt)

In [None]:
# search for the word "friends"


Note that the function _only matches_ the **first instance**.

If we wanted a **list of *all matches***, we can use `.findall()` method:

<br/>

To get a **match object** for each pattern found, we can use the **iterator**:

<br/>

Can the pattern `"friends"` also match the word "friends" written in other ways?

In [None]:
txt

**Answer:** **NO**, it can't. This is a simple _regex_ that only matches the ***exact word/patter***.

But, in this specific case involving _lowercase and uppercase letters_, we can simply **ignore the _case_** by passing an extra argument to `re.search`.

Now, the pattern was found ignoring the _case_.

For other _flags_, refer to: https://docs.python.org/3/library/re.html#flags

Let's see how to generalize patterns.

## Patterns

We only learned how to search for a basic pattern -- the exact word/string -- so far. What about **a more complex (and *generic*) pattern**? For example, trying to find if there is a phone number in the text, without necessarily knowing the exact number.

We're going to start learning the regex syntax to define _generic patterns_. This systax will hurt you in a first moment, but be strong... you'll love it later 🤓

For more details, refer to: https://www.w3schools.com/python/python_regex.asp

### Special Sequences
A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:

| **Character** | **Description**                                                                        | **Pattern Example** | **Match Example** |
|---------------|----------------------------------------------------------------------------------------|---------------------|-------------------|
|       \d      |                               A digit (numbers from 0-9)                               |  "student-id-\d\d"  |   student-id-23   |
|       \D      |                                       A non digit                                      |        "\D\D"       |         Zy        |
|       \w      | Alphanumeric (characters from a to Z, digits from 0-9, and the underscore _ character) |      "\w\w\w\w"     |        Xa2y       |
|       \W      |                                    Non-alphanumeric                                    |       "\W\W\W"      |        -+=        |
|       \s      |                                       White space                                      |     "I\sF\sS\sP"    |      I F S P      |
|       \S      |                                     Non-whitespace                                     |        "\S\S"       |         a0        |

### Sets
A set is a set of characters inside a pair of square brackets `[]` with a special meaning:

| **Set**    | **Description**                                                                                                                                                   |
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    [arn]   |         Returns a match where one of the specified characters (`a`, `r`, **or** `n`) is present. <br/> You can choose any character and how many you want.        |
|    [a-n]   |                                         Returns a match for any lower case character, alphabetically between `a` and `n`.                                         |
|   [^arn]   |         Returns a match for any character **EXCEPT** `a`, `r`, and `n`. <br/> You can choose any character and how many you want. <br/> `^` means EXCEPT.         |
|   [0123]   |          Returns a match where any of the specified digits (`0`, `1`, `2`, **or** `3`) are present. <br/> You can choose any digit and how many you want.         |
|    [0-9]   |                                         Returns a match for any digit between `0` and `9`. <br/> `-` defines an interval.                                         |
| [0-5][0-9] | Returns a match for any **two-digit numbers** from `00` and `59`. <br/> The first digit can be in the interval of `0-5` whereas the second in the interval `0-9`. |
|  [a-zA-Z]  |                               Returns a match for ***any* character alphabetically** between `a` and `z`, lower case OR upper case.                               |
|     [+]    |             In sets, `+`, `*`, `.`, `\|`, `()`, `$`, `{}` has no special meaning, so `[+]` means: return a match for any `+` character in the string.             |

#### Examples

##### 1) Find the phone number.

In [None]:
txt = "Hey, my new phone number is 99-98123-4150. Call me later!"

In [None]:
# regex/pattern for the phone number


In [None]:
match = re.search(regex, txt)
print(match)
print(match.group())

`.group()`: returns the part of the string where there was a match

##### 2) Old Brazilian Licence Plate: three capital letters followed by four digits.

In [None]:
match = re.search(regex, 'ABC9123')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'BEE1010')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'My licence plate is: RIO2000')
print(match)
print(match.group())

In [None]:
# invalid pattern
match = re.search(regex, 'abc9123')
print(match)

In [None]:
# invalid pattern
match = re.search(regex, 'ABCD123')
print(match)

In [None]:
# invalid pattern
match = re.search(regex, 'A text without a licence plate.')
print(match)

<br/>

Note the repetition of the pattern elements (e.g., `\d`, `[A-Z]`) in the previous examples. That is a bit of an _annoyance_, especially if we are looking for _very long_ strings of numbers. Let's explore the possible *quantifiers*.

Please take note that in the previous examples, the pattern elements (such as `\d` and `[A-Z]`) have been **repeated multiple times**. This could be somewhat _bothersome_, especially when searching for extended strings of numbers. Therefore, we will now examine the available **quantifiers**.

### Quantifiers
We can specify the number of times that the regex happen.

| **Character** |                          **Description**                         | **Pattern Example** | **Match Example** |
|:-------------:|:----------------------------------------------------------------:|:-------------------:|:-----------------:|
|       *       |                     Zero or more occurrences.                    |       I*F*S*P*      |       IIIFSS      |
|       +       |                      One or more occurrences                     |        IFS+P+       |       IFSSSP      |
|       ?       |                     Zero or one occurrences.                     |        birds?       |        bird       |
|      {4}      |    Exactly four occureences. <br/> You can specify any number.   |        \d{4}        |        4029       |
|     {1,5}     | From 1 to 5 times occurrences. <br/> You can specify any number. |    [a-zA-Z]{1,5}    |        nLp        |
|      {2,}     |                      2 or more occurrences.                      |        \w{2,}       |      anytext      |

#### Examples

##### 1) Find the phone number.

In [None]:
txt = "Hey, I have two phone numbers: 99-98123-4150 and 99-12345-6789!"

In [None]:
# regex for the phone number


In [None]:
# return the first found regex/pattern
match = re.search(regex, txt)
print(match)
print(match.group())

In [None]:
phone_numbers = re.findall(regex, txt)
phone_numbers

In [None]:
for match in re.finditer(regex, txt):
    print(match)
    print(match.group())
    print()

##### 2) Old Brazilian Licence Plate: three capital letters followed by four digits.

In [None]:
match = re.search(regex, 'ABC9123')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'BEE1010')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'My licence plate is: RIO2000')
print(match)
print(match.group())

In [None]:
# invalid pattern
match = re.search(regex, 'abc9123')
print(match)

In [None]:
# invalid pattern
match = re.search(regex, 'ABCD123')
print(match)

In [None]:
# invalid pattern
match = re.search(regex, 'A text without a licence plate.')
print(match)

##### 3) Define a regex for the following pattern:
FullName Age ID

FullName = FirstName LastName (at least three letters for each, with the first letter capitalized and the other lowercase.) <br/>
Age = Integer number. <br/>
ID = two capital letters followed by three digits.

Ex: `"Luke Skywalker 16 LS987"`

**PS:** In case of a _single_ **white space**, we can use the white space directly in the string instead of `\s`.

In [None]:
match = re.search(regex, "Luke Skywalker 16 LS987")
print(match)
print(match.group())

In [None]:
match = re.search(regex, "Here is your data: Leia Morgana 16 LM234")
print(match)
print(match.group())

In [None]:
# invalid pattern
match = re.search(regex, "LukeSkywalker 16 LS987")
print(match)

In [None]:
# invalid pattern
match = re.search(regex, "Luke skywalker 16 LS987")
print(match)

## Groups

We can perform a more advanced search by grouping regular expressions. Thus, we can extract the entire regex and parts of it. For example, extract the _International Dialling Codes (IDC)_, the _dialling codes (DC)_, and the _phone number_. 

We can separate **groups** of _regular expressions_ using **parentheses**.

full phone number: IDC DC PHONENUMBER <br/>
- IDC: + and from one to three digits
- DC: two digits
- PHONENUMBER: five digits, hyphen, four digits
- E.g: +55 19 12345-6789

**PS:** Use the backslack `\` to escape regex symbols such as `+`.

In [None]:
match = re.search(regex, "My number is: +55 19 12345-6789")
print(match)

In [None]:
# return the full found regex/pattern


# or


In [None]:
# return all the groups (parts of the regex) --> tuple


In [None]:
# return the first group


In [None]:
# return the second group


In [None]:
# return the third group


## Advanced Regex Syntax

### Or operator `|`

In [None]:
match = re.search(regex, 'I have a cute cat.')
print(match)
print(match.group())

In [None]:
# cats - plural
match = re.search(regex, 'I have three cute cats.')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'I have a dangeous dog.')
print(match)
print(match.group())

In [None]:
# return the first occurrence of the regex
match = re.search(regex, 'I have a cat and a dog.')
print(match)
print(match.group())

In [None]:
# return all occurrences of the regex
re.findall(regex, 'I have a cat and a dog.')

### The Wildcard Character `.`
Use a "wildcard" symbol `.` matches **_any character_**, including alphanumeric, symbols, and spaces, except newline character.

In [None]:
match = re.search(regex, 'Hello, world!')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'He+9o, world!')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'He0 o, world!')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'He12o+-23, world!')
print(match)
print(match.group())

### Starts With `^` and Ends With `$`

We can use the `^` to indicate the start of the regex, and the `$` to the end.

Example with `^` (starting):

In [None]:
# the string/text must start with 3 capital letters


In [None]:
match = re.search(regex, 'HELLO, I am Samuka, your professor!')
print(match)
print(match.group())

In [None]:
# invalid
match = re.search(regex, 'Hello, I am Samuka, your professor!')
print(match)

Example with `$` (ending):

In [None]:
# the string/text must end with 2 digits


In [None]:
match = re.search(regex, 'I am 32')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'I was born in 1990')
print(match)
print(match.group())

In [None]:
# invalid
match = re.search(regex, 'I am 32 years old')
print(match)

Example with both `^` and `$`: <br/>
Brazilian CPF: XXX.XXX.XXX-XX

In [None]:
match = re.search(regex, '123.456.789-00')
print(match)
print(match.group())

In [None]:
# invalid
match = re.search(regex, 'My CPF is: 123.456.789-00')
print(match)

##### **Example**

Define a regex for files with the following rules:
- The filename can be formed by any kind of characters with at least one character except spaces.
- It must have and a file extension formed by a dot '.' and at least one character.
- The first character must be a letter and the remaining ones can be alphanumeric.
- The string must only contain the filename

Exs:
- `image01.png`
- `audio.MP3`
- `102_abc.d2`
- `veryLong-fil3N4m3___.XYs234c`

**OBS:** In order to consider the literal value of a regex symbol, use the backslace `\` _to escape it_.

In [None]:
match = re.search(regex, 'image01.png')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'audio.MP3')
print(match)
print(match.group())

In [None]:
match = re.search(regex, '102_abc.d2')
print(match)
print(match.group())

In [None]:
match = re.search(regex, 'veryLong-fil3N4m3___.XYs234c')
print(match)
print(match.group())

In [None]:
# invalid
match = re.search(regex, 'image01.1png')
print(match)

In [None]:
# invalid
match = re.search(regex, 'My filename is: image01.png')
print(match)

Our regex included the entire sentence.

### Exclusion

We can use the `^` symbol inside brackets `[]` to **exclude characters**. <br/>
Anything inside the brackets is excluded.

Ex: Removing punctuation:

In [None]:
txt = 'What an amazing day! I am really happy today, my friends. And you?'

In [None]:
# exclude ! . ? ,


### Multiple Options `()`

To consider **multiple options** for _matching_, we need to list the options inside `()`.

In [None]:
txt_1 = 'They shared similar musical tastes.'
txt_2 = 'She laughed musically.'
txt_3 = 'Your father was a fine musician.'

In [None]:
# pattern for: musical, musically, musician


In [None]:
match = re.search(regex, txt_1)
print(match)
print(match.group())

In [None]:
match = re.search(regex, txt_2)
print(match)
print(match.group())

In [None]:
match = re.search(regex, txt_3)
print(match)
print(match.group())

## Other useful RegEx functions
https://www.w3schools.com/python/python_regex.asp

### `split()`
Returns a list where the string has been split at each match. <br/>
The difference with the `split()` method of `strings` is that we can define a _regex_ as the **separator** instead of a _fixed string/separator_.

##### Ex 1) Split by comma and multiple spaces, semicolon and multiple spaces (;)

In [None]:
txt = 'apples, oranges, bananas; grapes'

##### Ex 2) Split by one or multiple hyphens

In [None]:
string = '1-2--3---4----5------6'

<br/>

Note the difference with the `split` of `string`.

### `sub()`
Replaces one or many matches with a string. <br/>
The difference with the `replace()` method of `strings` is that we can replace a _regex_ instead of a _fixed substring_.

##### Ex 1) Replace one or multiple hyphens with a space

In [None]:
string = '1-2--3---4----5------6'

##### Ex 2) Replace only the first occurrences of the regex for the previous string

# Keep studying
https://docs.python.org/3/howto/regex.html

# Exercises
https://www.w3resource.com/python-exercises/re/