# Regular Expression

Regular Expressions, often shortened as regex, are a sequence of characters used to extract or check whether a **pattern exists in a given text (string) or not**.

They are used at the server side to validate the format of email addresses or passwords during registration, used for parsing text data files to find, replace, or delete certain string, etc.

Very common use case of regular expression
- Password validation
- Email validation
- Valid date format
- Empty string validation
- Phone number/Credit card number validation
- ...


![](https://imgs.xkcd.com/comics/regular_expressions.png)



In conclusion: Regular expression helps in manipulating textual data, which is often a prerequisite for **data science projects involving text mining**.

# Python ``re`` module

The **re library** in Python provides several functions that make it a skill worth mastering. You will see some of them closely in this tutorial.

### re.search

* **`re.search(pattern, string)`**
Scan through string looking for the first location where the regular expression
pattern produces a match, and return a corresponding **match object**. Return None if
no position in the string matches the pattern


In [2]:
import re

In [9]:
pattern = r'Cookie'

sequence = 'In this store we sell cookie'

match_obj = re.search(pattern, sequence)
print(match_obj)


None


In [11]:
pattern = r'cookie'
sequence = 'In this Cookie store we sell cookie'
match_obj = re.search(pattern, sequence)
print(match_obj)

<re.Match object; span=(29, 35), match='cookie'>


Call **.group()** from the **match object** returns the **part of the string** where there was a match

In [12]:
match_obj.group()

'cookie'

**.span()** returns a tuple containing the start and end positions of the match.

In [13]:
match_obj.span()

(29, 35)

What is ``r`` at the start of the pattern Cookie?
This is called a **raw string literal**. It changes how the string literal is interpreted. To prevent [special characters](https://www.w3schools.com/python/gloss_python_escape_characters.asp) from being interpreted as special characters, you should use the ``r`` prefix.

In [16]:
pattern='A word\tAnother word\nA new line'
print(pattern)

A word	Another word
A new line


In [17]:
pattern=r'A word\tAnother word\nA new line'
print(pattern)

A word\tAnother word\nA new line


# Wild card characters

The following table lists a few of these characters that are commonly useful:

|Character classes||Quantifiers & Alternation||
|--- |--- |--- |--- |
|.|any character except newline|a* a+ a?|0 or more a / 1 or more a / 0 or 1 a|
|\w \d \s|word / digit / whitespace|a{5} a{2,}|exactly five, two or more|
|\W \D \S|not word / not digit / not whitespace|a{1,3}|between one & three|
|[abc]|any of a, b, or c|a+? a{2,}?|match as few as possible (non-greedy)|
|[^abc]|not a, b, or c|(cat\|dog)|match 'cat' or 'dog'|
|[a-g]|character between a & g|||
|**Anchors**||**Escaped characters**||
|^abc$|start / end of the string|\. \* \\|\ is used to escape special chars. \* matches *|
|\b|word boundary|\t \n \r|tab, linefeed, carriage return|



| Character | Description | Example |
|------------|-----------|------------|
| ? | Match zero or one repetitions of preceding |  "ab?" matches "a" or "ab" |
| * | Match zero or more repetitions of preceding | "ab*" matches "a", "ab", "abb", "abbb"... |
| + | Match one or more repetitions of preceding |  "ab+" matches "ab", "abb", "abbb"... but not "a" |
| {n} | Match n repetitions of preceding | "ab{2}" matches "abb" |
| {m,n} | Match between m and n repetitions of preceding |  "ab{2,3}" matches "abb" or "abbb" |


**Note**: \w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_] )


# Use case: Phone number validation (US)


## Read a text file in Python

In [18]:
!wget -q -c https://raw.githubusercontent.com/anhquan0412/dataset/main/sample_text.txt

The '**!**' is commonly used in Jupyter Notebook or JupyterLab environments to indicate that the command should be executed in the system shell or command-line interface, rather than being interpreted as Python code.

**wget**: retrieve files from the web using HTTP, HTTPS, or FTP protocols.

**-q**: This option stands for "quiet" and tells wget to operate in silent mode, suppressing any unnecessary output or progress indicators.

**-c**: This option stands for "continue" and instructs wget to resume interrupted downloads, allowing it to pick up from where it left off if the file is partially downloaded or if the file already exists locally.


How can you read a text file line by line?

1. Using file readlines

In [19]:
with open('sample_text.txt') as f:
    sequences = f.readlines()

print(sequences)

['This is my phone number 2816837760.\n', 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.\n', 'This is another phone format: 281-683-7760.\n']


In [20]:
# but you have to strip the new line character
for i in range(len(sequences)):
    sequences[i]= sequences[i].strip()
print(sequences)

['This is my phone number 2816837760.', 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.', 'This is another phone format: 281-683-7760.']


2. Without using readlines

In [21]:
with open('sample_text.txt') as f:
    sequences = []
    for line in f:
        sequences.append(line.strip())

In [22]:
sequences

['This is my phone number 2816837760.',
 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.',
 'This is another phone format: 281-683-7760.']

## Design a regex pattern

In [23]:
pattern=r'\d\d\d\d\d\d\d\d\d\d'

In [24]:
sequences[0]

'This is my phone number 2816837760.'

In [25]:
match_obj = re.search(pattern,sequences[0])
match_obj.group()

'2816837760'

In [26]:
sequences[1]

'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.'

In [27]:
match_obj = re.search(pattern,sequences[1])
match_obj.group()

'2816837760'

Note that ```re.search``` scans through string looking for the **first matched location**

## re.findfall

You can use findall to return multiple matches


**`re.findall(pattern, string)`**
Return all **non-overlapping matches** of pattern in string, as a list of strings. The
string is scanned **left-to-right**, and matches are returned in the order found. If one
or more groups are present in the pattern, return a list of groups; this will be a list
of tuples if the pattern has more than one group. Empty matches are included in the result.


In [28]:
match_list = re.findall(pattern,sequences[1])
match_list

['2816837760', '2811234567']

The pattern is good but repetitive. We can write better matching pattern.

In [30]:
pattern=r'\d{10}'

In [31]:
match_list = re.findall(pattern,sequences[0])
match_list

['2816837760']

In [32]:
match_list = re.findall(pattern,sequences[1])
match_list

['2816837760', '2811234567']

In [33]:
sequences[2]

'This is another phone format: 281-683-7760.'

In [34]:
match_list = re.findall(pattern,sequences[2])
match_list

[]

In [35]:
for line in sequences:
    match_list = re.findall(pattern,line)
    if len(match_list):
        print(match_list)

['2816837760']
['2816837760', '2811234567']


## Group regex with parenthesis

You can use parenthesis to extract a sub-match (group) of a whole match. To extract this sub-match, use **.group(index)** or **.groups** syntax from **match object**

In [36]:
pattern=r'(\d\d\d)(\d\d\d)(\d\d\d\d)'

In [37]:
sequences[0]

'This is my phone number 2816837760.'

In [38]:
match_obj = re.search(pattern,sequences[0])
match_obj.group()

'2816837760'

In [39]:
match_obj.groups()

('281', '683', '7760')

In [42]:
# alternative

print(match_obj.group(1))
print(match_obj.group(2))
print(match_obj.group(3))

281
683
7760


## Replace phone numbers

**`re.sub(pattern, repl, string)`**
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern
in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

In [43]:
sequences[1]

'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.'

In [44]:
pattern=r'\d{10}'

result = re.sub(pattern,r'REDACTED',sequences[1])
print(result)


I am at 123 Main street, NY 10010, phone number is REDACTED, but I have an alternative: REDACTED.


You can also replace the sub-match using literal string: **\position**
- \1: use group 1 to substitute
- \2: use group 2 to substitute
- ...


In [47]:
pattern=r'(\d\d\d)(\d\d\d)(\d\d\d\d)'

result = re.sub(pattern,r'\1',sequences[1])
print(result)

I am at 123 Main street, NY 10010, phone number is 281, but I have an alternative: 281.


# Practice time!!!

Regular Expression might be overwhelming as you have to remember those wildcard characters and know how to apply them to your application. That's why there are a lot of resources for you to write and test your regular expression patterns. No matter what you do, to use regular expression well, you need to **practice**!

You will learn and practice regular expression using this website: [https://regexone.com/lesson/introduction_abcs](https://regexone.com/lesson/introduction_abcs)


And here is a tool to check your regular expression pattern [https://regex101.com/](https://regex101.com/)

If you want to read more about regex tutorial, here is a good resource: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285