# Regular expression

## Introduction
Regular expressions (also known as "regex") are powerful tools that provide a way to match patterns within strings of text. They offer an efficient way to search, replace, and parse text data based on patterns defined by a specific syntax.

A regular expression is a sequence of characters that forms a search pattern. They can be used to perform 'match' operations to check if a string contains the specified search pattern.



## Regular Expression Patterns
Regular expressions use specific syntax to represent patterns:

`.`: Matches any character except newline.

`^`: Matches start of the string.

`$`: Matches end of the string.

`*`: Matches 0 or more repetitions.

`+`: Matches 1 or more repetitions.

`?`: Matches 0 or 1 repetitions.

`{n}`: Exactly n repetitions.

`{n,}`: n or more repetitions.

`{,n}`: Less than or equal to n repetitions.

`{m,n}`: Between m and n repetitions.

`\`: Escape special characters.

`[]`: Indicates a set of characters.

`|`: Means OR (Matches with any of the characters separated by it).

`()`: Group sub-patterns.

Special Sequences

`\d`: Matches any decimal digit; this is equivalent to the class `[0-9]`.

`\D`: Matches any non-digit character.

`\s`: Matches any whitespace character.

`\S`: Matches any non-whitespace character.

`\w`: Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`.

`\W`: Matches any non-alphanumeric character.


Several on-line visual tools are available to validate the patterns we have written. One of them is https://regex101.com/. All we need to do is paste the text and provide a pattern, and the tool will visualize which parts of the text fit correctly.

## Python's re Module
Python's built-in re module allows us to work with regular expressions. To use it, you need to import the module using the import statement:

In [2]:
import re


## Using the re Module
The re module offers functions like match(), search(), findall(), split(), sub(), which we'll dive into during this lecture.

### re.match()
The re.match() function will try to match the regular expression pattern to the string with optional flags. 

In [18]:
import re

pattern = r"Python"
string = "Python is amazing"

match = re.match(pattern, string)

if match:
    print("Match found!")
else:
    print("No match found.")


Match found!


Here, r"Python" is a raw string which is used to write regular expressions in Python. The match() function only checks if the RE matches at the beginning of the string.

### re.search()
The re.search() function is similar to match(), but it doesn't limit us to finding matches at the beginning of the string: i.e you want to locate a match anywhere in the string

The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, `re.search()` returns a match object; if not, it returns `None`

In [4]:
import re

pattern = r"amazing"
string = "Python is amazing"

match = re.search(pattern, string)

if match:
    print("Match found!")
else:
    print("No match found.")


Match found!


### re.findall()
The re.findall() function returns all non-overlapping matches of the RE pattern in string as a list of strings:

In [5]:
import re

pattern = r"\b\w{4}\b"
string = "This is a Python code"

matches = re.findall(pattern, string)

print(matches)  # ['This', 'code']


['This', 'code']


The pattern \b\w{4}\b matches any four-letter word.

### re.split()
The re.split() function splits the string by the occurrences of the pattern:

In [6]:
import re

pattern = r"\s"
string = "This is a Python code"

split = re.split(pattern, string)

print(split)  # ['This', 'is', 'a', 'Python', 'code']


['This', 'is', 'a', 'Python', 'code']


### re.sub()
The re.sub() function replaces the occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This count is set to 1 by default:

In [7]:
import re

pattern = r"Python"
repl = "JavaScript"
string = "This is a Python code"

new_string = re.sub(pattern, repl, string)

print(new_string)  # 'This is a JavaScript code'


This is a JavaScript code


Note: We will be using the re module in Python and findall() function for these examples which returns all non-overlapping matches of pattern in string, as a list of strings.

###  re.subn()

The `re.subn()` is similar to `re.sub()` except it returns a tuple of 2 items containing the new string and the number of substitutions made.

In [8]:

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)


('abc12de23f456', 4)


Matching Any Single Character (.):

In [9]:
import re

pattern = r"a.b"  # Matches any character between 'a' and 'b'
string = "acb axb a2b a_b aeb a-zb"

matches = re.findall(pattern, string)
print(matches)  # Output: ['acb', 'axb', 'a2b', 'a_b', 'aeb', 'a-zb']


['acb', 'axb', 'a2b', 'a_b', 'aeb']


In this case, the dot . matches any character except the newline character (\n).

Matching Start and End of String (^ and $):

In [10]:
import re

pattern = r"^Hello"  # Matches strings that start with 'Hello'
string1 = "Hello world"
string2 = "world Hello"

print(bool(re.search(pattern, string1)))  # Output: True
print(bool(re.search(pattern, string2)))  # Output: False


True
False


The ^ character matches the start of the string. On the other hand, $ matches the end of the string.

Character Sets ([]):

In [11]:
import re

pattern = r"[A-Z]"  # Matches any uppercase letter
string = "Hello World"

matches = re.findall(pattern, string)
print(matches)  # Output: ['H', 'W']


['H', 'W']


Here, [A-Z] matches any uppercase letter.

Quantifiers (*, +, ?, {n}):

In [12]:
import re

pattern = r"a{3}"  # Matches 'a' repeated exactly 3 times
string = "aaaa baa aaa aa a aaa aaaa"

matches = re.findall(pattern, string)
print(matches)  # Output: ['aaa', 'aaa', 'aaa', 'aaa']


['aaa', 'aaa', 'aaa', 'aaa']


Quantifiers define how many times a character, group, or character set should be matched.

Special Sequences (\d, \w, \s, etc.):

In [13]:
import re

pattern = r"\d+"  # Matches one or more digits
string = "123 abc 456 def 789"

matches = re.findall(pattern, string)
print(matches)  # Output: ['123', '456', '789']


['123', '456', '789']


n this example, \d matches any digit, and + specifies one or more occurrences.

Remember, these are just a few examples of regex patterns. Regular expressions are extremely flexible and powerful, capable of matching and manipulating complex patterns of text data.

Extracting email addresses:

In [19]:
import re

pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
string = "My emails are example@gmail.com and test123@domain.co.uk."

matches = re.findall(pattern, string)
print(matches)  


['example@gmail.com', 'test123@domain.co.uk']


This regular expression matches any valid email address. The character set [A-Za-z0-9._%+-]+ matches the local part of the email. @ matches the at sign, [A-Za-z0-9.-]+ matches the domain name, and finally \.[A-Z|a-z]{2,} matches a period followed by the top-level domain (like .com, .co.uk).


Extracting dates in the YYYY-MM-DD format:

In [20]:
import re

pattern = r"\b\d{4}-\d{2}-\d{2}\b"
string = "The dates are 2023-07-08, 2023.10.20,  and 2021-05-12."

matches = re.findall(pattern, string)
print(matches)  # Output: ['2023-07-08', '2021-05-12']


['2023-07-08', '2021-05-12']


This regex pattern matches any date in the YYYY-MM-DD format. \d{4} matches a four-digit year, followed by a dash. \d{2} matches a two-digit month and day, separated by a dash.

Validating a Password:

Let's say you want to check if a password string is strong. For this example, a strong password is one that contains at least 8 characters, has both uppercase and lowercase characters, has at least one digit, and at least one special character.

In [16]:
import re

pattern = r"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*#?&])[A-Za-z\d@$!#%*?&]{8,}$"
password = "Password123!"

match = re.match(pattern, password)
if match:
    print("Strong password.")
else:
    print("Weak password.")


Strong password.


Matching a Hexadecimal Color Code:

Hexadecimal color codes are used in web development and design to specify colors. They consist of a hash sign # followed by 6 hexadecimal digits.

In [17]:
import re

pattern = r"^#(?:[0-9a-fA-F]{3}){1,2}$"
color_code = "#1F1F1F"

match = re.match(pattern, color_code)
if match:
    print("Valid color code.")
else:
    print("Invalid color code.")


Valid color code.


# Exercise

https://web.stanford.edu/~jurafsky/slp3/2.pdf

1. All three letter words
2. Dollar amounts like $8 and $3.99
3. Two consecutive repeated words
4. All single syllable words