# Regex (Regular Expressions)

Regular Expressions (sometimes called regex for short) allows a user to search for **strings** using almost any sort of **rule** they can come up. For example, finding all capital letters in a string, or finding a phone number in a document.

Regular expressions are notorious for their seemingly **complex syntax**. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to **filter out any string pattern** you can imagine, which is why they have a complex string pattern format.

Regex in Python is in library `re`

All the **Regex** info from the official **Python library**: [link](https://docs.python.org/3/howto/regex.html)

In [1]:
# Importing a Python Regex library:
import re

## Simple Regex pattern searches in strings

In [5]:
# Example 01 - search of a word in a string:

string = "The agent's phone number is (555)-123-4567"
pattern = 'phone'

# Using Python's 'in' clause:
print('phone' in string)

# Using Regex search (this outputs also a span = from-to indexes in a string):
match = re.search(pattern, string)
match

True


<re.Match object; span=(12, 17), match='phone'>

In [6]:
# Outputting other valuable information - span, start and end:
print( match.span() )
print( match.start() )
print( match.end() )

(12, 17)
12
17


In [8]:
# Example 02 - pattern occuring more than once:
string = "my phone is a new phone"
match = re.search('phone', string)
match  ## This outputs only one occurence.

<re.Match object; span=(3, 8), match='phone'>

In [16]:
matches = re.findall('phone', string)
print('Occurences:', matches)
print('Number of occurences:', len(matches))

Occurences: ['phone', 'phone']
Number of occurences: 2


In [17]:
# Iterating through all the occurences:
for match in re.finditer('phone', string):
    print(match.span())

(3, 8)
(18, 23)


In [18]:
# Returning only a matched text:
match.group()

'phone'

## Regex **Patterns**

| Character | Description | Example Pattern Code | Example Match |
| -- | -- | -- | -- |
| \d |	A digit |	file_\d\d |	file_25 |
| \w |	Alphanumeric |	\w-\w\w\w |	A-b_1 |
| \s |	White space |	a\sb\sc |	a b c |
| \D |	A non digit |	\D\D\D |	ABC 
| \W |	Non-alphanumeric |	\W\W\W\W\W | 	*-+=) |
| \S |	Non-whitespace |	\S\S\S\S |	Yoyo |

In [38]:
# Example 01 - find a digit pattern:
phone_number = '555-123-4567'
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'
match = re.search(pattern, phone_number)
print(match)
print('The match:', match.group())

<re.Match object; span=(0, 12), match='555-123-4567'>
The match: 555-123-4567


## Regex **Quantifiers**

| Character |	Description |	Example Pattern Code |	Example Match |
| -- | -- | -- | -- |
| +	| Occurs one or more times	| Version \w-\w+ |	Version A-b1_1 |
| {3} |	Occurs exactly 3 times	| \D{3} |	abc |
| {2,4}	| Occurs 2 to 4 times	| \d{2,4} |	123 |
| {3,}	| Occurs 3 or more	| \w{3,} |	anycharacters |
| \*	| Occurs zero or more times	| A\*B\*C* |	AAACC |
| ? |	Once or none |	plurals? |	plural |

In [41]:
# The above example, just with quantifiers (Example 01 - find a digit pattern):
phone_number = '555-123-4567'
pattern = r'\d{3}-\d{3}-\d{4}'
match = re.search(pattern, phone_number)
print(match)
print('The match:', match.group())

<re.Match object; span=(0, 12), match='555-123-4567'>
The match: 555-123-4567


## Regex **Groups**

**Groups** serve for searching for a pattern, but being able to search for **sub-patterns** as well, i.e. **without** the necessity to **write separate patterns** for each sub-search.

Groups are created using `.compile`.

In [47]:
phone_number = '555-123-4567'
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

results = re.search(phone_pattern, phone_number)
print(results)

# The compiled group:
print('Compiled group:', results.group() )

# The group 1:
print('Group 1:', results.group(1) )

# The group 2:
print('Group 2:', results.group(2) )

# The group 3:
print('Group 3:', results.group(3) )

<re.Match object; span=(0, 12), match='555-123-4567'>
Compiled group: 555-123-4567
Group 1: 555
Group 2: 123
Group 3: 4567


## Regex **Additional operators**

### Logical operator **OR**

In [48]:
re.search(r"man|woman", "This man was here.")

<re.Match object; span=(5, 8), match='man'>

### Regex **Wildcard characters**

Wildcards are used for any matches that are on the position of a wildcard.

In [50]:
# Example 01 - find all occurences of "-at", i.e. cat, hat, sat ...:
re.findall(r".at", "The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [52]:
# Example 02 - using more than one wildcard:
re.findall(r"...at", "The bat went splat")

['e bat', 'splat']

In [54]:
# Previous example with only words 
# (here \S+ stands for 'one or more non-whitespace characters'):
re.findall(r'\S+at', "The bat went splat")

['bat', 'splat']

### **Starts with** and **Ends With**

- **Starts with** - **^** 
- **Ends with** - **$** 
- Applies for the entire string, not individual words!

In [56]:
# Starts with a Number:
re.findall(r'^\d','1 is the loneliest number.')

['1']

In [57]:
# Ends with a Number:
re.findall(r'\d$','This ends with a number 2')

['2']

### **Brackets** for **Exclusion**

- To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. 
- Anything inside the brackets is excluded.

In [59]:
# Example 01 - find all characters that does not start with number:
phrase = "There are 3 numbers 34 inside 5 this sentence."
re.findall(r'[^\d]', phrase)

['T',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [61]:
# Above example with words together (not single characters) using + sign for one or more occurences:
re.findall(r'[^\d]+', phrase)

['There are ', ' numbers ', ' inside ', ' this sentence.']

In [62]:
# Example 02 - removing punctations:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

re.findall('[^!.? ]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [66]:
# Above example with all sentence together, just without punctation by
# joining the occurenes by whitespaces:
clean = ' '.join(re.findall('[^!.? ]+', test_phrase))
clean

'This is a string But it has punctuation How can we remove it'

### **Brackets** for **Grouping**

We can use **brackets** to **group together options**.

In [68]:
# Example 01 - finding hyphenated words:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

### **Parenthesis** for Multiple Options

If we have **multiple options** for **matching**, we can use **parenthesis** (brackets) to **list out** these **options**.

In [69]:
# Example 01 - finding words that start with cat and end with one of these options: 
# 'fish','nap', or 'claw':
text = 'Hello, would you like some catfish?'

re.search(r'cat(fish|nap|claw)', text)

<re.Match object; span=(27, 34), match='catfish'>