# Regular Expressions

This notebook introduces the basics of regular expression searching; functions, identifiers and examples are presented.

Overview of contents:

1. Basic Search Functions
2. Patterns
    - 2.1 Identifiers & Quantifiers
    - 2.2 Groups
    - 2.3 OR Statements: `|`
    - 2.4 Wildcards: `.`
    - 2.5 Starts with and Ends with: `^,$`
    - 2.6 Exclusion: `[^]`
    - 2.7 Brackets for Grouping (Words): `[]+`
    - 2.8 Parenthesis for Multiple Options
    - 2.9 Example: Find Emails

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Basic Search Functions

In [78]:
# Example text
text = "The agent's phone number is 408-555-1234. Call soon!"

In [79]:
# Python brings built-in capabilities
"408-555-1234" in text

True

In [80]:
# Python native library for regular expressions
import re

In [81]:
pattern = 'phone'

In [82]:
# We always search a pattern in the text
# which returns a match object with a lot of methods & atttributes
match = re.search(pattern,text)

In [83]:
match.span()

(12, 17)

In [84]:
match.start()

12

In [85]:
match.end()

17

In [86]:
# Get the found text
# In this case it's trivial, but if we use regular expression patterns
# we don't know the actual found text string
match.group()

'phone'

In [87]:
# Several match objects
text = "my phone is a new phone"
pattern = 'phone'
for match in re.finditer(pattern,text):
    print(match.span())

(3, 8)
(18, 23)


In [88]:
# To just find the pattern without the return match object
# It makes sense when we use patterns
re.findall(pattern,text)

['phone', 'phone']

## 2. Patterns

### 2.1 Identifiers & Quantifiers

To define a pattern, we use the format

    r'mypattern'
    r"mypattern"    

And define the type of symbols preceded by `\`; not using `r` would result in invoking escape, but the regular expression symbols are not escapes.

List of typical character/symbol types, aka. **identifiers**:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>

<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>

<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

When an identifier repeats, we can use **quantifiers**:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>

<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>

<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [21]:
text = "My telephone number is 408-555-1234"

In [22]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [23]:
phone = re.search(pattern,text)

In [25]:
# With patterns, we don't really know the content found
# We can access it with group()
phone.group()

'408-555-1234'

In [29]:
# Quantifiers: when an identifier repeats,
# we put its number in curly braces
phone = re.search(r'\d{3}-\d{3}-\d{4}',text)

In [30]:
phone.group()

'408-555-1234'

### 2.2 Groups

In [42]:
# We can group pattern parts in () inside their definition
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [43]:
results = re.search(phone_pattern,text)

In [44]:
# The entire result
results.group()

'408-555-1234'

In [45]:
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1.
# Passing in 0 returns everything
results.group(1)

'408'

In [46]:
results.group(2)

'555'

In [47]:
results.group(3)

'1234'

### 2.3 OR Statements: `|`

In [48]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [49]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

### 2.4 Wildcards: `.`

In [50]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [54]:
# Note that . means exactly one character
re.findall(r".at","The bat went splat")

['bat', 'lat']

In [55]:
# Similarly, several . mean that amount of characters
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

In [56]:
# Solution: \S
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### 2.5 Starts with and Ends with: `^,$`

In [57]:
# Ends with: $
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [58]:
# Starts with: ^
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

### 2.6 Exclusion: `[^]`

In [59]:
# To exclude characters, 
# we can use the ^ symbol in conjunction with a set of brackets []
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [61]:
# Match: everything execot digits
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [62]:
# Get words back together: +
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [64]:
# We can use this to remove punctuation from a sentence
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [65]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [66]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [67]:
clean

'This is a string But it has punctuation How can we remove it'

### 2.7 Brackets for Grouping (Words): `[]+`

In [68]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [70]:
# We want to find words with a hyphen
# [\w]+: any number of alphanumeric characters
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

### 2.8 Parenthesis for Multiple Options

In [89]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [90]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [91]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [92]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

### 2.9 Example: Find Emails

In [96]:
text = "This is a nice email: name@service.com"

In [97]:
re.findall(r'\w+@\w+.\D{3}',text)

['name@service.com']