## Raw Strings

In [None]:
print("Hello\nWorld")

In [None]:
print(r"Hello\nWorld")

We will use the `re.compile()` function from the `re` module. The `re.compile(pattern)` function converts a regular expression pattern into a regular expression object. This allows us to save our regular expressions into objects that can be used later to perform pattern matching using various methods, such as `.match()`, `.search()`, `.findall()`, and `.finditer()`

### List Comprehension vs Generator Expression

In [None]:
name = "John Smith"
listA = [c for c in name if c not in ['i', ' ']]
listA

In [None]:
name = "John Smith"
listB = (c for c in name if c not in ['i', ' '])
listB

In [None]:
next(listB)

In [None]:
next(listB)

In [None]:
for i in listB:
    print(i, end=" ")

### Regular Expressions

In [None]:
import re

In [None]:
with open('Data/email.txt') as f:
    content = f.read()

In [None]:
print(content)

In [None]:
regex = re.compile('A')
matches = regex.finditer(content)
for match in matches:
    print(match)

### MetaCharaters

`. ^ $ * + ? { } [ ] \ | ( )`


#### `/`

- `\d` - Matches any decimal digit; this is equivalent to the set [0-9]

- `\D` - Matches any non-digit character; this is equivalent to the set [^0-9]

- `\s` - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]

- `\S` - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]

- `\w` - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]

- `\W` - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]


In [None]:
regex = re.compile(r'\d')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

In [None]:
regex = re.compile(r'\D')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

In [None]:
regex = re.compile(r'\s')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

In [None]:
regex = re.compile(r'\S')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

In [None]:
regex = re.compile(r'\w')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

In [None]:
regex = re.compile(r'\W')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
print(counter)

### Finding Phone Numbers

In [None]:
regex = re.compile(r'\(\d\d\d\)-\d\d\d-\d\d\d\d')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'\(\d\d\d\)\d\d\d-\d\d\d\d')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\d')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
content[810:822]

### Character Sets

`{} []`
* The sequence `{m}` specifies that exactly m copies of the previous regular expression should be matched. For example, the sequence `\d{3}` specifies that exactly 3 copies of the `\d` regular expression should be matched. Therefore, the sequence `\d{3}` is equivalent to the sequence `\d\d\d`
* Character sets are specified using the `[]` metacharacters and are used to indicate a set of characters that you wish to match
* `.` is a wild card

In [None]:
regex = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'\d{3}[)\.-]+\d{3}.\d{4}')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

### Finding Complex Patterns

* What if a character is optional? We can do this by using the `?` metacharacter in our regular expression. The `?` will match 0 or 1 repetitions of the preceding regular expression. For example, the regular expression `ab?` will match either `a` or `ab`. In other words, the `?` after the `b` indicates that the `b` after the `a` is optional. 

* The `*` metacharacter, matches 0 or more repetitions of the preceding regular expression. In other words, it matches 0 or as many repetitions as possible of the preceding regular expression. For example, the regular expression `ab*` will match `a` or a followed by any number of `b`'s, such as `ab` or `abbbbb`.

* The `+` metacharacter, matches 1 or more 

In [None]:
regex = re.compile(r'[)\.-]?\d{3}[)\.-]+\d{3}.\d{4}')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

### Flags

The `re.compile(pattern, flags)` function, has a `flag` keyword that can be used to allow more flexibility. For example, the `re.IGNORECASE` flag can be used to perform case-insensitive matching. In the code below we have a string that contains the name Walter written in two different combinations of upper and lower case letters

In [None]:
regex = re.compile(r'Henry')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'henry')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'Henry', re.IGNORECASE)

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

### Finding Emails

In [None]:
regex = re.compile(r'\w+\.\w+@\w+\.com')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'\w+[\.\w+]?@\w+\.com')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

### Finding Dates

In [None]:
regex = re.compile(r'\d{2}/\d{2}/\d{4}')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1

In [None]:
regex = re.compile(r'[0-9][0-2]\/[0-3][0-9]\/20[0-8]+')

matches = regex.finditer(content)

counter = 0
for match in matches:
    print(match)
    counter+= 1