Review [this guide](https://www.codingforentrepreneurs.com/blog/python-regular-expressions/) as a reference for this notebook.

The following are all valid written phone number formats:

- +1-555-555-3121
- 1-555-555-3121
- 555-555-3121
- +1(555)-555-3121
- +15555553121

In [1]:
my_phone_number = "555-867-5309"

In [2]:
numbers = []
for char in my_phone_number:
    number_val = None
    try:
        number_val = int(char)
    except:
        pass
    if number_val != None:
        numbers.append(number_val)

numbers_as_str = "".join([f"{x}" for x in numbers])
numbers_as_str

'5558675309'

In [3]:
numbers_as_str2 = my_phone_number.replace("-", "")
numbers_as_str2

'5558675309'

In [4]:
numbers_as_str3 = "".join([f"{x}" for x in my_phone_number if x.isdigit()])
numbers_as_str3

'5558675309'

In [5]:
import re
pattern = "\d+"
re.findall(pattern, my_phone_number)

['555', '867', '5309']

In [6]:
my_other_phone_numbers = "Hi there, my home number is 3123123asdfasdf3123 555-867-5309 and my cell number is +1-555-555-0007."

pattern = "\d+"
re.findall(pattern, my_other_phone_numbers)

['3123123', '3123', '555', '867', '5309', '1', '555', '555', '0007']

In [7]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007."

pattern = "\d"
re.findall(pattern, meeting_str)

['8', '3', '0', '1', '5', '5', '5', '5', '5', '5', '0', '0', '0', '7']

In [8]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007 1-555-555-0007."

pattern = r"\+\d{1}-\d{3}-\d{3}-\d{4}" # "\n"

re.findall(pattern, meeting_str)

['+1-555-555-0007']

- `\+` -> escape the `+` and use it in our pattern.
- `\d` -> matches all digits
- `{1}` -> {`n`} -> for n number, let's slice there.
- `-` -> is there a dash?


- Chunk 1 -> `\+\d{1}-`
- Chunk 2 -> `\d{3}-`
- Chunk 3 -> `\d{3}-`
- Chunk 4 => `\d{4}`



In [9]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007 1-555-555-0007."

chunk_1 = r"\+\d{1}-" # "\n"

re.findall(chunk_1, meeting_str)

['+1-']

In [10]:
chunk_1 = "\+?" + "\d{1}" + "-?"
chunk_2 = "\d{3}" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"

meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007 1-555-555-0007 +15555553121."

regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-555-555-0007', '1-555-555-0007', '+15555553121']

In [11]:
phone_number = "+1(555)-555-3121"

chunk_1 = "\+?" + "\d{1}" + "-?"
chunk_2 = "\(\d{3}\)" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, phone_number)

['+1(555)-555-3121']

In [12]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007 1-555-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "\d{1}" + "-?"
chunk_2 = "\(?" + "\d{3}" + "\)?" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-555-555-0007', '1-555-555-0007', '+15555553121', '+1(555)-555-3121']

In [13]:
re.compile(pattern)

re.compile(r'\+?\d{1}-?\(?\d{3}\)?-?\d{3}-?\d{4}', re.UNICODE)

- `\d` -> `[0-9]`

In [14]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007 1-555-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "[0-9]{1}" + "-?"
chunk_2 = "\(?" + "[0-9]{3}" + "\)?" + "-?"
chunk_3 = "[0-9]{3}-?"
chunk_4 = "[0-9]{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-555-555-0007', '1-555-555-0007', '+15555553121', '+1(555)-555-3121']

In [15]:
re.compile(pattern)

re.compile(r'\+?[0-9]{1}-?\(?[0-9]{3}\)?-?[0-9]{3}-?[0-9]{4}', re.UNICODE)

In [21]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-255-555-0007 1-555-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "\d{1}" + "-?"
# chunk_2 = "\(?" + "[13579]{3}" + "\)?" + "-?"
chunk_2 = "\(?" + "[1-4]{3}" + "\)?" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

[]

In [22]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-255-555-0007 1-555-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "\d{1}" + "-?"
# chunk_2 = "\(?" + "[13579]{3}" + "\)?" + "-?"
chunk_2 = "\(?" + "[0-2]{1}" + "[0-9]{2}" + "\)?" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-255-555-0007']

In [23]:
re.compile(pattern)

re.compile(r'\+?\d{1}-?\(?[0-2]{1}[0-9]{2}\)?-?\d{3}-?\d{4}', re.UNICODE)

In [31]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-213-555-0007 1-909-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "\d{1}" + "-?"
chunk_2 = "\(?" + "(?:213|212|909)" + "\)?" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-213-555-0007', '1-909-555-0007']

In [38]:
meeting_str = "Hey, give me a call at 8:30 on my cell at +1-213-555-0007 1-909-555-0007 +15555553121 +1(555)-555-3121."

chunk_1 = "\+?" + "\d{1}" + "-?"
chunk_2 = "\(?" + "(?:2|3|9)" + "\d{2}" + "\)?" + "-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"

pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"
regex = re.compile(pattern)
re.findall(regex, meeting_str)

['+1-213-555-0007', '1-909-555-0007']

## Groups
We regex, we can group parts of a pattern so they are easier to identify. A phone number is identified as:

```
<country-code>-<area-code>-<exchange-code>-<line-number>
```

This represents:

```
1-212-555-5123
```
- `1` is the country code
- `212` is the area code
- `555` is the exchange code
- `5123` is the line number

In [40]:
chunk_1 = "\d{1}-?"
chunk_2 = "\d{3}-?"
chunk_3 = "\d{3}-?"
chunk_4 = "\d{4}"


example = "1-212-555-5123"
pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"

print('example', re.compile(pattern).findall(example))

example ['1-212-555-5123']


In [51]:
group_1 = "(\+?\d{1}-?)"
group_2 = "(\d{3}-?)"
group_3 = "(\d{3}-?)"
group_4 = "(\d{4})"


example = "1-212-555-5123 1-212-555-5123"
grouped_pattern = f"{group_1}{group_2}{group_3}{group_4}"

matched = re.compile(grouped_pattern).match(example)
if matched:
    print('group', matched.group())
    print('groups', matched.groups())
    print(matched) # -> [entire_group, n, n+1] # n = number of groups

# print('example', re.compile(grouped_pattern).findall(example))

group 1-212-555-5123
groups ('1-', '212-', '555-', '5123')
5123


In [54]:
group_1 = "(\d{1})-?"
group_2 = "(\d{3})-?"
group_3 = "(\d{3})-?"
group_4 = "(\d{4})"


example = "1-212-555-5123"
grouped_pattern = f"{group_1}{group_2}{group_3}{group_4}"

matched = re.compile(grouped_pattern).match(example)
country_code = matched.group(1)
print('country_code', country_code)

area_code = matched.group(2)
print('area_code', area_code)

exchange_code = matched.group(3)
print('exchange_code', exchange_code)

line_number = matched.group(4)
print('line_number', line_number)

country_code 1
area_code 212
exchange_code 555
line_number 5123


### Named Groups

In [63]:
chunk_1 = "\+?-?" + "(?P<country_code>\d{1})" + "-?"
chunk_2 = "\(" + "(?P<region_code>\d{3})" "\)" + "-?"
chunk_3 = "(?P<exchange_code>\d{3})-?"
chunk_4 = "(?P<line_number>\d{4})"


example = "+1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123"
pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"

for m in re.compile(pattern).finditer(example):
    print(m.group(0))
    print(m.groupdict())

+1-(212)-555-5123
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123'}
+1-(212)-555-5123
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123'}
+1-(212)-555-5123
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123'}
+1-(212)-555-5123
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123'}
+1-(212)-555-5123
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123'}


In [64]:
group_1 = "(\+?\d{1}-?)"
group_2 = "(\d{3}-?)"
group_3 = "(\d{3}-?)"
group_4 = "(\d{4})"


example = "1-212-555-5123 1-212-555-5123"
grouped_pattern = f"{group_1}{group_2}{group_3}{group_4}"
for m in re.compile(grouped_pattern).finditer(example):
    print(m.group(0))
    print(m.groupdict())

1-212-555-5123
{}
1-212-555-5123
{}


In [68]:
chunk_1 = "\+?-?" + "(?P<country_code>\d{1})" + "-?"
chunk_2 = "\(" + "(?P<region_code>\d{3})" "\)" + "-?"
chunk_3 = "(?P<exchange_code>\d{3})-?"
chunk_4 = "(?P<line_number>\d{4})"


example = "+1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123 +1-(212)-555-5123"
pattern = f"{chunk_1}{chunk_2}{chunk_3}{chunk_4}"

datas = []

for m in re.compile(pattern).finditer(example):
#     print(m.group(0))
    # print(m.groupdict())
    data = {**m.groupdict()}
    full_number = m.group(0)
    data['full_number'] = full_number
    print(data)
    datas.append(data)

{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123', 'full_number': '+1-(212)-555-5123'}
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123', 'full_number': '+1-(212)-555-5123'}
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123', 'full_number': '+1-(212)-555-5123'}
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123', 'full_number': '+1-(212)-555-5123'}
{'country_code': '1', 'region_code': '212', 'exchange_code': '555', 'line_number': '5123', 'full_number': '+1-(212)-555-5123'}


In [None]:
# pd.DataFrame(datas)

### What about letters?

In [77]:
my_text = "Hello world. I have a score of 100/100. How cool is that?"

pattern = "[a-zA-Z]+"

re.findall(pattern, my_text)

['Hello',
 'world',
 'I',
 'have',
 'a',
 'score',
 'of',
 'How',
 'cool',
 'is',
 'that']

In [79]:
my_text = "Hello world. I have a score of 100/100. How cool is that?"

pattern = "\w+"

re.findall(pattern, my_text)

['Hello',
 'world',
 'I',
 'have',
 'a',
 'score',
 'of',
 '100',
 '100',
 'How',
 'cool',
 'is',
 'that']

In [84]:
my_text = "Hello world. I have a score of 100/100. How cool is that?"

pattern =  "[0-9a-zA-Z .]"

"".join(re.findall(pattern, my_text))

'Hello world. I have a score of 100100. How cool is that'

In [105]:
my_text = "Hello world. I have a score of 100/100. How cool is that? \\backslash"

pattern =  r"[0-9a-zA-Z .\/\\\?]"

print("".join(re.findall(pattern, my_text)))

Hello world. I have a score of 100/100. How cool is that? \backslash



## Metacharacters

Here's a few metacharacters that you can use:

- `^` - the start of a string
- `[^0-9]` This matches everything except `[0-9]` because of `^`; `[^a-z]` matches anything that's not a lowercase number.
- `$` - the end of a string
- `+` - if 1 or more happens
- `*` - if 0 or more happens
- `?` - makes the value before `?` optional (as discussed above)
- `|` - the or operator (from above as well)

In [112]:
pattern = r"^[A-Z]" # re_path -> url -> parse urls
print(re.findall(pattern, "Another one here"))
print(re.findall(pattern, "yet another. One here"))

['A']
['O']


In [120]:
pattern = r"[^0-9a-zA-Z ]"
long_example =  "Hi there, my home number *&*#*&@# : is 1-555-867-5309 and my cell number is 1-555-555-0007."

list(set(re.findall(pattern, long_example)))

['@', '-', '.', '*', ':', '&', ',', '#']

In [126]:
pattern = "[^0-9a-zA-Z]\$"
long_example =  "Hi there, my home number :$ *&*#*&@# : is 1-555-867-5309 and my cell number is 1-555-555-0007?"


re.findall(pattern, long_example)

[':$']

In [127]:
pattern = "(?:abc|123)"

In [132]:
pattern = "\w+" # [0-9a-zA-Z]+
example = "This is going to work!"
re.findall(pattern, example)


['This', 'is', 'going', 'to', 'work']

In [133]:
pattern = "\w" # [0-9a-zA-Z]+
example = "This is going to work!"
re.findall(pattern, example)

['T',
 'h',
 'i',
 's',
 'i',
 's',
 'g',
 'o',
 'i',
 'n',
 'g',
 't',
 'o',
 'w',
 'o',
 'r',
 'k']

In [139]:
pattern = "[ABc ]*" # [0-9a-zA-Z]+
example = "AAA AA bbb CCC"
re.findall(pattern, example)

['AAA AA ', '', '', '', ' ', '', '', '', '']

- `\d`
- `\D` -> `[^0-9]`
- `\s` -> whitespace character
- `\S` - nonwhitespace
- `\w` -> characters `[0-9a-zA-Z]`
- `\W` -> characters `[^0-9a-zA-Z]`