# Pattern matching with regex

In previous tutorial, we already showed how to find text with regex. What if now we want to not only find the text, but also use a specific pattern to extract the subgroup of the text.

## A simple example

Suppose we have a phone number which compose three parts:
1. three digit: country code
2. three digit: region code
3. four digit: user id
For example a valid phone number looks like : 415-555-4242

Task 1 : check if a text contains a phone number.
Task 2: if there is a phone number, extract the country code, region code, and user id

### Task 1

With below code, we can easily check if the text contains a phone number or not, if it contains, it returns a **re.Match** object, otherwise it returns None

In [2]:
import re

In [3]:
# \d means any digit,
# We use r"" to specify that do not interpreter the string inside, otherwise "\" will be interpreted as a special symbol to escape
# we can also use "[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
phoneRegex=re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")

In [2]:
text1="My number is : none of your business"

text2="My number is : 415-555-8888"

text3="My number is : 415-55-8888"

In [4]:
result1=phoneRegex.search(text1)

print(result1)
if result1:
    print(result1.group())

None


In [5]:
result2=phoneRegex.search(text2)

print(result2)
if result2:
    targetText=result2.group()
    print(f"type: {type(targetText)}, value : {targetText}")

<re.Match object; span=(15, 27), match='415-555-8888'>
type: <class 'str'>, value : 415-555-8888


In [6]:
result3=phoneRegex.search(text3)

print(result3)
if result3:
    print(result3.group())

None


### Task 2

Now we want to extract the country code, etc.

The **re.Match** object provides two interesting method for extracting subgroups
- group():  return the actual matched text from the searched string,
- groups()
- groupDict():


In [7]:
result=phoneRegex.search(text2)
countryCode=result.group(1)
regionCode=result.group(2)
userId=result.group(3)

print(f"country code: {countryCode}")
print(f"region code: {regionCode}")
print(f"user Id : {userId}")

IndexError: no such group

In [8]:
resultTuple=result.groups()
print(f"type: {type(resultTuple)}, value: {resultTuple}")

type: <class 'tuple'>, value: ()


In [9]:
resultDict=result.groupdict()
print(resultDict)

{}


#### Grouped regex

What happened? Why it does not work?

Let's check our regex "\d\d\d-\d\d\d-\d\d\d\d", it's a single regex without group, so the returned **re.match** does not contain any subgroups

Let' try another version of the regex "(\d\d\d)-(\d\d\d)-(\d\d\d\d)", this time, we use () to define three subgroups.

In [10]:
phoneGroupedRegex=re.compile(r"(\d\d\d)-(\d\d\d)-(\d\d\d\d)")

groupedResu=phoneGroupedRegex.search(text2)
countryCode=groupedResu.group(1)
regionCode=groupedResu.group(2)
userId=groupedResu.group(3)

print(f"country code: {countryCode}")
print(f"region code: {regionCode}")
print(f"user Id : {userId}")

country code: 415
region code: 555
user Id : 8888


In [11]:
resultTuple=groupedResu.groups()
print(f"type: {type(resultTuple)}, value: {resultTuple}")

type: <class 'tuple'>, value: ('415', '555', '8888')


In [18]:
resultDict=groupedResu.groupdict()
print(f"type: {type(resultDict)}, value: {resultDict}")

type: <class 'dict'>, value: {}


The **groupdict** method still returns an empty dictionary

#### Named group

We can notice, we can only access the subgroup by using their position. Can we name the subgroup?

Yes we can. Check below regex

> To maximize readability, we use **?P<name>**, where name is the tag of the subgroup. For example, we tagged the country code subgroup with `country`(?P<country>\d\d\d)

In [12]:
phoneNamedGroupedRegex=re.compile(r"(?P<country>\d\d\d)-(?P<region>\d\d\d)-(?P<user>\d\d\d\d)")

namedResu=phoneNamedGroupedRegex.search(text2)
countryCode=namedResu.group("country")
regionCode=namedResu.group("region")
userId=namedResu.group("user")

print(f"country code: {countryCode}")
print(f"region code: {regionCode}")
print(f"user Id : {userId}")

country code: 415
region code: 555
user Id : 8888


In [13]:
resultTuple=namedResu.groups()
print(f"type: {type(resultTuple)}, value: {resultTuple}")

type: <class 'tuple'>, value: ('415', '555', '8888')


With named groups, we can use the syntax **\g<name>** to refer to the tagged group in substitution text:


In [15]:
outputPattern= "\g<country> \g<region> \g<user>"
output=phoneNamedGroupedRegex.sub(outputPattern,text2)

print(output)

My number is : 415 555 8888


In [16]:
outputPattern= "country code: \g<country>, region code: \g<region>, user id: \g<user>"
output=phoneNamedGroupedRegex.sub(outputPattern,text2)

print(output)

My number is : country code: 415, region code: 555, user id: 8888


In [17]:
resultDict=namedResu.groupdict()
print(f"type: {type(resultDict)}, value: {resultDict}")

type: <class 'dict'>, value: {'country': '415', 'region': '555', 'user': '8888'}


## Another example

Now we want to match a file name with different subgroups. A file name example CRTO_CT_Bio_G2_2016
file name patterns:
- domain=CRTO_CT
- table_name = Bio
- carto_version = 2
- year = 2016

In [6]:
fileRegex = re.compile(r"(?P<domain>CRTO_CT)_(?P<table>\w+)_G(?P<version>[0-9])_(?P<year>2[0-1][0-5][0-9])")
text="CRTO_CT_Bio_G2_2016"

In [7]:
resu= fileRegex.search(text)

In [8]:
print(resu.groupdict())

{'domain': 'CRTO_CT', 'table': 'Bio', 'version': '2', 'year': '2016'}
