Metacharacters

Metacharacter | Description

.               Finds any single character

^               Start of string

$               End of string

\*               Matches zero or more times   

\+               Matches one or more times

?               Matches either 0 or 1times

{}              Matches all inside a set amount of times

[]              Specifies a set of characters to match

\               Escapes metacharacters to match them in patterns

|               Specifies either a or b (a|b)
 
()              Captures all enclosed

In [1]:
import re

In [2]:
test_string = "my word is text"
pattern = re.compile(r'\w') #match a single word character. This will match each and every word character
matches = pattern.findall(test_string)
print(matches)

['m', 'y', 'w', 'o', 'r', 'd', 'i', 's', 't', 'e', 'x', 't']


In [3]:
test_string = "my word is text"
'''
'\w\d'
matches a single word character followed exactly by a digit in the exact order specified. eg c2, a1 etc.
This will not hit any pattern in test_string since no such pattern exists.  
'''
pattern = re.compile(r'\w\d') 

matches = pattern.findall(test_string)
print(matches)

[]


In [4]:
test_string = "my word is text"
'''
'[\w\d]'
matches a single word character OR a digit. eg c2, a1 etc.
Each character in test_string will qualify.  
'''
pattern = re.compile(r'[\w\d]') 

matches = pattern.findall(test_string)
print(matches)

['m', 'y', 'w', 'o', 'r', 'd', 'i', 's', 't', 'e', 'x', 't']


In [5]:
test_string = "7st art of a str ing"
pattern = re.compile(r'^[789]')
matches = pattern.sub(test_string, '')
print(matches)




---

In [6]:
test_string = "7st art of a str ing"
pattern = re.compile(r'.')
matches = pattern.findall(test_string)
print(matches)

['7', 's', 't', ' ', 'a', 'r', 't', ' ', 'o', 'f', ' ', 'a', ' ', 's', 't', 'r', ' ', 'i', 'n', 'g']


In [7]:
# exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
pattern = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}")
match = pattern.findall(ip)
match = pattern.search(ip)
print(match.group())

192.168.0.185


In [8]:
s = 'news/100'
# pattern = '\w+/(\d+)'  #with capturing group (\d+)
pattern = '\w+/(?:\d+)'  #with capturing group (\d+)
# pattern = '\w+/\d+'      #without a capturing group
matches = re.finditer(pattern, s)
for match in matches:
    print(match.group(0))

news/100


In [17]:
#named capturing group
'''
https://www.pythontutorial.net/python-regex/python-regex-capturing-group/

format --> (?P<name>pattern to name)
In this syntax:

- () indicates a capturing group.
- ?P<name> specifies the name of the capturing group.
- rule is a rule in the pattern.

eg '(?P<resource>\w+)/(?P<id>\d+)'
In this syntax, the resource is the name for the first capturing group and the id is the name for the second capturing group
To get all the named subgroups of a match, you use the groupdict() method of the Match object. 
'''
s = 'news/100'
pattern = '(?P<resource>\w+)/(?P<id>\d+)'  
matches = re.finditer(pattern, s)
for match in matches:
    print(match.groupdict())

{'resource': 'news', 'id': '100'}


In [18]:
s = 'news/2021/12/31'
pattern = '(?P<resource>\w+)/(?P<year>\d{4})/(?P<month>\d{1,2})/(?P<day>\d{1,2})'

matches = re.finditer(pattern, s)
for match in matches:
    print(match.groupdict())

{'resource': 'news', 'year': '2021', 'month': '12', 'day': '31'}


## Capturing and Non-capturing Groups

###   What are they?
*  The pattern that you submit to regex can be broken into groups using parenthesis, eg '(\d+)[-./]\d{1,2}[-./]\d{1,2}'
*  Note that the entire pattern submit constitutes the first group and is indexed at group(0).
*  In the example given, '(\d+)[-./]\d{1,2}[-./]\d{1,2}', the first parenthesis, (\d+), creates a group, group(1), apart from the whole pattern which is group(0). 
*  (\d+) creates a capturing group. This means the group is captured and kept inside memory, much like variables, and can be referenced (through backference $ or \ ) later in the pattern, to define same subpattern.
*  The fact that they are captured into separate groups and can be referenced later in the pattern makes them capturing groups.
*  Capturing groups are indexed from 0, 1, 2, etc, with the entire pattern being the group(0), first sub-group being group(1) and so on.
*  Captured groups can be named and referenced by their names: (?P<name>rule)
*  If you create a group in a pattern but do not want that group to be captured, you define it as (?:rule). This creates a Non-capturing group and cannot be backreferenced later in the pattern, since they have not been captured and have no reference.
*  The distinguishing factor is the ?: present inside a group. The presence of it indicates a Non-capturing group and vice versa.

Summary

* Place a rule of a pattern inside parentheses () to create a capturing group.
* Use the group() method of the Match object to get the subgroup by an index.
* Use the (?P<name>rule) to create a named capturing group for the rule in a pattern.
* Use the groupdict() method of the Match object to get the named subgroups as a dictionary.

Backreferences
* Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.
* The following shows the syntax of a backreference: ```\N or \g<N>```
* In this syntax, N can be 1, 2, 3, etc. that represents the corresponding capturing group.
* Note that the \g<0> refer to the entire match, which has the same value as the match.group(0)

Using Python regex backreferences to get text inside quotes
* Suppose you want to get the text within double quotes: "Python\'s awsome". She said
* you may use the following pattern: ```'[\'"](.*?)[\'"]'```, but, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa.
* The preceding returns the "Python' not "Python's awesome"
* To fix it, you can use a backreference: ```r'([\'"]).*?\1'```
* The backreference \1 refers to the first capturing group. So if the subgroup starts with a single quote, the \1 will match the single quote. And if the subgroup starts with a double-quote, the \1 will match the double-quote.

In [32]:
import re

text = '2004-04-02  was great as well as 2012-05-01 '
# pattern = re.compile(r'(\d+?)[-\.\/](\d{1,2})[-\.\/]\d{1,2}')
pattern = re.compile(r'(\d+?)([-\./])(\d{1,2})\2\3')
# matches = pattern.findall(text)
matches = pattern.search(text)
print(matches)

None
