<a href="https://colab.research.google.com/github/neuralsrg/Python-Projects/blob/main/regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular expressions tutorial 

[Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

In [1]:
import re

## findall, match, search, finditer

In [None]:
test_string = 'abc123ABC599abc_#abcc'
pattern = re.compile(r'abc')

In [None]:
# findall 
matches = pattern.findall(test_string)
type(matches), type(matches[0]) # list of strings which match the condition

(list, str)

In [None]:
# match 
# returns match obj if pattern matches the beginning of string 
# else returns None
pattern.match(test_string), pattern.match('hello_abc')

(<re.Match object; span=(0, 3), match='abc'>, None)

In [None]:
# pretty much like match but looks for the first occurrence of the match in the string
pattern.search(test_string)

<re.Match object; span=(0, 3), match='abc'>

In [None]:
# finditer
matches = pattern.finditer(test_string)
for match in matches:
  print(match)

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(12, 15), match='abc'>
<re.Match object; span=(17, 20), match='abc'>


On `match object` we can use: 
* `span` -- returns span tuple 
* `start` -- returns span start point
* `end` -- returns span end point
* `group` -- returns matched string

In [None]:
# the following code gives the same result
matches = re.finditer(r'abc', test_string)

## Example with meta character

In [None]:
pattern = re.compile(r'ABC$') # $ means at the end
str1 = 'helloABC_testABC'
str2 = 'ABCABC_'
for match in pattern.finditer(str1):
  print(match)

<re.Match object; span=(13, 16), match='ABC'>


In [None]:
for match in pattern.finditer(str2):
  print(match) # None



*   `\d` searches for any digit character
*   `\D` searches for any non-digit character
*   `\s` any whitespace character (space / tab)
*   `\S` any non-whitespace character
*   `\b` at the beginning of any block (which are separeted by white spaces)


    Example: re.finditer(r'\bhello', 'hhello world _hello helloworld hihello') will find only one entry


*   `\b` at the end of any block (which are separeted by white spaces)
*   We can also pass `set` of characters to look for: `r'[xyz]'`, `r'[a-z]'`, `r'[a-zA-Z]'`, `r'[0-9]'`, `r'[0-9-]'` will look also for `-`
*   Quantifiers (can be found in a cheat sheet):


    r'\d+' will combine any digital sequences into a one match 
    r'_?\d' will find any character with an optional leading underscore


*   `\d{3}` Will group digits into matches of length 3. We can also use ranges: `\d{1, 3}` to combine groups of length 1-3

## Conditions, grouping 

In [4]:
test_string = """
8/20/2022
Hello world!
Mr Smith
Mr. Brown
Mrs Stone 
Ms Grunfeld 
Mrs. Brown
"""

pattern = re.compile(r'(Mr|Mrs|Ms)(\.?\s)(\w+)') # we have 3 groups here
matches = pattern.finditer(test_string)
for match in matches:
  print(match)
  print('\t', match.group(1)) # only the first group

<re.Match object; span=(24, 32), match='Mr Smith'>
	 Mr
<re.Match object; span=(33, 42), match='Mr. Brown'>
	 Mr
<re.Match object; span=(43, 52), match='Mrs Stone'>
	 Mrs
<re.Match object; span=(54, 65), match='Ms Grunfeld'>
	 Ms
<re.Match object; span=(67, 77), match='Mrs. Brown'>
	 Mrs


## split, sub

In [5]:
test_string = '123abchelloworldabc789'
pattern = re.compile(r'abc')
splitted = pattern.split(test_string)
splitted

['123', 'helloworld', '789']

In [7]:
substituted = pattern.sub('xyz', test_string)
substituted

'123xyzhelloworldxyz789'

### Example:


In [11]:
urls = """
https://www.google.com
http://youtube.com
https://www.some-website.de
"""

pattern = re.compile(r'https?://(www\.)?([a-zA-z-]+)(\.[a-zA-Z]+)')
subbed = pattern.sub(r'\2\3', urls)
print(subbed)


google.com
youtube.com
some-website.de



## [Compilation flags](https://docs.python.org/3/howto/regex.html#compilation-flags)