### Python 3 RegEx Playbook

## Special Characters

### Wildcards
- \w   -  Matches alphanumeric characters and underscore
- \d   -  Matches any decimal digit
- \s   -  Any whitespace characters like space,tab,etc
- \b   -  word boundary
- \B   -  Not a workd boundary
- \W   -  Any non-word characters
- \D   -  Matches any non-decimal digit  
- \S   -  Matches any non-space characters
- .    -  Any character except newline(\n)
- [ae] -  Character class\set that matches "e" or "a"
- [a-z] - Matches all small case characters from a to z
- [a-gA-G] - Matches all characters from a to g/A to G
- [^abc] - Matches all characters except "a","b" or "c"
- [aA]\d - Matches either "a" or "A" followed by a digit
- (abc) - treat "abc" as a single unit
- (abc|def) - matches either strin "abc" or "def"


### Anchors

- $ - matches end of a string or end of a line (for multi-line strings)
- ^ - matches start of a string
- \A - restricts the match to the start of string
- \Z - restricts the match to the end of string


### Repetitions

- \+  - matching one or more times
- \*  - matching 0 or more times
- ?   - matching 0 or 1 time
- {n} - matching exactly n times
- {m,n} - matching m to n times
- .+\\.txt - matches text filenames e.g. data.txt
- colou?r - matches "colour" or "color"
- (ha){2,3} - matches "haha" or "hahaha"
- ^#\d{6}$ - matches hex color code with only digit

- ^#(\d|[a-fA-F]){6}$ - matches hex color code with data as well as characters from a/A to f/F  

- ^#((\d|[a-fA-F]){3}){1,2}$ - Hex code that matches either color code of length 6 or length 3


### RegEx flags

- Case Insetivity - re.I or re.IGNORECASE
- Multiline - ^ and $ matches the start and end of each line instead of start and end of each line
- Dotall Matching - re.S or re.DOTALL makes the character match any character including new lines



- https://www.regular-expressions.info/tutorial.html
- https://learnbyexample.github.io/py_regular_expressions/re-introduction.html


**Capturing Groups**

In [32]:
import re

string = "contact for john doe : 093-098-8949"

phone_full_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
full_match_groups=re.search(phone_full_pattern,string)
print("findall output: ",re.findall(phone_full_pattern,string))
print("group 0: ",full_match_groups.group(0))
print("groups: ",full_match_groups.groups())

phone_capture_first_3 =  re.compile(r"\d{3}-(\d{3})-(\d{4})")
print("findall output: ",re.findall(phone_capture_first_3,string))
partial_match_groups = re.search(phone_capture_first_3,string)
print("group 0: ",partial_match_groups.group(0))
print("group 1: ",partial_match_groups.group(1))
print("group 2: ",partial_match_groups.group(2))
print("groups: ",partial_match_groups.groups())

print(re.sub(phone_capture_first_3,"XXX",string))

findall output:  ['093-098-8949']
group 0:  093-098-8949
groups:  ()
findall output:  [('098', '8949')]
group 0:  093-098-8949
group 1:  098
group 2:  8949
groups:  ('098', '8949')
contact for john doe : XXX


**Non-Capturing Groups**

In [26]:
phone_full_pattern = re.compile(r"(\d{3}-){2}\d{4}")
print("findall output: ",re.findall(phone_full_pattern,string))

phone_full_pattern = re.compile(r"(?:\d{3}-){2}\d{4}")
print("findall output: ",re.findall(phone_full_pattern,string))

findall output:  ['098-']
findall output:  ['093-098-8949']


**Lookarounds**

- Negative Lookaround
  - Lookahead : (?!pat)
  - Lookbehind : (?<!pat)
- Positive Lookaround
  - Lookahead : (?=pat)
  - Lookbehind : (?<=pat)

In [5]:
# words containing 'b' and 'e' and 't' in any order
import re
words = ['sequoia', 'subtle', 'questionable', 'exhibit', 'equation']
pattern = re.compile(r"(?=.*e)(?=.*b).*t")
res = [word for word in words if re.search(pattern,word)]
print(res)

# words containing ('ab' or 'at') and 'q' but not 'n' at the end of the element
pattern = re.compile(r"(?<=(.ab|at))")

['subtle', 'questionable', 'exhibit']


**Searching/Matching Patterns in Files**

In [35]:
string = """
  number 1: 343-009-6432
  number 2: 009.765.1246
  number 3: (455)654-8754
"""

phone_pattern = r"\(?\d{3}[-.)]\d{3}[-.]\d{4}"

print(re.findall(phone_pattern,string,re.MULTILINE))


['343-009-6432', '009.765.1246', '(455)654-8754']


In [None]:
import re

# search a file to find the first match
pattern = re.compile("sample")
with open("./sample_data/README.md") as f:
  for line in f:
    match  = re.search(pattern,line)
    if match:
      print("Found match in the line ", line)
      break

Found match in the line  This directory includes a few sample datasets to get you started.



**Extract product codes from file**

Product codes examples
- RE45-TG78
- 546-989

In [None]:
product_list = """
Product list:
Product 1:  456-009
Product 2:  WE44-TE33
Product 3:  WW67-TN33
Product 4:  544-353
Product 5:  5456-666

stored in sample.txt
"""

In [None]:
pattern = re.compile(r"\b(\d{3}-\d{3}|[A-Z]{2}[0-9]{2}-[A-Z]{2}[0-9]{2})\b")
with open("./sample.txt",'r') as f:
  data = f.read()
  products = re.findall(pattern,data)
  print(products)

# pattern_2 = r"\b[A-Z]{2}\d{2}-[A-Z]{2}\d{2}\b|\b\d{3}-\d{3}\b"

['456-009', 'WE44-TE33', 'WW67-TN33', '544-353']


**Matching specific patterns e.g email, zipcode, phone,order no**

In [None]:
phone_pattern = r'\d{3}-\d{3}-\d{4}'
email_pattern = r'\b[A-Za-z0-9._%+-]+@'
zip_pattern = r''
order_pattern = r''

**Parsing Application Logs**
- Parse log files for data types
- Parse log files for error codes
- Live Logging filters for errors
- Split entries of log files i.e date,message,loglevel,ipaddresses


In [42]:
import re

def extract_ip_address(log_line):
  #pattern = re.compile(r"(\d{1,3}\.){3}\d{1,3}")
  pattern = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}")
  ip_addresses = re.findall(pattern,log_line)
  for ip in  ip_addresses:
    print("ip address: ",ip)


def extract_http_code(log_line):
  pattern = re.compile(r"(?<=(?:HTTP/1\.1\W\s))\d+\b")
  http_codes = re.findall(pattern,log_line)
  for code in http_codes:
    print("http code :",code)

def extract_log_level(log_line):
  if log_line.strip() == '': return
  pattern = re.compile(r"(INFO|ERROR|DEBUG|WARNING)")
  log_level = re.search(pattern,log_line)
  print("log level: ",log_level.group(1))

In [33]:
log_file = """
83.149.9.216 - - [17/May/2015:10:05:56 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
24.236.252.67 - - [17/May/2015:10:05:40 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0"
93.114.45.13 - - [17/May/2015:10:05:14 +0000] "GET /articles/dynamic-dns-with-dhcp/ HTTP/1.1" 200 18848 "http://www.google.ro/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCwQFjAB&url=http%3A%2F%2Fwww.semicomplete.com%2Farticles%2Fdynamic-dns-with-dhcp%2F&ei=W88AU4n9HOq60QXbv4GwBg&usg=AFQjCNEF1X4Rs52UYQyLiySTQxa97ozM4g&bvm=bv.61535280,d.d2k" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:04 +0000] "GET /reset.css HTTP/1.1" 200 1015 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:45 +0000] "GET /style2.css HTTP/1.1" 200 4877 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:14 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
66.249.73.135 - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
50.16.19.13 - - [17/May/2015:10:05:10 +0000] "GET /blog/tags/puppet?flav=rss20 HTTP/1.1" 200 14872 "http://www.semicomplete.com/blog/tags/puppet?flav=rss20" "Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
66.249.73.185 - - [17/May/2015:10:05:37 +0000] "GET / HTTP/1.1" 200 37932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
110.136.166.128 - - [17/May/2015:10:05:35 +0000] "GET /projects/xdotool/ HTTP/1.1" 200 12292 "http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&sqi=2&ved=0CFYQFjAE&url=http%3A%2F%2Fwww.semicomplete.com%2Fprojects%2Fxdotool%2F&ei=6cwAU_bRHo6urAeI0YD4Ag&usg=AFQjCNE3V_aCf3-gfNcbS924S6jZ6FqffA&bvm=bv.61535280,d.bmk" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
46.105.14.53 - - [17/May/2015:10:05:03 +0000] "GET /blog/tags/puppet?flav=rss20 HTTP/1.1" 200 14872 "-" "UniversalFeedParser/4.2-pre-314-svn +http://feedparser.org/"
110.136.166.128 - - [17/May/2015:10:05:06 +0000] "GET /reset.css HTTP/1.1" 200 1015 "http://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
110.136.166.128 - - [17/May/2015:10:05:03 +0000] "GET /style2.css HTTP/1.1" 200 4877 "http://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
110.136.166.128 - - [17/May/2015:10:05:41 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
110.136.166.128 - - [17/May/2015:10:05:32 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0"
"""

log_file_2 = """
2023-07-16 10:49:43.000 [INFO] [main] [com.example.myapp.MyApp] - Starting MyApp on myhost with PID 1234 (/opt/myapp/myapp.jar started by user in /opt/myapp)
2023-07-16 10:49:43.000 [DEBUG] [main] [com.example.myapp.MyApp] - Running with Spring Boot v2.5.4, Spring v5.3.9
2023-07-16 10:49:43.000 [INFO] [main] [com.example.myapp.MyApp] - No active profile set, falling back to default profiles: default
2023-07-16 10:49:44.000 [INFO] [main] [com.example.myapp.MyApp] - Started MyApp in 1.234 seconds (JVM running for 1.567)
"""

In [43]:
for line in log_file.split("\n"):
  extract_ip_address(line)
  extract_http_code(line)

for line in log_file_2.split("\n"):
  extract_log_level(line)


ip address:  83.149.9.216
http code : 200
ip address:  24.236.252.67
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  93.114.45.13
http code : 200
ip address:  66.249.73.135
http code : 200
ip address:  50.16.19.13
http code : 200
ip address:  66.249.73.185
http code : 200
ip address:  110.136.166.128
http code : 200
ip address:  46.105.14.53
http code : 200
ip address:  110.136.166.128
http code : 200
ip address:  110.136.166.128
http code : 200
ip address:  110.136.166.128
http code : 200
ip address:  110.136.166.128
http code : 200
log level:  INFO
log level:  DEBUG
log level:  INFO
log level:  INFO


In [3]:
import re
def replace_misspellings(corrections,text):
  for misspelled,correct in corrections.items():
    pattern = re.compile(rf"\b{misspelled}\b",re.IGNORECASE)
    text = pattern.sub(correct,text)
  return text


In [5]:
corrections = {
    'comon' : 'common',
    'structre':'structure',
    'basicaly':'basically'
}

text = """
  CTE or Comon Table Expression is way of writing sql in order to make it more
  readable and easier to understand. It involves creating intermediate resultset
  using with clause before writing the main query.
  Writing complex sql queries basicaly involves breaking down the problem/ask in
 small steps and then structre the sql. This is can be achieved by either using subquery or using CTE.
"""

new_text = replace_misspellings(corrections,text)
print(new_text)


  CTE or common Table Expression is way of writing sql in order to make it more 
  readable and easier to understand. It involves creating intermediate resultset 
  using with clause before writing the main query.
  Writing complex sql queries basically involves breaking down the problem/ask in
 small steps and then structure the sql. This is can be achieved by either using subquery or using CTE.

