##Python RegEx
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module

ref : https://www.w3schools.com/python/python_regex.asp

In [None]:
#import module
import re

##RegEx Functions
Function Description <br> <br>
findall	-> Returns a list containing all matches <br>
search -> Returns a Match object if there is a match anywhere in the string <br>
split	-> Returns a list where the string has been split at each match <br> 
sub	-> Replaces one or many matches with a string 

In [None]:
#regex format
#re.function(pattern,string)

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")


YES! We have a match!


In [None]:
thai_txt = 'ทนงทวยคงควรคอย'
y = re.search("^ทนง.*คอย$", thai_txt)

In [None]:
print(y)

<re.Match object; span=(0, 14), match='ทนงทวยคงควรคอย'>


In [None]:
if y:
  print("YES! We have a match!",' ', thai_txt)
else:
  print("No match")

YES! We have a match!   ทนงทวยคงควรคอย


##Metacharacters
Metacharacters are characters with a special meaning:<br><br>

Character          | Description         |  Example 
-------------------|---------------------|-----------
[]                 | A set of characters | "[a-m]"
\                  | Signals a special sequence <br> (can also be used to escape special characters) | "\d"
.                  | Any character (except newline character) | "he..o"
^                  | Starts with         | "^hello"
Dollar sign        | Ends with           | "planet$"
*                  | Zero or more occurrences | "he.*o"
+                  | One or more occurrences  | "he.+o"
?                  | Zero or one occurrences  | "he.?o"
{}                 | Exactly the specified number of occurrences| "he.{2}o"




In [None]:
text = "คุณอดิศรพรสว่างไสวเจิดจ้านครินชัยบดินสินบดีมาเธอร์เอ็ดเวิร์ดโรเบิร์ดฮุค"

#find ก-ง
p = '[ก-ง]'
r = re.findall(p, text)
print(r)

['ค', 'ง', 'ค', 'ค']


In [None]:
#find ชัยบดิน
p = 'ชัย.*ดิน'
r = re.findall(p, text)
print(r)

['ชัยบดิน']


In [None]:
#find เอ็ดเวิร์ดโรเบิร์ดฮุค
p = 'เอ็ดเวิร์ดโรเบิร์ดฮุค$'
r = re.findall(p, text)
print(r)

['เอ็ดเวิร์ดโรเบิร์ดฮุค']


##Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:<br><br>

Character          | Description         |  Example 
-------------------|---------------------|-----------
\A                 | Returns a match if the specified characters are at the beginning of the string | "\AThe"
\b                 | Returns a match where the specified characters are at the beginning or at the end of a word <br>(the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\bain" <br> r"ain\b"
\B                 | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word <br>(the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\Bain" <br> r"ain\B"
\d                 | Returns a match where the string contains digits (numbers from 0-9) | "\d"
\D                 | Returns a match where the string DOES NOT contain digits             | "\D"
\s                 | Returns a match where the string contains a white space character    | "\s"
\S                 | Returns a match where the string DOES NOT contain a white space character  | "\S"
\w                 | Returns a match where the string contains any word characters <br> (characters from a to Z, digits from 0-9, and the underscore _ character)  | "\w"
\W                 | Returns a match where the string DOES NOT contain any word characters    | "\W"
\Z                 | Returns a match if the specified characters are at the end of the string    | "Spain\Z"

In [None]:
#Number collect

text = 'bd84uureoiufkd8jh38fj928790j3hhkfjhlkahnvnnnmeoismfuckyouman69uehjhhbn77'
pattern = '\d'

r = re.findall(pattern, text)
print(r)

['8', '4', '8', '3', '8', '9', '2', '8', '7', '9', '0', '3', '6', '9', '7', '7']


In [None]:
r = re.findall('\d\d',text)
print(r) 

['84', '38', '92', '87', '90', '69', '77']


In [None]:
r =re.findall('\D', text)
print(r)

['b', 'd', 'u', 'u', 'r', 'e', 'o', 'i', 'u', 'f', 'k', 'd', 'j', 'h', 'f', 'j', 'j', 'h', 'h', 'k', 'f', 'j', 'h', 'l', 'k', 'a', 'h', 'n', 'v', 'n', 'n', 'n', 'm', 'e', 'o', 'i', 's', 'm', 'f', 'u', 'c', 'k', 'y', 'o', 'u', 'm', 'a', 'n', 'u', 'e', 'h', 'j', 'h', 'h', 'b', 'n']


##Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:<br><br>

Set                | Description         
-------------------|---------------------
[arn]              | Returns a match where one of the specified characters (a, r, or n) is present 
[a-n]              | Returns a match for any lower case character, alphabetically between a and n
[^arn]             | Returns a match for any character EXCEPT a, r, and n
[0123]             | Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9]              | Returns a match for any digit between 0 and 9
[0-5][0-9]         | Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]           | Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]                | In sets, +, *, ., (), Dollar sign ,{} has no special meaning, so [+] means: return a match for any + character in the string

In [None]:
text = '@มีดที่ว่าคม ยังแค่บาดนิ้ว แต่เธอ so cute มันช่างบาดใจ!! : 0887776969'
text2 = 'รถติด คือมรดกไทย อนุรักษ์ไว้ให้ลูกหลาน'

p = '[a-z]'

r = re.findall(p, text)
print(r)

['s', 'o', 'c', 'u', 't', 'e']


In [None]:
p = '[0-9][0-9][0-9]'
r = re.findall(p, text)
print(r)

['088', '777', '696']
