In [2]:
import re

The `re` module offers a set of functions that allows us to search a string for a match:
--------------------------
|Function|    Description|
|:---|:---|
findall	|Returns a list containing all matches
search	|Returns a Match object if there is a match anywhere in the string
split	|Returns a list where the string has been split at each match
sub	|Replaces one or many matches with a string



`Metacharacters`

Metacharacters are characters with a special meaning:
------------------------------------
|Character|	Description|	Example|
|:---|:---|:---|
[]|	A set of characters|	"[a-m]"|
\ |	Signals a special sequence (can also be used to escape special characters)|	"\d"|
.|	Any character (except newline character)|	"he..o"|
^|	Starts with|	"^hello"|
$|	Ends with|	"planet$"|
*|	Zero or more occurrences|	"he.*o"|
+|	One or more occurrences|	"he.+o"|
?|	Zero or one occurrences|	"he.?o"|
{}|	Exactly the specified number of occurrences|	"he.{2}o"|
||	Either or|	"falls|stays"|
()|	Capture and group|


In [12]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt) # ? means 0 or 1

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


In [11]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)

['hello']


In [10]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)

print(x)

['hello']


In [9]:
import re

txt = "hello planet"

#Check if the string ends with 'planet':

x = re.findall("planet$", txt)
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


In [8]:
import re

txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")


Yes, the string starts with 'hello'


In [7]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
print(x)

['hello']


In [6]:
# \ signals a special sequence (can also be used to escape special characters)
txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)


['5', '9']


In [1]:
# [] A set of characters 
import re

txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


## `Special Sequences`


A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:
------------------------------------
|Character	|Description	|Example|
|:---|:---|:---|
\A|	Returns a match if the specified characters are at the beginning of the string	|"\AThe"	
\b|	Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	 |r"\bain" r"ain\b"
\B|	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word(the "r" in the beginning is making sure that the string is being treated as a "raw string")|	r"\Bain" r"ain\B"
\d|	Returns a match where the string contains digits (numbers from 0-9)	|"\d"
\D|	Returns a match where the string DOES NOT contain digits	|"\D"
\s|	Returns a match where the string contains a white space character	|"\s"
\S|	Returns a match where the string DOES NOT contain a white space character	|"\S"
\w|	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	|"\w"
\W|	Returns a match where the string DOES NOT contain any word characters	|"\W"
\Z|	Returns a match if the specified characters are at the end of the string	|"Spain\Z"


# `Sets`


A set is a set of characters inside a pair of square brackets [] with a special meaning:
---------------------------
|Set	|Description|
|:---|:---|
[arn]|	Returns a match where one of the specified characters (a, r, or n) is present
[a-n]|	Returns a match for any lower case character, alphabetically between a and n
[^arn]|	Returns a match for any character EXCEPT a, r, and n
[0123]|	Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9]|	Returns a match for any digit between 0 and 9
[0-5][0-9]|	Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]|	Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]    |	In sets, +, *, ., (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

In [13]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


In [14]:
import re

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


In [15]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [16]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [18]:
txt = "The rain in Spain"
x = re.split("\s", txt, 2)
print(x)

['The', 'rain', 'in Spain']


In [20]:
import re

txt = "The rain in Spain"
x = re.sub("rain", "9", txt)
print(x)

The 9 in Spain
