## Python was orginally created to work with text

Using Python to parse text is one of the ways the language excels. You should lean a few ways to take advantage of this and use them. Or at least undersand what is possible.

In [1]:
import re  # The standard library for regular expressions

Here we are searching a text string for the word "France". Since there was no match the search returns None.

In [2]:
string = "The rain in Spain"
result = re.search("France", string)
print('result with "France":', result)

result with "France": None


Here we search for "rain" and make a match so a re.match object is returned.

In [3]:
result = re.search("rain", string)
print('result with "rain":', result)

result with "rain": <re.Match object; span=(4, 8), match='rain'>


So far when we printed the result variable it was the object not the string found. To get the string found we need to extract it from the returned object by using the .group() method.

We can also use some other methods on the returned object to get where the match was found. 

In [4]:
print('result', result)
print('result.group():', result.group())
print('result.start():', result.start())
print('result.span():', result.span())

result <re.Match object; span=(4, 8), match='rain'>
result.group(): rain
result.start(): 4
result.span(): (4, 8)


Typical use requires checking if the returned value is None to check if a match was found. The most common syntax is something like this.

In [5]:
if result is not None:
    print(f'\nFound "rain" in "{string}"')
else:
    print('We did not find a match')


Found "rain" in "The rain in Spain"


Or you can use a try/except error handling syntax. Trying to extract the value from None will result in raising an _AttributeError_ exception. This will be caught by the try and the except clause is execuited. Choose your own adventure.

In [6]:
try:
    result = result.group()
    print(f'\nFound {result} in {string}')
except AttributeError:
    print('We did not find a match')


Found rain in The rain in Spain


Can also find multiple matches. Notice how the search is case sensitive.

In [7]:
text = 'those THAT tilt that That 8'
print('re.findall(r"that", text):', re.findall(r"that", text))
print('re.findall(r"that", text, flags=re.IGNORECASE):',
      re.findall(r"that", text, flags=re.IGNORECASE))

re.findall(r"that", text): ['that']
re.findall(r"that", text, flags=re.IGNORECASE): ['THAT', 'that', 'That']


This is finding all instances of upper or lower case "t".

In [8]:
print('re.findall(r"[tT]",text):', re.findall(r"[tT]", text))
pattern = re.compile(r"t", flags=re.IGNORECASE)
print('re.findall(pattern, text):', re.findall(pattern, text))

re.findall(r"[tT]",text): ['t', 'T', 'T', 't', 't', 't', 't', 'T', 't']
re.findall(pattern, text): ['t', 'T', 'T', 't', 't', 't', 't', 'T', 't']


This finds all characters in a-z or A-Z using excape character notation. "\w" stands for any alpha-numeric chacter.

In [9]:
print(re.findall(r"[\w]", text))

['t', 'h', 'o', 's', 'e', 'T', 'H', 'A', 'T', 't', 'i', 'l', 't', 't', 'h', 'a', 't', 'T', 'h', 'a', 't', '8']


Here we can take the single string and split into a list of strings. Works the same as .split() modifier. The "\s" means single space.

In [10]:
string = "The rain in Spain"
re.split("\s", string)

['The', 'rain', 'in', 'Spain']

This will stop after first character that matches and leave the rest as is.

In [11]:
re.split("\s", string, 1)

['The', 'rain in Spain']

We can also repalce a string with another sting.

In [12]:
string = "The rain in Spain"
print(re.sub("\s", "_", string))  # Replace all spaces with underscore
print(re.sub("\s", "_", string, 1))  # Replace first instance of spaces with underscore
print(re.sub(" in ", "_IN_", string))  # Replace a space followe by in followed by space.

The_rain_in_Spain
The_rain in Spain
The rain_IN_Spain


We can start to use the power of regular expressions with special characters.
The ^ character indicates start of text and $ character indicates end of text. If the match is not found will return None

In [13]:
text = 'someText'
pattern = r"^someText$"
search_result = re.match(pattern, text)
print(search_result.group())

text = '  someText'
search_result = re.match(pattern, text)
print(search_result)

someText
None


The . character means match anything including nothing. The character after . indicates how many to match. In this case * means match 0 or more of them.

In [14]:
text = 'someTextsomeText'
pattern = r".*"
search_result = re.match(pattern, text)
search_result.group()

'someTextsomeText'

In this case + means match 1 or more of them.

In [15]:
pattern = r".+"
search_result = re.match(pattern, text)
search_result.group()

'someTextsomeText'

Now we can add some complexity. Looks for a string starting with "some".

In [16]:
pattern = r"^some.*"
search_result = re.match(pattern, text)
search_result.group()

'someTextsomeText'

Match anything upt to "xt" and then end the string. This will search for for first occurence and if found will return with a match.

In [17]:
pattern = r".*xt+$"
search_result = re.match(pattern, text)
search_result.group()

'someTextsomeText'

Here we search the string for "xt". Since there are two we would normally find the first and return. But here it is saying look for 0 or 1 matches at the end of the string. Since the string ends in "xt" this works. The ? is matching nothing which works.

In [18]:
pattern = r".*xt?"
search_result = re.match(pattern, text)
search_result.group()

'someTextsomeText'

We can now say how many times should we find a pattern wiht {}. Here we look for the instance of "abc" one time.

In [19]:
text = 'abcabcABCCCCCCCabc'

pattern1 = r".*abc{1}"
search_result = re.match(pattern1, text)
search_result.group()

'abcabcABCCCCCCCabc'

Here we say we need find 1 to five occurances of the C in ABC. Notice how there are more than 5 C's but only 5 are returned.

In [20]:
pattern2 = r".*ABC{1,5}"
search_result = re.match(pattern2, text)
search_result.group()

'abcabcABCCCCC'

Here we group what we search for with the multiple occuances. Instead of just the one character next to {} find "BC"

In [21]:
pattern3 = r".*A(BC){1}"
search_result = re.match(pattern3, text)
search_result.group()

'abcabcABC'

Here we search for "B" or "C", but the search will be a success on first one found.

In [22]:
pattern5 = r".*(B|C)"
search_result = re.match(pattern5, text)
search_result.group()

'abcabcABCCCCCCC'

Here we look for the letters between B to J.

In [23]:
pattern6 = r".*A[B-J]"
search_result = re.match(pattern6, text)
search_result.group()

'abcabcAB'

Let's do a more complicated search to highlight some important aspects of regular expressions. In particular the idea that searches are greedy by nature.

We start by searching for first word's character.

In [24]:
text = 'more complicated string with 3456 to search.'
pattern = r"\w"
search_result = re.match(pattern, text)
search_result.group()

'm'

Search for 0 or more things ending in a word character. Notice how the one or more search is greedy. It returns as much as possible.

In [25]:
pattern = r".*\w"
search_result = re.match(pattern, text)
search_result.group()

'more complicated string with 3456 to search'

Here we are searching for 0 or more charactres ending in a digit not a word character. The one or more search is still greedy and returns the full numer in the string.

In [26]:
pattern = r".*\d"
search_result = re.match(pattern, text)
search_result.group()

'more complicated string with 3456'

Here things start to run together. There are specal characters in the search ".", "*", "+", some real text "with", " " and some excape characters. Escape characters start with \ to say the next character is not a character to be matched but should be interpreted as what category to search. \d is a digit, \w is a word character, \s is a space character. This says find 0 or more up to a space follwed by "with " followed by a two digits follwed by 1 or more characters. Which translates into the entire string.

In [27]:
pattern = r".*\swith \d\d.+"
search_result = re.match(pattern, text)
search_result.group()

'more complicated string with 3456 to search.'

Here we search for two word characters from the beginning. 

In [28]:
text = 'abcdeABC123aB'

pattern = r"^\w{2}"
search_result = re.match(pattern, text)
search_result.group()

'ab'

This is using ranges of characters [a-z] means all lower case letters. [A-Z] means all upper case letters. When the brackets are follwed by "+" means one or more shoud be found. Same for the \d digit and \w word characters. In addion to what is being found, group the results and return individually. To get the groups we call .groups() on the search result. It will return a tuple of matches. If no matches search_result is set to None. The four patterns all match to the same result just with different syntax.

In [29]:
pattern = r"([a-z]+)([A-Z]+)(\d+)(\w+)$"
# pattern = r"^([a-z]{1,5})([A-Z]{1,5})(\d{1,5})(\w{1,5})$"
# pattern = r"([^A-Z]+)([^a-z]+)(\d+)(\w+)$"
# pattern = r"([^A-Z]+)([A-Z]+)(\d+)(\w+)$"
search_result = re.match(pattern, text)
print(search_result.groups())
print(type(search_result.groups()))

('abcde', 'ABC', '123', 'aB')
<class 'tuple'>


We can use standard Python syntax to unpack the tuple into varibles directly. Just need to make sure the number of variables match the number of groups we expect to find.

In [30]:
first, second, third, fourth = search_result.groups()
print(first, second, third, fourth)

abcde ABC 123 aB
