There's already a module which might have a lot of the things you might want to do with text. Let's take a look at it and some of its most interesting commands.

In [9]:
import re

# Search
Wanna know if a string exists inside another one? Search no more, for you've found your Match! :)

In [13]:
string1 = 'Três pratos de trigo para três tigres tristes'
string2 = 'pratos'
string3 = 'pirulito'

if re.search(string2,string1):
    print(f'{string2} is in {string1}')
else:
    print(f'{string2} is in NOT {string1}')
    
if re.search(string3,string1):
    print(f'{string3} is in {string1}')
else:
    print(f'{string3} is in NOT {string1}')

pratos is in Três pratos de trigo para três tigres tristes
pirulito is in NOT Três pratos de trigo para três tigres tristes


It's worth noting that re.search() actually returns an object or <code>None</code>.

In [15]:
print(type(re.search(string2,string1)))
print(type(re.search(string3,string1)))

<class 're.Match'>
<class 'NoneType'>


This object has some cool methods. Let's take a look at some. There are more than the ones shown bellow. These are just some quick examples.

In [23]:
search_res = re.search(string2,string1)
print(search_res.start())
print(search_res.end())
print(search_res.span())

5
11
(5, 11)


# Findall()
This will return every match of a given term in a string. It might look a bit dumb, but if you count the ammount of elements in the returned list you can find out how many times a certain term occurs inside a string.

It can also become more interesting if we are in a situation where we're interested in something that would be dealt with using <b>repetition sintax</b> More about it bellow.

In [24]:
re.findall('tr','Três pratos de trigo para três tigres tristes')

['tr', 'tr', 'tr']

# Repetition sintax

In [46]:
#Any pattern that has zero or more
re.findall('an*',"annie I'm gonna kick your ass as an ant")

['ann', 'a', 'a', 'a', 'an', 'an']

In [47]:
#Any pattern that has one or more
re.findall('an+',"annie I'm gonna kick your ass as an ant")

['ann', 'an', 'an']

In [50]:
#Must happen zero or one time
re.findall('an?',"annie I'm gonna kick your ass as an ant")

['an', 'a', 'a', 'a', 'an', 'an']

In [53]:
#Must happen and specific number of times
re.findall('an{2}',"annie I'm gonna kick your ass as an ant")

['ann']

In [57]:
#Must happen in a specific range of times
re.findall('an{0,3}',"annie I'm gonna kick your ass as an ant")

['ann', 'a', 'a', 'a', 'an', 'an']

# Character sets
What if you want to check if one or another character exists

In [59]:
#Let's look for 'a's and 'n's
re.findall('[an]','another day in paradise')

['a', 'n', 't', 'a', 'n', 'a', 'a']

In [61]:
#We can combine this with repetition sintax. 'a's followed by one or more 'n's.
re.findall('a[an+]','another day in paradise, annie')

['an', 'an']

# Exclusion
Wanna remove certain characters!? Use <code>[^...]</code> Let's remove spaces and 'a's. This could be usefull to remove punctuation.

In [69]:
re.findall('[^a ]+','três patas de tigre para três patos tristes')

['três', 'p', 't', 's', 'de', 'tigre', 'p', 'r', 'três', 'p', 'tos', 'tristes']

# Character ranges
What if you wanna test agains a lot of letters? Typing all of them would be kinda dumb. Use character ranges!

In [80]:
# Here we wanna sequences of lower case letters
re.findall('[a-z]+','Hi Annie! It looks like we will have another day in paradise! LOL!')

['i',
 'nnie',
 't',
 'looks',
 'like',
 'we',
 'will',
 'have',
 'another',
 'day',
 'in',
 'paradise']

In [81]:
# Here we wanna sequences of upper case letters
re.findall('[A-Z]+','Hi Annie! It looks like we will have another day in paradise! LOL!')

['H', 'A', 'I', 'LOL']

In [82]:
# Here we wanna sequences of upper case letter follow by lower case letters
re.findall('[A-Z][a-z]+','Hi Annie! It looks like we will have another day in paradise! LOL!')

['Hi', 'Annie', 'It']

In [85]:
# Here we wanna sequences of upper or lower case letters
re.findall('[A-Za-z]+','Hi Annie! It looks like we will have another day in paradise! LOL!')

['Hi',
 'Annie',
 'It',
 'looks',
 'like',
 'we',
 'will',
 'have',
 'another',
 'day',
 'in',
 'paradise',
 'LOL']

# Escape codes
<code>findall()</code> understands "categories" of characters.

In [88]:
#This looks for digits
re.findall('\d','Annie posted #Annieday 456 times.')

['4', '5', '6']

In [89]:
#This looks for digits
re.findall('\D','Annie posted #Annieday 456 times.')

['A',
 'n',
 'n',
 'i',
 'e',
 ' ',
 'p',
 'o',
 's',
 't',
 'e',
 'd',
 ' ',
 '#',
 'A',
 'n',
 'n',
 'i',
 'e',
 'd',
 'a',
 'y',
 ' ',
 ' ',
 't',
 'i',
 'm',
 'e',
 's',
 '.']

In [90]:
#This looks for white spaces, tabs, line breaks...
re.findall('\s','Annie posted #Annieday 456 times.')

[' ', ' ', ' ', ' ']

In [91]:
#This looks for the opposite of above
re.findall('\S','Annie posted #Annieday 456 times.')

['A',
 'n',
 'n',
 'i',
 'e',
 'p',
 'o',
 's',
 't',
 'e',
 'd',
 '#',
 'A',
 'n',
 'n',
 'i',
 'e',
 'd',
 'a',
 'y',
 '4',
 '5',
 '6',
 't',
 'i',
 'm',
 'e',
 's',
 '.']

In [92]:
#This looks for alphanumerics
re.findall('\w','Annie posted #Annieday 456 times.')

['A',
 'n',
 'n',
 'i',
 'e',
 'p',
 'o',
 's',
 't',
 'e',
 'd',
 'A',
 'n',
 'n',
 'i',
 'e',
 'd',
 'a',
 'y',
 '4',
 '5',
 '6',
 't',
 'i',
 'm',
 'e',
 's']

In [93]:
#This looks for non alphanumerics
re.findall('\W','Annie posted #Annieday 456 times.')

[' ', ' ', '#', ' ', ' ', '.']