# String Manipulation and Regular Expressions

**Whirlwind Tour of Python** by Jake VanderPlas

https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html

Github resources:

https://github.com/jakevdp/WhirlwindTourOfPython

Other resources:
- **Regular Expressions: Regexes in Python (Part 1)** https://realpython.com/regex-python/

- **Python Book - Regular expressions** https://book.onpy.dev/ch02-numbers-strings/s09-regular-expressions



## Simple String Manipulation in Python

### Formatting strings: Adjusting case

Python makes it quite easy to adjust the case of a string. Here we'll look at the `upper()`, `lower()`, `capitalize()`, `title()`, and `swapcase()` methods, using the following messy string as an example:

In [1]:
fox = "tHe qUICk bROWn fOx."

In [2]:
fox.upper()

'THE QUICK BROWN FOX.'

In [3]:
fox.lower()

'the quick brown fox.'

In [4]:
fox.title()

'The Quick Brown Fox.'

In [5]:
fox.capitalize()

'The quick brown fox.'

In [6]:
fox.swapcase()

'ThE QuicK BrowN FoX.'

### Formatting strings: Adding and removing spaces

Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the `strip()` method, which strips whitespace from the beginning and end of the line:

In [7]:
line = '         this is the content         '
line.strip()

'this is the content'

In [8]:
line.rstrip()

'         this is the content'

In [9]:
line.lstrip()

'this is the content         '

In [10]:
num = "000000000000435"
num.strip('0')

'435'

The opposite of this operation, adding spaces or other characters, can be accomplished using the `center()`, `ljust()`, and `rjust()` methods.

For example, we can use the `center()` method to center a given string within a given number of spaces:

In [11]:
line = "this is the content"
line.center(30)

'     this is the content      '

In [12]:
line.ljust(30)

'this is the content           '

In [13]:
'435'.rjust(10, '0')

'0000000435'

In [14]:
'435'.zfill(10)

'0000000435'

### Finding and replacing substrings

If you want to find occurrences of a certain character in a string, the `find()/rfind()`, `index()/rindex()`, and `replace()` methods are the best built-in methods.

`find()` and `index()` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:



In [15]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

16

In [16]:
line.index('fox')

16

The only difference between `find()` and `index()` is their behavior when the search string is not found; `find()` returns -1, while `index()` raises a ValueError:

In [17]:
line.find('bear')

-1

In [18]:
line.index('bear')

ValueError: substring not found

The related 'rfind()' and 'rindex()' work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

In [19]:
line.rfind('a')

35

For the special case of checking for a substring at the beginning or end of a string, Python provides the `startswith()` and `endswith()` methods:

In [21]:
line.endswith('dog')

True

In [22]:
line.startswith('fox')

False

To go one step further and replace a given substring with a new string, you can use the `replace()` method. Here, let's replace 'brown' with 'red':

In [23]:
line.replace('brown', 'red')

'the quick red fox jumped over a lazy dog'

In [24]:
line.replace('o', '--')

'the quick br--wn f--x jumped --ver a lazy d--g'

### Splitting and partitioning strings

If you would like to find a substring and then split the string based on its location, the `partition()` and/or `split()` methods are what you're looking for. Both will return a sequence of substrings.

The `partition()` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [25]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

The `rpartition()` method is similar, but searches from the right of the string.

The `split()` method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string:

In [26]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

In [27]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

Note that if you would like to undo a `split()`, you can use the `join()` method, which returns a string built from a split-point and an iterable:

In [28]:
'--'.join(['1', '2', '3'])

'1--2--3'

A common pattern is to use the special character "\n" (newline) to join together lines that have been previously split, and recover the input:

In [29]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


## Format Strings

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats. Another use of string methods is to manipulate string representations of values of other types. Of course, string representations can always be found using the `str()` function; for example:

In [30]:
pi = 3.14159
str(pi)

'3.14159'

In [31]:
"The value of pi is " + str(pi)

'The value of pi is 3.14159'

In [32]:
"The value of pi is {}".format(pi)

'The value of pi is 3.14159'

In [33]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

In [34]:
"pi = {0:.3f}".format(pi)

'pi = 3.142'

As before, here the "0" refers to the index of the value to be inserted. The ":" marks that format codes will follow. The ".3f" encodes the desired precision: three digits beyond the decimal point, floating-point format.

## Flexible Pattern Matching with Regular Expressions

Fundamentally, regular expressions are a means of flexible pattern matching in strings. If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "*" character, which acts as a wildcard.

Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching syntaxes. The Python interface to regular expressions is contained in the built-in re module; as a simple example, let's use it to duplicate the functionality of the string `split()` method:

In [35]:
import re
regex = re.compile('\s+')
regex.split(line)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

Here we've first compiled a regular expression, then used it to split a string. Just as Python's `split()` method returns a list of all substrings between whitespace, the regular expression split() method returns a list of all substrings between matches to the input pattern.

In this case, the input is "\s+": "\s" is a special character that matches any whitespace (space, tab, newline, etc.), and the "+" is a character that indicates one or more of the entity preceding it. Thus, the regular expression matches any substring consisting of one or more spaces.

The `split()` method here is basically a convenience routine built upon this pattern matching behavior; more fundamental is the `match()` method, which will tell you whether the beginning of a string matches the pattern:

In [36]:
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


Like `split()`, there are similar convenience routines to find the first match (like `str.index()` or `str.find()`) or to find and replace (like `str.replace()`). We'll again use the line from before:

In [37]:
line = 'the quick brown fox jumped over a lazy dog'

In [38]:
line.index('fox')

16

In [39]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

16

Similarly, the `regex.sub()` method operates much like `str.replace()`:

In [40]:
line.replace('fox', 'BEAR')

'the quick brown BEAR jumped over a lazy dog'

In [41]:
regex.sub('BEAR', line)

'the quick brown BEAR jumped over a lazy dog'

### Regular expressions offer far more flexibility

In [42]:
email = re.compile('\w+@\w+\.[a-z]{3}')

In [43]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

### Basics of regular expression syntax

#### Simple strings are matched directly


In [44]:
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']


#### Some characters have special meanings

There are a handful of characters that have special meanings within regular expressions: **. ^ $ * + ? { } [ ] \ | ( )**

You can escape them with a back-slash:

In [45]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

The r preface in r'\$' indicates a raw string; in standard Python strings, the backslash is used to indicate special characters. For example, a tab is indicated by "\t":

In [46]:
print('a\tb\tc')

a	b	c


In [47]:
print(r'a\tb\tc')

a\tb\tc


#### Special characters can match character groups

Special characters match specified groups of characters, and we've seen them before. In the email address regexp from before, we used the character "\w", which is a special marker matching any alphanumeric character. Similarly, in the simple split() example, we also saw "\s", a special marker indicating any whitespace character.

In [48]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

The following table lists a few of these characters that are commonly useful:

|Character	|Description		|Character	|Description
|:----------|:------------------|:----------|:------------------------------------
|"\d"	|Match any digit		|"\D"	|Match any non-digit
|"\s"	|Match any whitespace		|"\S"	|Match any non-whitespace
|"\w"	|Match any alphanumeric char		|"\W"	|Match any non-alphanumeric char

This is not a comprehensive list or description; for more details, see Python's regular expression syntax documentation. 

https://docs.python.org/3/library/re.html#re-syntax

#### Square brackets match custom character groups

The following will match any lower-case vowel:

In [49]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example, "[a-z]" will match any lower-case letter, and "[1-3]" will match any of "1", "2", or "3":

In [50]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

#### Wildcards match repeated characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, "\w\w\w". Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:


In [51]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

There are also markers available to match any number of repetitions – for example, the "+" character will match one or more repetitions of what precedes it:

In [52]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

The following is a table of the repetition markers available for use in regular expressions:

|**Character** |  **Description** | **Example**
|:-------------|:-----------------|:-----------------------------------------------------------------------------
|?           | Match zero or one repetitions of preceding	    | "ab?" matches "a" or "ab"
|*	        | Match zero or more repetitions of preceding	    | "ab*" matches "a", "ab", "abb", "abbb"...
|+	        | Match one or more repetitions of preceding	    | "ab+" matches "ab", "abb", "abbb"... but not "a"
|{n}	        | Match n repetitions of preceding	            | "ab{2}" matches "abb"
|{m,n}       | Match between m and n repetitions of preceding	| "ab{2,3}" matches "abb" or "abbb"


#### Parentheses indicate groups to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to group the results:

In [53]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [54]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

We can go a bit further and name the extracted components using the "(?P<name> )" syntax, in which case the groups can be extracted as a Python dictionary:

In [55]:
email4 = re.compile(r'(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('guido@python.org')
match.groupdict()

{'user': 'guido', 'domain': 'python', 'suffix': 'org'}