# Strings

## Strings as sequences of characters

For the purposes of extracting characters and substrings, strings can be considered to be sequences of characters, which means that you can use index or slice notation: 

In [3]:
x = "Hello"

print(x[0])   # 'H'
print(x[-1])  # 'o'
print(x[1:])  # 'ello'

H
o
ello


One use for slice notation with strings is to chop the newline off the end of a string (usually, a line that’s just been read from a file): 

In [1]:
x = "Goodbye\n"
print(x) # "Goodbye\n"

x = x[:-1]
print(x) # 'Goodbye'

# This code is just an example. You should know that Python strings have other, better
# methods to strip unwanted characters, but this example illustrates the usefulness of 
# slicing. 
x = "Goodbye\n"
x = x.strip()
print(x)

Goodbye

Goodbye
Goodbye


You can also determine how many characters are in the string by using the `len` function, just like finding out the number of elements in a list

In [2]:
print(len("Goodbye"))  # 7

7


But strings aren’t lists of characters. The most noticeable difference between strings and lists is that unlike lists, strings can’t be modified. Attempting to say something like `string.append('c')` or `string[0] = 'H'` results in an error. You’ll notice in the previous example that I stripped off the newline from the string by creating a string that was a slice of the previous one, not by modifying the previous string directly. This is a basic Python restriction, imposed for efficiency reasons. 

## Basic String Operations

In [3]:
x = "Hello " + "World"
print(x)  # 'Hello World'

Hello World


In [4]:
print(8 * "x")  # 'xxxxxxxx'

xxxxxxxx


You’ve already seen a few of the character sequences that Python regards as special when used within strings: `\n` represents the newline character, and `\t` represents the tab character. Sequences of characters that start with a backslash and that are used to represent other characters are called escape sequences. Escape sequences are generally used to represent special characters—that is, characters (such as tab and newline) that don’t have a standard one-character printable representation.

Table 6.1. Escape sequences for string and bytes literals

| Escape sequence | Character represented |
|-----------------|-----------------------|
| \' 	| Single-quote character |
| \" 	| Double-quote character |
| \\ 	| Backslash character |
| \a 	| Bell character |
| \b 	| Backspace character |
| \f 	| Formfeed character |
| \n 	| Newline character |
| \r 	| Carriage-return character (not the same as \n) |
| \t 	| Tab character |
| \v 	| Vertical tab character |

Because all strings in Python 3 are Unicode strings, they can also contain almost every character from every language available. Although a discussion of the Unicode system is far beyond the scope of this book, the following examples illustrate that you can also escape any Unicode character, either by number (as shown earlier) or by Unicode name: 

In [5]:
unicode_a ='\N{LATIN SMALL LETTER A}'
print(unicode_a)  # 'a'

unicode_a_with_acute = '\N{LATIN SMALL LETTER A WITH ACUTE}'
print(unicode_a_with_acute)  # 'á'

print("\u00E1")  # 'á'

a
á
á


## String methods

For the purposes of this section, you need only remember that most string methods are attached to the string object they operate on by a dot (`.`), as in `x.upper()`. That is, they’re prepended with the string object followed by a dot. Because strings are immutable, the string methods are used only to obtain their return value and don’t modify the string object they’re attached to in any way. 

Anyone who works with strings is almost certain to find the `split` and `join` methods invaluable. They’re the inverse of one another: `split` returns a list of substrings in the string, and `join` takes a list of strings and puts them together to form a single string with the original string between each element. Typically, `split` uses whitespace as the delimiter to the strings it’s splitting, but you can change that behavior via an optional argument.

String concatenation using `+` is useful but **not efficient** for joining large numbers of strings into a single string, because each time `+` is applied, a **new string object** is created. The previous “Hello World” example produces two string objects, one of which is immediately discarded. A better option is to use the `join` function:


In [6]:
print(" ".join(["join", "puts", "spaces", "between", "elements"])) # 'join puts spaces between elements'

join puts spaces between elements


In [7]:
print("::".join(["Separated", "with", "colons"])) # 'Separated::with::colons'

Separated::with::colons


In [10]:
print("".join(["Separated", "by", "nothing"]))  # 'Separatedbynothing'

Separatedbynothing


In [23]:
%%timeit
# try different n!
s=""
for i in range(10000): 
    s += str(i)

3.28 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
%timeit "".join((str(i) for i in range(10000)))

2.28 ms ± 79.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The most common use of `split` is probably as a **simple parsing** mechanism for string-delimited records stored in text files. By default, `split` splits on any whitespace, not just a single space character, but you can also tell it to split on a particular sequence by passing it an optional argument:

In [25]:
x = "You\t\t can have tabs\t\n \t and newlines \n\n " \
               "mixed in"
print(x.split())  # ['You', 'can', 'have', 'tabs', 'and', 'newlines', 'mixed', 'in']

x = "Mississippi"
print(x.split("ss"))  # ['Mi', 'i', 'ippi']

['You', 'can', 'have', 'tabs', 'and', 'newlines', 'mixed', 'in']
['Mi', 'i', 'ippi']


If you need more complex rules for how to split, consider the `re` module:

In [28]:
import re

x = "Mississippi"
pattern = re.compile(r'[is]', flags=re.IGNORECASE)
print(pattern.split(x)) # ['M', '', '', '', '', '', '', 'pp', '']

['M', '', '', '', '', '', '', 'pp', '']


Sometimes, it’s useful to permit the last field in a joined string to contain arbitrary text, perhaps including substrings that may match what `split` splits on when reading in that data. You can do this by specifying how many splits `split` should perform when it’s generating its result, via an optional second argument. If you specify `n` splits, `split` goes along the input string until it has performed `n` splits (generating a list with `n+1` substrings as elements) or until it runs out of string. Here are some examples: 

In [1]:
x = 'a b c d'
print(x.split(' ', 1))  # ['a', 'b c d']

print(x.split(' ', 2))  # ['a', 'b', 'c d']

print(x.split(' ', 9))  # ['a', 'b', 'c', 'd']

['a', 'b c d']
['a', 'b', 'c d']
['a', 'b', 'c', 'd']


You can use the functions `int` and `float` to convert strings to integer or floating-point numbers, respectively. If they’re passed a string that can’t be interpreted as a number of the given type, these functions raise a ValueError exception.

In [6]:
print(float('123.456'))  # 123.456

#print(float('xxyy'))  #  ValueError

print(int('3333'))  # 3333

#print(int('123.456'))  # ValueError

print(int('10000', 8))  # 4096

print(int('101', 2))  # 5

print(int('ff', 16))  # 255

#print(int('123456', 6))  # ValueError


123.456
3333
4096
5
255


A trio of surprisingly useful simple methods are the `strip`, `lstrip`, and `rstrip` functions. `strip` returns a new string that’s the same as the original string, except that any whitespace at the beginning or end of the string has been removed. `lstrip` and `rstrip` work similarly, except that they remove whitespace only at the left or right end of the original string, respectively: 

In [8]:
x = "  Hello,    World\t\t "

print(x.strip())  # 'Hello,    World'

print(x.lstrip())  # 'Hello,    World\t\t '

print(x.rstrip())  # '  Hello,    World'

Hello,    World
Hello,    World		 
  Hello,    World


In this example, tab characters are considered to be whitespace. The exact meaning may differ across operating systems, but you can always find out what Python considers to be whitespace by accessing the `string.whitespace` constant. On my Windows system, Python returns the following: 

In [11]:
import string
print(repr(string.whitespace))  # ' \t\n\r\x0b\x0c'

' \t\n\r\x0b\x0c'


You can change which characters `strip`, `rstrip`, and `lstrip` remove by passing a string containing the characters to be removed as an extra parameter: 

In [15]:
x = "www.python.org"

print(x.strip("w"))  # '.python.org'

print(x.strip("gor"))  # 'www.python.'

print(x.strip(".gorw"))  # 'python'

.python.org
www.python.
python


Note that `strip` removes any and all of the characters in the extra parameter string, no matter in which order they occur 2. 

<h3> Quick Check </h3>

If the string `x` equals `"(name, date),\n"`, which of the following would return a string containing `"name, date"`? 

1. `x.rstrip("),")`
2. `x.strip("),\n")`
3. `x.strip("\n)(,")`

The four basic string-searching methods are similar: `find`, `rfind`, `index`, and `rindex`. A related method, `count`, counts how many times a substring can be found in another string. I describe find in detail and then examine how the other methods differ from it. 

`find` takes one required argument: the substring being searched for. `find` returns the position of the first character of the first instance of substring in the string object, or –1 if substring doesn’t occur in the string: 

In [17]:
x = "Mississippi"

print(x.find("ss"))  # 2

print(x.find("zz"))  # -1

2
-1


`find` can also take one or two additional, optional arguments. The first of these arguments, if present, is an integer `start`; it causes find to ignore all characters before position start in string when searching for substring. The second optional argument, if present, is an integer `end`; it causes find to ignore characters at or after position end in string: 

In [18]:
x = "Mississippi"

print(x.find("ss", 3))  # 5

print(x.find("ss", 0, 3))  # -1

5
-1


`rfind` is almost the same as `find`, except that it starts its search at the **end of string** and so returns the position of the first character of the last occurrence of substring in string:


In [20]:
x = "Mississippi"
print(x.rfind("ss"))  # 5

5


`index` and `rindex` are identical to `find` and `rfind`, respectively, except for one difference: If `index` or `rindex` fails to find an occurrence of substring in string, it doesn’t return –1 but raises a `ValueError` **exception**.

`count` is used identically to any of the previous four functions, but returns the number of **non-overlapping** times the given substring occurs in the given string: 

In [21]:
x = "Mississippi"

print(x.count("ss"))  # 2

2


You can use two other string methods to search strings: `startswith`  and `endswith`. These methods return a `True` or `False` result, depending on whether the string they’re used on starts or ends with one of the strings given as parameters: 

In [23]:
x = "Mississippi"

print(x.startswith("Miss"))  # True

print(x.startswith("Mist"))  # False

print(x.endswith("pi"))  # True

print(x.endswith("p"))  # False

True
False
True
False


Both `startswith` and `endswith` can look for more than one string at a time. If the parameter is a `tuple` of strings, both methods check for all the strings in the tuple and return `True` if any one of them is found: 

In [25]:
print(x.endswith(("i", "u")))  # True

True


### Modifying strings

Strings are **immutable**, but string objects have several methods that can operate on that string and return a new string that’s a modified version of the original string. This provides much the same effect as direct modification for most purposes. You can find a more complete description of these methods in the documentation.

You can use the `replace` method to replace occurrences of substring (its first argument) in the string with newstring (its second argument). This method also takes an optional third argument (see the documentation for details):


In [27]:
x = "Mississippi"

print(x.replace("ss", "+++"))  # 'Mi+++i+++ippi'

x = "a wonderful filename with whitespace"

print(x.replace(" ", "_")) # 

Mi+++i+++ippi
a_wonderful_filename_with_whitespace


In [29]:
x = "~x ^ (y % z)" # a string in the old source langauge

table = x.maketrans("~^()", "!&[]")  # a table with a mapping between the two langauges

print(x.translate(table))  # '!x & [y % z]'

!x & [y % z]


The second line uses `maketrans` to make up a translation table from its two string arguments. The two arguments must each contain the same number of characters, and a table is made such that looking up the nth character of the first argument in that table gives back the nth character of the second argument.

Next, the table produced by `maketrans` is passed to `translate`. Then `translate` goes over each of the characters in its `string` object and checks to see whether they can be found in the table given as the second argument. If a character can be found in the translation table, `translate` replaces that character with the corresponding character looked up in the table to produce the translated string.

You can give `translate` an optional argument to specify characters that should be removed from the string. See the documentation for details.

Other functions in the string module perform more specialized tasks. `string.lower` converts all alphabetic characters in a string to lowercase, and `upper` does the opposite. `capitalize` capitalizes the first character of a string, and `title` capitalizes all words in a string. `swapcase` converts lowercase characters to uppercase and uppercase to lowercase in the same string. `expandtabs` gets rid of tab characters in a string by replacing each tab with a specified number of spaces. `ljust`, `rjust`, and `center` pad a string with spaces to justify it in a certain field width. `zfill` left-pads a numeric string with zeros. Refer to the documentation for details on these methods.


### Modifying strings with list manipulations

Because strings are immutable objects, you have no way to manipulate them directly in the same way that you can manipulate lists. Although the operations that produce new strings (leaving the original strings unchanged) are useful for many things, sometimes you want to be able to manipulate a string as though it were a list of characters. In that case, turn the string into a list of characters, do whatever you want, and then turn the resulting list back into a string: 

In [11]:
text = "Hello, World"
wordList = list(text)  # ['H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd']

wordList[6:] = []  # ['H', 'e', 'l', 'l', 'o', ',']
wordList.reverse() 

text = "".join(wordList)

print(text) # ,olleH

,olleH


### Useful methods and constants

`string` objects also have several useful methods to report various characteristics of the string, such as whether it consists of digits or alphabetic characters, or is all uppercase or lowercase: 

In [12]:
x = "123"
print(x.isdigit())  # True

print(x.isalpha())  # False

x = "M"
print(x.islower())  # False

print(x.isupper())  # True

True
False
False
True


For a list of all the possible string methods, refer to the string section of the official Python documentation.

Finally, the string module defines some useful constants. You’ve already seen `string.whitespace`, which is a string made up of the characters Python thinks of as whitespace on your system. `string.digits` is the string '0123456789'. `string.hexdigits` includes all the characters in `string.digits`, as well as 'abcdefABCDEF', the extra characters used in hexadecimal numbers. `string.octdigits` contains '01234567'—only those digits used in octal numbers. `string.lowercase` contains all lowercase alphabetic characters; `string.uppercase` contains all uppercase alphabetic characters; `string.letters` contains all the characters in `string.lowercase` and `string.uppercase`. You might be tempted to try assigning to these constants to change the behavior of the language. Python would let you get away with this action, but it probably would be a bad idea.


Remember that strings are **sequences of characters**, so you can use the convenient Python `in` operator to test for a character’s membership in any of these strings, although usually the existing string methods are simpler and easier.

### Common string operations

Note that these methods don’t change the string itself; they return either a location in the string or a new string.

| String operation | Explanation | Example |
|------------------|-------------|---------|
| `+`     | Adds two strings together         | `x = "hello " + "world"` |
| `*`     | Replicates a string               | `x = " " * 20`           |
| `upper` | Converts a string to uppercase    | `x.upper()` | 
| `lower` | Converts a string to lowercase 	| `x.lower()` |
| `title` | Capitalizes the first letter of each word in a string 	|`x.title()` |
| `find`, `index` | Searches for the target in a string | `x.find(y)` `x.index(y)` | 
| `rfind`, `rindex` | Searches for the target in a string, from the end of the string | `x.rfind(y)` `x.rindex(y)` |
| `startswith`, `endswith` | Checks the beginning or end of a string for a match | `x.startswith(y)` `x.endswith(y)` |
| `replace` | Replaces the target with a new string | `x.replace(y, z)`
| `strip`, `rstrip`, `lstrip` | Removes whitespace or other characters from the ends of a string | `x.strip()` |
| `encode` | Converts a Unicode string to a bytes object | `x.encode("utf_8")` |

#### Try This

Suppose that you have a list of strings in which some (but not necessarily all) of the strings begin and end with the double quote character: 

```python
x = ['"abc"', 'def', '"ghi"', '"klm"', 'nop']
```

What code would you use on each element to remove just the double quotes?

In [15]:
x = ['"abc"', 'def', '"ghi"', '"klm"', 'nop']



What code could you use to find the position of the last `p` in `Mississippi`? When you’ve found that position, what code would you use to remove just that letter? 

In [17]:
s = "Mississippi"



## Converting from objects to strings

In Python, almost anything can be converted to some sort of a string representation by using the built-in `repr` function. Lists are the only complex Python data types you’re familiar with so far, so here, I turn some lists into their representations: 

In [19]:
print(repr([1, 2, 3]))  # '[1, 2, 3]'

x = [1]
x.append(2)
x.append([3, 4])

print('the list x is ' + repr(x))  # 'the list x is [1, 2, [3, 4]]'

[1, 2, 3]
the list x is [1, 2, [3, 4]]


I’ve covered how Python can convert any object to a string that describes that object. The truth is, Python can do this in either of two ways. The `repr` function always returns what might be loosely called the **formal string representation** of a Python object. More specifically, `repr` returns a **string representation of a Python object** from which the original object can be rebuilt. For large, complex objects, this may not be the sort of thing you want to see in debugging output or status reports.

Python also provides the built-in `str` function. In contrast to `repr`, `str` is intended to produce printable string representations, and it can be applied to any Python object. `str` returns what might be called the informal string representation of the object. A `string` returned by `str` need not define an object fully and is intended to be **read by humans**, not by Python code.

You won’t notice any difference between `repr` and `str` when you start using them, because until you begin using the object-oriented features of Python, there’s no difference. `str` applied to any built-in Python object always calls `repr` to calculate its result. Only when you start defining your own classes does the difference between `str` and `repr` become important


In [32]:
from pathlib import Path

p = Path().cwd()

print(repr(p))
print(str(p))

WindowsPath('c:/Users/micha/work/git/python-complete/nb')
c:\Users\micha\work\git\python-complete\nb


## String interpolation 

Starting in Python 3.6, there’s a way to create string constants containing arbitrary values, which is called **string interpolation**. String interpolation is a way to include the values of Python expressions inside literal strings. These **f-strings**, as they’re commonly called because they are prefixed with `f`, use a syntax similar to that of the format method, but with a little less overhead. The following examples should give you a basic idea of how f-strings work: 

In [33]:
value = 42
message = f"The answer is {value}"

print(message)  # The answer is 42

The answer is 42


Just as with the format method, format specifiers may be added:

In [55]:
import math
pi = math.pi

print(f"pi is {pi}")  # pi is 3.141592653589793
print(f"pi is {pi:10.2f}")  # pi is 3.14
print(f"pi is {pi:1e}")

pi is 3.141592653589793
pi is       3.14
pi is 3.141593e+00
