# Chapter 11 - Strings
---------

## Strings
A string is a sequence of letters/characters which are compounded to form a whole. In Python a string is a type of variable for which the value is enclosed by single- or double-quotes. A string can contain a large variety of characters, numbers, and symbols and can be a single letter long or contain entire books.

The basic usage of a string is as follows:

In [12]:
# We can declare a string using single or double quotes, there is no difference.
course = "Data Processing"
university = 'Tilburg University'

However, if your string contains a quote-symbol it can lead to errors if you enclose it with the wrong quotes.

In [11]:
# Run this cell to see the error generated by the following line.
restaurant = 'Wendy's'

SyntaxError: invalid syntax (<ipython-input-11-025f82182cb9>, line 2)

In the example above the error indicates that there is something wrong with the letter s, this is because the single-quote closes the string we started, and anything after that is unexpected by python.

To solve this we can enclose the string in double-quotes, as follows:

In [107]:
restaurant = "Wendy's"
# Similarly we can also enclose a string containing double-quotes with single-quotes as such:
quotes = 'Using "double" quotes enclosed by a single quote.'

#### Multi-line strings

Strings in python can also span across multiple lines, which can be useful for when you have a very long string, or for when you want to format the output of the string in a certain way. This can be achieved in two ways:

1. With single or double quotes, where we manually indicates that the rest of the string continues on the next line with a backslash.
2. With three single or double quotes.

We will first demonstrate how this would work when you use one double or single quote.

In [109]:
# This example also works with single-quotes.
longString = "A very long string\n\
can be split into multiple\n\
sentences by appending a blackslash\n\
to the end of the line."

print(longString)

A very long string
can be split into multiple
sentences by appending a blackslash
to the end of the line.


The \n or **n**ew**l**ine symbol indicates that we want to start the rest of the text on a new line in the string, the following \ indicates that we want the string to continue on the next line of the code. This difference can be quite hard to understand, but best illustrated with an example where we do not include the \n symbol.

In [112]:
longString = "A very long string \
can be split into multiple \
sentences by appending a blackslash \
to the end of the line."

print(longString)

A very long string can be split into multiple sentences by appending a blackslash to the end of the line.


As you can see, python now interprets this example as a single line of text. If we use the recommended way in python to write multiline strings, with triple double or singles quotes, you will see that the \n or newline symbol is automatically included.

In [116]:
longString = """A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes."""

print(longString)

print()

anotherLongString = '''A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes.'''

print(anotherLongString)

A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes.

A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes.


## Using strings

### String Indices
Each symbol in a string has a position, this position can be referred to by the index number of the position. The index numbers start at 0 and then increase to the length of the string. The following table shows the word "oranges"in the first row and the indices for each letter in the second row:

| O | R | A | N | G | E | S |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| -7 | -6 | -5 | -4 | -3 | -2 | -1 |

We can extract letters from a string by referring to them by their index number.

In [34]:
fruit = "Oranges"
print(fruit[0]) # O
print(fruit[1]) # r
# ...
print(fruit[6]) # s
print()
# Besides using positive numbers it is also possible to refer to positions using a negative index number
print(fruit[-7]) # O
print(fruit[-6]) # r
# ...
print(fruit[-1]) # s

O
r
s

O
r
s


Besides using single indices we can also extract a range from a string, a range is indicated using a colon (:), where the left number indicates the starting index and the right one the last index (exclusive). By leaving out the left index the range will start at the beginning of the string, by leaving out the right index the range will include the last character of the string.

In [35]:
fruit = "Oranges"
print(fruit[:])
print(fruit[0:])
print(fruit[:7])
print()
print(fruit[:6])
print()
print(fruit[1:-1])
print()
print(fruit[2], fruit[1:6])

Oranges
Oranges
Oranges

Orange

range

a range


The length of a string can be found using the len() function:

In [214]:
fruit = "Oranges"
print(len(fruit))
print(fruit[len(fruit)-1]) # This can be used to find the last letter, but it is easier to just use -1

7
s


### Traversing strings

Because strings are sequence we can traverse them, and visit each symbol. The simplest way of doing this is with a loop:

In [42]:
fruit = 'Apple'
for char in fruit:
    print(char, '- ', end='') # end='' allows us to not print every letter on a new line
print()

for i in range(len(fruit)):
    print(fruit[i], '- ', end='')
print()

A - p - p - l - e - 
A - p - p - l - e - 


Similarly, we can also iterative over multiple characters at a time.

In [188]:
fruit = 'Orange'
for i in range(0,len(fruit),3):
    print(fruit[i:i+3])
print()

for i in range(len(fruit)-2):
    print(fruit[i:i+3])
print()

i = 0
while i <= len(fruit):
    print( fruit[i:], "X", fruit[:i] )
    i += 1
    
print()

for i in range(len(fruit)+1):
    print( fruit[i:], "X", fruit[:i] )

Ora
nge

Ora
ran
ang
nge

Orange X 
range X O
ange X Or
nge X Ora
ge X Oran
e X Orang
 X Orange

Orange X 
range X O
ange X Or
nge X Ora
ge X Oran
e X Orang
 X Orange


#### Extended slices

Slices in python can take a third argument, which is the step size or the stride that is taken between indices, this is similar to the third argument for the range() function. The format for slices then becomes: string[begin:end:step], where begin, end and step are numbers we can choose. By default the step size is 1.

In [223]:
fruit = "Pineapple"
print(fruit[4:9])
# is the same as
print(fruit[4:9:1])

apple
apple


The stride can be increased to only take some letters.

In [233]:
fruit = "Banana"
print(fruit[1:6:2]) # take every other letter in the range 1 to 6

aaa


Additionally, the stride can be used with an unspecified range (or using the entire string), and it is also possible to specify a negative stride.

In [241]:
fruit = "Banana"
print(fruit[::-2]) # We leave the range empty to get the entire string, similar to: fruit[:]
print()
print(fruit[::-1]) # This is a common way in python to reverse strings

aaa

ananaB


Reversing a string using [::-1] is conceptually similar to traversing the string from the last character to the beginning of the string in backward steps of -1.

In [243]:
fruit = "Banana"
for i in range(5, -1, -1):
    print(fruit[i])

a
n
a
n
a
B


### Comparing strings

In python it is possible to use comparison operators (as used in conditional statements) on strings. These operators are: ==, !=, <, <=, >, and >= 

The order of letters is alphabetical, but uppercase letters come before lowercase letters.

In [245]:
print('a' == 'a')
print('a' != 'b')
print('a' == 'A')
print('a' > 'A') # 'a' is smaller than 'A'
print()
print('orange' == 'Orange')
print('orange' > 'Orange')
print('orange' < 'Orange')

True
True
False
True

False
True
False


Another way of comparing strings is to check whether a string is part of another string, this can be done using the 'in' operator, as such:

In [248]:
print('a' in "abcdefg")
# Or for longer strings:
print('word' in 'A sentence with words.')

True
True


### Immutability

The mutability of an object, like a string or a variable refers to how liable the object is to change. Because strings are immutable they cannot be changed, it is possible to create a new string based on the old one, but we cannot modify the string in place. The cell below demonstrates this.

In [11]:
fruit = 'guanabana'
island = fruit[:5] # This is fine, because we are creating a new string
print(island, 'island')
print()
fruit = fruit[5:] + 'na' # This works because we are creating a new string and overwriting our old one
print(fruit)
fruit[4:5] = 'an' # This does not work because now we are trying to change an existing string
# If we want to do this then we need to do:
fruit = fruit[:4] + 'an'

guana island

banana


TypeError: 'str' object does not support item assignment

The reasons for why strings are immutable are beyond the scope of this notebook, but just remember that if you want to modify a string you need to overwrite the entire strong, and you cannot modify individual indices.

### The string module

The string module contains a collection of methods that are designed to operate on strings. See the reference for a overview of string methods: https://docs.python.org/3/library/stdtypes.html#string-methods

All of these methods can be applied to a string to perform some operation. Some examples are shown below.

In [201]:
s = ' Humpty Dumpty sat on the wall '
print( s )
s = s.strip() # This function removes leading and trialing whitespace
print( s )

print( s.upper() )
print( s.lower() )

print( s.find( 'sat' ) )
print( s.find( 't', 12 ) )
print( s.find( 'q', 12 ) )
print( s.replace( 'sat on', 'fell off' ) )

words = s.split() # This returns a list, which are outside of the scope for this notebook.
for word in words: # But you can iterate over each word in this manner
    print( word.capitalize() )
print( '-'.join( words ) )

 Humpty Dumpty sat on the wall 
Humpty Dumpty sat on the wall
HUMPTY DUMPTY SAT ON THE WALL
humpty dumpty sat on the wall
14
16
-1
Humpty Dumpty fell off the wall
Humpty
Dumpty
Sat
On
The
Wall
Humpty-Dumpty-sat-on-the-wall


Split() allows you to split a string based on a given character, by default it uses a space (" "), but you can change it to any character or combination of characters you want.

In [208]:
s = 'Humpty Dumpty sat on the wall'
splits = s.split('umpty')
for word in splits:
    print(word.strip()) # We use strip() here because splitting on 'umpty' retains the whitespace

H
D
sat on the wall


A very useful property of splitting is that we can decode some basic file formats, for example a comma separated value (CSV) file is a very simple format, the basic setup of this file is that each line consists of values that are separated by a comma. 

These values can be split from each other using the split() function.

In [213]:
csv = "2015, September, 28, Data Processing, Tilburg University, Tilburg"
values = csv.split(',')
for value in values:
    print(value.strip())

2015
September
28
Data Processing
Tilburg University
Tilburg


#### ASCII

A string is a sequence of characters, each character can be encoded as a code using an encoding scheme. A commonly used encoding-scheme is ASCII (more at: https://en.wikipedia.org/wiki/ASCII). In python it is possible to convert back and forth between characters and their encoding using the chr() and ord() functions. chr() the character that correspond to the ASCII code we passed as input, wheras ord() returns the opposite, the ASCII code when given a character.

In [87]:
print(ord('A'))
print(ord('a'))

print()

print(chr(65))
print(chr(97))

65
97

A
a


As you can see the character 'A' is converted to the number 65, and the character 'a' is converted to the number 97. This is also the explanation why the comparison:

In [None]:
print('a' > 'A')

returns the value 'True', because the ASCII code for the character 'a' is greater than that of the character 'A'. 

Below you can find an example of the ASCII chart generated using python code.

In [244]:
print( ' ', end='' )
for i in range(16):
    if i < 10:
        print( ' '+chr(ord('0')+i), end='' )
    else:
        print( ' '+chr(ord('A')+i-10), end='' )
print()
for i in range(2,8):
    print( i, end='' )
    for j in range(16):
        c = i*16+j
        if chr(c) in ' \t\n\r\f':
            print( '  ', end='' )
        else:
            print( ' '+chr(c), end='' )
    print()

  0 1 2 3 4 5 6 7 8 9 A B C D E F
2   ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ 


## String formatting

As you might've noticed in previously notebooks that strings were sometimes followed by the format() function,
with the following notation:

In [97]:
"text to {}".format("print")

"{} {} {}".format("text", "to", "print")

'text to print'

While in these cases it is not very useful to use formatted strings, there are times where it can make the code a lot more easy to understand and nicely formatted. When you have some information there are different ways of placing this information in a string, with and without format:

In [200]:
name = 'Bruce Banner'
alterego = 'The Hulk'
colour = 'Green'
country = 'USA'

# By setting the separator to '' we can concatenate strings without white space in between and print them
print("His name is ", name, " and his alter ego is ", alterego, 
      ", a big ", colour, " superhero from the ", country, ".", sep='')
# another way is:
print("His name is " + name +" and his alter ego is " + alterego + 
      ", a big " + colour + " superhero from the " + country +".")
# and yet another way, this time using format is:
print("His name is {} and his alter ego is {}, a big {} superhero from the {}.".format(name, alterego, colour, country))

His name is Bruce Banner and his alter ego is The Hulk, a big Green superhero from the USA.
His name is Bruce Banner and his alter ego is The Hulk, a big Green superhero from the USA.
His name is Bruce Banner and his alter ego is The Hulk, a big Green superhero from the USA.


{} is a placeholder that can be used to indicate where in the text we would like to place our formatted text.
The placeholder can be modified using the format specification to do more advanced formatting, below we will show some examples. More information about the format specification can be found at: https://docs.python.org/3/library/string.html#formatspec

In [144]:
# Numbered formatting
print("{}{}{}".format("a", "b", "c"))
print("{0} {1} {2}".format("a", "b", "c"))
print("{2} {1} {0}".format("a", "b", "c"))
print()
# Named formatting
print("My name is {name} and I am {years} years old".format(years=21, name="Bobby Hill"))
# Adjusting number precision
print()
print("{:d} or {:.2f} or {:.6f} or {:f}".format(10, 10, 10, 10))
# Adjusting the alignment and width
print()
print("Potatoes: {:>5}.".format(100))
print("Potatoes: {:>5}.".format(10000))
print("Potatoes: {:<5}.".format(100))
print("Potatoes: {:<5}.".format(10000))
print()
print("Potatoes: {:>2}.".format(10000)) # This works, but defeats the purpose of aligning
print("Potatoes: {:>2}.".format(1)) # because it is not lined up with this line

abc
a b c
c b a

My name is Bobby Hill and I am 21 years old

10 or 10.00 or 10.000000 or 10.000000

Potatoes:   100.
Potatoes: 10000.
Potatoes: 100  .
Potatoes: 10000.

Potatoes: 10000.
Potatoes:  1.


Formatting strings can be especially nice when we want to repeatedly print nicely formatted information that is of consistent length to make it more readable.

In [185]:
maximum = 1000
steps = 75
for i in range(0, maximum, steps):
    print("Step {:>3} is {:>6.2%} of the total, with {:>7.2%} remaining.".format(i, 
                                                                i/maximum, (maximum-i)/maximum))

Step   0 is  0.00% of the total, with 100.00% remaining.
Step  75 is  7.50% of the total, with  92.50% remaining.
Step 150 is 15.00% of the total, with  85.00% remaining.
Step 225 is 22.50% of the total, with  77.50% remaining.
Step 300 is 30.00% of the total, with  70.00% remaining.
Step 375 is 37.50% of the total, with  62.50% remaining.
Step 450 is 45.00% of the total, with  55.00% remaining.
Step 525 is 52.50% of the total, with  47.50% remaining.
Step 600 is 60.00% of the total, with  40.00% remaining.
Step 675 is 67.50% of the total, with  32.50% remaining.
Step 750 is 75.00% of the total, with  25.00% remaining.
Step 825 is 82.50% of the total, with  17.50% remaining.
Step 900 is 90.00% of the total, with  10.00% remaining.
Step 975 is 97.50% of the total, with   2.50% remaining.


Is more consistent than, and it is easier to compare between lines than when we print the lines like:

In [187]:
maximum = 1000
steps = 75
for i in range(0, maximum, steps):
    print("Step {} is{:.2%} of the total, with {:.2%} remaining.".format(i, 
                                                                i/maximum, (maximum-i)/maximum))

Step 0 is0.00% of the total, with 100.00% remaining.
Step 75 is7.50% of the total, with 92.50% remaining.
Step 150 is15.00% of the total, with 85.00% remaining.
Step 225 is22.50% of the total, with 77.50% remaining.
Step 300 is30.00% of the total, with 70.00% remaining.
Step 375 is37.50% of the total, with 62.50% remaining.
Step 450 is45.00% of the total, with 55.00% remaining.
Step 525 is52.50% of the total, with 47.50% remaining.
Step 600 is60.00% of the total, with 40.00% remaining.
Step 675 is67.50% of the total, with 32.50% remaining.
Step 750 is75.00% of the total, with 25.00% remaining.
Step 825 is82.50% of the total, with 17.50% remaining.
Step 900 is90.00% of the total, with 10.00% remaining.
Step 975 is97.50% of the total, with 2.50% remaining.


----------
#Exercises chapter 11

* Ex 1: In the following code cell you are given a string that contains the same spelling error (vox instead of fox), replace all occcurances of this spelling error with the correct spelling. Hint use [str.replace()](https://docs.python.org/3/library/stdtypes.html#str.replace).

In [191]:
text = """The quick, brown vox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. 
Junk MTV quiz graced by vox whelps. Bawds jog, flick quartz, vex nymphs. 
Waltz, bad nymph, for quick jigs vex! Vox nymphs grab quick-jived waltz. 
Brick quiz whangs jumpy veldt vox. """

# Your code here.

* Ex 2: In the following cell you are given a string that contains an @, print all the text that occurs before this @. Hint use [str.find()](https://docs.python.org/3/library/stdtypes.html#str.find) and string indices.

In [199]:
text = """Tilburg University’s goal is to actively contribute to society. 
@ We want to serve society and make it a better place for all citizens. 
Our university has always actively promoted ways to firmly embed our education and research in society."""

# Your code here

* Ex 3: Check (using python) if the string given in the following cell contains the word "Semantics". Use a if-statement and print "yes" if it does and "no" if it does not.

In [249]:
text = """Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. """

# Your code here

* Ex 4: Count how many of each vowel (a,e,i,o,u) there are in the text string in the next cell, and print the count for each vowel with a single formatted string. Remember that vowels can be both lower and uppercase.

In [251]:
text = """But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but because occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure? On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain."""

# Your code here

* Ex 5: The text string in the next cell contains several words which are enclosed by square brackets ('[' and ']'). Scan through the string and print out all words which are between square brackets. For example if the text string is "[a]n example [string]", you are expected to print out "a string".

In [266]:
text = """The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. [Never gonna ] Bawds jog, flick quartz, vex nymphs. [give you up ] Waltz, bad nymph, for quick jigs vex! Fox nymphs grab quick-jived waltz. Brick quiz whangs jumpy veldt fox. [Never ] Bright vixens jump; [gonna let ] dozy fowl quack. Quick wafting zephyrs vex bold Jim. Quick zephyrs blow, vexing daft Jim. Charged [you down ] fop blew my junk TV quiz. How quickly daft jumping zebras vex. Two driven jocks help fax my big quiz. Quick, Baz, get my woven flax jodhpurs! "Now fax quiz Jack!" my brave ghost pled. [Never ] Five quacking zephyrs jolt my wax bed. [gonna ] Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. [run around ] A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! [and desert you] Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck. A wizard’s job is to vex chumps quickly in fog. Watch "Jeopardy!", Alex Trebek's fun TV quiz game."""

# Your code here