# Notebook 3.1: String operations

The code in this notebook corresponds to notes in lecture 3. In this notebook you can follow along and execute or modify code as we go. All code in this notebook uses the `Python3` standard library. 

## Working with strings in Python 

A "string" is the name used in Python for words, sentences, or paragraphs of text that are joined together. It is one of the most basic data types and one that Python is very good at dealing with. In fact, the ease with which Python can be used to manipulate text is one of the primary reasons it bas become such a popular language for both scientific programming as well and web development. 

### Strings as variables
Your reading this week introduced the concept of storing variables in Python. We assign a value to a variable using the `=` sign. Below a string variable named `mystring` is created in this way. The name is arbitrary and could be almost anything.

In [1]:
mystring = "a string of text"

### Manipulating strings
Strings are an *indexed* datatype that is *immutable*. This means that we can select portions of the text using indexed numbering, but we cannot modify individual parts of it. We can however make new strings from existing ones and assign them to a variable. This is demonstrated below. 

In [2]:
## indexing: select the first element of a string (Python index starts at 0)
mystring[0]

'a'

In [3]:
## indexing: select the last element (negative indexes start from the end)
mystring[-1]

't'

In [4]:
## indexing: select a range of elements with a slice (e.g., first:last)
mystring[0:8]

'a string'

In [5]:
## indexing: the above could also be written as [:8] meaning from start to index 8
mystring[:8]

'a string'

In [6]:
## indexing: this means index -3 until the end.
mystring[-3:]

'ext'

### Strings can be combined or reassigned


In [7]:
## create new string objects and assign to variables
string1 = "a new string "
string2 = "bigger than the last string"


In [8]:
## combine them and assign the result to another variable
newstring = string1 + string2
newstring

'a new string bigger than the last string'

### You cannot mutate string objects
Attempts to assign a value to an index or slice of a string will raise an error because strings are *immutable*. We can take an index or slice of a string to return part of it, but cannot change it *in place*. Instead, we can replace the variable storing a string object with a new string. 

In [9]:
## error: you cannot assign to an index inside a string.
mystring[0] = "A"

TypeError: 'str' object does not support item assignment

### How do you modify strings then?
You assign a new string to the same named variable. This can be done using indexing or concatenation, or several other ways as well as we'll see. This may seem a little nuanced, but you'll see later how this varies among different objects, some of which are mutable and others which are not. The example below creates a new string by adding "A" to the previous string indexed from 1 until the end.

In [10]:
## you could do this instead
newstring = "A" + mystring[1:]
newstring

'A string of text'

## Built-in string functions
String variables are a type of data object in Python, and just like all objects in Python this means that they store more information than just the value that we assigned to it. For example, string objects store the length of the string that is stored, and they have many `functions` that can be called to manipulate that string. These `'attributes'` and `'functions'` can be accessed by using tab-completion on an object in jupyter. After the period at the end of `mystring` below hit tab to see the options displayed. 

In [None]:
## put your cursor after the period below and press <tab> to see available options
mystring.

### Example builtins
The examples below call `functions` associated with string objects.

In [11]:
## this returns a capitalized version of the string
mystring.capitalize()

'A string of text'

In [12]:
## this splits the string into a 'list' where ever there is whitespace
mystring.split()

['a', 'string', 'of', 'text']

In [13]:
## this centers the text across a width of 40
mystring.center(40)

'            a string of text            '

In [14]:
## this prints the index of the searched word starting from the left
mystring.find("of")

9

### Modifying strings
In the examples above we applied a function to `mystring` which returned a new string object, but then we didn't store that result to a variable. Instead, we just let it be returned and printed to the cell (technically, this is called being returned to `stdout`). Because the returned value wasn't saved it was "garbage-collected" by Python, meaning that it's space in memory was erased. If we want to store the result we need to assign it to a variable. This is done below, where you see `mystring` is changed after we run the command.

In [15]:
## here mystring is replaced by a new string where the first character is capitalized.
mystring = mystring.capitalize()
mystring

'A string of text'

In [16]:
## let's create a new variable like the original that is not capitalized
lower_string = mystring.lower()
lower_string

'a string of text'

### The `print()` function
You can return the value of a variable by executing the variable itself, or by using the print statements, like below. You can see that the first example shows the quotes around the text to indicate it is a string. The second way prints just the text. 

In [17]:
## you return a variable's value by entering just the variable
mystring

'A string of text'

In [18]:
## or you can use the print() function, which prints it to stdout
print(mystring)

A string of text


In [20]:
## In Python3 but not Py2 print of multiple strings is concatenated
print(mystring, lower_string)

A string of text a string of text


### Single, double, and triple quotes
What is the difference. There is little difference, but by providing redundancy 
it is easier to find ways to write strings that include quotes inside them, as shown
in the examples below. 

In [21]:
## examples of printing strings
print("hello world")
print('printing in single quotes is the same as in double quotes')
print("'you can make strings that include single quotes by putting them inside doubles'")
print('"you can make strings that include double quotes by putting them inside singles"')
print("""
multi-line string with mixed single and double
quotes can all be captured inside triple-quotes.
This also interprets the starting line as a newline.
""")

hello world
printing in single quotes is the same as in double quotes
'you can make strings that include single quotes by putting them inside doubles'
"you can make strings that include double quotes by putting them inside singles"

multi-line string with mixed single and double
quotes can all be captured inside triple-quotes.
This also interprets the starting line as a newline.



### Parsing a string document
In many cases an entire data set, or document page, will be stored as a string, and so it is really useful to know some common workflows for parsing strings of text into other usable forms. 



In [22]:
## a string that is like a full page document
page = """
This is a multi-line document.
This is the second line.
The last line is here.
"""

In [23]:
## the string variable looks like this. Newlines are represented as '\n'
page

'\nThis is a multi-line document.\nThis is the second line.\nThe last line is here.\n'

In [24]:
## strip() removes the newline characters at the beginning and end
page.strip()

'This is a multi-line document.\nThis is the second line.\nThe last line is here.'

In [25]:
## split can take an argument to split on a specific character .
## Here we enter '\n' to split on new lines. 
## This parses the string into a 'list' object of lines. 
## We'll discuss lists more later. 
page.strip().split('\n')

['This is a multi-line document.',
 'This is the second line.',
 'The last line is here.']

### Comments (#)
Written code often contains comment strings or lines. These are not typically meant
to not be interpreted by the program, but instead to provide hints to the person reading
the code. You've seen this already in the lecture notes and example code we have examined, 
including in most cells of this notebook. You can see that the jupyter notebook interpreter
colors comments as a lighter blue-grey color. In the examples below you can see that the comments are not interpreted when written in-line either. 

In [26]:
newstring = "a new string"            ## creating a new string variable
newstring = newstring.capitalize()    ## return capiltalized version
newstring.startswith("A")             ## ask whether it starts with an "A"

True

### Using print() for debugging

A common use for the `print()` function when writing code is to use it for *debugging*. Essentially, this is a way of asking "what is happening in the code right now?". It is a good way to ensure that the code is running how you want it to, or to find the bugs in your code if they exist. We'll use `print` in this way below while discussing `for-loops`. 

### Strings are iterable
Strings, like many other Python data objects, are *iterable*, which means that we can sample sequential elements from them. The elements in a string are bytes (i.e., characters). We can write a for-loop like below to iterate over the elements in the string object. 

In [27]:
## define a string variable
stringvar = "apples orange grapes"

## a for-loop iterating over elements in stringvar
for x in stringvar:
    print(x)

a
p
p
l
e
s
 
o
r
a
n
g
e
 
g
r
a
p
e
s


### More on for-loops
In lecture we discussed operators for things like addition, subtraction, and for performing comparisons, such as `=`, `>`, and `<`. Below we combine these in a for-loop to perform a more complex operation. Following the format we used above to write a for-loop, it's important to recognize that the variable in the loop is being reassigned on every iteration. In this loop we name the variable `char`, and if `char` is a 

In [28]:
## find vowels in stringvar
for char in stringvar:
    for vowel in "aeiou":
        if char == vowel:
            print(char)

a
e
o
a
e
a
e


### A simpler way to do the same thing

In [29]:
## find vowels in stringvar
for char in stringvar:
    if char in "aeiou":
        print(char)

a
e
o
a
e
a
e


# Challenges

A. Create a variable named 'varstring' and assign it the value "apples"


B. Use indexing to create a new variable of only the first two fruits


C. Use indexing to create a new variable of only the last two fruits


D. Split the string on whitespace to create a list


E. Iterate over varstring and print every element that is not a vowel


F. Create a variable that is assigned the following string and print it: `They asked, "what's your name?"`

G. Count the number of differences between these two DNA strings


In [None]:
dna1 = "ACAGAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTACAACTCTCTCTAATTTAAGGGCCAATTAACATT"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTAAAATAAGGGCCAATTAACGTT"
