# Introduction to Strings

In this notebook we are going to learn about a couple of data structures that will continue to build up your power in Python.

## Objectives

At the end of this notebook you should be able to:

- use string operations
- address individual characters in strings
- describe the different types of string formatting


## Strings

First, we are going to learn about another common data type, strings. From a high-level perspective, a string is just a bit of text. This could be text that you have read in from a file, html that you have pulled from the Internet, or any other text. From Python's perspective, a string (type `str`) is simply a collection of encoded characters. Wait, what's an encoding...?

An encoding is just a fancy way of us saying that the characters in our string follow a certain format, or structure. The reason this matters to us in terms of our Python programs, though, is that Python expects our strings to be in one of a couple of different encodings (either `ASCII`, `utf-8`, or `unicode`). This isn't something you will run into often, and especially not when defining your own strings (it's probably most prevalent when pulling text from the Internet). However, it's worth noting because there is a good chance that sometime in your Python career, you will end up with Python telling you it doesn't recognize a certain character in one of your strings, and an unexpected encoding will most likely be at the heart of that error.

In Python, strings are recognized as a collection of characters surrounded by a set of either single quotation marks (`'...'`) or double quotation marks (`"..."`). So long as you open and close your string with a **matching** set of single or double quotation marks, you are free to use either. The single caveat to that is that if you are writing an expression with a single quotation mark in it (such as "Don't do that"), you will **have to** use a matching set of **double** quotation marks. Let's experiment with some strings...


In [1]:
'This is a string.'

'This is a string.'

In [2]:
"This is another string, but this time with double quotation marks."

'This is another string, but this time with double quotation marks.'

In [3]:
'They told me not to do this, but I didn't listen.' 

SyntaxError: unterminated string literal (detected at line 1) (876804734.py, line 1)

Just like we expected, we can use both single and double quotation marks. What happened in the 3rd case there? Well, we opened the string with a single quotation mark, and Python started looking for the next single quotation mark to close the string. When it found that quotation mark in the word `didn't`, it assumed the string was closed after `didn`. As a result, this left `t listen.'` just hanging out, and Python didn't know how to interpret that, resulting in our error. The solution to this, as mentioned above, is to use double quotation marks in any case where your text will have single quotation marks in it. For example...

In [4]:
"Now that I've got double quotes, I can use all the contractions!"

"Now that I've got double quotes, I can use all the contractions!"

In [None]:
"Can't, won't, didn't, don't... all the contractions!"

As a final note before we dive into string operations, we can store strings in variables in the exact same way that we can store an `int`, `float`, or `complex`.

In [None]:
my_str_variable = 'This is a string variable.' 

: 

In [None]:
my_str_variable # my_str_variable holds the string that we put in it in the above cell. 

### String Operations

Surprisingly, a couple of our standard mathematical operations will work on strings, namely `+` and `*`. We can use the `+` operator to add two strings together (this is known as string **concatenation**), and we can use the `*` operator to repeat a string a given number of times. Let's take a look...

In [None]:
'My first string' + 'My second string'

In [None]:
'Repeating string' * 3

Note that Python didn't put spaces between the strings with either the `+` operator or the `*` operator. Why not? Because it wasn't told to! In this case, and in programming in general, we have to be extremely explicit about what we want the computer to do. To fix this, we can add a space in the middle of the first case, and then add a space to the end of our string in the second case.

In [None]:
'My first string' + ' ' + 'My second string'

In [None]:
'Repeating string ' * 3

That looks much better! But, what about that pesky little space at the end of our second string: `'Repeating string Repeating string Repeating string '`. Is there a way to remove this? It turns out there is! One of the methods (a name for a function that is attached to a particular object) that we can call on strings is the `strip()` method. Methods are something that we will cover in much more depth later, but for now just note that we call them on our objects through **dot notation**. We simply place a `.` at the end of our object (`str`, `int`, `float`, any variable, etc.), and then call the method by name. Here's how the use of this **dot notation** looks in practice.

In [None]:
'Repeating string Repeating string Repeating string '.strip()

In [None]:
' Repeating string Repeating string Repeating string '.strip()

So, what did the `strip()` method do? In the first example, it removed the trailing space from the string. In the second example, it removed both the leading and trailing spaces. This is exactly what the `strip()` method does - by default (without any arguments) it removes leading and trailing whitespace (*note, the method can actually remove any leading or trailing characters if you pass them to `strip()`, but whitespace is the default character that it removes*).

Are there other things that we can do with strings? There are tons! Let's store our string in a variable below, so we can get some exposure working with strings in variables.

In [None]:
my_str_variable = 'this IS my STRING to PLAY around WITH.'

In [None]:
my_str_variable.capitalize()

In [None]:
my_str_variable.upper()

In [None]:
my_str_variable.lower()

In [None]:
my_str_variable.replace('STR', 'fl')

In [None]:
my_str_variable.split()

These are some of the most commonly used string methods. You can see above what they do by default: `capitalize()` capitalizes the first letter of the string and lowercases the rest; `upper()` converts all the letters in the string to uppercase, and `lower()` to lowercase; `replace()` replaces all instances of a given substring in your string with another given substring; finally, `split()` splits the string by an inputted string (whitespace by default, just as with `strip()`). There are many more string methods available, and you can check them out in the [docs](https://docs.python.org/2/library/stdtypes.html#string-methods).

Alternatively, you can find out what methods are available to call on strings from the IPython terminal itself (this is one of the really awesome features of IPython)! This also works in an IPython notebook like this one. Using tab completion, if you have a string stored in a variable, you can type the variable name followed by a period, and then use tab complete to see all the methods available for strings! For display purposes, we're showing below what you would see if you tab completed in IPython (if you tab completed in an IPython notebook instead, you would get a dropdown menu showing what's available on that variable). 

```python
In [1]: my_str.  # Hit tab now!

my_str.capitalize  my_str.isalnum     my_str.lstrip      my_str.splitlines
my_str.center      my_str.isalpha     my_str.partition   my_str.startswith
my_str.count       my_str.isdigit     my_str.replace     my_str.strip
my_str.decode      my_str.islower     my_str.rfind       my_str.swapcase
my_str.encode      my_str.isspace     my_str.rindex      my_str.title
my_str.endswith    my_str.istitle     my_str.rjust       my_str.translate
my_str.expandtabs  my_str.isupper     my_str.rpartition  my_str.upper
my_str.find        my_str.join        my_str.rsplit      my_str.zfill
my_str.format      my_str.ljust       my_str.rstrip      
my_str.index       my_str.lower       my_str.split
```


**Note**: This works for all of our variable types! Not only that, but we can also tab complete the names of the variables that IPython currently knows about (those in the **namespace**).

### Working with individual characters in strings

We know how to work with an entire string via some of the methods that we've discussed, but what if we wanted to work with the individual characters? There are a couple of ways to do this, but the first we'll focus on is through indexing. We know that to Python, a string is just a collection of characters. It turns out that we can access the individual characters simply by asking Python for a given numbered element in our collection (i.e. the string).  We do this by placing the element number that we want in square brackets, `[]`,  right after our string (or variable, if it's stored in one). This element number is referred to as the **index** of the character (or element, if it's not a string - more on this soon).

In [None]:
my_str_variable = 'Test String'
my_str_variable[1]

In [None]:
my_str_variable[5]

In [None]:
my_str_variable[-1]

In [None]:
my_str_variable[-3]

Using indices like this, we can access any element of a string. But why is the element at index 1 `e`, and not `T`? After all, `T` is the first element in the string. Also, what are those negative numbers doing? In the case of the former, it turns out that Python (and many programming languages) starts indexing at 0, which means that the first element in our string (and any collection that supports indexing) is accessed via indexing at 0. We refer to languages that work this way as **zero indexed**. As for the negative numbers, this is a way to access elements starting from the end of the string, rather than the beginning. Indexing from the end starts from -1 and continues downwards from there. So, we would use -2 to access the `n` in the string.

Note that we can also access any given number of the characters (any **substring**) by combining multiple index numbers separated by a colon `:`. For example:

In [None]:
my_str_variable[1:3]

In [None]:
my_str_variable[5:9]

In [None]:
my_str_variable[-6:-1]

In [None]:
my_str_variable[1:]

In [None]:
my_str_variable[:-1]

This indexing turns out to be pretty useful. You might notice, though, that when indexing from `[1:3]`, only the letters at index 1 and 2 are returned; when indexing from `[5:9]`, we get the letters at indices 5, 6, 7, and 8. This is because the indices that you pass in are inclusive on the left side, and exclusive on the right side. This means that when you index, you will grab letters from the starting index that you give up to but not including letters at the ending index that you give. 

What about those last two examples, where there isn't an ending index or a starting one? If you don't give an ending index, then Python assumes that your ending index is the last index in the string. Similarly, if you don't give a starting index, Python assumes that your starting index is the first index in the string. Remember, this is the zeroth index in Python (don't worry if this feels confusing, you'll get used to it quickly).

Is there a way to grab elements at regular intervals in a string? For example, what if we wanted to grab every second letter? Python allows us to do this by passing in an optional third number while indexing. This optional third number, also separated by a colon (`:`), tells Python the step size by which to move through the string when indexing. So, if we wanted to grab every second letter from the beginning to end, we could index with `[::2]`. If we wanted to grab every 3rd letter from the letter at index 2 to the letter at index 10, we could use the indexing `[2:10:3]`.

In [None]:
my_str_variable[::2]

In [None]:
my_str_variable[2:10:3]

Got it, enough indexing already! Is there a way to cycle (or step through) each one of the letters one by one, and do something with the conditional logic we learned, rather than just grabbing a certain letter or group of letters? Of course! (Why would I ask a question for which the answer was no? That would be lame.)

### Iteration and Strings

We can cycle through all of the letters in our string (a process called **iteration**) in one of a couple of different ways. Let's first look at cycling through with a `while` loop.

In [None]:
my_str, idx = 'hello', 0
while idx < 5:
    print(my_str[idx])
    idx += 1


This while loop will **iterate** over the letters of our string `hello`, printing each one until `idx` reaches the value 6. Since we knew the length of our string (i.e it's 5 letters long), we knew that we could use the condition `while idx < 5:` for our loop checking, and ensure that all the letters would be printed. What if we didn't know the length ahead of time, though? There is actually a function that we can use to figure this out (we'll talk much more about functions and how they work later). It's `len()`, and we simply call `len()` with our string passed as an argument, and it returns the length of our string.

In [None]:
my_str = 'hello'

len(my_str)

Now, we can write our `while` loop to be a little bit more general:

```python
my_str, idx = 'hello', 0
while idx < len(my_str):
    print(my_str[idx])
    idx += 1
```
As this is a Codealong, you will learn more by typing the code yourself. The next line is there for you to exactly do that and see the result.

Great! But we did mention that there are other ways to iterate over the letters in our string.

The other way that we can iterate over the letters in our string is to use a `for` loop. `for` loops are built off of the same idea of `while` loops (doing something over and over again), but instead of continuing until some condition is no longer met, `for` loops operate directly on iterables. This leaves the concern about when to stop for Python to figure out. With a `for` loop, we don't have to care how many iterations/cycles the loop will go through. Let's look at the syntax of a `for` loop.   

```python
my_str = 'hello'
for idx in range(len(my_str)):
    print(my_str[idx])
```

**Note**: the `range()` function (which we will cover in more depth when we get to functions) as used above simply gives us a list of numbers from 0 up to but not including the inputted number. In the case above, since `len(my_str)` is 5, `range(len(my_str))` returns a list of integers from 0 to 4.

This `for` loop does the exact same thing as the `while` loop we wrote above, but with slightly different syntax. How does it work? At each iteration of the loop, `idx` is assigned one of the values in `range(len(my_str))`, and then the code within the indented block is run with that value of `idx`. How does Python know what the values of `idx` will be? Python simply goes through the values of whatever is after the `in` statement **in order**, and assigns those values to `idx`, one at a time through each iteration of the loop. Since `range(len(my_str))` returns to us a list of integers from 0 to 4, those values get assigned to `idx` as we run through the `for` loop. Let's look at one of our favorite kinds of tables to view this:

| After loop # | idx | What's Printed |
| ------------ |:---:|:--------------:|
|      1       |  0  |       'h'      |
|      2       |  1  |       'e'      |
|      3       |  2  |       'l'      |
|      4       |  3  |       'l'      |  
|      5       |  4  |       'o'      |

Note that with our `for` loop, the `idx` variable is automatically changed, rather than us having to manually update it (like we did in the `while` loop). This is one of the incredibly nice aspects of `for` loops! But wait, it gets even better!

It turns out that the above implementation of our `for` loop is actually considered to be non-Pythonic. This is because the way that `for` loops are constructed allows us to achieve the same output as above by writing the following:

```python
my_str = 'hello'
for char in my_str:
    print(char)
```

What's going on here!? Well, instead of iterating over all of the integers in a `range(len(my_str))` call like we did in our first `for` loop, we've gotten Python to simply iterate over all of the individual characters in our string, `my_str`. In each iteration of this `for` loop, `char` stores a different letter of `my_str`, and then the call `print char` prints that character. In the end, we get the same result as either of our `while` loops above, and the less Pythonic `for` loop that we wrote above. This way is considered to be the Pythonic way to iterate over a string, and so it's an important concept to grasp.

Why is it more Pythonic? That's a good question. When we say that something is more "Pythonic", this means that we are using the language in such a way that makes your code both more readable and simultaneously uses Python's power to make your solutions more optimal. Let's look at how this applies to the final implementation of our `for` loop.

We can see that it is more readable since we don't have to index into our string anymore. This means that there is less to follow along with and keep track of; rather than keeping track of both the current index we are on and what letter that index corresponds to in our string, all we have to keep track of is the current letter we're on. We can also note that our code just looks cleaner and more simple, too. In terms of making our code more optimal, since we no longer have to index into the string to grab characters, we have fewer steps in each iteration of the loop. This means less work for Python to do.

#### A Quick Aside on String Formatting 

There's one more thing that we should talk about before moving on from our discussion of strings - string formatting. String formatting is going to allow us to format strings in certain ways. Probably most usefully, it's going to allow us to insert variable contents into strings dynamically. Let's look at the syntax of it all.  

```python
In [1]: my_name = 'Sean'

In [2]: print('Hello %s' % my_name)

In [3]: print('Hello {}'.format(my_name))

In [4]: print(f'Hello {my_name}')
```

How is this working? Well, in each case, it's filling in a given part of our string with the value of our variable. In the first case, we use a `%` sign to denote where the replacement should happen, followed by a letter to denote what type of variable will be passed in there (`s` is used for string, `d` is for a decimal, etc.). You can find what each letter denotes [here](https://docs.python.org/2/library/stdtypes.html#string-formatting). In the second case, we use brackets `{}` to denote where the replacement should take place. We can also place numbers, or even variable names themselves inside these brackets and reference them in the `format()` method or the f-string (`f"{variable}"`).

```python
In [1]: print('Hello {0}'.format(my_name))


In [2]: print('Hello {name}'.format(name=my_name))

```

This is something that we don't use much past pretty simple cases, but there are many more things you can do with it - you can read about them [here](https://docs.python.org/2/library/string.html#format-specification-mini-language). In general, though, string formatting is much more readable and dynamic as compared to a bunch of concatenation.

## Check your understanding
**String Operation Questions**

1. When does the distinction between using single and double quotes to build a string matter?
2. Fix the following string to be considered valid and not throw an error when run.
    * `'They told me not to do this, but I didn't listen.'`
3. Create a variable that holds a string of your name.
4. Create another variable that holds a string of your best friend's name.
5. Now, use string concatenation (e.g. addition) to add 'Hello, ' before your name. 
6. Given the string 'Hello, Sean', replace each of the letter 'e''s with a 't'.
7. Use the `.split()` method on the string 'Hello, Sean' to split it by the comma (`','`).
    * What happens if you split by a comma and a space (`', '`)?