# Strings

Lesson goals:

1.  Examine the string class in greater detail.
2.  Use `open()` to open, read, and write to files.


To start understanding the string type, let's use the built in helpsystem.

In [None]:
help(str)

The help page for string is very long, and it may be easier to keep it open
in a browser window by going to the [online Python
documentation](http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange)
while we talk about its properties.

At its heart, a string is just a sequence of characters. Basic strings are
defined using single or double quotes.

In [None]:
s = "This is a string."
s2 = 'This is another string that uses single quotes'

The reason for having two types of quotes to define a string is
emphasized in these examples:

In [None]:
s = "Bob's mom called to say hello."
s = 'Bob's mom called to say hello.'

The second one should be an error: Python interprets it as `s = 'Bob'` then the
rest of the line breaks the language standard.

Characters in literal strings must come from the ASCII character set,
which is a set of 127 character codes that is used by all modern
programming languages and computers. Unfortunately, ASCII does not have
room for non-Roman characters like accents or Eastern scripts. Unicode
strings in Python are specified with a leading u:

In [None]:
u = u'abcdé'

For the rest of this lecture, we will deal with ASCII strings, because
most scientific data that is stored as text is stored with ASCII.

## Escape Characters 
How can you have multiline line strings in python? We can represent an "enter" using the escape character '\n'. An [escape character](https://en.wikipedia.org/wiki/Escape_character) starts with a '\\' and is followed by another character. This invokes an alternative interpretation in the string. Try running the example below to see how \n changes the string: 

In [None]:
s = "Hello\n World"
print(s)

Notice how it didn't print \n, but replaced it with a newline. There are more characters like this, such as \t which is replaced with a tab or \b that is equal to a backspace. Use [this guide](https://linuxconfig.org/list-of-python-escape-sequence-characters-with-examples) to print the following output as one string:

"If you think you can do a thing or think you can't do a thing, you're right."  

    - Henry Ford
 /\\
/  \

In [None]:
# TODO: Modify the codes below
s = "Your string here"
print(s) # do not modify the print statement

## Working with Strings

Strings are iterables, which means many of the ideas from lists can also
be applied directly to string manipulation. For instance, characters can
be accessed individually or in sequences:

In [None]:
s = 'abcdefghijklmnopqrstuvwxyz'
s[0]

In [None]:
s[-1]

In [None]:
s[1:4] #this include char at index 1, but excludes char at index 4

They can also be compared using sort and equals.

In [None]:
'str1' == 'str2'

In [None]:
'str1' == 'str1'

In [None]:
'str1' < 'str2'

In the help screen, which we looked at above, there are lots of
functions that look like this:

    |  __add__(...)
    |      x.__add__(y) <==> x+y

    |  __le__(...)
    |      x.__le__(y) <==> x<y

These are special Python functions that interpret operations like < and \+.
We'll talk more about these in the next lecture on Classes.

Some special functions introduce handy text functions.

**Hands on example**

Try each of the following functions on a few strings. What do these
functions do?

In [None]:
s = "This is a string   "

In [None]:
s.startswith("This")

In [None]:
s.split(" ")

In [None]:
s.strip() # This won't change every string!

In [None]:
s.capitalize()

In [None]:
s.lower()

In [None]:
s.upper()

## Formatting
Try printing "Dave was traveling at 50 mph for 4.5 hours" using these given variables:

In [None]:
name = "Dave"
v = 50
t = 4.5

Here is an easy way to do this is using string formatting:

In [None]:
print("%s was traveling at %d mph for %f hours" % (name, v, t))

**%s** is used to represent a string (name), **%d** is used to represented an integer (v), and **%f** is replaced with a float (t). Now, try printing this data as "Dave drove 10 miles faster than Sally for 4.5 hours."

In [None]:
name = "Sally"
v1 = 40
print("Dave drove %d miles faster than %s for %f hours" % (v-v1, name,t))

## File I/O

Python has a built-in function called "open()" that can be used to
manipulate files. The help information for open is below:

In [None]:
help(open)

The main two parameters we'll need to worry about are 'file', the name of the
file, and 'mode', which determines whether we can read from or write to the file. `open(...)` returns a file object, which acts like a pointer into the file, similarily to how an assigned variable can 'point' to a list/array.  
An example will make this clear. In the code below, I've opened a file
that contains one line:

    $ cat ./OtherFiles/testfile.txt
    abcde
    fghij

Now let's open this file in Python:

In [None]:
f = open('./OtherFiles/testfile.txt','r')

The second input, 'r' means I want to open the file for reading only. I
cannot write to this handle. The read() command will read a specified
number of bytes:

In [None]:
s = f.read(3)
print(s)

We read the first three characters, where each character is a byte long.
We can see that the file handle points to the 4th byte (index number 3)
in the file:

In [None]:
f.tell() # which index we are pointing at

In [None]:
f.read(1) # read the 1st byte, starting from where the file handle is pointing

In [None]:
f.close() # close the old handle

In [None]:
f.read()  # can't read anymore because the file is closed.

The file we are using is a long series of characters, but two of the
characters are new line characters. If we looked at the file in
sequence, it would look like "abcdenfghijn". Separating a file into
lines is popular enough that there are two ways to read whole lines in a
file. The first is to use the `readlines()` method:

In [None]:
f = open('OtherFiles/testfile.txt','r')
lines = f.readlines()
print(lines)
f.close() # Always close the file when you are done with it

A very important point about the readline method is that it *keeps* the
newline character at the end of each line. You can use the `strip()`
method to get rid of the escape characters at the beggining and end of the string.

File handles are also iterable, which means we can use them in for loops
or list extensions. You will learn more about this iteration later:

In [None]:
f = open('OtherFiles/testfile.txt','r')
lines = [line.strip() for line in f]
f.close()
print(lines)

In [None]:
lines = []
f = open('OtherFiles/testfile.txt','r')
for line in f:
    lines.append(line.strip())
f.close()
print(lines)

These are equivalent operations. It's often best to handle a file one
line at a time, particularly when the file is so large it might not fit
in memory.

The other half of the story is writing output to files. We'll talk about
two techniques: writing to the shell and writing to files directly.

If your program only creates one stream of output, it's often a good
idea to write to the shell using the print function. There are several
advantages to this strategy, including the fact that it allows the user
to select where they want to store the output without worrying about any
command line flags. You can use "\>" to direct the output of your
program to a file or use "|" to pipe it to another program (this was covered in the 01-shell notebook).

Sometimes, you need to direct your output directly to a file handle. For
instance, if your program produces two output streams, you may want to
assign two open file handles. Opening a file for reading simply requires
changing the second option from 'r' to 'w' or 'a'.

*Caution!* Opening a file with the 'w' option means start writing *at
the beginning*, which may overwrite old material. If you want to append
to the file without losing what is already there, open it with 'a'.

Writing to a file uses the `write()` command, which accepts a string. Check outfile.txt before and after running the following code.

In [None]:
outfile = open('OtherFiles/outfile.txt','w')
outfile.write('This is the first line!')
outfile.close()

Another way to write to a file is to use `writelines()`, which accepts a
list of strings and writes them in order.   

*Caution!* `writelines()` does not
append newlines. If you really want to write a newline at the end of
each string in the list, add it yourself.

### Aside About File Editing

How is it possible that you can edit a file in place. You can use `f.seek()`
and `f.tell()` to verify that even if your file handle is pointing to the
middle of a file, write commands go to the end of the file in append
mode. The best way to change a file is to open a temporary file in
/tmp/, fill it, and then move it to overwrite the original. On large
clusters, /tmp/ is often local to each node, which means it reduces I/O
bottlenecks associated with writing large amounts of data.

## Exercise
Find the index of the string 'needle' in the file OtherFiles/haystack.txt using the `str.find()` method of a String. Recall that `file.read()` will return the file as a string.

Note: if you run `file.read()` twice, what is the output of the second time running? Is it different from the output of the first time? 

In [None]:
# TODO: Modify the codes below
f = None; # open the file here
n = 0 # find the index here
print("Needle at Index %d" % n) # n should be 185
f.close()

Create a new file OtherFiles/haystack1.txt by opening in "w+" mode. "a+" and "r+" also create a file if it does not exist. Write the contents of haystack.txt into this new file and add an extra 'needle' at the end.

In [None]:
# TODO: Modify the codes below
f = None;
f.close()

# Prints the file to check your answer; don't write anything here
f = open("OtherFiles/haystack1.txt","r")
print(f.read())
f.close()