Now that we have covered some coding basics, we have a better understanding of syntax, and we can reference our old examples to help us for more complicated programming. In Part II, we will begin some more science-specific applications. To begin doing that, we will begin to utilize other Python functions.

For our first analysis tasks, we will work with text files. 

# Opening text files

For a text file, in Python, we can start with the `open()` function. The docstring for `open` has a lot of info if you want more details. We will cover the basic syntax for reading text files.

```
f = open(filename, mode)
```
- `f` is a variable that corresponds to a file object. It is not the same is the info in the file. But, we can use the file object to do basic operations with the file.
- `filename` is a string with the full path to the file. If the file is in a folder, you can add folders and separate with slashes, like `"folder/subfolder/more_subfolders/file.txt"`. 
- `mode` must be one of a few options of strings. For our purposes, we will use `'r'`, `'w'`, or `'a'` for "read", "write", and "append" modes, respectively. Depending on the mode you choose, the functions for handling `f` will be different.


### <mark>EX 2.1 - open a file</mark>

*Using the syntax above, open the file "quotes.txt" in the folder "data". Try printing it and see what happens.*

# Reading text files

We want to get the text from the file into a string, or a list of strings, so we can start to parse it. There are a few ways to do this. All the functions are methods of `f`, so we call them by using `f.function_name()`.

- `f.read(N)` - this will read the first `N` number of characters into a single string. The `N` argument is optional. If you call `f.read()`, you get the whole file text.
- `f.readline()` - this will read the first line, up to the first "\n" (or new line). Calling it repeatedly gives you each subsequent line.
- `f.readlines()` - this will read all the lines into a list of strings, divided by each "\n".

You can only use one of these per file read. You can think of the file as having a text cursor, and the process of reading a file moves the cursor along, until we are at the end and there is no more space to move the cursor.

After you're done reading a file, you should close it: `f.close()`


### <mark>EX 2.2 - read and print a file</mark>
*Read and print the contents of the file you opened in the previous exercise. Close it when you are done.*

# Strings

When we read our file, it usually is not formatted in a nice way for handling data. It likely includes extraneous whitespace characters, normal spaces and also '\n' (new line) or '\t' (tab). Some lines might be totally empty. An empty string is just `''` or `""`.

Here are some example things you can do with strings. Take a look at each output. We will use these functions to read a file into a specific format. 

In [None]:
s = '        test string, to do some tests   '
print(s)

In [None]:
# replacing text
print(s.replace('test', 'TEST'))

In [None]:
# removing whitespace
print(s.strip())

In [None]:
# split into a list  
print(s.split(','))   # split on comma
print(s.split(' '))   # split on one space

# by default, we split on all whitespace.
# this can get rid of extra spaces!
print(s.split())

# Lists

Lists are probably the most important variable type to understand when working with scientific data. We use lists to handle parameters, variables, and tables of information to analyze. We often want to re-format file text into some kind of list or list of lists (i.e. a table).

Like strings, lists have their own functions and methods that are important for expertly implementing them. 

**Indexing**

The most fundamental list principle is probably "indexing", or accessing each individual element. The elements of a Python list `lst` are denoted as `lst[0], lst[1], lst[2], ...` until you reach the end of the list. The number of elements in a list can be found as `len(lst)`. Since the first element of a list has index `0`, the final element has index `len(lst)-1`. This might seem weirdly complicated--why not start with 1?--but it actually becomes useful for iterating through the list in "for" loops with the "range" function. You can also go backwards using negative indices. The last element also has index `-1`.
```
first_element =         lst[0]
last_element =          lst[len(lst)-1]
also_the_last_element = lst[-1]
```

**Slicing**

The simplest form of indexing uses just one number. We can also use multiple numbers, separated by a colon, to grab part of a list. For example, `lst[0:N]` will be a list of the first $N$ elements of `lst`. (Note that since the first element has index 0, this sublist includes `lst[0]` but omits `lst[N]`). If we want to index from the start of the list or to the end of the list, we can also omit the index entirely and just use the colon.
```
N = len(lst)//2
whole_list = lst[:]
first_half = lst[:N]
last_half =  lst[N:]
middle =     lst[N//2 : N + N//2]
```

**Direction**

Finally, we can add a second colon that gives us the direction we move indexing the list. This can help us skip over values, getting every 2nd, 3rd, ... Nth value. We can also go backwards, and flip the direction of the list.
```
every_other_element = lst[::2]
reversed_order      = lst[::-1]
```

We can combine indexing, slicing, and direction by placing or omitting values between the colons we want to use.


**Sorting**

We can call the `lst.sort()` function to alphabetize (for string elements) or order (for number elements). This function will not work on lists that mix strings and numbers, but we can mix floats and ints. The function returns `None`, but it stores the newly sorted list in the old variable `lst`. Some functions will return things and leave the original variable unchanged, others change the original variable, so it is important to keep track of these differences in functions. You can always look things up if you forget (as I often do).

Also note that an empty list is just `[]`. It has length 0.

Example usages:

In [None]:
l = ['test','list','for','testing']
print(l)

In [None]:
# index
for ind in range(len(l)):
    print('index', ind, l[ind])

In [None]:
# alphabetize
l.sort()  # note, this function returns "None" and changes the original variable
print(l)

In [None]:
# reverse
l_reversed = l[::-1]
print(l_reversed)

There are some other key list functions we can do. See below.

In [None]:
# join a list of strings into one string
print((' ').join(l))
print(('_SPACE_').join(l))

In [None]:
# check the number of elements in a list (length)
length = len(l)
print(length)

In [None]:
# `append` or add an element to the end
l.append('an appended string')
print(l)
l.append('a second string')
print(l)

In [None]:
# `pop` or remove the last element and assign it to a variable
element_last = l.pop()
print(element_last)
print(l)

### <mark>EX 2.3 - read a file into a list</mark>

*Using the methods we've learned above, do the following:*

1. Open, read, and close the text content of "quotes.txt"
2. Format the text into a list of each line. Do not include blank lines.
3. Format each element of your main list into a sublist with three elements, [quote, author, book]
4. Store this final list in a variable of your choice.

*There are multiple ways to do this! Try it on your own, then compare with a neighbor.
As you write your code, if you get confused, I like to remember this rhyme:*

![when in doubt print](imgs/print.png)

*Also, feel free to check the file content by double clicking it in your file browser on the left, if you want. This will help you get an idea of its structure. This is often critical for helping read files into the proper structure.*

## Writing text files

Similar to reading existing text files, we can create and write to new text files. This uses the same syntax for opening a file:
```
f = open(filename, mode)
```
We can specify the mode as `'w'`, to write to a file. This overwrites any existing file contents.
We could also use `'a'`, for appending text to the end of the existing file.
These methods work the same way for a new file.
The path to the file (the folders and subfolders) should already exist, but the filename itself does not need to exist. You should include an extension like ".txt" when you choose a new filename.

(You can actually create the folders if you want. We will get to that in part 3, when we talk about importing modules. For now, let's just use existing ones).


For writing data to our file, the functions look very similar for reading data:
- `f.write(my_string)` - this will write a string to a file.
- `f.writelines(my_list)` - this will write a list of strings to a file.

Check the currently existing files in the "data" folder. Then run the code below and see what happens.



In [None]:
f = open('data/test.txt', 'w')
f.writelines(l)
f.close()

If we want to write multiple lines, we need to make sure to include new line '\n' characters. Let's try this instead:

In [None]:
f = open('data/test.txt', 'w')
f.writelines('\n'.join(l))
f.close()

# Formatted printing

Also useful for writing to text files is formatting strings.
You can insert variables into strings using format strings, which looks like `print(f"your stuff here {a_variable} {another_variable}")`.
You start with an "f" (no quotes) before a string (with quotes) that encloses your variables in curly brackets {}. 
Depending on the variable, you can do extra formatting of the variables within the strings. For floats, you can truncate the decimals.


In [None]:
number = 10.123456789
name = "Dorothy"
location = "not Kansas"

# manipulating decimals
print(f'Three decimals - {number:.3f}')
print(f'Zero decimals - {number:.0f}')
print()

# manipulating total characters, right justify numbers
print(f'Eight total characters - {number:8.0f}')
print(f'Eight total characters - {number:8.3f}')
print()

# combining strings
print(f'{name} is located in {location}!')

### <mark>EX 2.4 - write a new file</mark>

*Let's change our text file into something new and output it into a new file. Pick one of the quotes from the file we read in earlier.*

1. Choose a word (or two) and switch it to a new word. The more absurd the better. Try to have it make sense.
2. Change the author and book to your own name.
3. Write your new quote to a file "quotes_new.txt" in the "data" folder.
4. If you have extra time, append other modified quotes.

*Some words are capitalized and others are lowercase. Verbs have different tenses. Make sure to handle these different cases.*

# Putting everything together

Not all data is nicely formatted. For example, our file reading functions always read in string types, but sometimes we want to work with numbers and must do type conversions. At the same time, numbers and strings might be interspersed in the same line. Other lines might be exceptions that do not include all the data we want. (This is common in science, especially clinical or experimental work. You might try to keep track of all the data, but miss elements for certain patients or trials).


### <mark>EX 2.5 - Complicated text file</mark>

*Look at the file "olympics.txt". Each line has multiple elements separated by tabs. The first line is a header telling what each element is. Note that some of the games were cancelled due to world wars.*

1. Read in "olympics.txt".
2. Create a list of lists tabulating the data.
3. Calculate the average and standard deviation year of the Olympics.
4. Print your results out to 3 decimal places.

*Remember the equations for average $\overline{x}$ and standard deviation $\sigma$ over $N$ numbers:*
$$
\overline{x} = \sum_i^N x_i
$$
$$
\sigma = \sqrt{\frac{\sum_i^N (x_i - \overline{x})^2}{N}}
$$

*This is more complex. Think about the task on your own for a moment, then consult a buddy nearby. If you want hints, see the last markdown box of this file.*


### <mark>EX 2.6 - Bonus</mark>

*If you finish the above task early, try computing the number of Olympics in each continent. Which continent had the most Olympics? The least?*

## Hints
You'll probably want to initialize numbers at 0 or lists at [] outside of a "for" loop.

Arithmetic with integers will return integers. Even if years are integers, to get decimal results, you need to use floats.

Some Olympics were cancelled, and these lines have unique formatting. This might require an "if" statement.