# [CptS 215 Data Analytics Systems and Algorithms](https://github.com/gsprint23/cpts215)
[Washington State University](https://wsu.edu)

[Gina Sprint](http://eecs.wsu.edu/~gsprint/)
# Python Basics II

Learner objectives for this lesson:
* Understand file I/O in Python
* Apply basic string operations

Content used in this lesson is based upon information in the following sources:
* None to report

## File I/O
A simple way to store data is in a *text file*, such as this simple text file, [transactions.txt](https://raw.githubusercontent.com/gsprint23/cpts215/master/lessons/files/transactions.txt), that stores an individual's credit card transaction history. Each line in the file represents a transaction price.

To process data in a file, we typically take the following approach:
1. Open the file
1. Process the file
    * Read data (doesn't modify the file) or
    * Write data (overwrite existing file) or
    * Append data (retains existing information and adds new data)
1. Close the file

### Opening a File
Before we can read from a file or write to a file, we first need to open the file and get a file object (AKA handle). We do this with the built-in function `open()`:

In [1]:
# in_file is our variable connecting our program to transactions.txt
# transactions.txt is a file I have in a files folder in the same folder as this running Python file
in_file = open(r"files\transactions.txt", "r")

#### File Modes
The first argument to `open()` is a string representing the path to the file and the second argument is the file opening *mode*: 
1. "r" for reading
    1. File must exist or you will get an error
1. "w" for writing
    1. If the file does not exist, it is created
    1. If the file does exist, it is cleared!
1. "a" is for appending
    1. If the file does not exist, it is created
    1. If the file does exist, new data written to the file is added at the end of the file

You can read more about modes [here](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files). 

`open()` returns an object that represents the connection between our program and transactions.txt.

#### Paths
The directory (or folder) where your Python script is running is called the *current directory*. When you open a file, Python looks for it in the current directory. 

If a file you want to open is in a directory other than the current directory, you will have to specify its path. 

Note: On a windows machine, folders and file names in a path are separated by backslashes "\". We know the backslash has a special purpose in Python, to escape certain characters, such as a newline "\n"; therefore, you will have to escape a backslash: "`\\`" in your path to a file: `"files\\transactions.txt"`. Alternatively, you can specify your path as a raw string: `r"files\transactions.txt"`. On a Unix-based machine (e.g. Mac, Linux distributions), the forward slash "/" is used in paths and you don't have to worry about this issue.

### Closing a File
When we are done with a file, we should close it with `close()`:

In [2]:
in_file.close()

### Processing a File
Once a file is open, we want to process the data inside the file (reading) or save data to file (writing). Consider the example [transactions.txt](https://raw.githubusercontent.com/gsprint23/cpts111/master/lessons/transactions.txt) we opened earlier.

#### Reading from a File
We will use the `readline()` function to read in a *single* line in the file (in transactions.txt this is the purchase price as a **string including the newline character \n**):

In [4]:
in_file = open(r"files\transactions.txt", "r")
transaction = in_file.readline()
# note the newline printed!! repr() shows non-printable characters like \n
print(transaction, repr(transaction), type(transaction))
transaction = float(transaction)
print(transaction, type(transaction))

13.42
 '13.42\n' <class 'str'>
13.42 <class 'float'>


#### Writing to a File
Now, let's use use the `write()` function to write the transaction price we just read in to an output file called single_transaction.txt:

In [5]:
# creates the file if it does not exist
# overwrites the file contents if it does exist
out_file = open(r"files\single_transaction.txt", "w")
# save the value of transaction as string
out_file.write("%.2f" %(transaction))

# close file because we are done with out_file
out_file.close()

### Example Problem
On average, how much money do I spend per credit card transaction?

Algorithm:
1. For each transaction
    1. Read in the purchase price from file
    1. Accumulate the total money spent so far
1. Divide total money spent by total number of transactions
1. Write the average transaction to file

In [12]:
def read_transaction_price(in_file):
    '''
    
    '''
    # readline() returns a string, including the newline character
    price = in_file.readline()
    # we need to convert the string returned by readline() to a numeric value
    return float(price)

def compute_total_spent():
    '''
    
    '''
    total_spent = 0.0
    
    in_file = open(r"files\transactions.txt", "r")

    # read in all 5 transactions in the file
    for i in range(5):
        total_spent += read_transaction_price(in_file)
    
    # close the file before in_file goes out of scope
    in_file.close()
    
    return total_spent

total_spent = compute_total_spent()

avg_spent_per_transaction = total_spent / 5.0

out_file = open(r"files\avg_transaction.txt", "w")
out_file.write("On average, you spend %.2f per transaction" %(avg_spent_per_transaction))
out_file.close()

## File Reading
### `for` Loops
Let's rewrite our transaction code to read in as many transactions as there are in the file (instead of the hard-coded 5). Using a `for` loop, `<sequence>` will be all of the lines in the input file, which we can get with a call to `in_file.readlines()`. Our `for` loop will walk through each line one at time with a loop control variable called `line`.

In [13]:
def compute_avg_spent():
    '''
    
    '''
    # accumulator variable
    total_spent = 0.0
    # count the transactions
    num_transactions = 0

    # the input file contains lines that we will iterate through as our items
    for line in in_file.readlines():
        print(line)
        total_spent += float(line)
        num_transactions += 1
    
    # close the file before in_file goes out of scope
    in_file.close()
    
    return total_spent / num_transactions

avg_spent_per_transaction = compute_avg_spent()

print("On average, you spend %.2f per transaction" %(avg_spent_per_transaction))

27.19

9.98

48.56

33.71
On average, you spend 29.86 per transaction


Note: There is another file function, `read()` that will read the entire file into a single string, not a single line as we did with `readlines()`.  When we learn more about strings, this function will be more useful. For now, let's stick with `readlines()`.

### `while` Loops 
Let's rewrite our transaction processing code to use a `while` loop. `readline()` will return an empty string when the end of the file is reached. This can be used in our Boolean condition:

In [14]:
def compute_avg_spent():
    '''
    
    '''
    # accumulator variable
    total_spent = 0.0
    # count the transactions
    num_transactions = 0
    
    in_file = open(r"files\transactions.txt", "r")

    # read the first line in the file
    spent = in_file.readline()
    # test if this line is the empty string, meaning the end of file has been reached
    while spent != "":
        # not end of file, process this transaction
        print(spent)
        total_spent += float(spent)
        num_transactions += 1
        # progress toward Boolean condition being False here is progress through the file
        spent = in_file.readline()
    
    # close the file before in_file goes out of scope
    in_file.close()
    
    return total_spent / num_transactions

avg_spent_per_transaction = compute_avg_spent()

print("On average, you spend %.2f per transaction" %(avg_spent_per_transaction))

13.42

27.19

9.98

48.56

33.71
On average, you spend 26.57 per transaction


## The File "Cursor"
When you open a file for reading ("r" mode), the cursor marking the current position at which to read from starts at the beginning of the file (position 0). As `readlines()` is called, the cursor moves through the file. As `readlines()` is called, the cursor moves through the file. To find out the position of the cursor, you can call `tell()`:

In [16]:
in_file = open(r"files\transactions.txt", "r")

print("File cursor is at position: %d" %(in_file.tell()))

# read data from the file advances the cursor by a certain number of bytes, depending on the number of characters in the line
transaction = in_file.readline()
print("File cursor is at position: %d" %(in_file.tell()))
# %r placeholder displays all characters in a string. we use it see the newline character as \n
print("First line contains: %r which contains %d characters (including newline)" %(transaction, len(transaction)))
in_file.close()

File cursor is at position: 0
File cursor is at position: 7
First line contains: '13.42\n' which contains 6 characters (including newline)


To move the cursor back to the beginning of the file, you can either:
1. Close the file and re-open it
1. Use `seek(0,0)`:

In [17]:
in_file = open(r"files\transactions.txt", "r")

print("File cursor is at position: %d" %(in_file.tell()))

# read data from the file advances the cursor by a certain number of bytes, depending on the number of characters in the line
transaction = in_file.readline()
print("File cursor is at position: %d" %(in_file.tell()))
# %r placeholder displays all characters in a string. we use it see the newline character as \n
# len() returns the number of characters in the string
print("First line contains: %r which contains %d characters (including newline)" %(transaction, len(transaction)))
# move the cursor back to the beginning of the file
in_file.seek(0,0) 
print("File cursor is at position: %d" %(in_file.tell()))
in_file.close()

File cursor is at position: 0
File cursor is at position: 7
First line contains: '13.42\n' which contains 6 characters (including newline)
File cursor is at position: 0


Note: In the code above I used a built-in function called [`len()`](https://docs.python.org/3/library/functions.html#len). `len()` accepts a string as an argument and returns the number of characters in the string.

Digression: On Windows, newlines are actually represented by \r\n (carriage return and newline). Python combines the carriage return and newline for us so we don't have to worry about this. Knowing this least helps explain the cursor position of 7 above.

|Position|0|1|2|3|4|5|6|7|8|...|
|-|-|-|-|-|-|-|-|-|-|-|
|Character|1|3|.|4|2|\r|\n|2|7|...|

We can remove whitespace characters (like \n and \r) with a call to a string function `strip()`:

## Removing Leading and Trailing Whitespace Characters
We can remove whitespace characters (like `\n`) with a call to a string function `strip()`:

In [18]:
in_file = open(r"files\transactions.txt", "r")

# read data from the file advances the cursor by a certain number of bytes, 
# depending on the number of characters in the line
transaction = in_file.readline()

# repr() displays all characters in a string. we use it see the newline character as \n
print("First line: ", repr(transaction))
in_file.close()
print("First line stripping leading/trailing whitespace characters: ", repr(transaction.strip()))

First line:  '13.42\n'
First line stripping leading/trailing whitespace characters:  '13.42'


## Revisiting `print()`
There are several ways to print strings with the `print()` function. It is helpful to be aware of other printing approaches, especially when you want to format output a particular way. Check out these different ways to print:

In [19]:
# format string and placeholders
print("Integer: %d, Float: %f, Float with 1 decimal: %.1f, String: %s" %(7, 8.4898899, 3.14, ":)"))

# arguments displayed separated by spaces
print(4, 5.5, ":P", 8)
# specifying the delimiter between arguments (a comma and a space)
print("A comma", "separated", "list", sep=", ")

# specifying the string concatenated to the end
print("A string without the added newline character", end="")
print("This sentence runs into the previous", end="!\n")

# https://docs.python.org/3/library/string.html
print("An {} form of placholders {:.1f}. You can also use keywords {name}".format("alternative", 9.99, name="cpts215"))

# alternative way to write to a file using print() instead of write()
outfile = open(r"files\out_demo.txt", "w")
print("Writing this output via print()", file=outfile)
outfile.close()

Integer: 7, Float: 8.489890, Float with 1 decimal: 3.1, String: :)
4 5.5 :P 8
A comma, separated, list
A string without the added newline characterThis sentence runs into the previous!
An alternative form of placholders 10.0. You can also use keywords cpts215


## Practice Problem
For the following problems, we will need to download a file: [words.txt](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/words.txt). This file contains 113,809 official crossword words, one per line. Using words.txt, write a program with the following functionality:
1. A function called `open_input_file()` that opens words.txt for reading and returns the file object associated with words.txt
1. A function called `close_file()` that accepts the file object as an argument and closes the file
1. A function called `first_five_words()` that displays the first 5 words of the file. Try to display the words one on each line, without an extra newline between the words like:
```
aa
aah
aahed
aahing
aahs
```
Hint: read the [Python input/output tutorial](https://docs.python.org/3/tutorial/inputoutput.html) for more info about how to do this with `print()`.

In [20]:
def open_input_file(fname):
    '''
    
    '''
    in_file = open(fname, "r")
    return in_file

def close_file(file_2_close):
    '''
    
    '''
    file_2_close.close()
    
def first_five_words(in_file):
    '''
    
    '''
    for i in range(5):
        print(in_file.readline(), end="")

def main():
    '''
    
    '''
    fin = open_input_file(r"files\words.txt")
    first_five_words(fin)
    close_file(fin)
    
main()

aa
aah
aahed
aahing
aahs


## Strings
A long time ago, we learned a string is a *sequence of characters*. We learned how to use strings in print statements, how to assign a string to a variable, how to typecast to a string using `str()`.

In [6]:
print("PYTHON")
my_string = "PYTHON"
print(my_string)
print("%s %s" %(my_string, str(100.0)))

PYTHON
PYTHON
PYTHON 100.0


### String Concatenation
We also learned how to make new strings using string concatenation (the `+` operator): `new_string = old_string1 + old_string2`:

In [7]:
new_string = "cpts" + "215" + " uses " + my_string + "!"
print(new_string)

cpts215 uses PYTHON!


### Strings and `for` Loops
More recently, we learned about sequences in the context of `for` loops:

```for <item> in <sequence>:
    <body>
```

We can also use `for` loops to iterate through each character in a string:

In [8]:
for character in "PYTHON":
    print(character, end=" ")

P Y T H O N 

## String Indexing
Logically, the string `"PYTHON"` is organized as follows:

|Index:|0|1|2|3|4|5|
|-|-|-|-|-|-|-|
|Character:|P|Y|T|H|O|N|

In general, a string that has `n` characters has valid indices of 0, 1, ..., `n` - 1.

We can access a single character in the string using indexing notation: `[<index>]` (hard brackets):

In [10]:
my_string = "PYTHON"

print(my_string[0])
print(my_string[5])
print(my_string[-1])

P
N
N


## String Length
We can find out the length of a string, meaning the number of characters in the string, by calling the built-in string `len(<string>)` function:

In [11]:
length = len(my_string)
print("The length of %s is: %d" %(my_string, length))

The length of PYTHON is: 6


## String Slicing
We can use the `:` operator to select a *slice* of a string: `<string variable>[start_index:end_index + 1]`

For example:

In [12]:
# Get the YT of PYTHON
print(my_string[1:3])

course = "CptS215"

# get a slice of the 215
course_num = course[4:7]
print(course_num)

YT
215


Omitting the start index implies a 0 for a start index and omitting an end index implies a `len(<string variable>)` for an end index:

In [9]:
# these two are the same
print(my_string[0:2])
print(my_string[:2])

# these two are the same
print(my_string[2:len(my_string)])
print(my_string[2:])

PY
PY
THON
THON


Note: We can also use negative indices! The last index of a string is -1, the second to last index is -2, and so on, until the first index is `-len(<string variable>)`

|Index:|0|1|2|3|4|5|
|-|-|-|-|-|-|-|
|Character:|P|Y|T|H|O|N|
|Index:|-6|-5|-4|-3|-2|-1|

In [10]:
print(my_string[-1])
print(my_string[-6])

N
P


## Immutability of Strings
Strings are *immutable*, meaning they can't be changed. This means we cannot re-assign a character of a string:

In [13]:
# crashes because strings are immutable
my_string[0] = "p"

TypeError: 'str' object does not support item assignment

To "change" a string (remember strings are immutable), we can make a new string that is a variant of the old string:

In [14]:
new_string = "p" + my_string[1:]
print(new_string)

pYTHON


## String Comparison
Often we need to compare 2 strings in order to determine if the strings are equal or not (e.g. a 20 questions game or a hang man game). We can use string comparison operators to do this:
* == (equality)
* != (not equal)
* < > (less than or greater than)

Note: String comparisons are performed by comparing character [Unicode values](http://dev.networkerror.org/utf8/?start=33&end=133&cols=4&search=&show_uni_int=on&show_uni_hex=on&show_html_ent=on&show_raw_hex=on&show_raw_bin=on), index by index.

In [15]:
print("cpts" == "cpts")
print("cpts" == "CptS")
print("cpts" < "CptS")
print("cpts" > "CptS")

True
False
False
True


## String Methods
A function that is associated with an object is called a *method*. To call a method, we use the form `<object>.<method name>()`. We have been using methods when we interact with file objects. Recall:

```
file_object = open("file.txt", "r")
line = file_object.readline()
file_object.close()
```

There are several string methods that provide useful string operations. To use a string object method, we use the form `<string variable>.<method name>()`. For example, we have been using the string method `strip()`:

In [16]:
my_string = "           python           "
my_string = my_string.strip()
print(my_string)

python


Useful string method functions include (but are not limited to):
* `upper()`: returns the string in uppercase
* `lower()`: returns the string in lowercase
* `find(<character to find>)`: returns the index of `<character to find>` if `<character to find>` is in the string
* `replace(<substring to replace>, <string to replace with>)`: returns a string with all occurrences of `<substring to replace>` replaced with `<string to replace with>`

Read more about string methods in the [Python documentation](https://docs.python.org/3.1/library/stdtypes.html#string-methods)