##### 28 Oct 2019

# Reading Data from Files

#### Reading:  _PCfB_ Ch 10 

#### Today's Topics: 

* recap: text files
* File objects
* File methods
  * `read`
  * `readline`
* Useful string methods
* CSV files
* Type conversion

#### Data Files

If you want to execute the examples in this notebook download these files from the Bi 410 server:
```
quote.txt
species.csv
checking_account.csv
```

## Aside: Shell Window in JupyterLab

As we look at how Python interacts with the file system it's helpful to have a terminal session running at the same time

One idea is to open a second terminal emulator, `cd` to the same directory as your notebook

We can also run shell commands directly from Jupyter

### Shell Commands in Jupyter Notebooks 

If a code cell begins with an exclamation mark Jupyter treats it as a shell command
* when the code cell is executed, the contents are sent to the host OS instead of the Python kernel

If you are running Jupyter in the Bi 410 Docker container:
* the underlying OS is Linux, and the commands that follow the exclamation mark should be `bash` commands
* the working directory is the directory where Jupyter found the notebook

In [1]:
! pwd

/home/jovyan/Bi410


In [2]:
! ls *.csv

amino_acid_table.csv  checking_account.csv  Singh2015.csv


### Shell Panel in JupyterLab 

JupyterLab also lets us run a shell session in your browser
* open a new Launcher (click the + button below the main menu)
* click on Terminal
* click on the original notebook, then drag the terminal tab over the notebook
* you can place the terminal anywhere you want (below the notebook, to the right, ...) and resize it after you place it

## Review: Text Files 

Most of the data we work with in Bioinformatics are in **plain text** files (as opposed to **binary** files or structured documents)

Some common formats and filename extensions:
* unstructured text, e.g. English paragraphs (`.txt`)
* HTML documents (`.html`)
* Jupyter notebooks (`.ipynb`)
* DNA and protein sequences (`.fasta`, `.fa`)
* short sequence reads (`.fastq`, `.fq`)
* comma-separated values (`.csv`)
* tab-separated values (`.tsv`)

### What's in a File?

There are several shell commands that will tell us what's in a file
* `ls -l` (list the files in long form) prints the file size
* `wc` (word count) tells us how many lines, words, and characters are in a file
* `file` will print a description of the file contents

And if the file is short enough, we can just use `cat` to print the entire contents

#### Examples

```
$ file quote.txt
quote.txt: ASCII text
```

```
$ wc quote.txt
       4      35     188 quote.txt
```

```
$ cat quote.txt
If you have no confidence in self,
  you are twice defeated in the race of life.
With confidence, you have won even before you have started.
    -- Marcus Tullius Cicero (106 BC -- 43 BC)
```

## Working with Files in Python

Working with a file in Python is analogous to using a book in real life
* call a function to **open** the file
* use methods to **read** or **write** data
* when we're done we have to **close** the file

### Paths 

Files are identified by a **path**

In a Python program, a path is a string
* a name by itself refers to a file in the current directory
* the string can contain dots and slashes

```
'quote.txt'
'../../2_python/lectures/images/ipython_kernel.jpg'
```

The paths are relative to where the Python program was first started
* when we're working in Jupyter, paths are relative to the directory that contains the notebook

## File Objects 

To open a file call a builtin function named `open`
* pass it a string representing the path to the file
* get back a **File object**

Like other objects in Python (strings, lists, etc), we carry out operations with a file by calling methods

### `read` 

The simplest method is `read`
* it returns the entire contents of the file as one long string

This snippet shows how to open the file named `quote.txt`, fetch the contents with a call to `read`, and save the result as a string named `q`
* if you use `ls -l` or `wc` you'll see this file has 188 characters
* we should get them all when we call `read`

In [5]:
f = open('quote.txt')
q = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'quote.txt'

In [4]:
type(f)

NameError: name 'f' is not defined

Let's take a look at what's in `q`:

In [None]:
len(q)

In [None]:
type(q)

In [None]:
q[0:20]

### Aside: Newline Characters 

If we print the string `q` we'll see the entire text, and it will look just like it does in a text editor
* you can open the file with Notepad++ (Windows) or BBEdit (Mac)
* you can also edit it with Jupyter

In [None]:
print(q)

But if we just ask Python to show us the value of `q` we'll see something different:

In [None]:
q

Look carefully at the string `q` and you will see four places where Python printed `\n`

These are **newline** characters
* they indicate where lines end
* newlines have their own internal code

Python prints them with two characters `\n` because when the language was invented there was no other way to indicate nonprinting characters
* there are several other characters like this
* tab, return, delete, bell, ...

In [None]:
s = 'Hello\nworld'

In [None]:
s

In [None]:
print(s)

Even though Python prints `\n` to show where the character is it is still just a single character:

In [None]:
s

In [None]:
len(s)

In [None]:
s[4]

In [None]:
s[5]

In [None]:
s[6]

### Closing a File 

Once we're done reading from a file we should close it

In [None]:
f.close()

#### What Happens if We Forget to Close a File? 

Nothing.  The kernel will automatically close a file when we close the notebook.

Lots of open files clutter up the kernel, and it's considered "best practice" to close files when we're done with them

Also, forgetting to close a file we're writing to can be a problem
* if the program crashes or we restart the kernel when an output file is still open the data might be lost

## Reading One Line at a Time

In most applications we don't need the entire contents of a file all at once

It's more common to read the file one line at a time

With very large data files this can be much more efficient

### `readline` 

To get a single line from a file call a method named `readline`

In [None]:
f = open('quote.txt')
s = f.readline()

In [None]:
len(s)

In [None]:
s

Note that the newline character is included in the line

In [None]:
s[-1]

### Files Are Read Sequentially

Each time we call `readline` we'll get the next line in the file

In [None]:
f.readline()

Execute this code cell several times -- you should see each successive line, and eventually an empty string:

In [None]:
f.readline()

### Empty String 

When there are no more lines to read the method returns an empty string

One way (but not the best way) to read all the lines one at a time:

In [None]:
f = open('quote.txt')
s = f.readline()
n = 1
while len(s) > 0:
    print('[', n, ']: ', s)
    s = f.readline()
    n += 1
f.close()

#### Spacing 

Did you notice how there is extra space between the lines in the output area of the previous code cell?  Can you explain why?  

Below we'll see where this space comes from and how to deal with it.

## Iterating Over a File 

A _much_ better way to read lines one at a time is with a `for` loop
* we prefer `for` loops because they're shorter and less error-prone

Python makes it easy for us:
```
for s in open(fn):
   ...
```

The call to `open` returns a file object, and the `for` loop iterates over the object just like it iterates over a list:
* read the first line, save it in `s`
* execute the body of the loop
* go back to the top of the loop, get the next line
* exit the loop when there are no more lines to read

In [None]:
n = 0
for line in open('quote.txt'):
    print('[', n, ']: ', line)
    n += 1

Here's another loop, this time to count the number of characters
* the value printed should be the same number shown by `ls -l` or `wc`

In [None]:
count = 0
for line in open('quote.txt'):
    count += len(line)
print('total number of chars:', count)

### File Not Closed 

A problem with that `for` loop is the file is opened but not closed
* we'll see how to fix this problem in the next notebook, by using a `with` statement

## Working with Lines 

After we read a line we'll want to do something with it

### `strip` 

The first step is often to remove the newline from the end

A string method named `strip` does the job
* it removes "whitespace" (non-printing) characters from ends of strings
* whitespace includes spaces, tabs, and newlines

In [None]:
q = '  ab cd  '

In [None]:
q.strip()

In [None]:
for line in open('quote.txt'):
    s = line.strip()
    print(s)

Compare this output with the result of the previous loop
* newlines have been removed from the ends, so the extra space problem is solved
* it also removed the space from the fronts of lines (e.g. line 2)

There are also methods named `lstrip` and `rstrip` if you just want to remove spaces from the left or right side

### `split` 

Another useful method is `split`
* a call to `split` will break a line into smaller pieces
* the return value is a list of smaller strings
* the original string is not affected (strings are immutable)

By default the string is split according to word breaks -- all whitespace characters are removed, and the result is a list of words

In [None]:
s = "fee fie foe fum"

In [None]:
s.split()

In [None]:
t = 'One, two,    three,      go!'

In [None]:
t.split()

## CSV Files 

A common data format:  comma-separated values (CSV)
* plain text files
* can be opened, edited with a text editor
* common filename extension is `.csv`

There is one line (_aka_ **record**) for each piece of data

Every line has a sequence of values (_aka_ **fields**) separated by commas

### Example 

A file named `species.csv` has records that describe species of bacteria
* the first field is a species ID
* the second in the genus name
* the third is the species name

Here are the first 5 lines in the file:
```
$ head -5 species.csv

577,Persephonella,marina
2812,Bifidobacterium,longum
5012,Salmonella,enterica
1405,Bordetella,holmesii
2147,Methylocella,silvestris
```

### Assumptions 

CSV files are usually generated automatically 
* programs like BLAST usually have options for printing results in CSV form
* spreadsheets and databases export data in CSV format

It's usually safe to assume
* one line per record
* the same number of fields on each line
* no spaces or extra characters surrounding the commas

For our projects we'll keep things simple, skip the error checking

For "production" programs we'll want to include statements that test for and handle formatting errors (more on this later in the term)

### Splitting CSV Lines 

It's easy to break an input line into separate items:  just pass a comma as an argument to `split`

In [None]:
for line in open('species.csv'):
    recs = line.split(',')
    print(len(recs), recs)

Oops, forgot to strip the newlines -- notice how the last item in each line still has a `\n`

Instead of adding an extra statement to this loop, a common technique is to "chain" the calls to `strip` and `split`:

In [None]:
for line in open('species.csv'):
    lt = line.strip()
    recs = lt.split(',')
    print(recs)

Do you see why this works?
* `line.strip()` returns a string
* we can call the `split` method on this string

## Type Conversion 

An important thing to remember about lines from a text file:
* all data is in the form of strings
* things that look like numbers are really sequences of digits

Note the difference between these two assignment statements:

In [None]:
n = 123

In [None]:
n * 3

In [None]:
n = '123'

In [None]:
n * 3

The first defines `n` to be an integer, the second defines it to be a sequence of characters

When we call `split` to separate out records in a CSV file we often need to convert strings of digits into numbers.

### Example:  Bank Transactions 

Banks and credit card companies often allow customers to download records in CSV format
* spreadsheets and other applications import CSV

Here are the lines in the file named `checking_account.csv`:
```
$ cat checking_account.csv

12/02/2018,REGISTER GUARD,11.96
12/03/2018,APL* ITUNES.COM/BILL,0.99
12/04/2018,CHEVRON 0204468,35.52
12/08/2018,MARKET OF CHOICE #9,11.21
12/09/2018,FOOD FOR LANE COUNTY,5.00
12/14/2018,CHEVRON 0204468,16.25
12/21/2018,MARKET OF CHOICE #9,38.76
12/28/2018,MARKET OF CHOICE #9,18.78
12/31/2018,KING ESTATE WINERY,44.40
```

We want this loop to compute the sum of all the transactions:

In [None]:
total = 0.0

for line in open('checking_account.csv'):
    rec = line.strip().split(',')
    print(rec)
    total += rec[2]                         # <--- bug
    
print('Total payments:', total)

What does that error message tell us?  Can you see where the error is?

#### Add a call to `print` 

If you're not sure where the problem is, try adding a `print` statement
* a good choice would be to print the value of `rec` so we can see if the split worked the way we expected

So it looks like `split` did the right thing -- the problem is that `amount` is a string.
* verify that by adding `print(type(amount))` to the loop....

### `int` and `float` 

The two type names for numbers are also the names of functions
* call `int(s)` to convert the string of digits `s` into an integer
* call `float(s)` to convert `s` into a floating point number

In [None]:
s = '123'

In [None]:
n = int(s)

In [None]:
n

In [None]:
n / 4

In [None]:
int(s) * 2

Here's the loop from above, this time converting the third field (which is in `rec[2]`) to a float before adding it to `total`

In [None]:
total = 0.0

for line in open('checking_account.csv'):
    rec = line.strip().split(',')
    print(rec)
    total += float(rec[2])         #  <--- the bug has been fixed
    
print('Total payments:', total)