# File IO in Python

We'll have a quick look at reading data in from, and writing data out to files. The *UsingBokeh* notebook demonstrates how to do this with the *Pandas* module, which is often, but not always, more convenient for subsequent data processing.

Here we'll restrict ourselves to pure Python.

We'll also cover some of the more common actions that are taken when processing file input:

* Recognising 'comment' and blank lines
* Removing unwanted 'whitespace' at the ends of lines
* Splitting 'delimited' lines into lists of words or numbers

Generally with file handling you can either read the file in as a whole - usually storing the lines in a list for later processing, or read it in a line at a time and process each line as you go along. We'll look at both.

## 1. Opening and closing a file 

We first need to 'open' the file and this returns an Python 'object' that we can iterate (or loop) through in a similar way to a list. The `open` function takes 2 parameters - the first, pretty obviously is a file name (don't forget the file extension if it has one). The second is a parameter that indicates what 'mode' we want to open the file in. We use "r" if we just want to read from it, "w" if we only want to create/write to it (overwriting anything that might be in it if it exists), "r+" to both read and write, and "a" if we want to append stuff to the end of it. If you don't include a mode, "r" is assumed.

> For the techies amongst you, by default, the file is opened in 'text' mode. If you add a 'b' to the mode letter it is opened in binary mode - which can be useful if, for example, you need to look at the individual bytes of data making up a JPEG or EXE file.  The line ends are indicated with special characters - `\n` (Unix/Linux and Mac) or `\r\n` (Windows) depending on what platform we are using. This isn't particularly important for text files but can be a real problem with binary files.

So, a very simple example of file IO  using Python is:

In [1]:
my_file = open('demo_file.txt', 'r')
print(type(my_file))
# Here is where you could do more stuff with the file's contents.
my_file.close()

<class '_io.TextIOWrapper'>


The `open` function returns the file object **`my_file`** which we can use to read data from the file. If the file does not exist you will get a `FileNotFoundError` error. As a matter of good practice, after processing the file you should *close* it by using (in this case) **`my_file.close()`**.

To make sure you don't inadvertantly leave the file open you can use the `with` statement form, which automatically closes the file after processing the indented block following it:


In [2]:
with open('demo_file.txt', mode='r') as my_file:
    # Do some stuff here
    print(type(my_file))
    
# demo_file.text is now closed.

<class '_io.TextIOWrapper'>


## 2. Reading and Writing

How to write data to files and extract the data that files contain.

### 2.1 Writing (and appending)

We'll look at writing to a file now. 

Let's assume we have a list of lines of text we want to store more permantly for future use. A very simple way to accomplish this task is:


In [3]:
my_data = ['The sky is so blue.', 
          'The sun is so warm up high.', 
          'I love the summer.']

with open('haiku_1.txt', mode='w') as haiku_file:
    haiku_file.writelines(my_data)
    # Note we don't need haiku_file.close() because we are using 'with'


This code creates a file called `haiku_1.txt` in the current directory, which contains a single line: 

    The sky is so blue.The sun is so warm up high.I love the summer. 

This probably wasn't what was intended. The list lines have been concatenated into a single string. We need to add an 'end of line' character (`\n`) onto the end of each line. A couple of ways we could do this (using the same data) are:


In [4]:
with open('haiku_1.txt', mode='w') as haiku_file:
    for line in my_data:
        haiku_file.write(line+'\n')


Which writes a file containing:
    
    The sky is so blue.
    The sun is so warm up high.
    I love the summer.

Note that this approach also adds a `\n` to the final line. Alternatively, we could use:


In [5]:
with open('haiku_1.txt', mode='w') as haiku_file:
    haiku_file.write('\n'.join(my_data))


In this example the **`join`** function joins all the list together with a `\n` between them. The result is the same - but without the final `\n`.



### 2.2 Reading

The code for reading from a file looks very similar to that used for writing. We'll look first at reading the whole file into a list of lines.


In [6]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.read()
    print(my_file)

The sky is so blue.
The sun is so warm up high.
I love the summer.



But, you notice this has read the file in as a single text line (the `\n` just format the line by inserting a line break). What if we want the lines as separate elements of a list? We can use the **`readlines()`** method of the `file` object (`haiku_file`, in this case) to do this.


In [7]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.readlines()
    print(my_file)

['The sky is so blue.\n', 'The sun is so warm up high.\n', 'I love the summer.']



This is fine - exept that the `\n` is still appended to each list line. If this is a problem, then we can read the whole file and then break it apart using the **`splitlines()`** method:


In [8]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.read().splitlines()
    print(my_file)

['The sky is so blue.', 'The sun is so warm up high.', 'I love the summer.']



Finally, how would we process the file one line at a time? In fact, it is possible to _iterate_ over a `file` object. If we do that, we retrieve the next line from the file on each iteration.


In [9]:
with open('haiku_1.txt', mode='r') as haiku_file:
    count = 1
    for line in haiku_file:
        print('Line',count,'is:', line)
        count += 1

Line 1 is: The sky is so blue.

Line 2 is: The sun is so warm up high.

Line 3 is: I love the summer.


### Exercise 2.2

Create and write a file containing a short poem of your choice. 

Read it back in, print it out, line by line, and couns the number of lines in it.


In [10]:
# Write your code here...

## 3. Processing the lines

What can we do with the data once we have it?


We've provided a file with some test lines in it called `demo_file.txt`. It contains this:

```
# This is a demo file
# With some comment lines indicated by a '#', blank lines, text lines,
# numbers and a couple of 'comma delimited' lines

The boy stood on the burning deck.
How now brown cow.
 The quick brown fox jumped over the lazy dogs

# Some numbers
3.14159
2.71828

# Some 'comma' delimited lines
1,2,3,4,5,6,7,8,9,10
1,4,9,16,25,36,49,64,81,100

The last line
```

There some comment lines marked with a `#` and blank lines that we want to ignore. But notice one line begins with a blank - this is probably an error and we may want to just remove it. At some point we will want to extract the 2 comma delimited lines and split each into a list of numbers.

This file structure implies that we need to look at each line in turn an treat them differently, so we'll go with the line-by-line processing approach. 

For a short file like this, we could equally well just read the whole file into a list and process each line in the list. However, for very large files (like astronomical catalogues) it is often better to take the line-by-line approach that we'll show here. We'll start simply by reading in each line, checking it isn't a comment or blank line and appending it to a working list.


In [11]:
lines=[]
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        if lin[0] == '#' or lin == '':
            continue # ignore comment and empty lines
        else:
            lines.append(lin)

print(lines)
        

['\n', 'The boy stood on the burning deck.\n', 'How now brown cow.\n', ' The quick brown fox jumped over the lazy dogs\n', '\n', '3.14159\n', '2.71828\n', '\n', '1,2,3,4,5,6,7,8,9,10\n', '1,4,9,16,25,36,49,64,81,100\n', '\n', 'The last line']



Well that's a good start but we've still got those pesky `\n` end-of-lines. Also, note the space before the `' The quick brown fox ...'` line. Luckily we have an ideal function available, **`strip()`** which removes all 'whitespace' which includes spaces, `\n`, etc. from both ends of a line. 
> There are also two other functions, lstrip() and rstrip(), which remove whitespace from the left and right ends of the line, respectively.

Our improved code now looks like:


In [12]:
lines=[]
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        linstr = lin.strip()
        if linstr == '' or linstr[0] == '#':  # The order is important here. If lin.strip()
            # returns a blank line, linstr[0] will crash. In this order the 'blank' 
            # is checked first and 'continue' triggered before we get to the test for '#'
            continue # ignore comment and empty lines
        else:
            lines.append(linstr)

print(lines)

['The boy stood on the burning deck.', 'How now brown cow.', 'The quick brown fox jumped over the lazy dogs', '3.14159', '2.71828', '1,2,3,4,5,6,7,8,9,10', '1,4,9,16,25,36,49,64,81,100', 'The last line']


 
Great! This is exactly what we want. We could now go on to process the whole list of 'useful' lines.
 

### Exercise 3

Read in the file as we've just done but **count** the number of lines AT THE START of the file that you encounter and ignore before getting to the first 'useful' line (`'The boy stood ...'`). You'll find this useful later on in the Module.

In [13]:
# Write your code here...

_Hint: you can't just count all the lines you've ignored at the 'continue' statement as this would count subsequent comment lines that appear after the first useful line._

_Hint: you'll probably have to set a 'flag' variable to True (something like first_time=True), count skipped lines whilst this is True and then set it to false once you get to the first useful line (the first time you actually append anything._

_Hint: the result should be 4._

In [14]:
lines=[]
first_time = True
count=0
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        linstr = lin.strip()
        if linstr == '' or linstr[0] == '#':
            if first_time:
                count += 1
            continue # ignore comment and empty lines
        else:
            if first_time:
                first_time=False
            lines.append(linstr)

print(lines)
print('We had', count, 'discarded lines at the begining of the file')

['The boy stood on the burning deck.', 'How now brown cow.', 'The quick brown fox jumped over the lazy dogs', '3.14159', '2.71828', '1,2,3,4,5,6,7,8,9,10', '1,4,9,16,25,36,49,64,81,100', 'The last line']
We had 4 discarded lines at the begining of the file


## 4. Chopping up a line

Finally, this section introduces a couple of useful functions that are often used in file IO.


### 4.1 `split()`

Now we'll examine the rather useful function named **`split()`**. Not because it's specifically related to  file IO, but because it's very often used when processing textual file input. The function will take a single line of text and split it into a list. **`split()`** separates the line wherever it comes across a delimiter that the programmer specifies. In scientific data, appropriate delimiters include commas (`,`), tab characters (`\t`) or a spaces (` `).
 
We'll use **`split()`** to operate on the first line in our list (index 0) and on the 6th (index 5).

In [15]:
print(lines[0])
split_line = lines[0].split(' ') # split it where there are spaces
print (split_line)

print(lines[5])
split_line = lines[5].split(',') # split it where there is a','
print(split_line)


The boy stood on the burning deck.
['The', 'boy', 'stood', 'on', 'the', 'burning', 'deck.']
1,2,3,4,5,6,7,8,9,10
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']


### 4.2 `strip()` (which we've seen earlier)

You'll often get anoying spaces at the begining and end of a line and sometimes the line you've read will have an 'end-of-line' character attached (seen as `\n`). You really don't need these characters (known as 'whitespace' - and there are others). In this case, **`strip()`** is your friend.

Here, we use **`strip()`** to remove whiespace from the end of **`line`**.


In [16]:
line = '  an irritating line   \n'
print(line + '< this should be the end')

stripped_line = line.strip()
print(stripped_line + '< this should be the end')

  an irritating line   
< this should be the end
an irritating line< this should be the end
