# File IO in Python

We'll have a quick look at reading files in from and writing out to files. Later we'll be seeing how to do this in the Pandas module, but here we restrict ourselves to pure Python.

We'll also cover some of the more common actions that are taken when processing file input:

* Recognising 'comment' and blank lines
* Removing unwanted 'whitespace' at the ends of lines
* Splitting 'delimited' lines into lists of words or numbers

Generally with file handling you can either read the file in as a whole - usually storing the lines in a list for later processing, or read it in a line at a time and process each line as you go along. We'll look at both.


## Opening and closing a file 

We first need to 'open' the file and this returns an Python 'object' that we can iterate, or loop, through in a similar way to a list. The open function takes 2 parameters - the first, pretty obviously is a file name (don't forget the file extension if it has one). The second is a parameter that indicates what 'mode' we want to open the file in "r" if we just want to read from it, "w" if we only want to create/write to it (overwriting anything that might be in it if it exists), "r+" to bothe read and write, and "a" if we want to append stuff to the end of it. If you don't include a mode, "r" is assumed.

_For the techies amongst you, by default, the line is opened in 'text' mode. If you adda 'b' to the mode letter it is opened in binary mode - useful, for example, to look at the individual bytes of data making up a JPEG or EXE file.  The lines are ended with a '\\n' (Unix/Linux and Mac) or '\\r\\n' (Windows) depending on what platform we are using. This isn't particularly important for text files but can be a real problem with binary files._

So, the simplest form is:


In [1]:
my_file = open('demo_file.txt', 'r')
print(type(my_file))
# more stuff
my_file.close()

<class '_io.TextIOWrapper'>


Which returns the file object **my_file** which we can use to read stuff from the file. If the file does not exist you will get a 'FileNotFoundError' error. As a matter of good practice, after processing the file you should 'close' it my using (in this case) **my_file.close()**.

To make sure you don't inadvertantly leave the file open you can use the 'with' statement, which automatically closes the file after processing the indented block following it:


In [2]:
with open('demo_file.txt', mode='r') as my_file:
    # Do some stuff here
    print(type(my_file))


<class '_io.TextIOWrapper'>


## Writing (and appending)

We'll look at writing to a file now. 

Let's assume we have a list of lines of text we want to store more permantly for future use. In its simplest form we could do:


In [3]:
my_data = ['The sky is so blue.', 
          'The sun is so warm up high.', 
          'I love the summer.']

with open('haiku_1.txt', mode='w') as haiku_file:
    haiku_file.writelines(my_data)
    # Note we don't need haiku_file.close() because we are using 'with'


This gives a file containing a single line: _'The sky is so blue.The sun is so warm up high.I love the summer.'_. Which probably wasn't waht was intended. The list lines have been concatenetaed into a single string. We need to add an 'end of line' charcter ('\\n') onto the end of each line. A couple of ways we could do this are (using the same data):


In [4]:
with open('haiku_1.txt', mode='w') as haiku_file:
    for line in my_data:
        haiku_file.write(line+'\n')


Which gives:
    
    The sky is so blue.
    The sun is so warm up high.
    I love the summer.

Note this adds a '\\n' to the final line. Or we could use:


In [5]:
with open('haiku_1.txt', mode='w') as haiku_file:
    haiku_file.write('\n'.join(my_data))


Which joins all the list togetehr with a '\\n' between them. The result is the same - but without the final '\\n'



## Reading

This is similar to writing really. We'll look first at reading the whole file into a lsit of lines.


In [6]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.read()
    print(my_file)

The sky is so blue.
The sun is so warm up high.
I love the summer.



But, you notice this has read the file in as asingle text line (the \\n just format the line by inserting a line break). What if we want the lines as seprate bits of a list:


In [7]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.readlines()
    print(my_file)

['The sky is so blue.\n', 'The sun is so warm up high.\n', 'I love the summer.']



This is fine - exept that the \\n is still appended to each list line. If this is aproblem, then use:


In [8]:
with open('haiku_1.txt', mode='r') as haiku_file:
    my_file = haiku_file.read().splitlines()
    print(my_file)

['The sky is so blue.', 'The sun is so warm up high.', 'I love the summer.']



Finally, how would we process the file a line at a time?:


In [9]:
with open('haiku_1.txt', mode='r') as haiku_file:
    count = 1
    for line in haiku_file:
        print('Line',count,'is:', line)
        count += 1

Line 1 is: The sky is so blue.

Line 2 is: The sun is so warm up high.

Line 3 is: I love the summer.


### Exercise

Create and write a file containing a short poem of your choice. 

Read it back in, print it out, line by line, and couns the number of lines in it.


## Processing the lines


We've provided a file with some test lines in it called 'demo_file.txt'. It contains this:

```
# This is a demo file
# With some comment lines indicated by a '#', blank lines, text lines,
# numbers and a couple of 'comma delimited' lines

The boy stood on the burning deck.
How now brown cow.
 The quick brown fox jumped over the lazy dogs

# Some numbers
3.14159
2.71828

# Some 'comma' delimited lines
1,2,3,4,5,6,7,8,9,10
1,4,9,16,25,36,49,64,81,100

The last line
```

There some comment lines marked with a '#' and blank lines that we want to ignore. But notice one line begins with a blank - this is probably an error and we may want to just remove it. At some time we wil want to extract the 2 comma delimited lines and split make each into a list of numbers.

This looks like we need to look at each line in turn an treat then differently, so we'll go with the line-by-line processing approach. (WE could, of course just read the whole file into a list and process each line in the list).

Often, for very large files it is better to take the line-by-line approach anyway. We'll start simply by reading in each line, checking it isn't a comment or blank line and appending it to a working list.


In [10]:
lines=[]
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        if lin[0] == '#' or lin == '':
            continue # ignore comment and empty lines
        else:
            lines.append(lin)

print(lines)
        

['\n', 'The boy stood on the burning deck.\n', 'How now brown cow.\n', ' The quick brown fox jumped over the lazy dogs\n', '\n', '3.14159\n', '2.71828\n', '\n', '1,2,3,4,5,6,7,8,9,10\n', '1,4,9,16,25,36,49,64,81,100\n', '\n', 'The last line']



Well that's a good start but we've still got those pesky '\n' end-of-lines. Also, note the space before the ' The quick brown fox ...' line. Luckily we have an ideal function available, strip() which removes all 'whitespace' which includes spaces, \n, etc from both ends of a line. (there is also lstrip() and rstrip() to do it from either end).

So:


In [11]:
lines=[]
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        linstr = lin.strip()
        if linstr == '' or linstr[0] == '#':  # The order is important here. If lin.strip()
            # returns a blank line, linstr[0] will crash. In this order the 'blank' 
            # is checked first and 'continue' triggered before we get to tye '#'
            continue # ignore comment and empty lines
        else:
            lines.append(linstr)

print(lines)

['The boy stood on the burning deck.', 'How now brown cow.', 'The quick brown fox jumped over the lazy dogs', '3.14159', '2.71828', '1,2,3,4,5,6,7,8,9,10', '1,4,9,16,25,36,49,64,81,100', 'The last line']


 
 Which is what we want. We could now go on to process the whole list of 'useful' lines.
 


### Exercise

Read in the file as we've just done but count the number of lines AT THE START of the file that you encounter and ignore before getting to the first 'useful' line ('The boy stood ...'). You'll find this useful later on in the Module.

Hint: you can't just count all the lines you've ignored at the 'continue' statement as this would count subsequent comment lines.

Hint: you'll probably have to set a 'flag' variable to True (something like first_time=True), count skipped lines whilst this is True and then set it to false once you get to the first useful line (the first time you actually append anything.

Hint the result should be 4.


In [12]:
lines=[]
first_time = True
count=0
with open('demo_file.txt', mode='r') as f:
    for lin in f:
        linstr = lin.strip()
        if linstr == '' or linstr[0] == '#':
            if first_time:
                count += 1
            continue # ignore comment and empty lines
        else:
            if first_time:
                first_time=False
            lines.append(linstr)

print(lines)
print('We had', count, 'discarded lines at the begining of the file')

['The boy stood on the burning deck.', 'How now brown cow.', 'The quick brown fox jumped over the lazy dogs', '3.14159', '2.71828', '1,2,3,4,5,6,7,8,9,10', '1,4,9,16,25,36,49,64,81,100', 'The last line']
We had 4 discarded lines at the begining of the file



Finally, we'll now introduce the rather useful function **'split()'** here. Not because it's anything to do with file IO, but because it's very often used when processing file input. The function will take a line and split it into a list, separating the line where it comes across a delimiter - usually a ',' but sometimes a tab character (\\t) or  a space.
 
 We'll look at the first line in our list (index 0) and the 6th (index 5).

In [13]:
print(lines[0])
split_line = lines[0].split(' ') # split it where there are spaces
print (split_line)

print(lines[5])
split_line = lines[5].split(',') # split it where there is a','
print(split_line)


The boy stood on the burning deck.
['The', 'boy', 'stood', 'on', 'the', 'burning', 'deck.']
1,2,3,4,5,6,7,8,9,10
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
