# 7. File Handling
<span id="chapters_ch8_file_handling_file_handling"> </span>
<span id="chapters_ch8_file_handling__doc"> </span>

All variables used in a program are kept in the main memory and they are
*volatile*, i.e., their values are lost when the program ends. Even if
you write a program that runs forever, your data will be lost in case of a
shutdown or power failure.

Another drawback of the main memory is the capacity limitation. In the
extreme case, when you need more than a couple of gigabytes for your
variables, it will be difficult to keep all of them in the main memory.
In particular, infrequently required data is better kept on an
external storage device instead of the main memory.

*Files* provide a mechanism for storing data *persistently* on hard drives that provide significantly larger storage than the main memory.
These devices are also called *secondary storage* devices. The data you
put in a file will stay on the hard drive until someone overwrites or
deletes the file (or when the hard drive fails, which is an unfortunate but rare
case).

A *file* is a sequence of bytes stored on the secondary storage,
typically hard drive (alternative secondary storage devices include CD,
DVD, USB disk, tape drive). Data in a file have the following differences
from data in memory (variables):
  1. A file is just a sequence of bytes. Therefore, data in a file is
     unorganized, there is no data type, and there are no variable boundaries.
  1. Data must be accessed indirectly, using I/O functions. For example, 
     updating a value in a file requires reading it into the main memory, updating it in the main
     memory, and then writing it back into the file.
  1. Accessing and updating data are significantly slower since it is on a slow
     external device.

Keeping data in a file instead of the main memory has the following use
cases:

  1. Data needs to be persistent. Data will be in the file when you
     restart your program, reboot your machine, or when you find your
     ancient laptop in the basement 30 years later (probably it will not
     be there when an AD 3000 archaeologist finds your laptop on an
     excavation site. Hard disks are not that durable. Therefore, persistence is
     bounded).

  1. You need to exchange data with another program.  
     Examples:
     *  You download data from the Web and your program gets it as input.
     *  You would like to generate data in your program and put it on a
        spreadsheet for further processing.  

  1. You have a large amount of data that does not fit in the main memory.
     In this case, you will probably use a library or software like a
     database management system to access data in a faster and more organized
     way. Files are the most primitive, basic way of achieving it.

In this chapter, we will talk about simple file access so that you will
learn about simple file operations such as opening, closing, reading, and writing. The
examples of the chapter will create and modify files when run – we
strongly encourage you to check the contents of the created files.



## 7.1 First Example
<span id="chapters_ch8_file_handling_first_example"> </span>

Let us quickly look at a simple example to get a feeling for the
different steps involved in working with files.

In [1]:
fpointer = open('firstexample.txt',"w")
fpointer.write("hello\n")
fpointer.write("how are\n")
fpointer.write("you?\n")
fpointer.close()

The program above will create a file in the current directory with
filename `firstexample.txt`. You can open it with your favorite text
editor (there are plenty of text editors for all operating systems: notepad,
wordpad, textedit, nano, vim) to see and edit it. The content will look
like this:

```python
hello
how are
you?
```

The first line of the program is
`fpointer = open('firstexample.txt',"w")`. This line opens the file
named `firstexample.txt` for writing to it. If the file exists, its
content will be erased (and it will be an empty file afterward). The result
of `open()` is a file object that we will use in the following lines.
This object is assigned to the variable `fpointer`.
In the following lines, all functions we call with this file object
`fpointer` will work on the corresponding file (i.e., `firstexample.txt`). This
special *dot* notation helps us with calling functions in the scope of the
file. `fpointer.`*functionname*`()` will call the *functionname*
function for this file. `write(string)` function will write the
`string` content to the file. Each call to `write(string)` will
append the `string` to the file and the file will grow. At the end,
when we are done, we call `close()` to finish accessing the file so
that your operating system will know and take necessary actions.
All open files will be closed when your program terminates. However,
calling `close()` after finishing writing is a good programming
practice.
Now, let us read this file:

In [2]:
fp = open("firstexample.txt","r")
content = fp.read()
fp.close()
print(content)

hello
how are
you?




In this case, we called `open()` with argument `"r"` which tells the interpreter 
that we are going to read the file (or use it as input source). If
you skip the second argument in `open()`, it is assumed to be `"r"`,
so `open("firstexample.txt")` will be equivalent. 

The `read()` function 
gets an optional argument, which is the number of bytes to read. If you
skip it, it will read the entire file content and return it as a string.
Therefore, after the call, the `content` variable will be a string
with the file content.



## 7.2 Files and Sequential Access
<span id="chapters_ch8_file_handling_files_and_sequential_access"> </span>

A file consists of bytes, and `read/write` operations access those
bytes *sequentially*. In sequential access, the current I/O operation
updates the file state so that the next I/O operation will resume from the
end of the current I/O operation.

Assume that you have an old MP3 player that supports only the “*play me next 10
seconds*” operation on a button. Pressing it will play the next 10
seconds of the song. When you press again, it will resume from where it
is left off and play for another 10 seconds. This follows until the song is
over. The sequential access is similar. A *file pointer* keeps the
current offset of the file and each I/O operation advances it so that the
next call will read or write from this new offset – see <a href="#chapters_ch8_file_handling_ch8_file_handling">Fig. 8.1</a>.


<figure>
<span id="chapters_ch8_file_handling_id1"> </span>
<span id="chapters_ch8_file_handling_ch8_file_handling"> </span>

<center><img src="img/fig8-1.png" width="450pt"></center>

<figcaption>Figure 8.1: Sequential reading of a file </figcaption>
</figure>

The following is a sample program illustrating sequential access:

In [3]:
fp = open("firstexample.txt","r") # the example file we created above

for i in range(3):          # repeat 3 times
   content = fp.read(4)     # read 4 bytes in each step
   print("> ", content)     # output 4 bytes preceded by >

fp.close()

>  hell
>  o
ho
>  w ar



The text in the file was:


```python
hello
how are
you?
```

The first `read()` reads `'hell'`, the second reads `'o\nho'`(note that `\n` stands for a new line so that `ho` is printed on a
new line), and the third reads `'w ar'`. After these operations, the
file offset is left at a position so that the following reads will
resume from content `'e\nyou\n'`.

We provided the example with 4-byte read operations. However, for text
files, the typical scenario is reading characters line by line instead
of fixed size strings.



## 7.3 Data Conversion and Parsing
<span id="chapters_ch8_file_handling_data_conversion_and_parsing"> </span>

A file, specifically a text file, consists of strings. However,
especially in engineering and science, we work with numbers. A number is
represented in a text file as a sequence of characters including digits,
a sign prefix (`'-'` and `'+'`) and at most one occurrence of a dot
(`'.'`). That means you may use $\pi$ as
`3.1416` in your Python program. However, in the text file, you store `'3.1416'`, which is a
string consisting of chars `'3','.','1','4','1','6'`.


In [4]:
pi = 3.1416
pistr = '3.1416'
print(pi+pi,':', pi * 3)
print(pistr+pistr,':', pistr *3)

6.2832 : 9.4248
3.14163.1416 : 3.14163.14163.1416



Note that the second line of output above is a result of Python
interpreting `+` operator as string concatenation, and `*` as
adjoining multiple copies of a string. If we need to treat numbers as
numbers, we need to convert them from strings. There are two handy
functions for this: `int()` and `float()` convert a string into an
integer and a floating point value, respectively. Here is an
illustration:


In [5]:
pistr = '  0.31416E01  '
nstr = ' 47 '

# Convert numbers in the strings into numerical data types:
piflt = float(pistr)
nint = int(nstr)

print(piflt*2, nint*2)

6.2832 94



Note that we cannot call `int('3.1416')` since the string is not a
valid integer. That brings us another challenge of making sure that
the strings we need to convert are actually numbers. Obviously, 
`int('hello')` and `float('one point five')` will not work either. 
The mechanisms for dealing with such errors are left for the next chapter. In this
chapter, we assume that we have our data carefully created and all
conversions work without any error.

Our next challenge is having multiple numbers in a string separated by
special characters or simply spaces, e.g., `'10.0 5.0 5.0'`. In this case,
we need to decompose a string into string pieces representing numbers,
so that we will have `'10.0','5.0','5.0'` for the above string. The
next step will be converting them into numbers:

`'10.0 5.0 5.0' `$\overset{Step\;1}{\longrightarrow}$ `['10.0','5.0','5.0']` $\overset{Step\;2}{\longrightarrow}$ `[10.0, 5.0, 5.0]`

For the first step, we will use the `split()` method of a string.
A string, or the variable containing the string, is followed by
`.split(delimiter)`, which returns a list of strings separated by the
given delimiters. The delimiters are removed and all values in between
are put in the list – for example:


In [6]:
print('a:b:c'.split(':'))
print('hello darkness, my old friend'.split(' '))
print('a <=> b <=> c'.split(' <=> '))
print('multiple       spaces          are         tricky'.split(' '))
a = '10.0 5.0 5.0'
print(a.split(' '))

['a', 'b', 'c']
['hello', 'darkness,', 'my', 'old', 'friend']
['a', 'b', 'c']
['multiple', '', '', '', '', '', '', 'spaces', '', '', '', '', '', '', '', '', '', 'are', '', '', '', '', '', '', '', '', 'tricky']
['10.0', '5.0', '5.0']



For the second step, we will use the `float()` function on a list (or
the `int()` function if you have a list of integers). We have a couple
of options for this. One is to start from an empty list and append the
converted value at each step:


In [7]:
instr = '10.0 5.0 5.0'
outlst = []

# Go over each substring
for substr in instr.split(' '):
  outlst += [float(substr)]    # Convert each element to float and append it to the list

print(outlst)

[10.0, 5.0, 5.0]



A more practical and faster version will be list comprehension, which is
the compact version of mapping a value into another as:


In [8]:
instr = '10.0 5.0 5.0'

outlst = [float(substr) for substr in instr.split(' ')]

print(outlst)

[10.0, 5.0, 5.0]



As we have explained in Section 5.2.6, the syntax is similar to the set/list
notation in mathematics:

$\left\{float(s) \; \vert \; s \in S\right\}$
vs. `[float(s) for s in S]`

If you need to have multiple spaces within the values, you can use
“`import re`” and call “`re.split(' +', inputstr)`” instead of
“`inputstr.split(' ')`”. This will split the
`'multiple        spaces   are    tricky'` example above into 4 words
without spaces. How it works is beyond the scope of the book. Curious
readers can refer to the “`re`” and “`parse`” modules for more advanced
forms of input parsing. These are not trivial modules for beginners.

Now, let us consider the reverse of the operation: Assume we have a list
of integers and we like to convert that into a string that can be written
to a file. For this, we follow the inverse of the steps: 

`[10.0, 5.0, 5.0]`$\overset{Step\;1}{\longrightarrow}$ `["10.0","5.0","5.0"]` $\overset{Step\;2}{\longrightarrow}$ `"10.0 5.0 5.0"`

The first step will be handled with the `str()` function which
converts any Python value into a human-readable string:


In [9]:
inlst = [10.0, 5.0, 5.0]

outlst = [str(num) for num in inlst]

print(outlst)

['10.0', '5.0', '5.0']



The next step is to join those elements with a delimiter, which is
the reverse of the `split()` operation. Not by accident, the name of this
operation is `join()`. `join()` is a method of the delimiter string
and list is the argument.
`':'.join(['hello','how','are','you?'])` returns
`'hello:how:are:you?'`.


In [10]:
inlst = [10.0, 5.0, 5.0]

outlst = [str(num) for num in inlst]

print(' '.join(outlst))

10.0 5.0 5.0



A more advanced way of converting values into strings is called
*formatted output* and briefly introduced in <a href="#chapters_ch8_file_handling_formatting_files">Section 8.7</a>.



## 7.4 Accessing Text Files Line by Line
<span id="chapters_ch8_file_handling_accessing_text_files_line_by_line"> </span>

Files consisting of human-readable strings are called *text files*.
Text files consist of strings separated by the *end-of-line* character
`'\n'`, also known as *new line*. The sequence of characters in a file
contains the end-of-line characters so that a text editor will end the
current line and show the following characters on a new line. We use
end-of-line characters so that logically relevant data is on the same
line. For example:


```python
4
10.0 20.0
15.5 22.2
3 44
10 10.5
```

Let us assume the integer value `4` on the first line denotes how many
lines will follow. Assume also that each of the following 4 lines has
two real values, denoting $x$ and $y$ values of a point. In
this way, we can represent our input separated by the end-of-line characters
for each point and by the space character for each value in a line.

Let us create such a text file from a Python list. Please note that the
file `read()` function returns a string and the `write()` function expects
a string argument. In other words, calling `write(3.14)` will fail. In order to
make the conversion, we use the `str()` function for numeric values
and call `write(str(3.14))` instead. Another tricky point is that
`write()` does not put the end-of-line character automatically. You need
to insert it in the output string or call an extra `write("\n")`.


In [11]:
pointlist = [(0,0), (10,0), (10,10), (0,10)]

fp = open("pointlist.txt", "w")       # open file for writing
fp.write(str(len(pointlist)))         # write list length
fp.write('\n')

# Go over each point in the list
for (x,y) in pointlist:              # for each x,y value in the list
    fp.write(str(x))                   # write x
    fp.write(' ')                      # space as number separator
    fp.write(str(y))                   # write y
    fp.write('\n')                     # \n as line separator

fp.close()

# let us read the content to verify what we wrote
fp = open("pointlist.txt")           # open for reading
content = fp.read()
print(content)
fp.close()

4
0 0
10 0
10 10
0 10




Using `read()` will get the whole content of the file; if the file is
large, your program would use too much memory and processing the data
will be difficult. In such a situation, we can access a text file line by
line using the `readline()` function.
Let us write a program to read and output the content of a text file. We
need a loop to read the file line by line and output. Here, when we are
going to stop the loop is crucial. Python’s `read()` and `readline()`
functions return an empty string (`''`) when there is nothing left to
read. We can use this to stop reading:


In [12]:
fp = open("pointlist.txt")              # open file for reading

nextline = fp.readline()                # read the first line
while nextline != '':                   # while read is successful
    print(nextline)                       # output the line
    nextline = fp.readline()              # read the nextline

fp.close()                              # when nextline == '' loop terminates

4

0 0

10 0

10 10

0 10




Please note the empty lines between each output line. This is due to
the `'\n'` character at the end of the string that `readline()` returns.
In other words, `readline()` keeps the new line character it reads.
`print()` puts an end-of-line after the output (this can be suppressed
by adding an `end=''` argument). As a result, we have the extra
end-of-line at the end of each line. In order to avoid it, you can call
`rstrip('\n')` on the returned string to remove the end-of-line. The new
code will be:


In [13]:
fp = open("pointlist.txt")              # open file for reading

nextline = fp.readline()                # read the first line
while nextline != '':                   # while read is successful
    nextline = nextline.rstrip('\n')      # remove occurrences of '\n' at the end
    print(nextline)                       # output the line
    nextline = fp.readline()              # read the nextline

fp.close()

4
0 0
10 0
10 10
0 10



Converting this file into the initial Python list
`[(0,0), (10,0), (10,10), (0,10)]` is our next challenge. This
requires conversion of a string as `"0 0\n"` into `(0,0)`. The first
one is of type `str` whereas the second is a tuple of numeric values.
We can use `int()` or `float()` functions to convert strings into
numbers. Note that the string should contain a valid representation of a
Python numeric value: `int("hello")` will raise an error.

The second issue is separating two numbers in the same string. We can
use `split()` function followed by the separator string as in
`nextline.split(' ')`. This call will return a sequence of strings from a string. 
If the separator does not occur in the string, it will return a list with one element. 
If there is one separator, it will return two elements. For $n$ occurrences of the separator, it will
return a list with $n-1$ elements.

Here is the solution in Python:


In [14]:
fp = open("pointlist.txt")              # open file for reading

pointlist = []                          # start with empty list

nextline = fp.readline()                # read the first line
n = int(nextline)                       # find number of lines to read

for i in range(n):                      # repeat n times
    nextline = fp.readline()              # read the nextline
    nextline = nextline.rstrip('\n')      # remove occurrences of '\n' at the end
    (x, y) = nextline.split(' ')          # get x and y (note that they are still strings)
    x = float(x)                          # convert them into real values
    y = float(y)
    pointlist.append( (x,y) )             # add tuple at the end

fp.close()
print(pointlist)                        # output the resulting list

[(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)]


## 7.5 Termination of Input
<span id="chapters_ch8_file_handling_termination_of_input"> </span>

There are two ways to stop reading input:
  1. By reading a definite number of items.
  1. By the end of the file.

In our previous examples, we read an integer that told us how many lines
followed in the file. Then, we called `readline()` in a `for` loop
with the given number of lines. This is an example of the first case
which provides a definite number of items.

The alternative is to read lines in a `while` loop until a termination
condition arises. The termination condition is usually the *end of
file*, the case where functions like `read()` and `readline()`
return an empty string `''`.


In [15]:
fp = open("pointlist.txt")              # open file for reading

pointlist = []                          # start with empty list
nextline = fp.readline()                # skip the first line (4) since we don't need it

nextline = fp.readline()                # read the first line
while nextline != '':                   # until end of file
    nextline = nextline.rstrip('\n')    # remove occurrences of '\n' at the end
    (x, y) = nextline.split(' ')        # get x and y (note that they are still strings)
    x = float(x)                        # convert them into real values
    y = float(y)
    pointlist.append( (x,y) )           # add tuple at the end
    nextline = fp.readline()              # read the nextline

fp.close()
print(pointlist)

[(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)]



Note that the example above skips (reads and throws away) the first line in our input file
so that the integer on the first line is ignored. When your input does
not contain such an unnecessary value, you can delete this line.

Sometimes termination can be marked explicitly by a *sentinel value*
which is a value marking the end of values. This is especially useful
when you have multiple objects to read:


In [16]:
# First, create a file named `twopointlists.txt`
fp = open("twopointlists.txt", "w")
fp.write("""3 0
3.4 2.1
5.1 3.2
EOLIST
1 1.5
2.0 2.5""")
fp.close()

In this input, there are two lists with arbitrary sizes and we use a word of our choice, `EOLIST` to separate them. 
We can use the `EOLIST` word to input distinct lists as follows:

In [17]:
fp = open("twopointlists.txt")
pntlst1 = []                            # start with empty list
pntlst2 = []                            # start with empty list

nextline = fp.readline()                # read the first line
while nextline != 'EOLIST\n':           # sentinel value
    nextline = nextline.rstrip('\n')    # remove occurrences of '\n' at the end
    (x, y) = nextline.split(' ')        # get x and y (note that they are still strings)
    x = float(x)                        # convert them into real values
    y = float(y)
    pntlst1.append( (x,y) )             # add tuple at the end
    nextline = fp.readline()            # read the nextline

# first list has been read, now continue with the second list from the same file
nextline = fp.readline() 
while nextline != '':                   # until end of file
    nextline = nextline.rstrip('\n')    # remove occurrences of '\n' at the end
    (x, y) = nextline.split(' ')        # get x and y (note that they are still strings)
    x = float(x)                        # convert them into real values
    y = float(y)
    pntlst2.append( (x,y) )             # add tuple at the end
    nextline = fp.readline()            # read the nextline

fp.close()
print('List 1:', pntlst1)
print('List 2:', pntlst2)

List 1: [(3.0, 0.0), (3.4, 2.1), (5.1, 3.2)]
List 2: [(1.0, 1.5), (2.0, 2.5)]


## 7.6 Example: Processing CSV Files
<span id="chapters_ch8_file_handling_example_processing_csv_files"> </span>
**CSV** stands for *Comma Separated Value*; it is a text-based format
for exporting/importing *spreadsheet* (i.e., Excel) data. Each row in a CSV file is separated by a newline, and each column is separated by a comma
(`,`). Actually, the format is more complex but for the time being, let
us ignore commas that might be appearing in strings and focus on a simple
form as follows:


```python
Name,Surname,Age
Ada,Lovelace,37
John,von Neumann,53
Alan,Turing,42
Edsger W.,Dijkstra,72
Donald,Knuth,87
Dennis,Ritchie,70 
```

Usually, the first line is reserved for the names of the columns in a spreadsheet. Now,
let us create this file:


In [18]:
content = '''Name,Surname,Age
Ada,Lovelace,37
John,von Neumann,53
Alan,Turing,42
Edsger W.,Dijkstra,72
Donald,Knuth,87
Dennis,Ritchie,70'''
fp = open("first.csv", "w")    # open for writing
fp.write(content)              # write in a single operation, practical for small files
fp.close()


Our next task is to read this file into the memory as a list of dictionaries, as: `[{"Name":"Ada", "Surname":"Lovelace","Age":"37"},...]`
We need to read the file line by line, extract the components using the
`split()` function, then create the dictionary. Then, we can append it
to a result list. For example:


In [19]:
fp = open("first.csv","r")              # open for reading

line =  fp.readline()                   # read column names
line = line.rstrip('\n')                # get rid of new line
colnames = line.split(',')              # list of column names

result = []                             # resulting list of dictionaries
line = fp.readline()
while line != '':                       # end-of-file check
    line = line.rstrip('\n')
    entry = {}                          # start with empty dictionary
    c = 0                               # a counter to address column number
    for v in line.split(','):           # in a loop process each column of the row
        entry[colnames[c]] = v          # column name is index, value is from current row
        c += 1
    result.append(entry)                # add dictionary to result
    line = fp.readline()                # read next line

fp.close()
print(type(result))
print(result)

<class 'list'>
[{'Name': 'Ada', 'Surname': 'Lovelace', 'Age': '37'}, {'Name': 'John', 'Surname': 'von Neumann', 'Age': '53'}, {'Name': 'Alan', 'Surname': 'Turing', 'Age': '42'}, {'Name': 'Edsger W.', 'Surname': 'Dijkstra', 'Age': '72'}, {'Name': 'Donald', 'Surname': 'Knuth', 'Age': '87'}, {'Name': 'Dennis', 'Surname': 'Ritchie', 'Age': '70'}]



Let us improve this example by adding a column as a result of a
computation. Let us calculate the age average and show the difference
from the average as a new column. We need to go over all age values in
the list, convert them to real values (so that we can do arithmetic on
them), calculate the average, then go over all rows to add a new column.
Then, go over the list again to export/write it into a new CSV file.


In [20]:
n = 0
# Calculate the average
sum = 0
for entry in result:
    sum += float(entry['Age'])
    n += 1
average = sum / n

# Calculate the difference of each age from the average
for entry in result:
    entry['Avgdiff'] = str(float(entry['Age']) - average)

# Write the updated content into another CSV file
fp = open('second.csv', 'w')
colnames = entry.keys()                # this returns the keys (column names) of the CSV file
fp.write(','.join(colnames) + '\n')    # write this as the first line with comma separated values
for entry in result: # Go over each row
    vals = []
    for key in colnames: # Write each column on this row
        vals.append(entry[key])        # extract values of entry, entry.values() is a short version of this
    fp.write(','.join(vals) + '\n')

# Finished, close the file
fp.close()


The content of the file ‘second.csv’, after this code runs is as follows:


```
Name,Surname,Age,Avgdiff
Ada,Lovelace,37,-22.833333333333336
John,von Neumann,53,-6.833333333333336
Alan,Turing,42,-17.833333333333336
Edsger W.,Dijkstra,72,12.166666666666664
Donald,Knuth,85,25.166666666666664
Dennis,Ritchie,70,10.166666666666664
```

## 7.7 Formatting Files
<span id="chapters_ch8_file_handling_formatting_files"> </span>

Sometimes readability is important for text files, especially if data is
in a tabular form. For example, seeing all related data in a column
starting at the same position can improve readability significantly. The
following shows the unformatted and formatted versions of the same data
side by side:


```
        UNFORMATTED                                            FORMATTED
Name,Surname,Age,Avgdiff                   Name      , Surname             , Age  , Avgdiff
Ada,Lovelace,37,-22.833333333333336        Ada       , Lovelace            ,    37, -22.833
John,von Neumann,53,-6.833333333333336     John      , von Neumann         ,    53,  -6.833
Alan,Turing,42,-17.833333333333336         Alan      , Turing              ,    42, -17.833
Edsger W.,Dijkstra,72,12.166666666666664   Edsger W. , Dijkstra            ,    72,  12.167
Donald,Knuth,85,25.166666666666664         Donald    , Knuth               ,    85,  25.167
Dennis,Ritchie,70,10.166666666666664       Dennis    , Ritchie             ,    70,  10.167
```

In order to achieve this, you can use the `format()` method of a
template string as in

`'{:10}, {:20}, {:3d}, {:7.3f}'.format('Ada', 'Lovelace', 37, -22.833333333336)'`.

Each `{}` in the template matches a data value in the arguments and specifies how the data will be converted as follows:

  *  The value after `:` denotes the (minimum) width of the data. 
     To make a text fit a specified size with a smaller number of characters, spaces are added to the right side, 
     aligning the text to the left. 
  *  For integers, the number
     is followed by a `d` to format it as a decimal value space-padded on
     the left (right aligned). 
  *  When working with floating point values, you can format the output by specifying the number of digits after the decimal point. This is achieved by appending a period (.) followed by the desired digit count and the letter `f` to indicate the value as a float. The fraction part of the number will be rounded to match the specified number of digits.
     

The detailed description of `format()` is out of the scope of the book. For a detailed description, please refer to the Python reference manuals.

Let us rewrite the output part of the code using formatted output:


In [21]:
template = '{:10}, {:20}, {:5d}, {:7.3f}\n'
fp = open('third.csv', 'w')
colnames = entry.keys()                   # this returns the keys of the CSV file
fp.write('{:10}, {:20}, {:5}, {:7}\n'.format(*colnames) )      # header, composed of coloumn names
for entry in result:
    fp.write(template.format(entry['Name'],entry['Surname'],int(entry['Age']),float(entry['Avgdiff']))) 
             # convert strings to numbers to respect number formatting
fp.close()

Producing the file ‘third.csv’ with the following content:
```
Name      , Surname             , Age  , Avgdiff
Ada       , Lovelace            ,    37, -22.833
John      , von Neumann         ,    53,  -6.833
Alan      , Turing              ,    42, -17.833
Edsger W. , Dijkstra            ,    72,  12.167
Donald    , Knuth               ,    85,  25.167
Dennis    , Ritchie             ,    70,  10.167
```

## 7.8 Binary Files

<span id="chapters_ch8_file_handling_binary_files"> </span>

So far, we have only looked at text files where all values are
represented as human-readable text and all numerical values are
represented as decimal strings. However, if you remember from , computers do not store and process numbers as decimal digit
sequences. They store variables in binary format using the two’s complement method
and the IEEE 754 floating point standard. In order to process, read and
write decimal data in text, programming languages and libraries have to convert
data from/to human readable form to/from the internal form. Even though you will not notice the time spent in conversion in small amounts of data, if you
read 10 million  numbers, you start spending a significant amount of
CPU time for converting data.

Binary files, on the other hand, store numbers as they are stored in
the computer’s memory. They are still sequences of bytes, but in a more
structured way. For example, a 4-byte integer is kept as a sequence of 4
bytes, each byte is a part of the number in two’s complement form.
Reading a binary file is simply copying data to the memory; while doing so, either no
conversion is performed or only the order of bytes is changed.

A floating point number `0` takes 1 byte in a text file, but the
number `3.1415926535897932384626433832795028` takes 34 bytes. In a
binary file, the total size of a number is fixed as the size of the IEEE 754
format, i.e., ~4 bytes on a 32-bit computer. Both `0` and the number 
$\pi$ are stored in 4 bytes for single precision, 8 bytes for
double precision, in a binary file.

Keeping values in binary files have the following advantages:

  1. It is more compact: Data occupies less space in the file.

  1. No decimal-to-binary conversion is required. More efficient in terms
     of CPU usage.

  1. Since sizes are fixed, randomly jumping to a location and reading
     relevant data is possible. In a text file, you have to start from the
     beginning and read all lines up to the relevant data. This kind of
     usage is a more advanced case and harder to understand for beginners.

On the other hand, using text files has the following advantages:
  1. Files are human-readable and editable. Users can change data using a
     standard editor. In binary files, special software has to be used.

  1. File format is more flexible, using `variablename:value` patterns
     in the file, data can be stored in any order in a flexible way. This
     is why text files are often used as configuration files.

Most of the special formats with `.exe`, `.xls`, `.zip`, `.pdf`
extensions are binary file formats.

**Note:** Binary files are kept out of the scope of this book. The following
paragraphs give a couple of pointers for curious readers.

In order to use binary files:
  1. You need to add `'b'` character in the second argument of the 
`open()` method as: `open('test.bin','rb')` or
`open('test.bin','wb')`.

  1. Binary I/O requires `bytes` typed values instead of `str` typed
     values. `bytes` is a sequence of bytes. Elements of a byte sequence
     are not printable in contrast to `str`.

  1. Python has the `struct` module for converting any value into `bytes`
     value. `struct.pack(format, values)` converts values into
`bytes`. Computationally this conversion is much more cheaper
      than decimal to binary conversion.

  1. `struct.unpack(format, bytesval)` can be used to convert `bytes`
     value into Python values. It is much cheaper than binary-to-decimal
     conversion.

  1. `read()`, `write()` can be used as usual. In `read(nbytes)`,
     data size should be given. `struct.calcsize(format)` can be used to
     calculate data size from format.

The following is an example of binary I/O. Assume the binary file
contains an integer $N$, for the number of points, followed by
$2 \times N$ floating point values. Let us write and then read
this data:


In [22]:
import struct

points = [(1,1), (2.5, 3.4), (5.4,3.3), (2.2, 1.121)]

# 1- Open and write the binary file
fp = open("points.bin", "wb")
fp.write(struct.pack('i', len(points)))   # 'i' denotes a single integer value is converted into bytes

for (x,y) in points:
    fp.write(struct.pack('dd', x, y))     # 'dd' denotes two floating point values are converted into bytes

fp.close()

# 2- Open and read the binary file
fp = open("points.bin", "rb")             # open same file for reading
content = fp.read(struct.calcsize('i'))   # read binary data with length sizeof integer bytes
(n,) = struct.unpack('i', content)        # unpack returns a tuple, 1tuple in this case

newpoints = []
for i in range(n):                        # n times
    content = fp.read(struct.calcsize('dd'))
    (x,y) = struct.unpack('dd', content)  # read two floats
    newpoints += [(x,y)]                  # append value at the end

fp.close()

# 3- Print the read and converted values
print("The read & converted points are:", newpoints)

print("This is what binary data looks like:")
fp = open("points.bin", "rb")
print(fp.read())
fp.close()

The read & converted points are: [(1.0, 1.0), (2.5, 3.4), (5.4, 3.3), (2.2, 1.121)]
This is what binary data looks like:
b'\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x04@333333\x0b@\x9a\x99\x99\x99\x99\x99\x15@ffffff\n@\x9a\x99\x99\x99\x99\x99\x01@V\x0e-\xb2\x9d\xef\xf1?'
