<a href="https://colab.research.google.com/github/modouseck/first-repo/blob/main/Workshop_File_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

## File I/O with NumPy

Once you are more comfortable with Python and NumPy arrays, it is highly likely that at some point you will have the need or desire to do one of the following:

- Work with a preexisting dataset or matrix or save your results for future use
- Export your results for publication or to share them with a friend or colleague

---

### Writing a NumPy Array to a File

Let's say we have an array `a` that we would like to export to a file for some reason.

In [None]:
import numpy as np

In [None]:
a = np.random.random((5,5))

In [None]:
print(a)

[[0.62372319 0.64372406 0.84033869 0.78156386 0.92517313]
 [0.20593054 0.26040217 0.27569802 0.96550772 0.17126956]
 [0.90450099 0.29947368 0.30778193 0.74892675 0.68625247]
 [0.91097039 0.97825094 0.98221419 0.09753638 0.07094166]
 [0.06387507 0.27941961 0.0156213  0.97393562 0.60169168]]


One option would be to use `np.save()` which saves the array to a binary `.npy` file.

In [None]:
np.save('array1', a)

Now when you open the **File Browser** from the left-hand menu, you should now see `array1.npy` along with some other files.

If you **downloaded** this notebook and are running it on a local instance of Jupyter, the `array1.npy` file is now saved into the same folder containing this notebook. You can use your system file browser (*Explorer* or *Finder*) to locate the file and take a look.

If you are running this notebook using **Binder** or **Google Colab**, the `array1.npy` file is temporarily stored on the server running the notebook. You can view the file *only* using the built-in file browser accessible via the left-hand menu. Also note that this file along with any other files you might create will be deleted from the server after you close the notebook. You can download any files you would like to save on your computer by *right-clicking* on the file in the left-hand browser and then selecting *Download*.

Note that if you try opening `array1.npy`, it will not work. That is because the file is in **binary** format, meaning it is not human-readable and can only be deciphered by NumPy. Saving NumPy arrays in binary format is a good option if you care about speed and efficiency and are only planning on using NumPy to work with the data.

However, in many cases you might actually want to be able to see the contents of the file and use it with other programs like MATLAB. In that case, it makes much more sense to save the NumPy array as a human-readable text file. This can be done using `np.savetxt()`.

In [None]:
np.savetxt('array2.txt', a)

Because `array2.txt` is a human-readable text file, you can take a look at it by opening it with a text editor or *double-clicking* on it in the built-in file explorer on the left. Note how by default, the values are separated by spaces and the numbers are formatted using scientific notation.

To change the separator between the array items in the outputted text file, we can use the `delimiter` argument. For example, to produce a **CSV** (Comma-Separated Values) file, we can specify `delimiter = ","`.



In [None]:
np.savetxt('array3.csv', a, delimiter=',')

Now take a look at `array3.csv` and see how setting the delimiter have changed the appearance of the output.

---

### Reading a NumPy Array from a File

To read a binary `.npy` file into a NumPy array, we can use `np.load()`.

In [None]:
b = np.load('array1.npy')

In [None]:
b

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

To read data from a text file into a NumPy array, we can use either `np.loadtxt()` or `np.genfromtxt()`.

- `np.loadtxt()` is an older function and provides very basic functionality
- `np.genfromtxt()` is a newer and **faster** faster function that is more customizable and can handle missing values

Hence it is recommended you use `np.genfromtxt()` as a default. When using either function, you have to specify the `delimiter` argument if using anything other than whitespace.



In [None]:
c = np.loadtxt('array2.txt')

In [None]:
c

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

In [None]:
d = np.genfromtxt('array3.csv', delimiter=',')

In [None]:
d

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

An important thing to note when saving floating-point arrays to text files is ***loss of significance***. Because we can only store a set number of significant digits in the text file, it is possible that the number of significant digits will be reduced when writing data to a file, introducing round-off errors and causing precision loss.

Note that this is not the case when using the binary `.npy` format.

In [None]:
a == b

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

When writing to a text file using the default setting of scientific notation with 16 significant digits, precision loss does not occur under normal circumstances. However, note that this is dependent on the *datatype* of your array.

---

### Advanced: File I/O With Python

But what exactly happens when we use `np.genfromtxt()` to read data from a file? We can get a high-level overview of the mechanisms that take place in the background when we try to recreate the functionality using standard Python.

First, we have to open the file in order to be able to read data from it.

In [None]:
file = open('array3.csv')

Now we have  **file object** called `file` that gives us access to `array3.csv`. Using `.readlines()` with a file object, we can read all the lines from a file into a list.

In [None]:
lines = file.readlines()

In [None]:
lines

['6.237231918648519224e-01,6.437240634816261409e-01,8.403386867865210164e-01,7.815638583940204276e-01,9.251731338779494163e-01\n',
 '2.059305424121925521e-01,2.604021728216241449e-01,2.756980151420885816e-01,9.655077170546886300e-01,1.712695565775376183e-01\n',
 '9.045009869690763260e-01,2.994736781812755710e-01,3.077819344613993424e-01,7.489267526670634334e-01,6.862524720236342635e-01\n',
 '9.109703911404098964e-01,9.782509379064019406e-01,9.822141916960781538e-01,9.753637709497453567e-02,7.094166292186709910e-02\n',
 '6.387507415508975050e-02,2.794196053372264288e-01,1.562129838304104901e-02,9.739356195997656007e-01,6.016916803512697420e-01\n']

Now we have a list called `lines`, where each element is a line from the file `array3.csv`. Note that some cleaning needs to be done as these lines still contain whitespace characters like newlines.

In [None]:
cleaned_lines = []
for line in lines:
    line = line.strip()
    cleaned_lines.append(line)

In [None]:
cleaned_lines

['6.237231918648519224e-01,6.437240634816261409e-01,8.403386867865210164e-01,7.815638583940204276e-01,9.251731338779494163e-01',
 '2.059305424121925521e-01,2.604021728216241449e-01,2.756980151420885816e-01,9.655077170546886300e-01,1.712695565775376183e-01',
 '9.045009869690763260e-01,2.994736781812755710e-01,3.077819344613993424e-01,7.489267526670634334e-01,6.862524720236342635e-01',
 '9.109703911404098964e-01,9.782509379064019406e-01,9.822141916960781538e-01,9.753637709497453567e-02,7.094166292186709910e-02',
 '6.387507415508975050e-02,2.794196053372264288e-01,1.562129838304104901e-02,9.739356195997656007e-01,6.016916803512697420e-01']

The next step would be to convert each line to a list by splitting the string on the separator. This will lead to a list of lists, which is already quite similar to a two-dimensional NumPy array.

In [None]:
lists = []
for line in cleaned_lines:
    lst = line.split(',')
    lists.append(lst)

In [None]:
lists

[['6.237231918648519224e-01',
  '6.437240634816261409e-01',
  '8.403386867865210164e-01',
  '7.815638583940204276e-01',
  '9.251731338779494163e-01'],
 ['2.059305424121925521e-01',
  '2.604021728216241449e-01',
  '2.756980151420885816e-01',
  '9.655077170546886300e-01',
  '1.712695565775376183e-01'],
 ['9.045009869690763260e-01',
  '2.994736781812755710e-01',
  '3.077819344613993424e-01',
  '7.489267526670634334e-01',
  '6.862524720236342635e-01'],
 ['9.109703911404098964e-01',
  '9.782509379064019406e-01',
  '9.822141916960781538e-01',
  '9.753637709497453567e-02',
  '7.094166292186709910e-02'],
 ['6.387507415508975050e-02',
  '2.794196053372264288e-01',
  '1.562129838304104901e-02',
  '9.739356195997656007e-01',
  '6.016916803512697420e-01']]

Note how all the elements still have the type of `str`, meaning they are text, not numbers. Luckily there is an easy fix for that.

In [None]:
type(lists[0][0])

str

In [None]:
float_lists = []
for lst in lists:
    flst = []
    for element in lst:
        element = float(element)
        flst.append(element)
    float_lists.append(flst)

In [None]:
float_lists

[[0.6237231918648519,
  0.6437240634816261,
  0.840338686786521,
  0.7815638583940204,
  0.9251731338779494],
 [0.20593054241219255,
  0.26040217282162414,
  0.2756980151420886,
  0.9655077170546886,
  0.17126955657753762],
 [0.9045009869690763,
  0.29947367818127557,
  0.30778193446139934,
  0.7489267526670634,
  0.6862524720236343],
 [0.9109703911404099,
  0.9782509379064019,
  0.9822141916960782,
  0.09753637709497454,
  0.0709416629218671],
 [0.06387507415508975,
  0.27941960533722643,
  0.015621298383041049,
  0.9739356195997656,
  0.6016916803512697]]

In [None]:
type(float_lists[0][0])

float

Now we can use this list of lists to create a NumPy array.

In [None]:
e = np.array(float_lists)

In [None]:
e

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

We can confirm that we got the same result as we would have gotten using `np.genfromtxt()` by comparing it to the array `d` from before.

In [None]:
e == d

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

Finally we have to remember to close the file. This is very important to avoid any potential file corruption.

In [None]:
file.close()

Forgetting to close the file could lead to various issues and have serious consequences. Hence, it is commonplace to use `open()` in conjunction with a `with`statement. Any code executed within the block defined by the `with` statement has access to the file and any code outside of the block does not. This reduces the potential for errors and does not require you to use manually close the connection to the file.

Also note how our previous processing involved looping over basically the same list numerous times. We can simplify this a little by looping over indices instead.

In [None]:
with open('array3.csv') as f:
    lines = f.readlines()

In [None]:
lines

['6.237231918648519224e-01,6.437240634816261409e-01,8.403386867865210164e-01,7.815638583940204276e-01,9.251731338779494163e-01\n',
 '2.059305424121925521e-01,2.604021728216241449e-01,2.756980151420885816e-01,9.655077170546886300e-01,1.712695565775376183e-01\n',
 '9.045009869690763260e-01,2.994736781812755710e-01,3.077819344613993424e-01,7.489267526670634334e-01,6.862524720236342635e-01\n',
 '9.109703911404098964e-01,9.782509379064019406e-01,9.822141916960781538e-01,9.753637709497453567e-02,7.094166292186709910e-02\n',
 '6.387507415508975050e-02,2.794196053372264288e-01,1.562129838304104901e-02,9.739356195997656007e-01,6.016916803512697420e-01\n']

In [None]:
for i in range(len(lines)):
    lines[i] = lines[i].strip().split(',')
    for j in range(len(lines[i])):
        lines[i][j] = float(lines[i][j])

In [None]:
lines

[[0.6237231918648519,
  0.6437240634816261,
  0.840338686786521,
  0.7815638583940204,
  0.9251731338779494],
 [0.20593054241219255,
  0.26040217282162414,
  0.2756980151420886,
  0.9655077170546886,
  0.17126955657753762],
 [0.9045009869690763,
  0.29947367818127557,
  0.30778193446139934,
  0.7489267526670634,
  0.6862524720236343],
 [0.9109703911404099,
  0.9782509379064019,
  0.9822141916960782,
  0.09753637709497454,
  0.0709416629218671],
 [0.06387507415508975,
  0.27941960533722643,
  0.015621298383041049,
  0.9739356195997656,
  0.6016916803512697]]

In [None]:
arr = np.array(lines)

In [None]:
arr

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

We can confirm that the result is indeed the same as before.

In [None]:
arr == e

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

Note that you can condense this even more by using `map()` with `lambda` and remembering that `np.array()` has a `dtype` argument.

In [None]:
with open('array3.csv') as f:
    arr2 = np.array(list(map(lambda x : x.strip().split(','), f.readlines())), dtype=float)

In [None]:
arr2

array([[0.62372319, 0.64372406, 0.84033869, 0.78156386, 0.92517313],
       [0.20593054, 0.26040217, 0.27569802, 0.96550772, 0.17126956],
       [0.90450099, 0.29947368, 0.30778193, 0.74892675, 0.68625247],
       [0.91097039, 0.97825094, 0.98221419, 0.09753638, 0.07094166],
       [0.06387507, 0.27941961, 0.0156213 , 0.97393562, 0.60169168]])

In [None]:
arr == arr2

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])