# File handling in Python

Suppose we have a text file like below, from which we would like to extract temperature and density data:

$$
# Density of air at different temperatures, at 1 atm pressure
# Column 1: temperature in Celsius degrees
# Column 2: density in kg/m^3
  0.0   999.8425
  4.0   999.9750
 15.0   999.1026
 20.0   998.2071
 25.0   997.0479
 37.0   993.3316
 50.0   988.04
100.0   958.3665
# Source: Wikipedia (keyword Density)
$$

## Python

It is usually best to read and write files using the **with statement** which will improve our syntax and do a lot of things automatically for us (like closing the file we opened).

Our file contains 3 lines at the beginning and 1 line at the end which we do not want to extract. Additionally, the temperature and density values are separated by whitespaces. Here is an example of how we could extract the data into two arrays:

In [33]:
with open('density_water.dat', 'r') as file:  # r for read
    # skip extra lines by taking a slice of the file
    data = file.readlines()[3:-1]
        
    # initialise 2 empty arrays to store our data
    temp, density = np.zeros(len(data)), np.zeros(len(data))
    
    for i in range(len(data)):
        values = data[i].split()
        temp[i] = float(values[0])
        density[i] = float(values[1])
        
print(temp)
print(density)

[  0.   4.  15.  20.  25.  37.  50. 100.]
[999.8425 999.975  999.1026 998.2071 997.0479 993.3316 988.04   958.3665]


The data variable in the above example is a list which contains our lines with numbers, but they are stored as a string - they are not recognised as numbers at this point. Note that the temperature and density arrays are already initialised as arrays of specific length, rather than empty lists which we will then append numbers to. It is good practice to initalise arrays like this when we know exactly how many elements our final array will have. The for loop is used to cycle through all elements of our data list. Each ith element is a string, which we split into a new list with the same number of elements as the number of values that are separated by whitespaces in the line - in our case 2. Finally we can assign those values to elements in the temperature and density arrays, but first we converted them from a string to a float.

In [34]:
with open('output.txt', 'w') as file:  # w for write
    file.write('Hello world!')

## NumPy

NumPy's [**genfromtxt**](https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html) (generate from text) function provides full control over the file which we are trying to open. It initialises a numpy array from the data. Some of the parameters include:
    - dtype: set a data type. If not set, determines the data type automatically for each column
    - comments: skips every line starting with a string that is set here. '#' by default.
    - skip_header: number of lines to skip at the beginning of the file
    - skip_footer: number of lines to skip at the end of the file
    - delimiter: the string used to separate values. Any whitespaces, by default.
    
For our file, we do not have to change anything since comments are marked by a # at the beginning of the line and the values are separated by whitespaces.

In [42]:
import numpy as np

data = np.genfromtxt('density_water.dat')
print(data)

[[  0.     999.8425]
 [  4.     999.975 ]
 [ 15.     999.1026]
 [ 20.     998.2071]
 [ 25.     997.0479]
 [ 37.     993.3316]
 [ 50.     988.04  ]
 [100.     958.3665]]


We can save our array to a file in multiple ways - examples below. If we plan on using the array in another python code, perhaps it is best to save it as a .npy file. We can later easily load the .npy file and reconstruct the original array. The reader is encouraged to read about **pickling** in Python, which allows any object in Python to be saved in this way, not only numpy arrays.

In [45]:
np.savetxt('data.txt', data)  # save data array to a text file
np.save('data.npy', data)  # save data array to a .npy file

A = np.load('data.npy')

print(A)

[[  0.     999.8425]
 [  4.     999.975 ]
 [ 15.     999.1026]
 [ 20.     998.2071]
 [ 25.     997.0479]
 [ 37.     993.3316]
 [ 50.     988.04  ]
 [100.     958.3665]]


## Pandas

Despite its name, panda's [**read_csv**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) can read many different types of files, including our .dat file. Pandas DataFrame is the primary data structure in pandas, so this function will generate one such DataFrame. It is a very powerful object with many capabilities not even implemented in NumPy - deserving of its own notebook.

In [38]:
from pandas import read_csv

file = read_csv('density_water.dat', comment='#', header=None)
print(file)

                  0
0    0.0   999.8425
1    4.0   999.9750
2   15.0   999.1026
3   20.0   998.2071
4   25.0   997.0479
5   37.0   993.3316
6     50.0   988.04
7  100.0   958.3665
