# File IO Solution

## EOSC 211

**Week 12 Day 2**

**Learning Objectives:** 

1. Convert python datatypes (arrays, datetime slices) into basic
   strings, lists, tuples, dicts that can be saved to a text file in json format
2. Write a function to open a json file and convert the file contents back
   into python objects.

## Introduction

In this course we've usually started a worksheet or lab by reading a data file into python.  For example,
in [wk05_indexing](https://phaustin.github.io/eosc211_students/wk05/wk05_indexing.html) we used
`np.load` to read in a dictionary containing two numpy binary files containing the arrays `flux` and `temp`.
As I mentioned in that notebook, these files were written by  [np.savez](https://numpy.org/doc/stable/reference/generated/numpy.savez.html).  In [week 11](https://phaustin.github.io/eosc211_students/wk11/pythia_pandas.html) you used `pd.read_csv` to read in
a text file written in spreadsheet `comma separated value` format.  This [enso_data.csv](https://github.com/phaustin/eosc211_students/blob/e211_live_main/wk11/enso_data.csv) file was written
from a dataframe using [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).  

Writing binary numpy files and csv files from dataframes works: but there is a significant drawback,
you need numpy and some kind of spreadsheet to read the data.  How would you write a file
that any text editor or programming language could work with?

## Background -- text vs. binary

What is the difference between a text file like `enso_data.csv` and a binary file like `temp_flux.npz`?  At the
machine level, there is no difference, like everything else on your computer, they consist only of a
long string of 1's and 0's.  The difference is in how a computer program interprets those 1's and 0's. 

In a text file, the 8 bit bytes are mapped to characters in a human language using a lookup table.
Python (and all modern programming languages) uses the [unicode character table](https://unicode-table.com/en/blocks/), which includes not only languages, but [math](https://unicode-table.com/en/blocks/mathematical-operators/), [emojis, musical notes and chess pieces](https://unicode-table.com/en/blocks/miscellaneous-symbols/) 

In a binary file, the bits are read into memory with no assumption about a mapping. You need to know
whether the bits were written as 64 bit floats, 32 bit ints, or 1 bit logical values.  This quickly gets
complicated -- there is a brief note about packages that write common binary data formats at the 
end of the worksheet.

## A basic text file example

Here's basic  write of a numpy array to a text file using the folders we created in the week 11 lab.

In [1]:
from pathlib import Path
import numpy as np
myhome = Path.home()

# create the folder
text_dir = myhome / 'eosc211/week12'
text_dir.mkdir(parents=True, exist_ok=True)

# make some data and write it out
simple_vec = np.arange(3,40,2)
filename = text_dir / 'simplefile.txt'
with open(filename,'w') as outfile:
    for the_num in simple_vec:
        outfile.write(f"{the_num}\n")

The last block of code above uses `open` to open a file for writing (which will
overwrite anything that was already there) and then converts each floating point number into
a string and writes that to the file, including a newline `\n`.  The `with` statement
handles closing the file safely once you've left the `with context block`

Check that this works by launching a text editor to look at the file:

```
cd ~/eosc211/week12
start simplefile.txt  (for windows)
open simplefile.txt  (for macs)
```

or using the text file editor on our jupyterhub.

## An easier way


That's quite a bit of work to write out a couple of numbers.  Python provides a variety of modules
that allow you to skip all this bookkeeping, as long as your data consists of basic python types:
`strings, lists, tuples, or dictionaries` and you want to save using a common format (like csv).
One very common general text file format is `json` (javascript object notation).  Here is the same
file written in json.  We'll try a 3-d array this time.  We need to first turn the array into
a list, then dump the list to the file using [json.dump](https://docs.python.org/3/library/json.html)

In [2]:
import json
#
# make a 3-d array
#
simple_vec = np.arange(0,100)
simple_vec = simple_vec.reshape(5,5,4)
#
# write out as a json list
#
json_file = text_dir / 'simplefile.json'
with open(json_file, 'w') as json_out:
    my_list = simple_vec.tolist()
    json.dump(my_list, json_out)

Take a look at that file, and double check that you can read it back in. Note that the
numpy `tolist` method correctly handles the nested rows, and the numpy `array` construct
correctly `round-trips` the list back to a numpy array.

In [3]:
with open(json_file) as json_in:
    new_vec = json.load(json_in)
new_vec = np.array(new_vec)
print(new_vec[:,:2,:2])

[[[ 0  1]
  [ 4  5]]

 [[20 21]
  [24 25]]

 [[40 41]
  [44 45]]

 [[60 61]
  [64 65]]

 [[80 81]
  [84 85]]]


## Including metadata

You want to avoid writing out naked lists of raw numbers without extra information (called metadata) like
physical units, uncertainty estimates, etc.  You can do this with json using a dictionary that contains
your list along with other strings, lists and dictionaries if needed:

In [4]:
my_data = {}
my_data['units'] = 'deg C'
my_data['plot_title'] = "incubator temperature"
my_data['valid range'] = (-10,100)
my_data['missing_values'] = -999
my_data['comment']='first run incubator temperature test'
my_data['first_run'] = simple_vec.tolist()

In [5]:
meta_file = text_dir / 'metadata.json'
#
# indent 4 spaces
#
with open(meta_file,'w') as meta_out:
    json.dump(my_data, meta_out, indent=4)

Take a look at this file, and notice that the `indent=4` added nested spaces to make the various
levels more readable.  Since whitespace isn't significant in json files, those space will be ignored
when your read it back in:

In [6]:
with open(meta_file) as meta_in:
    new_dict = json.load(meta_in)
print(new_dict.keys())

dict_keys(['units', 'plot_title', 'valid range', 'missing_values', 'comment', 'first_run'])


## Summary

To write general data out to a text file you need to choose a text file format (we've seen csv and json, but
there are hundreds of different special formats) and then convert your data into datatypes that
the output file format can understand.  Since this is such a common task, almost all python objects have
some way to represent themselves as lists, dicts, or strings.

## Question 1

Put the following items in a dictionary, and write that dictionary to a json file called
`parameters.json` in your week12 folder:

1) A list of 3 datetime objects -- with the key `the_dates`  
2) the start and stop values for a slice, with keys `start` and `stop`


Recall that we introduced a couple of different ways to convert datetime objects to numbers or strings in
the [week 6 lab](https://phaustin.github.io/eosc211_students/lab_keys/week6_lab/lab_wk6.html#part-1-notes-on-time-series-data).  Using an integer or floating point number for the datetime, as produced by
`datetime.toordinal` or `datetime.timestamp` saves space in
a binary file since a binary float64 uses just 8 bytes.
Using a string, as produced by `datetime.isoformat`, gives human readable dates
in a text file, at the
cost of 19 bytes per date.  Feel free to use
whichever format you think fits best with your application.

In [7]:
# solution

from datetime import datetime as dt
the_dates = [dt(2021,3,1), dt(2021,4,2),dt(2021,5,4)]
date_list = []
for a_date in the_dates:
    date_list.append(a_date.isoformat())

out_dict = {}
out_dict['start'] = 5
out_dict['stop'] = 20
out_dict['the_dates'] = date_list

param_file = text_dir / 'parameters.json'
with open(param_file,'w') as file_out:
    json.dump(out_dict,file_out)

## Question 2

Write a function called `process_params` that takes the filename (as a Path object) and returns a dictionary,
holding a list of datetime objects (dictionary key: `the_dates`) converted form their isoformat strings, and the
start and stop values converted to a slice object stored in dictionary key `the_slice`.  
The datetime module provides the function `datetime.datetime.fromisoformat`
to go from iso strings back to datetimes.

In [8]:
# solution
import datetime
def process_params(infile):
    with open(infile) as file_in:
        in_dict = json.load(file_in)
    the_dates = []
    for a_date in in_dict['the_dates']:
        the_dates.append(dt.fromisoformat(a_date))
    the_slice = slice(in_dict['start'],in_dict['stop'])
    out_dict = dict(the_dates=the_dates, the_slice = the_slice)
    return out_dict
 
params = process_params(param_file)
print(params)

{'the_dates': [datetime.datetime(2021, 3, 1, 0, 0), datetime.datetime(2021, 4, 2, 0, 0), datetime.datetime(2021, 5, 4, 0, 0)], 'the_slice': slice(5, 20, None)}


## Postscript: writing binary files

This is all fine for text files, but what if you need to write out 3 Mbytes of binary arrays?  Converting
all of that to and from text is a major waste of cpu time and disk space.   For this course, the choice would
be to write out an npz file, and accompanying that npz file with a json file holding whatever metadata
you want to include (especially the file name of the npz file, so you don't lose track).  Once you
get beyond this course however, it's good to know that there is common workflow. Specifically: 
we recommend the following steps:

1) Convert your numpy array to an [xarray Dataset](https://foundations.projectpythia.org/core/xarray/xarray.html)  
2) Use xarray to write out the data in a common binary format. The two most common binary data formats
   in the earth sciences are [netcdf](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html)
   and [zarr](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html)