# Working with Datafiles

In [1]:
from PyDatSci.FileReport import FileReport
import PyDatSci.FileReport as fr
import pickle

Let's read the file report we created in the `FileReport` notebook:

In [2]:
report = FileReport.read('/work/ch0636/eddy/pool/sims/cmip5/file-report-cmip5.fr')

In [None]:
print(report)

## Scan for NetCDF attributes

In [None]:
datafiles = fr.scan_ncattrs(report['.nc'])

We quickly dump the datafiles list which now contains NetCDF attributes to a file.

In [None]:
pickle.dump(datafiles, open('datafiles-cmip5','wb'))

## Sorting

Reload our datafiles:

In [None]:
datafiles = pickle.load(open('datafiles-cmip5','rb'))

We can create a dictionary containing an arbitray sorting order for the datafiles, e.g.:

In [None]:
sorted_files = fr.sort_by_attrs(datafiles, ['institute_id'])

In [None]:
from PyDatSci.tools import print_dict
print_dict(sorted_files)

Ok, there are files without an `institute_id`! Let's check them out:

In [None]:
for datafile in sorted_files['no institute_id']:
    print(datafile)

These are obviously no files that are supposed to conform with cmor standard.

Let's sort our files a little more accurately, e.g.

In [None]:
sorted_files = fr.sort_by_attrs(datafiles, ['institute_id','model_id','experiment_id','frequency','varname'])

The dictionary `sorted_files` now contains all datafiles sorted by a list of NetCDF attributes. To check out it's contents, you can print the dictionary using:

In [None]:
print_dict(sorted_files)

You can see, that this gives an overview of the available files depending on their NetCDF attributes that we defined above. Another way, to walk through the dictionary, is to check out the keys of the recursive dictionary:

Check out what `institude_id`s are available:

In [None]:
print(sorted_files.keys())

Assume we are interested in the files with `institute_id='MPI-ESM'`. Let's see what kind of `model_id`s we get:

In [None]:
print(sorted_files['MPI-M'].keys())

... and so on ...

In [None]:
print(sorted_files['MPI-M']['MPI-ESM-LR'].keys())

In [None]:
print(sorted_files['MPI-M']['MPI-ESM-LR']['historical'].keys())

In [None]:
print(sorted_files['MPI-M']['MPI-ESM-LR']['historical']['day'].keys())

Now, let's see what the dictionary actually contains for us:

In [None]:
print(sorted_files['MPI-M']['MPI-ESM-LR']['historical']['day']['pr'])

We can see that the file list contains objects of type `DataFile`. This is basically a container which stores some meta information of the actual file and it's NetCDF attributes. Let's print the filenames:

In [None]:
for datafile in sorted_files['MPI-M']['MPI-ESM-LR']['historical']['day']['pr']:
    print(datafile.filename)

We can recognize, that `pr_mon_MPI-ESM-LR_historical_r1i1p1_18500101-20051231.nc` and `pr_year_MPI-ESM-LR_historical_r1i1p1_18500101-20051231.nc` actually have inconsistent netcdf attributes and filenames. Their filenames indicate different frequencies than the NetCDF attribute in the file.

We can check this, since we stored the NetCDF attributes in the `DataFile` container, e.g.

In [None]:
for datafile in sorted_files['MPI-M']['MPI-ESM-LR']['historical']['day']['pr']:
    print(datafile.filename)
    print(datafile.frequency)

In [None]:
yearly_file = sorted_files['MPI-M']['MPI-ESM-LR']['historical']['day']['pr'][0]
print(yearly_file.history)

The `yearly_file` is the result from several cdo commands, that created yearly mean data from a daily input frequency and inherited the attribute `frequency=day` from those. This might become a problem, when we want to filter data files.

## Filtering

The sorting gives a good overview, of what data is available. We can also define an arbitrary filter directly and apply it on the datafiles:

In [None]:
attr_filter = {'institute_id':'MPI-M', 'model_id':'MPI-ESM-LR', 'experiment_id':'historical', 'frequency':'day', 'varname':'pr'}
filtered = fr.filter_by_attrs(datafiles, **attr_filter)

which will results in the same list of files as above:

In [None]:
for datafile in filtered:
    print(datafile.filename)

Are there any files from us?

In [None]:
gerics_files = fr.filter_by_attrs(datafiles, institute_id='GERICS')

In [None]:
print(gerics_files)

Obviously not! :(