# Load And Dump Arrays, Sessions, Axes And Groups


LArray provides methods and functions to load and dump LArray, Session, Axis Group objects to several formats such as Excel, CSV and HDF5. The HDF5 file format is designed to store and organize large amounts of data. It allows to read and write data much faster than when working with CSV and Excel files. 


In [None]:
# run this cell to avoid annoying warnings
import warnings
warnings.filterwarnings("ignore", message=r'.*numpy.dtype size changed*')

In [None]:
# first of all, import the LArray library
from larray import *

Check the version of LArray:

In [None]:
from larray import __version__
__version__

## Loading and Dumping Arrays


### Loading Arrays - Basic Usage (CSV, Excel, HDF5)

To read an array from a CSV file, you must use the ``read_csv`` function:


In [None]:
csv_dir = get_example_filepath('examples')

# read the array pop from the file 'pop.csv'.
# The data of the array below is derived from a subset of the demo_pjan table from Eurostat
pop = read_csv(csv_dir + '/pop.csv')
pop

To read an array from a sheet of an Excel file, you can use the ``read_excel`` function:

In [None]:
filepath_excel = get_example_filepath('examples.xlsx')

# read the array from the sheet 'births' of the Excel file 'examples.xlsx'
# The data of the array below is derived from a subset of the demo_fasec table from Eurostat
births = read_excel(filepath_excel, 'births')
births

The ``open_excel`` function in combination with the ``load`` method allows you to load several arrays from the same Workbook without opening and closing it several times:


```python
# open the Excel file 'population.xlsx' and let it opened as long as you keep the indent.
# The Python keyword ``with`` ensures that the Excel file is properly closed even if an error occurs
with open_excel(filepath_excel) as wb:
    # load the array 'pop' from the sheet 'pop' 
    pop = wb['pop'].load()
    # load the array 'births' from the sheet 'births'
    births = wb['births'].load()
    # load the array 'deaths' from the sheet 'deaths'
    deaths = wb['deaths'].load()

# the Workbook is automatically closed when getting out the block defined by the with statement
```

<div class="alert alert-warning">
  **Warning:** `open_excel` requires to work on Windows and to have the library ``xlwings`` installed.
</div>

The `HDF5` file format is specifically designed to store and organize large amounts of data. 
Reading and writing data in this file format is much faster than with CSV or Excel. 
An HDF5 file can contain multiple arrays, each array being associated with a key.
To read an array from an HDF5 file, you must use the ``read_hdf`` function and provide the key associated with the array:

In [None]:
filepath_hdf = get_example_filepath('examples.h5')

# read the array from the file 'examples.h5' associated with the key 'deaths'
# The data of the array below is derived from a subset of the demo_magec table from Eurostat
deaths = read_hdf(filepath_hdf, 'deaths')
deaths

### Dumping Arrays - Basic Usage (CSV, Excel, HDF5)

To write an array in a CSV file, you must use the ``to_csv`` method:


In [None]:
# save the array pop in the file 'pop.csv'
pop.to_csv('pop.csv')

To write an array to a sheet of an Excel file, you can use the ``to_excel`` method:

In [None]:
# save the array pop in the sheet 'pop' of the Excel file 'population.xlsx' 
pop.to_excel('population.xlsx', 'pop')

Note that ``to_excel`` create a new Excel file if it does not exist yet. 
If the file already exists, a new sheet is added after the existing ones if that sheet does not already exists:


In [None]:
# add a new sheet 'births' to the file 'population.xlsx' and save the array births in it
births.to_excel('population.xlsx', 'births')

To reset an Excel file, you simply need to set the `overwrite_file` argument as True:


In [None]:
# 1. reset the file 'population.xlsx' (all sheets are removed)
# 2. create a sheet 'pop' and save the array pop in it
pop.to_excel('population.xlsx', 'pop', overwrite_file=True)

The ``open_excel`` function in combination with the ``dump()`` method allows you to open a Workbook and to export several arrays at once. If the Excel file doesn't exist, the ``overwrite_file`` argument must be set to True.

<div class="alert alert-warning">
  **Warning:** The ``save`` method must be called at the end of the block defined by the *with* statement to actually write data in the Excel file, otherwise you will end up with an empty file.
</div>


```python
# to create a new Excel file, argument overwrite_file must be set to True
with open_excel('population.xlsx', overwrite_file=True) as wb:
    # add a new sheet 'pop' and dump the array pop in it 
    wb['pop'] = pop.dump()
    # add a new sheet 'births' and dump the array births in it 
    wb['births'] = births.dump()
    # add a new sheet 'deaths' and dump the array deaths in it 
    wb['deaths'] = deaths.dump()
    # actually write data in the Workbook
    wb.save()
    
# the Workbook is automatically closed when getting out the block defined by the with statement
```

To write an array in an HDF5 file, you must use the ``to_hdf`` function and provide the key that will be associated with the array:

In [None]:
# save the array pop in the file 'population.h5' and associate it with the key 'pop'
pop.to_hdf('population.h5', 'pop')

### Specifying Wide VS Narrow format (CSV, Excel)

By default, all reading functions assume that arrays are stored in the ``wide`` format, meaning that their last axis is represented horizontally:

| country \\ time | 2013     | 2014     | 2015     |
| --------------- | -------- | -------- | -------- |
| Belgium         | 11137974 | 11180840 | 11237274 |
| France          | 65600350 | 65942267 | 66456279 |

By setting the ``wide`` argument to False, reading functions will assume instead that arrays are stored in the ``narrow`` format, i.e. one column per axis plus one value column:

| country | time | value    |
| ------- | ---- | -------- |
| Belgium | 2013 | 11137974 |
| Belgium | 2014 | 11180840 |
| Belgium | 2015 | 11237274 |
| France  | 2013 | 65600350 |
| France  | 2014 | 65942267 |
| France  | 2015 | 66456279 |


In [None]:
# set 'wide' argument to False to indicate that the array is stored in the 'narrow' format
pop_BE_FR = read_csv(csv_dir + '/pop_narrow_format.csv', wide=False)
pop_BE_FR

In [None]:
# same for the read_excel function
pop_BE_FR = read_excel(filepath_excel, sheet='pop_narrow_format', wide=False)
pop_BE_FR

By default, writing functions will set the name of the column containing the data to 'value'. You can choose the name of this column by using the ``value_name`` argument. For example, using ``value_name='population'`` you can export the previous array as:

| country | time | population |
| ------- | ---- | ---------- |
| Belgium | 2013 | 11137974   |
| Belgium | 2014 | 11180840   |
| Belgium | 2015 | 11237274   |
| France  | 2013 | 65600350   |
| France  | 2014 | 65942267   |
| France  | 2015 | 66456279   |


In [None]:
# dump the array pop_BE_FR in a narrow format (one column per axis plus one value column).
# By default, the name of the column containing data is set to 'value'
pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False)

# same but replace 'value' by 'population'
pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False, value_name='population')

In [None]:
# same for the to_excel method
pop_BE_FR.to_excel('population.xlsx', 'pop_narrow_format', wide=False, value_name='population')

Like with the ``to_excel`` method, it is possible to export arrays in a ``narrow`` format using ``open_excel``. 
To do so, you must set the ``wide`` argument of the ``dump`` method to False:


```python
with open_excel('population.xlsx') as wb:
    # dump the array pop_BE_FR in a narrow format: 
    # one column per axis plus one value column.
    # Argument value_name can be used to change the name of the 
    # column containing the data (default name is 'value')
    wb['pop_narrow_format'] = pop_BE_FR.dump(wide=False, value_name='population')
    # don't forget to call save()
    wb.save()

# in the sheet 'pop_narrow_format', data is written as:
# | country | time | value    |
# | ------- | ---- | -------- |
# | Belgium | 2013 | 11137974 |
# | Belgium | 2014 | 11180840 |
# | Belgium | 2015 | 11237274 |
# | France  | 2013 | 65600350 |
# | France  | 2014 | 65942267 |
# | France  | 2015 | 66456279 |
```

### Specifying Position in Sheet (Excel)

If you want to read an array from an Excel sheet which does not start at cell `A1` (when there is more than one array stored in the same sheet for example), you will need to use the ``range`` argument. 

<div class="alert alert-warning">
  **Warning:** Note that the ``range`` argument is only available if you have the library ``xlwings`` installed (Windows).
</div>

```python
# the 'range' argument must be used to load data not starting at cell A1.
# This is useful when there is several arrays stored in the same sheet
births = read_excel(filepath_excel, sheet='pop_births_deaths', range='A9:E15')
```

Using ``open_excel``, ranges are passed in brackets:

```python
with open_excel(filepath_excel) as wb:
    # store sheet 'pop_births_deaths' in a temporary variable sh
    sh = wb['pop_births_deaths']
    # load the array pop from range A1:E7
    pop = sh['A1:E7'].load()
    # load the array births from range A9:E15
    births = sh['A9:E15'].load()
    # load the array deaths from range A17:E23
    deaths = sh['A17:E23'].load()

# the Workbook is automatically closed when getting out the block defined by the with statement
```

When exporting arrays to Excel files, data is written starting at cell `A1` by default. Using the ``position`` argument of the ``to_excel`` method, it is possible to specify the top left cell of the dumped data. This can be useful when you want to export several arrays in the same sheet for example

<div class="alert alert-warning">
  **Warning:** Note that the ``position`` argument is only available if you have the library ``xlwings`` installed (Windows).
</div>

```python
filename = 'population.xlsx'
sheetname = 'pop_births_deaths'

# save the arrays pop, births and deaths in the same sheet 'pop_births_and_deaths'.
# The 'position' argument is used to shift the location of the second and third arrays to be dumped
pop.to_excel(filename, sheetname)
births.to_excel(filename, sheetname, position='A9')
deaths.to_excel(filename, sheetname, position='A17')
```

Using ``open_excel``, the position is passed in brackets (this allows you to also add extra informations): 


```python
with open_excel('population.xlsx') as wb:
    # add a new sheet 'pop_births_deaths' and write 'population' in the first cell
    # note: you can use wb['new_sheet_name'] = '' to create an empty sheet
    wb['pop_births_deaths'] = 'population'
    # store sheet 'pop_births_deaths' in a temporary variable sh
    sh = wb['pop_births_deaths']
    # dump the array pop in sheet 'pop_births_deaths' starting at cell A2
    sh['A2'] = pop.dump()
    # add 'births' in cell A10
    sh['A10'] = 'births'
    # dump the array births in sheet 'pop_births_deaths' starting at cell A11 
    sh['A11'] = births.dump()
    # add 'deaths' in cell A19
    sh['A19'] = 'deaths'
    # dump the array deaths in sheet 'pop_births_deaths' starting at cell A20
    sh['A20'] = deaths.dump()
    # don't forget to call save()
    wb.save()
    
# the Workbook is automatically closed when getting out the block defined by the with statement
```

### Exporting data without headers (Excel)

For some reasons, you may want to export only the data of an array without axes. For example, you may want to insert a new column containing extra information. As an exercise, let us consider we want to add the capital city for each country present in the array containing the total population by country:

| country | capital city | 2013     | 2014     | 2015     |
| ------- | ------------ | -------- | -------- | -------- |
| Belgium | Brussels     | 11137974 | 11180840 | 11237274 |
| France  | Paris        | 65600350 | 65942267 | 66456279 |
| Germany | Berlin       | 80523746 | 80767463 | 81197537 |

Assuming you have prepared an excel sheet as below: 

| country | capital city | 2013     | 2014     | 2015     |
| ------- | ------------ | -------- | -------- | -------- |
| Belgium | Brussels     |          |          |          |
| France  | Paris        |          |          |          |
| Germany | Berlin       |          |          |          ||

you can then dump the data at right place by setting the ``header`` argument of ``to_excel`` to False and specifying the position of the data in sheet:


```python
pop_by_country = pop.sum('gender')

# export only the data of the array pop_by_country starting at cell C2
pop_by_country.to_excel('population.xlsx', 'pop_by_country', header=False, position='C2')
```

Using ``open_excel``, you can easily prepare the sheet and then export only data at the right place by either setting the ``header`` argument of the ``dump`` method to False or avoiding to call ``dump``:


```python
with open_excel('population.xlsx') as wb:
    # create new empty sheet 'pop_by_country'
    wb['pop_by_country'] = ''
    # store sheet 'pop_by_country' in a temporary variable sh
    sh = wb['pop_by_country']
    # write extra information (description)
    sh['A1'] = 'Population at 1st January by country'
    # export column names
    sh['A2'] = ['country', 'capital city']
    sh['C2'] = pop_by_country.time.labels
    # export countries as first column
    sh['A3'].options(transpose=True).value = pop_by_country.country.labels
    # export capital cities as second column
    sh['B3'].options(transpose=True).value = ['Brussels', 'Paris', 'Berlin']
    # export only data of pop_by_country
    sh['C3'] = pop_by_country.dump(header=False)
    # or equivalently
    sh['C3'] = pop_by_country
    # don't forget to call save()
    wb.save()
    
# the Workbook is automatically closed when getting out the block defined by the with statement
```

### Specifying the Number of Axes at Reading (CSV, Excel)

By default, ``read_csv`` and ``read_excel`` will search the position of the first cell containing the special character ``\`` in the header line in order to determine the number of axes of the array to read. The special character ``\`` is used to separate the name of the two last axes. If there is no special character ``\``, ``read_csv`` and ``read_excel`` will consider that the array to read has only one dimension. For an array stored as:

| country | gender \\ time | 2013     | 2014     | 2015     |
| ------- | -------------- | -------- | -------- | -------- |
| Belgium | Male           | 5472856  | 5493792  | 5524068  |
| Belgium | Female         | 5665118  | 5687048  | 5713206  |
| France  | Male           | 31772665 | 31936596 | 32175328 |
| France  | Female         | 33827685 | 34005671 | 34280951 |
| Germany | Male           | 39380976 | 39556923 | 39835457 |
| Germany | Female         | 41142770 | 41210540 | 41362080 |

``read_csv`` and ``read_excel`` will find the special character ``\`` in the second cell meaning it expects three axes (country, gender and time). 

Sometimes, you need to read an array for which the name of the last axis is implicit: 

| country | gender | 2013     | 2014     | 2015     |
| ------- | ------ | -------- | -------- | -------- |
| Belgium | Male   | 5472856  | 5493792  | 5524068  |
| Belgium | Female | 5665118  | 5687048  | 5713206  |
| France  | Male   | 31772665 | 31936596 | 32175328 |
| France  | Female | 33827685 | 34005671 | 34280951 |
| Germany | Male   | 39380976 | 39556923 | 39835457 |
| Germany | Female | 41142770 | 41210540 | 41362080 |

For such case, you will have to inform ``read_csv`` and ``read_excel`` of the number of axes of the output array by setting the ``nb_axes`` argument:

In [None]:
# read the 3 x 2 x 3 array stored in the file 'pop_missing_axis_name.csv' wihout using 'nb_axes' argument.
pop = read_csv(csv_dir + '/pop_missing_axis_name.csv')
# shape and data type of the output array are not what we expected
pop.info

In [None]:
# by setting the 'nb_axes' argument, you can indicate to read_csv the number of axes of the output array
pop = read_csv(csv_dir + '/pop_missing_axis_name.csv', nb_axes=3)

# give a name to the last axis
pop = pop.rename(-1, 'time')

# shape and data type of the output array are what we expected
pop.info

In [None]:
# same for the read_excel function
pop = read_excel(filepath_excel, sheet='pop_missing_axis_name', nb_axes=3)
pop = pop.rename(-1, 'time')
pop.info

### NaNs and Missing Data Handling at Reading (CSV, Excel)

Sometimes, there is no data available for some label combinations. In the example below, the rows corresponding to `France - Male` and `Germany - Female` are missing:

| country | gender \\ time | 2013     | 2014     | 2015     |
| ------- | -------------- | -------- | -------- | -------- |
| Belgium | Male           | 5472856  | 5493792  | 5524068  |
| Belgium | Female         | 5665118  | 5687048  | 5713206  |
| France  | Female         | 33827685 | 34005671 | 34280951 |
| Germany | Male           | 39380976 | 39556923 | 39835457 |

By default, ``read_csv`` and ``read_excel`` will fill cells associated with missing label combinations with nans. 
Be aware that, in that case, an int array will be converted to a float array.

In [None]:
# by default, cells associated will missing label combinations are filled with nans.
# In that case, the output array is converted to a float array
read_csv(csv_dir + '/pop_missing_values.csv')

However, it is possible to choose which value to use to fill missing cells using the ``fill_value`` argument:

In [None]:
read_csv(csv_dir + '/pop_missing_values.csv', fill_value=0)

In [None]:
# same for the read_excel function
read_excel(filepath_excel, sheet='pop_missing_values', fill_value=0)

### Sorting Axes at Reading (CSV, Excel, HDF5)

The ``sort_rows`` and ``sort_columns`` arguments of the reading functions allows you to sort rows and columns alphabetically:

In [None]:
# sort labels at reading --> Male and Female labels are inverted
read_csv(csv_dir + '/pop.csv', sort_rows=True)

In [None]:
read_excel(filepath_excel, sheet='births', sort_rows=True)

In [None]:
read_hdf(filepath_hdf, key='deaths').sort_axes()

### Metadata (HDF5)

Since the version 0.29 of LArray, it is possible to add metadata to arrays:

In [None]:
pop.meta.title = 'Population at 1st January'
pop.meta.origin = 'Table demo_jpan from Eurostat'

pop.info

These metadata are automatically saved and loaded when working with the HDF5 file format:  

In [None]:
pop.to_hdf('population.h5', 'pop')

new_pop = read_hdf('population.h5', 'pop')
new_pop.info

<div class="alert alert-warning">
  **Warning:** Currently, metadata associated with arrays cannot be saved and loaded when working with CSV and Excel files.
  This restriction does not apply however to metadata associated with sessions.
</div>

## Loading and Dumping Sessions

One of the main advantages of grouping arrays, axes and groups in session objects is that you can load and save all of them in one shot. Like arrays, it is possible to associate metadata to a session. These can be saved and loaded in all file formats. 

### Loading Sessions (CSV, Excel, HDF5)

To load the items of a session, you have two options:

1) Instantiate a new session and pass the path to the Excel/HDF5 file or to the directory containing CSV files to the Session constructor:

In [None]:
# create a new Session object and load all arrays, axes, groups and metadata 
# from all CSV files located in the passed directory
csv_dir = get_example_filepath('demography_eurostat')
session = Session(csv_dir)

# create a new Session object and load all arrays, axes, groups and metadata
# stored in the passed Excel file
filepath_excel = get_example_filepath('demography_eurostat.xlsx')
session = Session(filepath_excel)

# create a new Session object and load all arrays, axes, groups and metadata
# stored in the passed HDF5 file
filepath_hdf = get_example_filepath('demography_eurostat.h5')
session = Session(filepath_hdf)

print(session.summary())

2) Call the ``load`` method on an existing session and pass the path to the Excel/HDF5 file or to the directory containing CSV files as first argument:

In [None]:
# create a session containing 3 axes, 2 groups and one array 'pop'
filepath = get_example_filepath('pop_only.xlsx')
session = Session(filepath)

print(session.summary())

In [None]:
# call the load method on the previous session and add the 'births' and 'deaths' arrays to it
filepath = get_example_filepath('births_and_deaths.xlsx')
session.load(filepath)

print(session.summary())

The ``load`` method offers some options:

1) Using the ``names`` argument, you can specify which items to load:

In [None]:
session = Session()

# use the names argument to only load births and deaths arrays
session.load(filepath_hdf, names=['births', 'deaths'])

print(session.summary())

2) Setting the ``display`` argument to True, the ``load`` method will print a message each time a new item is loaded:  

In [None]:
session = Session()

# with display=True, the load method will print a message
# each time a new item is loaded
session.load(filepath_hdf, display=True)

### Dumping Sessions (CSV, Excel, HDF5)

To save a session, you need to call the ``save`` method. The first argument is the path to a Excel/HDF5 file or to a directory if items are saved to CSV files:

In [None]:
# save items of a session in CSV files.
# Here, the save method will create a 'population' directory in which CSV files will be written 
session.save('population')

# save session to an HDF5 file
session.save('population.h5')

# save session to an Excel file
session.save('population.xlsx')

# load session saved in 'population.h5' to see its content
Session('population.h5')

<div class="alert alert-info">
  Note: Concerning the CSV and Excel formats:  
  
  - all Axis objects are saved together in the same Excel sheet (CSV file) named `__axes__(.csv)`  
  - all Group objects are saved together in the same Excel sheet (CSV file) named `__groups__(.csv)`  
  - metadata is saved in one Excel sheet (CSV file) named `__metadata__(.csv)`  
  
  These sheet (CSV file) names cannot be changed. 
</div>

The ``save`` method has several arguments:

1) Using the ``names`` argument, you can specify which items to save:

In [None]:
# use the names argument to only save births and deaths arrays
session.save('population.h5', names=['births', 'deaths'])

# load session saved in 'population.h5' to see its content
Session('population.h5')

2) By default, dumping a session to an Excel or HDF5 file will overwrite it. By setting the ``overwrite`` argument to False, you can choose to update the existing Excel or HDF5 file: 

In [None]:
pop = read_csv('./population/pop.csv')
ses_pop = Session([('pop', pop)])

# by setting overwrite to False, the destination file is updated instead of overwritten.
# The items already stored in the file but not present in the session are left intact. 
# On the contrary, the items that exist in both the file and the session are completely overwritten.
ses_pop.save('population.h5', overwrite=False)

# load session saved in 'population.h5' to see its content
Session('population.h5')

3) Setting the ``display`` argument to True, the ``save`` method will print a message each time an item is dumped:  

In [None]:
# with display=True, the save method will print a message
# each time an item is dumped
session.save('population.h5', display=True)