![lop](../../images/logo_diive1_128px.png)

<span style='font-size:32px; display:block;'>
<b>
    Load and save Parquet file
</b>
</span>

---
**Notebook version**: `2` (24 Oct 2023)  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)  

</br>

# **Background**

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

source: https://parquet.apache.org/

- Parquet files are much smaller than e.g. CSV files, and faster to load and save

</br>

# **Imports**

In [None]:
import importlib.metadata
import os
from datetime import datetime
from pathlib import Path

from diive.core.io.filereader import MultiDataFileReader
from diive.core.io.files import save_parquet, load_parquet

version_diive = importlib.metadata.version("diive")
print(f"diive version: v{version_diive}")

</br>

# **Docstring** of `save_parquet`

In [None]:
# help(save_parquet)

</br>

# **Docstring** of `load_parquet`

In [None]:
# help(load_parquet)

</br>

# **Input data**

## Source and output folder

In [None]:
DIR = Path(r'F:\TMP\example')

## Search files in folder

In [None]:
filepaths = [f for f in os.listdir(DIR) if f.endswith(".csv")]
filepaths = [DIR / f for f in filepaths]
filepaths = [Path(f) for f in filepaths]
filepaths = [f for f in filepaths if 'fluxnet' in str(f)]
[print(f) for f in filepaths];

## Load and merge data from files
- Read datafiles in `filepaths` using `diive`
- All datafiles are in the `EDDYPRO_FLUXNET_30MIN` format

In [None]:
loaddatafile = MultiDataFileReader(filetype='EDDYPRO-FLUXNET-CSV-30MIN', 
                                   filepaths=filepaths,
                                   output_middle_timestamp=False)
df = loaddatafile.data_df

## Check data

In [None]:
df

</br>

# **Save dataframe as parquet file**

## (1) Save to specific folder

In [None]:
filepath = save_parquet(outpath=DIR, filename="output_file", data=df)

## (2) Save to same folder as notebook

In [None]:
# filepath = save_parquet(filename="output_file", data=df)

## Filepath

In [None]:
filepath

</br>

# **Load data from parquet file**

In [None]:
data_from_parquet = load_parquet(filepath=filepath, output_middle_timestamp=False)

In [None]:
data_from_parquet

</br>

# **Finish**

In [None]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")