![lop](../../images/logo_diive1_128px.png)

<span style='font-size:32px; display:block;'>
<b>
    Load and save Parquet file
</b>
</span>

---
**Notebook version**: `2` (24 Oct 2023)  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)  

</br>

# **Background**

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

source: https://parquet.apache.org/

- Parquet files are much smaller than e.g. CSV files, and faster to load and save

</br>

# **Imports**

In [1]:
import importlib.metadata
import os
from datetime import datetime
from pathlib import Path

from diive.core.io.filereader import MultiDataFileReader
from diive.core.io.files import save_parquet, load_parquet

version_diive = importlib.metadata.version("diive")
print(f"diive version: v{version_diive}")

diive version: v0.85.0


</br>

# **Docstring** of `save_parquet`

In [2]:
# help(save_parquet)

</br>

# **Docstring** of `load_parquet`

In [3]:
# help(load_parquet)

</br>

# **Input data**

## Source and output folder

In [4]:
DIR = Path(r"..\..\diive\configs\exampledata\EDDYPRO-FLUXNET-CSV-30MIN_multiple")

## Search files in folder

In [5]:
filepaths = [f for f in os.listdir(DIR) if f.endswith(".csv")]
filepaths = [DIR / f for f in filepaths]
filepaths = [Path(f) for f in filepaths]
filepaths = [f for f in filepaths if 'fluxnet' in str(f)]
[print(f) for f in filepaths];

..\..\diive\configs\exampledata\EDDYPRO-FLUXNET-CSV-30MIN_multiple\eddypro_CH-HON_FR-20240818-090003_fluxnet_2024-08-18T090018_adv.csv
..\..\diive\configs\exampledata\EDDYPRO-FLUXNET-CSV-30MIN_multiple\eddypro_CH-HON_FR-20240819-090003_fluxnet_2024-08-19T090019_adv.csv
..\..\diive\configs\exampledata\EDDYPRO-FLUXNET-CSV-30MIN_multiple\eddypro_CH-HON_FR-20240820-090004_fluxnet_2024-08-20T090021_adv.csv


## Load and merge data from files
- Read datafiles in `filepaths` using `diive`
- All datafiles are in the `EDDYPRO_FLUXNET_30MIN` format

In [6]:
loaddatafile = MultiDataFileReader(filetype='EDDYPRO-FLUXNET-CSV-30MIN', 
                                   filepaths=filepaths,
                                   output_middle_timestamp=False)
df = loaddatafile.data_df

Reading file eddypro_CH-HON_FR-20240818-090003_fluxnet_2024-08-18T090018_adv.csv ...
Reading file eddypro_CH-HON_FR-20240819-090003_fluxnet_2024-08-19T090019_adv.csv ...
Reading file eddypro_CH-HON_FR-20240820-090004_fluxnet_2024-08-20T090021_adv.csv ...


## Check data

In [7]:
df

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,BOWEN,BURBA_METHOD,BADM_LOCATION_LAT,BADM_LOCATION_LONG,BADM_LOCATION_ELEV,BADM_HEIGHTC,BADM_INST_SAMPLING_INT,BADM_INST_AVERAGING_INT,BADM_INST_MODEL_SA,...,W_T_SONIC_COV_IBROM_N0004,W_NUM_SPIKES,WD_FILTER_NREX,W_SPIKE_NREX,W_ABSLIM_NREX,W_VM97_TEST,W_LGD,W_KID,W_ZCD,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_END,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2024-08-15 01:30:00,0.026289,1.10046,1119.00,1016.85,0,1,-0.397400,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,1,0,801000011,0.0,3.92798,1626,16,2,0,1,2.708700,2.398160
2024-08-15 02:00:00,0.026281,1.10081,1119.08,1016.59,0,1,-0.067049,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,1,0,800000111,0.0,5.27995,1161,87,5,0,1,14.255900,8.720120
2024-08-15 02:30:00,0.026254,1.10196,1120.16,1016.51,0,1,-0.113645,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,801000011,0.0,4.23637,1109,13,1,0,1,3.178640,2.030050
2024-08-15 03:00:00,0.026242,1.10246,1120.76,1016.60,0,1,-0.906105,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,3,0,800000011,0.0,5.62788,2537,5,1,0,1,4.078210,3.826030
2024-08-15 03:30:00,0.026237,1.10266,1121.05,1016.68,0,1,-0.898270,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,800000011,0.0,4.53595,1836,2,1,0,1,3.585550,3.362940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-19 17:00:00,0.026464,1.09337,1110.11,1015.31,0,1,-0.976782,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,800000100,0.0,14.18050,134,41,3,0,1,0.057519,0.054155
2024-08-19 17:30:00,0.026443,1.09423,1111.11,1015.43,0,1,-0.509692,0,47.4189,8.49131,527.0,0.5,20,30,,...,,3,0,3,0,800000100,0.0,11.72300,117,22,2,0,1,0.058328,0.051651
2024-08-19 18:00:00,0.026424,1.09501,1111.99,1015.51,0,1,-0.281252,0,47.4189,8.49131,527.0,0.5,20,30,,...,,7,0,10,0,800000000,0.0,15.29790,175,15,1,0,1,0.032435,0.025775
2024-08-19 18:30:00,0.026384,1.09665,1113.67,1015.52,0,1,-0.277239,0,47.4189,8.49131,527.0,0.5,20,30,,...,,6,0,9,0,800000100,0.0,13.36820,41,10,1,0,1,0.020879,0.016491


</br>

# **Save dataframe as parquet file**

## (1) Save to specific folder

In [8]:
filepath = save_parquet(outpath=r"F:\TMP", filename="output_file", data=df)

Saved file F:\TMP\output_file.parquet (14.830 seconds).


## (2) Save to same folder as notebook

In [9]:
# filepath = save_parquet(filename="output_file", data=df)

## Filepath

In [10]:
filepath

'F:\\TMP\\output_file.parquet'

</br>

# **Load data from parquet file**

In [11]:
data_from_parquet = load_parquet(filepath=filepath, output_middle_timestamp=False)

Loaded .parquet file F:\TMP\output_file.parquet (0.043 seconds).
    --> Detected time resolution of <30 * Minutes> / 30min 


In [12]:
data_from_parquet

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,BOWEN,BURBA_METHOD,BADM_LOCATION_LAT,BADM_LOCATION_LONG,BADM_LOCATION_ELEV,BADM_HEIGHTC,BADM_INST_SAMPLING_INT,BADM_INST_AVERAGING_INT,BADM_INST_MODEL_SA,...,W_T_SONIC_COV_IBROM_N0004,W_NUM_SPIKES,WD_FILTER_NREX,W_SPIKE_NREX,W_ABSLIM_NREX,W_VM97_TEST,W_LGD,W_KID,W_ZCD,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_END,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2024-08-15 01:30:00,0.026289,1.10046,1119.00,1016.85,0,1,-0.397400,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,1,0,801000011,0.0,3.92798,1626,16,2,0,1,2.708700,2.398160
2024-08-15 02:00:00,0.026281,1.10081,1119.08,1016.59,0,1,-0.067049,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,1,0,800000111,0.0,5.27995,1161,87,5,0,1,14.255900,8.720120
2024-08-15 02:30:00,0.026254,1.10196,1120.16,1016.51,0,1,-0.113645,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,801000011,0.0,4.23637,1109,13,1,0,1,3.178640,2.030050
2024-08-15 03:00:00,0.026242,1.10246,1120.76,1016.60,0,1,-0.906105,0,47.4189,8.49131,527.0,0.5,20,30,,...,,1,0,3,0,800000011,0.0,5.62788,2537,5,1,0,1,4.078210,3.826030
2024-08-15 03:30:00,0.026237,1.10266,1121.05,1016.68,0,1,-0.898270,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,800000011,0.0,4.53595,1836,2,1,0,1,3.585550,3.362940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-19 17:00:00,0.026464,1.09337,1110.11,1015.31,0,1,-0.976782,0,47.4189,8.49131,527.0,0.5,20,30,,...,,0,0,0,0,800000100,0.0,14.18050,134,41,3,0,1,0.057519,0.054155
2024-08-19 17:30:00,0.026443,1.09423,1111.11,1015.43,0,1,-0.509692,0,47.4189,8.49131,527.0,0.5,20,30,,...,,3,0,3,0,800000100,0.0,11.72300,117,22,2,0,1,0.058328,0.051651
2024-08-19 18:00:00,0.026424,1.09501,1111.99,1015.51,0,1,-0.281252,0,47.4189,8.49131,527.0,0.5,20,30,,...,,7,0,10,0,800000000,0.0,15.29790,175,15,1,0,1,0.032435,0.025775
2024-08-19 18:30:00,0.026384,1.09665,1113.67,1015.52,0,1,-0.277239,0,47.4189,8.49131,527.0,0.5,20,30,,...,,6,0,9,0,800000100,0.0,13.36820,41,10,1,0,1,0.020879,0.016491


</br>

# **Finish**

In [13]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")

Finished 2025-01-24 23:44:13
