![lop](../../images/logo_diive1_128px.png)

<span style='font-size:32px; display:block;'>
<b>
    Load and save Parquet file
</b>
</span>

---
**Notebook version**: `2` (24 Oct 2023)  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)  

</br>

# **Background**

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

source: https://parquet.apache.org/

- Parquet files are much smaller than e.g. CSV files, and faster to load and save

</br>

# **Imports**

In [1]:
import os
import importlib.metadata
from datetime import datetime
from pathlib import Path
from diive.core.io.files import save_parquet, load_parquet
from diive.core.io.filereader import MultiDataFileReader

version_diive = importlib.metadata.version("diive")
print(f"diive version: v{version_diive}")

diive version: v0.74.0


</br>

# **Docstring** of `save_parquet`

In [2]:
help(save_parquet)

Help on function save_parquet in module diive.core.io.files:

save_parquet(filename: str, data: pandas.core.frame.DataFrame, outpath: str = None) -> str
    Save pandas Series or DataFrame as parquet file
    
    Args:
        filename: str
            Name of the generated parquet file.
        data: pandas Series or DataFrame
        outpath: str or None
            If *None*, file is saved to system default folder. When used within
            a notebook, the file is saved in the same location as the notebook.
    
    Returns:
        str, filepath to parquet file



</br>

# **Docstring** of `load_parquet`

In [3]:
help(load_parquet)

Help on function load_parquet in module diive.core.io.files:

load_parquet(filepath: str) -> pandas.core.frame.DataFrame
    Load data from Parquet file to pandas DataFrame
    
    Args:
        filepath: str
            filepath to parquet file
    
    Returns:
        pandas DataFrame, data from Parquet file as pandas DataFrame



</br>

# **Input data**

## Source and output folder

In [8]:
DIR = Path(r'F:\Sync\luhk_work\TMP\FRU')

## Search files in folder

In [11]:
filepaths = [f for f in os.listdir(DIR) if f.endswith(".csv")]
filepaths = [DIR / f for f in filepaths]
filepaths = [Path(f) for f in filepaths]
filepaths = [f for f in filepaths if 'fluxnet' in str(f)]
[print(f) for f in filepaths]

F:\Sync\luhk_work\TMP\FRU\2020_eddypro_CH-FRU_FR-20230425-205325_fluxnet_2023-04-26T220030_adv.csv
F:\Sync\luhk_work\TMP\FRU\2021_eddypro_CH-FRU_FR-20230425-205131_fluxnet_2023-04-26T214014_adv.csv
F:\Sync\luhk_work\TMP\FRU\2022_eddypro_CH-FRU_FR-20230425-205042_fluxnet_2023-04-26T222351_adv.csv


[None, None, None]

## Load and merge data from files
- Read datafiles in `filepaths` using `diive`
- All datafiles are in the `EDDYPRO_FLUXNET_30MIN` format

In [12]:
loaddatafile = MultiDataFileReader(filetype='EDDYPRO-FLUXNET-CSV-30MIN', filepaths=filepaths)
df = loaddatafile.data_df

Reading file 2020_eddypro_CH-FRU_FR-20230425-205325_fluxnet_2023-04-26T220030_adv.csv ...
Reading file 2021_eddypro_CH-FRU_FR-20230425-205131_fluxnet_2023-04-26T214014_adv.csv ...
Reading file 2022_eddypro_CH-FRU_FR-20230425-205042_fluxnet_2023-04-26T222351_adv.csv ...


## Check data

In [13]:
df

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,...,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2019-12-31 19:15:00,0.024907,1.16085,1169.83,1007.74,0.0,1.0,...,6.0,1.0,0.0,1.0,3.079130,3.006370
2019-12-31 19:45:00,0.024865,1.16271,1171.77,1007.79,0.0,1.0,...,47.0,3.0,0.0,1.0,-0.096300,-0.099549
2019-12-31 20:15:00,0.024910,1.16061,1169.68,1007.81,0.0,1.0,...,27.0,2.0,0.0,1.0,-0.780695,-0.771121
2019-12-31 20:45:00,0.024886,1.16183,1170.78,1007.70,0.0,1.0,...,21.0,2.0,0.0,1.0,0.540130,0.545173
2019-12-31 21:15:00,0.024874,1.16239,1171.37,1007.73,0.0,1.0,...,9.0,1.0,0.0,1.0,-0.135396,-0.095764
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 22:45:00,0.025906,1.11442,1125.59,1010.02,0.0,1.0,...,9.0,1.0,0.0,1.0,0.316949,0.309181
2022-12-31 23:15:00,0.025770,1.12033,1131.50,1009.97,0.0,1.0,...,60.0,4.0,0.0,1.0,-0.057874,-0.056127
2022-12-31 23:45:00,0.025757,1.12077,1132.08,1010.09,0.0,1.0,...,8.0,1.0,0.0,1.0,1.716370,1.677440
2023-01-01 00:15:00,0.026678,1.08509,1096.57,1010.58,0.0,1.0,...,25.0,2.0,0.0,1.0,0.378159,0.374705


</br>

# **Save dataframe as parquet file**

## (1) Save to specific folder

In [14]:
filepath = save_parquet(outpath=DIR, filename="CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN", data=df)

Saved file F:\Sync\luhk_work\TMP\FRU\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet (0.953 seconds).


## (2) Save to same folder as notebook

In [15]:
# filepath = save_parquet(filename="data", data=df)

## Filepath

In [16]:
filepath

'F:\\Sync\\luhk_work\\TMP\\FRU\\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet'

</br>

# **Load data from parquet file**

In [17]:
data_from_parquet = load_parquet(filepath=filepath)

Loaded .parquet file F:\Sync\luhk_work\TMP\FRU\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet (0.127 seconds). Detected time resolution of <30 * Minutes> / 30min 


In [18]:
data_from_parquet

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,...,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2019-12-31 19:15:00,0.024907,1.16085,1169.83,1007.74,0.0,1.0,...,6.0,1.0,0.0,1.0,3.079130,3.006370
2019-12-31 19:45:00,0.024865,1.16271,1171.77,1007.79,0.0,1.0,...,47.0,3.0,0.0,1.0,-0.096300,-0.099549
2019-12-31 20:15:00,0.024910,1.16061,1169.68,1007.81,0.0,1.0,...,27.0,2.0,0.0,1.0,-0.780695,-0.771121
2019-12-31 20:45:00,0.024886,1.16183,1170.78,1007.70,0.0,1.0,...,21.0,2.0,0.0,1.0,0.540130,0.545173
2019-12-31 21:15:00,0.024874,1.16239,1171.37,1007.73,0.0,1.0,...,9.0,1.0,0.0,1.0,-0.135396,-0.095764
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 22:45:00,0.025906,1.11442,1125.59,1010.02,0.0,1.0,...,9.0,1.0,0.0,1.0,0.316949,0.309181
2022-12-31 23:15:00,0.025770,1.12033,1131.50,1009.97,0.0,1.0,...,60.0,4.0,0.0,1.0,-0.057874,-0.056127
2022-12-31 23:45:00,0.025757,1.12077,1132.08,1010.09,0.0,1.0,...,8.0,1.0,0.0,1.0,1.716370,1.677440
2023-01-01 00:15:00,0.026678,1.08509,1096.57,1010.58,0.0,1.0,...,25.0,2.0,0.0,1.0,0.378159,0.374705


</br>

# **Finish**

In [19]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")

Finished 2024-04-20 19:05:25
