![lop](../../images/logo_diive1_128px.png)
# **Load and save Parquet file**

**Notebook creation date**: 11 Aug 2023  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)
---
Example notebook for functions `save_parquet` and `load_parquet` in `diive.core.io.files` of the time series processing library `diive`.  

# **Background**

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

source: https://parquet.apache.org/

- Parquet files are much smaller than e.g. CSV files, and faster to load and save

# **Imports**

In [14]:
import os
import pkg_resources
from datetime import datetime
from pathlib import Path
from diive.core.io.files import save_parquet, load_parquet
from diive.core.io.filereader import MultiDataFileReader

dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"This page was last modified on: {dt_string}")
version_diive = pkg_resources.get_distribution("diive").version
print(f"Used diive version: v{version_diive}")

This page was last modified on: 2023-08-14 15:55:21
Used diive version: v0.55.0


## Show function docstring of `save_parquet`

In [2]:
docstring = save_parquet.__doc__
print(docstring)


    Save pandas Series or DataFrame as parquet file

    Args:
        filename: str
            Name of the generated parquet file.
        data: pandas Series or DataFrame
        outpath: str or None
            If *None*, file is saved to system default folder. When used within
            a notebook, the file is saved in the same location as the notebook.

    Returns:
        str, filepath to parquet file
    


## Show function docstring of `load_parquet`

In [3]:
docstring = load_parquet.__doc__
print(docstring)


    Load data from Parquet file to pandas DataFrame

    Args:
        filepath: str
            filepath to parquet file

    Returns:
        pandas DataFrame, data from Parquet file as pandas DataFrame
    


# **Input data**

## Source and output folder

In [4]:
DIR = Path(r'F:\Sync\luhk_work\TMP')

## Search files in folder

In [5]:
filepaths = [f for f in os.listdir(DIR) if f.endswith(".csv")]
filepaths = [DIR / f for f in filepaths]
filepaths = [Path(f) for f in filepaths]
[print(f) for f in filepaths]

F:\Sync\luhk_work\TMP\2020_eddypro_CH-FRU_FR-20230425-205325_fluxnet_2023-04-26T220030_adv.csv
F:\Sync\luhk_work\TMP\2021_eddypro_CH-FRU_FR-20230425-205131_fluxnet_2023-04-26T214014_adv.csv
F:\Sync\luhk_work\TMP\2022_eddypro_CH-FRU_FR-20230425-205042_fluxnet_2023-04-26T222351_adv.csv


[None, None, None]

## Load and merge data from files
- Read datafiles in `filepaths` using `diive`
- All datafiles are in the `EDDYPRO_FLUXNET_30MIN` format

In [6]:
loaddatafile = MultiDataFileReader(filetype='EDDYPRO_FLUXNET_30MIN', filepaths=filepaths)
df = loaddatafile.data_df

Reading file 2020_eddypro_CH-FRU_FR-20230425-205325_fluxnet_2023-04-26T220030_adv.csv ...
Reading file 2021_eddypro_CH-FRU_FR-20230425-205131_fluxnet_2023-04-26T214014_adv.csv ...
Reading file 2022_eddypro_CH-FRU_FR-20230425-205042_fluxnet_2023-04-26T222351_adv.csv ...


## Check data

In [7]:
df

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,BOWEN,BURBA_METHOD,BADM_LOCATION_LAT,BADM_LOCATION_LONG,BADM_LOCATION_ELEV,BADM_HEIGHTC,BADM_INST_SAMPLING_INT,BADM_INST_AVERAGING_INT,BADM_INST_MODEL_SA,...,W_T_SONIC_COV_IBROM_N0004,W_NUM_SPIKES,WD_FILTER_NREX,W_SPIKE_NREX,W_ABSLIM_NREX,W_VM97_TEST,W_LGD,W_KID,W_ZCD,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2019-12-31 19:15:00,0.024907,1.16085,1169.83,1007.74,0.0,1.0,5.765210,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.003629,3.0,0.0,5.0,0.0,800000011.0,0.0,6.40148,630.0,6.0,1.0,0.0,1.0,3.079130,3.006370
2019-12-31 19:45:00,0.024865,1.16271,1171.77,1007.79,0.0,1.0,1.627180,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.002473,0.0,0.0,0.0,0.0,800000011.0,0.0,7.97684,255.0,47.0,3.0,0.0,1.0,-0.096300,-0.099549
2019-12-31 20:15:00,0.024910,1.16061,1169.68,1007.81,0.0,1.0,-5.226190,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.018346,6.0,0.0,9.0,0.0,800000111.0,0.0,8.32701,470.0,27.0,2.0,0.0,1.0,-0.780695,-0.771121
2019-12-31 20:45:00,0.024886,1.16183,1170.78,1007.70,0.0,1.0,2.296260,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.001395,5.0,0.0,9.0,0.0,800000011.0,0.0,8.49884,514.0,21.0,2.0,0.0,1.0,0.540130,0.545173
2019-12-31 21:15:00,0.024874,1.16239,1171.37,1007.73,0.0,1.0,-0.195424,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.005141,4.0,0.0,5.0,0.0,800000111.0,0.0,11.34600,266.0,9.0,1.0,0.0,1.0,-0.135396,-0.095764
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 22:45:00,0.025906,1.11442,1125.59,1010.02,0.0,1.0,-3.645490,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.000189,13.0,0.0,22.0,0.0,800001101.0,0.0,23.89100,424.0,9.0,1.0,0.0,1.0,0.316949,0.309181
2022-12-31 23:15:00,0.025770,1.12033,1131.50,1009.97,0.0,1.0,-1.667900,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.005310,4.0,0.0,7.0,0.0,800000011.0,0.0,10.83960,1305.0,60.0,4.0,0.0,1.0,-0.057874,-0.056127
2022-12-31 23:45:00,0.025757,1.12077,1132.08,1010.09,0.0,1.0,61.622800,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.030341,4.0,0.0,6.0,0.0,800000111.0,0.0,10.21180,822.0,8.0,1.0,0.0,1.0,1.716370,1.677440
2023-01-01 00:15:00,0.026678,1.08509,1096.57,1010.58,0.0,1.0,13.422700,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.000116,5.0,0.0,13.0,0.0,800000111.0,0.0,7.66713,584.0,25.0,2.0,0.0,1.0,0.378159,0.374705


# **Save dataframe as parquet file**

## (1) Save to specific folder

In [8]:
filepath = save_parquet(outpath=DIR, filename="data", data=df)

Saved file F:\Sync\luhk_work\TMP\data.parquet (0.949 seconds).


## (2) Save to same folder as notebook

In [9]:
filepath = save_parquet(filename="data", data=df)

Saved file data.parquet (0.902 seconds).


## Filepath

In [10]:
filepath

'data.parquet'

# **Load data from parquet file**

In [11]:
data_from_parquet = load_parquet(filepath=filepath)

Saved file data.parquet (0.280 seconds).


In [12]:
data_from_parquet

Unnamed: 0_level_0,AIR_MV,AIR_DENSITY,AIR_RHO_CP,AIR_CP,AOA_METHOD,AXES_ROTATION_METHOD,BOWEN,BURBA_METHOD,BADM_LOCATION_LAT,BADM_LOCATION_LONG,BADM_LOCATION_ELEV,BADM_HEIGHTC,BADM_INST_SAMPLING_INT,BADM_INST_AVERAGING_INT,BADM_INST_MODEL_SA,...,W_T_SONIC_COV_IBROM_N0004,W_NUM_SPIKES,WD_FILTER_NREX,W_SPIKE_NREX,W_ABSLIM_NREX,W_VM97_TEST,W_LGD,W_KID,W_ZCD,W_ITC,W_ITC_TEST,WBOOST_APPLIED,WPL_APPLIED,ZL,ZL_UNCORR
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2019-12-31 19:15:00,0.024907,1.16085,1169.83,1007.74,0.0,1.0,5.765210,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.003629,3.0,0.0,5.0,0.0,800000011.0,0.0,6.40148,630.0,6.0,1.0,0.0,1.0,3.079130,3.006370
2019-12-31 19:45:00,0.024865,1.16271,1171.77,1007.79,0.0,1.0,1.627180,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.002473,0.0,0.0,0.0,0.0,800000011.0,0.0,7.97684,255.0,47.0,3.0,0.0,1.0,-0.096300,-0.099549
2019-12-31 20:15:00,0.024910,1.16061,1169.68,1007.81,0.0,1.0,-5.226190,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.018346,6.0,0.0,9.0,0.0,800000111.0,0.0,8.32701,470.0,27.0,2.0,0.0,1.0,-0.780695,-0.771121
2019-12-31 20:45:00,0.024886,1.16183,1170.78,1007.70,0.0,1.0,2.296260,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.001395,5.0,0.0,9.0,0.0,800000011.0,0.0,8.49884,514.0,21.0,2.0,0.0,1.0,0.540130,0.545173
2019-12-31 21:15:00,0.024874,1.16239,1171.37,1007.73,0.0,1.0,-0.195424,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.005141,4.0,0.0,5.0,0.0,800000111.0,0.0,11.34600,266.0,9.0,1.0,0.0,1.0,-0.135396,-0.095764
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 22:45:00,0.025906,1.11442,1125.59,1010.02,0.0,1.0,-3.645490,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.000189,13.0,0.0,22.0,0.0,800001101.0,0.0,23.89100,424.0,9.0,1.0,0.0,1.0,0.316949,0.309181
2022-12-31 23:15:00,0.025770,1.12033,1131.50,1009.97,0.0,1.0,-1.667900,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.005310,4.0,0.0,7.0,0.0,800000011.0,0.0,10.83960,1305.0,60.0,4.0,0.0,1.0,-0.057874,-0.056127
2022-12-31 23:45:00,0.025757,1.12077,1132.08,1010.09,0.0,1.0,61.622800,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,-0.030341,4.0,0.0,6.0,0.0,800000111.0,0.0,10.21180,822.0,8.0,1.0,0.0,1.0,1.716370,1.677440
2023-01-01 00:15:00,0.026678,1.08509,1096.57,1010.58,0.0,1.0,13.422700,0.0,47.1158,8.53778,982.0,0.4,20.0,30.0,,...,0.000116,5.0,0.0,13.0,0.0,800000111.0,0.0,7.66713,584.0,25.0,2.0,0.0,1.0,0.378159,0.374705


# **Finish**

In [13]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")

Finished 2023-08-14 15:52:03
