![lop](../../images/logo_diive1_128px.png)
# **Load and save Parquet file**

**Notebook creation date**: 11 Aug 2023  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)
---
Example notebook for functions `save_parquet` and `load_parquet` in `diive.core.io.files` of the time series processing library `diive`.  

# **Background**

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

source: https://parquet.apache.org/

- Parquet files are much smaller than e.g. CSV files, and faster to load and save

# **Imports**

In [1]:
import os
import pkg_resources
from datetime import datetime
from pathlib import Path
from diive.core.io.files import save_parquet, load_parquet
from diive.core.io.filereader import MultiDataFileReader

dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"This page was last modified on: {dt_string}")
version_diive = pkg_resources.get_distribution("diive").version
print(f"Used diive version: v{version_diive}")

  from pandas import Int64Index as NumericIndex


This page was last modified on: 2023-09-07 12:48:00
Used diive version: v0.57.0


## Show function docstring of `save_parquet`

In [2]:
docstring = save_parquet.__doc__
print(docstring)


    Save pandas Series or DataFrame as parquet file

    Args:
        filename: str
            Name of the generated parquet file.
        data: pandas Series or DataFrame
        outpath: str or None
            If *None*, file is saved to system default folder. When used within
            a notebook, the file is saved in the same location as the notebook.

    Returns:
        str, filepath to parquet file
    


## Show function docstring of `load_parquet`

In [3]:
docstring = load_parquet.__doc__
print(docstring)


    Load data from Parquet file to pandas DataFrame

    Args:
        filepath: str
            filepath to parquet file

    Returns:
        pandas DataFrame, data from Parquet file as pandas DataFrame
    


# **Input data**

## Source and output folder

In [4]:
DIR = Path(r'F:\TMP\dav')

## Search files in folder

In [5]:
filepaths = [f for f in os.listdir(DIR) if f.endswith(".csv")]
filepaths = [DIR / f for f in filepaths]
filepaths = [Path(f) for f in filepaths]
[print(f) for f in filepaths]

F:\TMP\dav\exampledata_REDDYPROC_30MIN_CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.csv


[None]

## Load and merge data from files
- Read datafiles in `filepaths` using `diive`
- All datafiles are in the `EDDYPRO_FLUXNET_30MIN` format

In [6]:
loaddatafile = MultiDataFileReader(filetype='REDDYPROC_30MIN', filepaths=filepaths)
df = loaddatafile.data_df

Reading file exampledata_REDDYPROC_30MIN_CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.csv ...


## Check data

In [7]:
df

Unnamed: 0_level_0,E_0_CUT_REF,ET_f,FP_alpha,FP_alpha_sd,FP_beta,FP_beta_sd,FP_dRecPar,FP_E0,FP_E0_sd,FP_errorcode,FP_GPP2000,FP_k,FP_k_sd,FP_NEEnight_CUT_REF,FP_qc,...,Tair_fsd,Tair_fwin,Tair_orig,Unnamed: 0,Ustar_CUT_REF_fqc,Ustar_CUT_REF_Thres,VPD_f,VPD_fall,VPD_fall_qc,VPD_fmeth,VPD_fnum,VPD_fqc,VPD_fsd,VPD_fwin,VPD_orig
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1997-01-01 00:15:00,188.45,-0.062663,,,,,1154.0,,,,,,,,2.0,...,,,-7.850000,1.0,1.0,0.287644,0.903000,0.903000,,,,0.0,,,0.903000
1997-01-01 00:45:00,188.45,-0.063766,,,,,1153.0,,,,,,,,2.0,...,,,-8.130000,2.0,1.0,0.287644,0.680000,0.680000,,,,0.0,,,0.680000
1997-01-01 01:15:00,188.45,-0.004956,,,,,1152.0,,,,,,,,2.0,...,,,-7.890000,3.0,1.0,0.287644,0.688000,0.688000,,,,0.0,,,0.688000
1997-01-01 01:45:00,188.45,0.485472,,,,,1151.0,,,,,,,,2.0,...,,,-8.490000,4.0,1.0,0.287644,0.454000,0.454000,,,,0.0,,,0.454000
1997-01-01 02:15:00,188.45,-0.063766,,,,,1150.0,,,,,,,,2.0,...,,,-8.130000,5.0,1.0,0.287644,0.369000,0.369000,,,,0.0,,,0.369000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 21:45:00,188.45,-0.016362,,,,,10.0,,,,,,,,1.0,...,,,3.463333,455804.0,0.0,0.287644,2.210036,2.210036,,,,0.0,,,2.210036
2022-12-31 22:15:00,188.45,0.105363,,,,,9.0,,,,,,,3.51795,1.0,...,,,5.283333,455805.0,0.0,0.287644,3.327497,3.327497,,,,0.0,,,3.327497
2022-12-31 22:45:00,188.45,0.219531,,,,,8.0,,,,,,,,1.0,...,,,5.536667,455806.0,0.0,0.287644,3.466505,3.466505,,,,0.0,,,3.466505
2022-12-31 23:15:00,188.45,0.059158,,,,,7.0,,,,,,,,1.0,...,,,4.963333,455807.0,0.0,0.287644,3.010161,3.010161,,,,0.0,,,3.010161


# **Save dataframe as parquet file**

## (1) Save to specific folder

In [11]:
filepath = save_parquet(outpath=DIR, filename="CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN", data=df)

Saved file F:\TMP\dav\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet (1.132 seconds).


## (2) Save to same folder as notebook

In [None]:
filepath = save_parquet(filename="data", data=df)

## Filepath

In [12]:
filepath

'F:\\TMP\\dav\\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet'

# **Load data from parquet file**

In [13]:
data_from_parquet = load_parquet(filepath=filepath)

Loaded .parquet file F:\TMP\dav\CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet (0.381 seconds). Detected time resolution of <30 * Minutes> / 30T 


In [14]:
data_from_parquet

Unnamed: 0_level_0,E_0_CUT_REF,ET_f,FP_alpha,FP_alpha_sd,FP_beta,FP_beta_sd,FP_dRecPar,FP_E0,FP_E0_sd,FP_errorcode,FP_GPP2000,FP_k,FP_k_sd,FP_NEEnight_CUT_REF,FP_qc,...,Tair_fsd,Tair_fwin,Tair_orig,Unnamed: 0,Ustar_CUT_REF_fqc,Ustar_CUT_REF_Thres,VPD_f,VPD_fall,VPD_fall_qc,VPD_fmeth,VPD_fnum,VPD_fqc,VPD_fsd,VPD_fwin,VPD_orig
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1997-01-01 00:15:00,188.45,-0.062663,,,,,1154.0,,,,,,,,2.0,...,,,-7.850000,1.0,1.0,0.287644,0.903000,0.903000,,,,0.0,,,0.903000
1997-01-01 00:45:00,188.45,-0.063766,,,,,1153.0,,,,,,,,2.0,...,,,-8.130000,2.0,1.0,0.287644,0.680000,0.680000,,,,0.0,,,0.680000
1997-01-01 01:15:00,188.45,-0.004956,,,,,1152.0,,,,,,,,2.0,...,,,-7.890000,3.0,1.0,0.287644,0.688000,0.688000,,,,0.0,,,0.688000
1997-01-01 01:45:00,188.45,0.485472,,,,,1151.0,,,,,,,,2.0,...,,,-8.490000,4.0,1.0,0.287644,0.454000,0.454000,,,,0.0,,,0.454000
1997-01-01 02:15:00,188.45,-0.063766,,,,,1150.0,,,,,,,,2.0,...,,,-8.130000,5.0,1.0,0.287644,0.369000,0.369000,,,,0.0,,,0.369000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 21:45:00,188.45,-0.016362,,,,,10.0,,,,,,,,1.0,...,,,3.463333,455804.0,0.0,0.287644,2.210036,2.210036,,,,0.0,,,2.210036
2022-12-31 22:15:00,188.45,0.105363,,,,,9.0,,,,,,,3.51795,1.0,...,,,5.283333,455805.0,0.0,0.287644,3.327497,3.327497,,,,0.0,,,3.327497
2022-12-31 22:45:00,188.45,0.219531,,,,,8.0,,,,,,,,1.0,...,,,5.536667,455806.0,0.0,0.287644,3.466505,3.466505,,,,0.0,,,3.466505
2022-12-31 23:15:00,188.45,0.059158,,,,,7.0,,,,,,,,1.0,...,,,4.963333,455807.0,0.0,0.287644,3.010161,3.010161,,,,0.0,,,3.010161


# **Finish**

In [15]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")

Finished 2023-09-07 12:50:45
