# Optimizing and parquetizing NinaPro data

Here we will optimize the NinaPro data CSV files after generating them per-subject.

***NOTE:*** To use this notebook, apart from pandas, you also need to install `fastparquet` and `pyarrow` using pip or conda.

In [1]:
import os
import pandas as pd

In [2]:
# Uint8 columns
uint8_cols = ['subject','exercise','repetition','rerepetition','stimulus','restimulus']

In [3]:
homedir = "../data/ninapro_db5"
listdirs = os.listdir(homedir)
for dir in listdirs:
    print("Optimizing "+homedir+'/'+dir)
    # Get file system data
    csvfiles = os.listdir(homedir+'/'+dir)
    csvfiles = [f for f in csvfiles if f.endswith('.csv')]
    csvfile = csvfiles[0]
    csvfileWOextension = homedir+'/'+dir+'/'+csvfile[:-4]
    # Read data
    df = pd.read_csv(homedir+'/'+dir+'/'+csvfile)
    # Optimize columns
    df[uint8_cols] = df[uint8_cols].astype('uint8')
    rest_cols = [c for c in df.columns if c not in uint8_cols]
    df[rest_cols] = df[rest_cols].astype('float32')
    # Save parquet format (comment to deactivate)
    df.to_parquet(csvfileWOextension+".parquet", compression='gzip')
    # Save CSV format but lighter (comment to deactivate)
    # df.to_csv(csvfileWOextension+"_light.csv", index=False, header=True, float_format='%.6f')
    os.remove(homedir+'/'+dir+'/'+csvfile)

Optimizing ../data/ninapro_db5//s1
Optimizing ../data/ninapro_db5//s10
Optimizing ../data/ninapro_db5//s2
Optimizing ../data/ninapro_db5//s3
Optimizing ../data/ninapro_db5//s4
Optimizing ../data/ninapro_db5//s5
Optimizing ../data/ninapro_db5//s6
Optimizing ../data/ninapro_db5//s7
Optimizing ../data/ninapro_db5//s8
Optimizing ../data/ninapro_db5//s9
