# 1 - Convert Data from VED
This short notebook converts the Vehicle Energy Dataset from the original multi-file CSV format to a more convenient single file parquet format. Parquet files do not require parsing, as CSV files do, so they are inherently faster to load.

Start by downloading the data from https://github.com/gsoh/VED into the `data` folder (please create it if it is not there). After expanding all the CSV files, please run the code below.

Note: Please install the `pyarrow` package before running this notebook.

In [None]:
import numpy as np
import pandas as pd
import os

Set the data path and target file name.

In [None]:
data_path = "./data"
parquet_file = os.path.join(data_path, "ved.parquet")

The `read_data_frame` function reads a single VED CSV file into its own DataFrame object. It is meant to be used with the `map` function in a comprehension expression (see below).

In [None]:
def read_data_frame(filename):
    columns = ['DayNum', 'VehId', 'Trip', 'Timestamp(ms)', 'Latitude[deg]', 'Longitude[deg]', 
               'Vehicle Speed[km/h]']
    types = {'VehId': np.int64,
             'Trip': np.int64,
             'Timestamp(ms)': np.int64}
    df = pd.read_csv(filename, usecols=columns, dtype=types)
    return df

Read all the files into the same DataFrame and dump it into a single parquet file.

In [None]:
files = [os.path.join(data_path, file) for file in os.listdir(data_path) if file.endswith(".csv")]
df = pd.concat(map(read_data_frame, files), ignore_index=True)
df = df.sort_values(by=['VehId', 'DayNum', 'Timestamp(ms)'])
df.to_parquet(parquet_file)