## Data converting

MARO come with a simple command-line tool that used to convert a csv file into binary.

``` sh
    maro data convert
```


Converting csv into binary need a meta file with yaml format, it contains 2 parts:

1. entity (required)

Entity part used to group columns from source csv file with specifing the data type and field name that will be used in program.

Each key under "entity" if the field name in program, and this field may contains:

column (required): specified with column being mapped to this field
dtype (optional): data type of this field, default is int32 if not provided, currently only support "i", "i2", "i4", "i8", "f" and "d" (same as numpy).
tzone (optional): for timestamp field only, used to specified timezone of the datetime in csv file, converting function will convert this time into UTC, default is UTC if not provided.
_event (optional): reversed key, used to specified with column's value should be mapped into events.

2. events (optional)

Event used to define events will be used in program, each event contains a type name and a display name.

type name (required): key under events part is the type name of event
display name (required): display name is a user friendly name used for displaying
value_in_csv (optional): value_in_csv used to specified how to map values to events, this can be empty, then converting function will not map current event to any value.

At last there is a reversed key "_default" used to specified default event if a value cannot be mapped to any event definition.


**NOTE**:

1. Currently convert function force meta file contains a field named as "timestamp", and it will be converted into UTC timesamps (in seconds).

2. Convert function does not sort the input csv files, so these files may need to be pre-processed.


A full example look like following.

In [1]:
# show an example of meta file
import sys

if sys.platform == "win32":
  !type ..\\..\\tests\\data\\citi_bike\\trips.meta.yml
else:
  !cat ../../tests/data/citi_bike/trips.meta.yml


# events filed used to define event name and related value in csv file
events:
  # event name in scenario -> related value in csv
  RequireBike: # type name
    # display_name should not contains special charactors, spaces, treat it as a variable
    display_name: "require_bike" # can be empty, then will be same as type name (key)
    # value_in_csv: 1 # used to mapping csv value into event type
  ReturnBike:
    display_name: "return_bike"
    # value_in_csv: 2 # if event have no value
  RebalanceBike:
    display_name: "rebalance_bike"
  DeliverBike:
    display_name: "deliver_bike"
    
  "_default": "RequireBike" # default event type if not event type in column, such as citi_bike scenario, all the rows are trip_requirement, so we do not need to specified event column
# entity used to specified which columns need to be extracted into binary, and its data type
# current supported data types are: i, i2, i4, i8, f, d
entity:
  timestamp:
    column: 'start_time'
    dtype: 'i8'
    # 

If the meta file is ready, then use command to convert. This command line need 3 parameters as below.

In [2]:
!maro data convert -h

usage: maro data convert [-h] [--meta META] [--file FILE [FILE ...]]
                         [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --meta META           Meta file of used for converting
  --file FILE [FILE ...]
                        Path to csv file(s) used to convert, you can save your
                        files' name into a file and call with prefix @ to read
                        files list from your file, like 'maro data convert
                        --meta meta.yml --output o.bin --file @files.txt'
  --output OUTPUT       Path (with file name) to safe the binary output.


In [3]:
# we have a csv file for testing, use it to convert
import pandas

pandas.read_csv("../../tests/data/citi_bike/case_1/trips.csv")

Unnamed: 0,start_time,duration,start_station_index,end_station_index
0,2019-01-01 00:00:00,5,0,1
1,2019-01-01 00:01:00,5,0,1
2,2019-01-01 00:01:00,5,1,0
3,2019-01-01 00:05:00,5,0,1


In [4]:
# convert csv file(s) into binary
!maro data convert --output trips.bin --meta ../../tests/data/citi_bike/trips.meta.yml --file ../../tests/data/citi_bike/case_1/trips.csv

In [1]:
from maro.datalib.binary_reader import BinaryReader

reader = BinaryReader('trips.bin')

reader
    

<maro.datalib.binary_reader.BinaryReader at 0x288dcffe518>

In [2]:

# timezone same as meta file
print("start time", reader.start_datetime)
print("end time", reader.end_datetime)


start time 2019-01-01 00:00:00+00:00
end time 2019-01-01 00:05:00+00:00


In [3]:
# get items between tick range
# default return all
for item in reader.items():
    print(item)

Item(timestamp=1546300800, durations=5, src_station=0, dest_station=1)
Item(timestamp=1546300860, durations=5, src_station=0, dest_station=1)
Item(timestamp=1546300860, durations=5, src_station=1, dest_station=0)
Item(timestamp=1546301100, durations=5, src_station=0, dest_station=1)


In [7]:
# reader must be reset if you want to read from beginning
reader.reset()

# tick=0 if the timestamp of first item in binary file
# use time_unit to specified unit of tick, currently support "s", "m", "h", "d"
for item in reader.items(0, 1, time_unit="m"):
    print(item)

Item(timestamp=1546300800, durations=5, src_station=0, dest_station=1)
Item(timestamp=1546300860, durations=5, src_station=0, dest_station=1)
Item(timestamp=1546300860, durations=5, src_station=1, dest_station=0)


In [12]:
reader.reset()

# there is another method, used to return a item picker, that provides method to get items by specified tick sequentially
picker = reader.items_tick_picker(0, 1, time_unit="m")

# NOTE: tick must specified sequentially
for item in picker.items(0):
    print("tick 0:", item)

for item in picker.items(1):
    print("tick 1:", item)

tick 0: Item(timestamp=1546300800, durations=5, src_station=0, dest_station=1)
tick 1: Item(timestamp=1546300860, durations=5, src_station=0, dest_station=1)
tick 1: Item(timestamp=1546300860, durations=5, src_station=1, dest_station=0)
