In [91]:
import timeit
import pytz
import glob
import pandas as pd

# Data Ingestion

### Arctic
[Arctic](http://github.com/manahl/arctic) is a "High performance datastore for time series and tick data", built by the rather large hedge fund Man AHL and built on MongoDB. Generally speaking, I have avoided MongoDB in the past because, well, [it doesn't do what it says on the tin](https://aphyr.com/posts/322-jepsen-mongodb-stale-reads). This seems to have been [fixed](https://jepsen.io/analyses/mongodb-3-4-0-rc3), and in fairness, it's not too important if you lose a handful out of a few billion rows of timeseries data.

Great video RE design: https://vimeo.com/album/3660528/video/145842301
It's attractive how focused Arctic seems to be on working with Pandas


Arctic expects to be able to read port '27017', so spin up that container with:
`sudo docker-compose up -d mongo`


CHUNKSTORE OR TICKSTORE?
According to [this issue](https://github.com/manahl/arctic/issues/197), the TickStore is best for continuously reading/writing data (ticks), and the ChunkStore is best for writing large blocks at once. Arctic's authors use Kafka to queue and batch-write ~3 hours worth of ticks, I believe as a ChunkStore. Because the API is so easy to work with, we might as well test both approaches.

In [97]:
from arctic import TICK_STORE, CHUNK_STORE
from arctic import Arctic

# Initialize Arctic
arctic_db = Arctic('localhost')
arctic_db.initialize_library("fx_tickstore", lib_type=TICK_STORE)
arctic_db.initialize_library("fx_chunkstore", lib_type=CHUNK_STORE)
arctic_tick_lib = arctic_db["fx_tickstore"]
arctic_chunk_lib = arctic_db["fx_chunkstore"]

# Load the raw data
all_files = glob.glob("./raw_data/*.csv")
raw = pd.concat((pd.read_csv(f, header=None, names=["symbol","date","bid","ask"]).drop("symbol", 1) for f in all_files))

raw["date"] = pd.to_datetime(raw["date"])
raw = raw.set_index("date").tz_localize(pytz.utc)

# Save some information about how much data we have.
num_rows = len(raw)
num_bytes = raw.memory_usage().sum()

print("Raw data has %s million rows and takes up %s megabytes." % (num_rows/1000000, num_bytes/1024/1024))

Raw data has 193.483396 million rows and takes up 4428.48348999 megabytes.


In [None]:
# Write the dataframe to Arctic
def insert_to_arctic_tick():
    arctic_tick_lib.delete("AUDUSD")
    arctic_tick_lib.write("AUDUSD", raw)

def insert_to_arctic_chunk():
    arctic_chunk_lib.delete("AUDUSD")
    arctic_chunk_lib.write("AUDUSD", raw)
    
timeit.timeit('insert_to_arctic_chunk()', 'from __main__ import insert_to_arctic_chunk', number=1)

In [98]:
raw.memory_usage().sum()

4643601504