First few cells are just starting a local mongo instance
===

In [1]:
!mongod --version

db version v3.4.0
git version: f4240c60f005be757399042dc12f6addbc3170c1
OpenSSL version: OpenSSL 1.0.2l  25 May 2017
allocator: system
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64


In [2]:
# kill mongo if it's running (makes this notebook easily re-runnable)
# however beware if you are running mongo for other reasons!
!ps x | grep mongod | grep -v grep | awk '{print $1}' | xargs kill

In [3]:
!cat ./mongo.sh

#!/usr/bin/env bash

rm -rf data
mkdir -p data/db
mongod --dbpath data/db --logpath data/mongo.log &
sleep 5


In [4]:
!./mongo.sh

In [5]:
!tail -10 data/mongo.log

2017-09-19T21:41:42.505-0400 I CONTROL  [initandlisten] 
2017-09-19T21:41:42.506-0400 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2017-09-19T21:41:42.506-0400 I CONTROL  [initandlisten] 
2017-09-19T21:41:42.573-0400 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'data/db/diagnostic.data'
2017-09-19T21:41:42.600-0400 I INDEX    [initandlisten] build index on: admin.system.version properties: { v: 2, key: { version: 1 }, name: "incompatible_with_version_32", ns: "admin.system.version" }
2017-09-19T21:41:42.600-0400 I INDEX    [initandlisten] 	 building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2017-09-19T21:41:42.611-0400 I INDEX    [initandlisten] build index done.  scanned 0 total records. 0 secs
2017-09-19T21:41:42.611-0400 I COMMAND  [initandlisten] setting featureCompatibilityVersion to 3.4
2017-09-19T21:41:42.611-0400 I NETWORK  [thread1]

In [6]:
import time

import arctic
import numpy as np
import pandas as pd

Create an example dataframe
===

In [7]:
days = 365*10
n_securities = 3000
df_wide = pd.DataFrame(data=np.random.rand(days, n_securities), index=pd.date_range('2000', periods=days))
df_wide.columns = ['security_{}'.format(i) for i in range(1, n_securities+1)]
df_wide.head()

Unnamed: 0,security_1,security_2,security_3,security_4,security_5,security_6,security_7,security_8,security_9,security_10,...,security_2991,security_2992,security_2993,security_2994,security_2995,security_2996,security_2997,security_2998,security_2999,security_3000
2000-01-01,0.360503,0.540044,0.175147,0.667505,0.05244,0.320114,0.479942,0.976082,0.452705,0.633895,...,0.522094,0.934895,0.853929,0.766299,0.278149,0.435728,0.415771,0.18901,0.939602,0.086047
2000-01-02,0.116827,0.442737,0.213746,0.512537,0.023199,0.589539,0.825675,0.990114,0.937921,0.541192,...,0.501659,0.993732,0.957384,0.383764,0.800484,0.583847,0.835349,0.596753,0.251528,0.107287
2000-01-03,0.77583,0.290438,0.821741,0.175384,0.065735,0.591823,0.224915,0.689926,0.529868,0.529271,...,0.091219,0.887558,0.211488,0.139867,0.12993,0.650067,0.53768,0.021839,0.36895,0.28927
2000-01-04,0.800512,0.252834,0.024264,0.742498,0.455012,0.157255,0.010719,0.46918,0.811157,0.613518,...,0.608599,0.027016,0.513667,0.033696,0.26222,0.637498,0.121702,0.443866,0.40591,0.688135
2000-01-05,0.295997,0.981768,0.989,0.393383,0.871876,0.518018,0.553047,0.667828,0.29535,0.528792,...,0.485422,0.89518,0.388003,0.862983,0.673129,0.805076,0.198353,0.593167,0.398647,0.621736


Using VersionStore to read/write wide data
===

In [8]:
db = arctic.Arctic("localhost")
print("Libraries: {}".format(db.list_libraries()))
db.initialize_library('libvs1', lib_type='VersionStore')
libvs1 = db['libvs1']
print("Symbols in {}: {}".format('libvs1', libvs1.list_symbols()))

Library created, but couldn't enable sharding: no such command: 'enablesharding', bad cmd: '{ enablesharding: "arctic" }'. This is OK if you're not 'admin'


Libraries: []
Symbols in libvs1: []


In [9]:
def get_size(lib):
    ''' helper to get size of an arctic library in mongo '''
    byts = 0.
    for c in lib._arctic_lib._library_coll.database.collection_names():
        if lib._collection.name in c:
            byts += lib._arctic_lib._library_coll.database.command('collstats', c)['storageSize']
    return '{} megabytes'.format(byts / 1e6)

In [10]:
size_before = get_size(libvs1)
%time libvs1.write('wide_item1', df_wide)
size_after = get_size(libvs1)

print('')
print('Size before: {}'.format(size_before))
print('Size after: {}'.format(size_after))
print('Symbols in {}: {}'.format('libvs1', libvs1.list_symbols()))

CPU times: user 687 ms, sys: 271 ms, total: 959 ms
Wall time: 1.41 s

Size before: 0.024576 megabytes
Size after: 0.024576 megabytes
Symbols in libvs1: ['wide_item1']


note: that's way more space efficient than I'd expect.  I will dig a bit deeper when I have a moment... possibly I messed up the calc

In [11]:
#note: versionstore wraps results in a class, .data gets access to the object we want (DataFrame in this case)
%time rb_wide = libvs1.read('wide_item1').data

CPU times: user 558 ms, sys: 682 ms, total: 1.24 s
Wall time: 1.39 s


In [12]:
rb_wide.head()

Unnamed: 0_level_0,security_1,security_2,security_3,security_4,security_5,security_6,security_7,security_8,security_9,security_10,...,security_2991,security_2992,security_2993,security_2994,security_2995,security_2996,security_2997,security_2998,security_2999,security_3000
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01,0.360503,0.540044,0.175147,0.667505,0.05244,0.320114,0.479942,0.976082,0.452705,0.633895,...,0.522094,0.934895,0.853929,0.766299,0.278149,0.435728,0.415771,0.18901,0.939602,0.086047
2000-01-02,0.116827,0.442737,0.213746,0.512537,0.023199,0.589539,0.825675,0.990114,0.937921,0.541192,...,0.501659,0.993732,0.957384,0.383764,0.800484,0.583847,0.835349,0.596753,0.251528,0.107287
2000-01-03,0.77583,0.290438,0.821741,0.175384,0.065735,0.591823,0.224915,0.689926,0.529868,0.529271,...,0.091219,0.887558,0.211488,0.139867,0.12993,0.650067,0.53768,0.021839,0.36895,0.28927
2000-01-04,0.800512,0.252834,0.024264,0.742498,0.455012,0.157255,0.010719,0.46918,0.811157,0.613518,...,0.608599,0.027016,0.513667,0.033696,0.26222,0.637498,0.121702,0.443866,0.40591,0.688135
2000-01-05,0.295997,0.981768,0.989,0.393383,0.871876,0.518018,0.553047,0.667828,0.29535,0.528792,...,0.485422,0.89518,0.388003,0.862983,0.673129,0.805076,0.198353,0.593167,0.398647,0.621736


In [13]:
np.all(rb_wide == df_wide)

True

In [14]:
df_wide.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3650 entries, 2000-01-01 to 2009-12-28
Freq: D
Columns: 3000 entries, security_1 to security_3000
dtypes: float64(3000)
memory usage: 83.6 MB


In [15]:
rb_wide.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3650 entries, 2000-01-01 to 2009-12-28
Columns: 3000 entries, security_1 to security_3000
dtypes: float64(3000)
memory usage: 83.6 MB


In [16]:
# note: rb_wide's index has lost metadata about it's frequency
# but in this instance it can be inferred. would still
# be good to change arctic to keep this..
rb_wide.index.inferred_freq

'D'

Using ChunkStore to read/write tall data
===

In [17]:
# reshape wide to tall
%time df_tall = df_wide.stack().reset_index().rename(columns={'level_0': 'date', 'level_1': 'security_id', 0: 'vals'})

df_tall.head()

CPU times: user 639 ms, sys: 430 ms, total: 1.07 s
Wall time: 1.07 s


Unnamed: 0,date,security_id,vals
0,2000-01-01,security_1,0.360503
1,2000-01-01,security_2,0.540044
2,2000-01-01,security_3,0.175147
3,2000-01-01,security_4,0.667505
4,2000-01-01,security_5,0.05244


In [18]:
db.initialize_library('libcs1', lib_type='ChunkStoreV1')
libcs = db['libcs1']
print('Symbols in {}: {}'.format('libcs1', libcs.list_symbols()))

Symbols in libcs1: []


In [19]:
size_before = get_size(libcs)
%time libcs.write('tall_item1', df_tall, chunk_size='A')
size_after = get_size(libcs)

print('')
print('Size before: {}'.format(size_before))
print('Size after: {}'.format(size_after))
print('Symbols in {}: {}'.format('libcs1', libcs.list_symbols()))

You can access infer_dtype as pandas.api.types.infer_dtype
  if pd.lib.infer_dtype(a) == 'mixed':


CPU times: user 2.96 s, sys: 1.07 s, total: 4.03 s
Wall time: 5.62 s

Size before: 0.016384 megabytes
Size after: 104.968192 megabytes
Symbols in libcs1: ['tall_item1']


In [20]:
%time rb_tall = libcs.read('tall_item1')

rb_tall.head()

CPU times: user 2.54 s, sys: 2.19 s, total: 4.73 s
Wall time: 5.08 s


Unnamed: 0,date,security_id,vals
0,2000-01-01,security_1,0.360503
1,2000-01-01,security_2,0.540044
2,2000-01-01,security_3,0.175147
3,2000-01-01,security_4,0.667505
4,2000-01-01,security_5,0.05244


In [21]:
np.all(df_tall == rb_tall)

True

In [22]:
df_tall.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10950000 entries, 0 to 10949999
Data columns (total 3 columns):
date           datetime64[ns]
security_id    object
vals           float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 250.6+ MB


In [23]:
rb_tall.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10950000 entries, 0 to 10949999
Data columns (total 3 columns):
date           datetime64[ns]
security_id    object
vals           float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 250.6+ MB


Writing tall data to VersionStore (perf ok, tall is less space efficient than wide format for VersionStore)
===

In [24]:
get_size(libvs1)

'0.024576 megabytes'

In [25]:
%time libvs1.write('tall_item1', df_tall)

CPU times: user 5.83 s, sys: 2.16 s, total: 7.99 s
Wall time: 7.99 s


VersionedItem(symbol=tall_item1,library=arctic.libvs1,data=<class 'NoneType'>,version=1,metadata=None

In [26]:
%time rb_vs_tall = libvs1.read('tall_item1')

CPU times: user 2.78 s, sys: 2.6 s, total: 5.38 s
Wall time: 5.99 s


In [27]:
np.all(df_tall == rb_vs_tall.data)

True

In [28]:
get_size(libvs1)

'0.024576 megabytes'

Writing wide data to ChunkStore (perf bad, not great on space either)
===

In [29]:
get_size(libcs)

'113.000448 megabytes'

In [30]:
df_wide.index.name = 'date'  # chunkstore's date chunker is picky about having an index or column called 'date'

In [31]:
%time libcs.write('wide_item1', df_wide, chunk_size='A')

CPU times: user 39.5 s, sys: 1min 2s, total: 1min 41s
Wall time: 1min 42s


In [32]:
%time rb_cs_wide = libcs.read('wide_item1')

CPU times: user 778 ms, sys: 468 ms, total: 1.25 s
Wall time: 1.39 s


In [33]:
np.all(df_wide == rb_cs_wide)

True

In [34]:
get_size(libcs)

'209.985536 megabytes'