Support for reading multiple symbols with a single query #814

Saturnix · 2019-09-06T11:30:59Z

Arctic Version

Latest

Arctic Store

# VersionStore

Description of problem and/or code sample that reproduces the issue

Pulling many different symbols (300 and more) in a loop is slow (can take several seconds).

dfs = []
for i in mylist:
	dfs.append(library.read(i))

Especially if this is from a remote mongodb instance. I suppose this is because for each .read() a separate query is made?

My solution is to store the dataframes concatenated:

con = pd.concat(my_dfs, keys=mylist)
library.write("con",  con)
con = library.read("con").data
dfs = []
for i in mylist:
	dfs.append(con.loc[i].copy())

Is there any way to do this, without any performance hit against my solution above?

df1, df2, df3 = library.read(["df1", "df2", "df3"])

The text was updated successfully, but these errors were encountered:

bmoscon · 2019-09-06T12:08:16Z

Parallel reads would work great. A process pool or threadpool should help a lot this this many symbols

Saturnix · 2019-09-06T13:23:06Z

Thanks for your reply! I've made some tests. Changed my code to:

from multiprocessing.pool import ThreadPool
p = ThreadPool(300)

def write(_i, _df, _lib):
    _lib.write(str(_i), _df)


arr = []
for i in range(len(mylist)):
    arr.append((i, mylist[i], lib))

pool_output = p.starmap(write, arr)

def read(_i, _lib):
    return _lib.read(str(_i)).data


arr = []
for i in range(len(mylist)):
    arr.append((i, lib))

pool_output = p.starmap(read, arr)

Here are the results, where localhost is a local instange of MongoDB and remote is a MongoDB cluster hosted on Atlas with the free tier.

Concat and singledfs are the 2 solutions in my first post. Threaded is the solution in this post.

localhost
writing threaded: 1.605351
reading threaded: 0.712160
preparing aggregation: 1.238278

writing singledfs: 1.963724
reading singledfs: 0.809319
preparing aggregation: 1.176273

writing concatenated: 3.038349
reading concatenated: 1.831411
preparing aggregation: 1.855426


Remote threaded
writing threaded: 21.979554
reading threaded: 3.176056
preparing aggregation: 0.698157

Remote singledfs
writing singledfs: 15.967014
reading singledfs: 4.898361
preparing aggregation: 0.692155

Remote concat
writing concatenated: 1.685871
reading concatenated: 1.240596
preparing aggregation: 1.031221

The performance boost of local threaded read against others makes me think I don't see the same results with the remote instance because it's the free-tier, rather than other reasons (network overhead and similar). Hopefully I'll get better results with the paid one. Will update if I decide to go this route.

Any input on the matter is very appreciated.

edit: disregard "preparing aggregation". I used that to compare performance between iterating over concat and an array of dfs.

edit2: I'm both testing on a better remote instance and publishing the complete code so that it can be reproduced, in a few minutes...

Saturnix · 2019-09-06T14:06:53Z

So, I've upgraded the remote MongoDB intance to a more powerful one changed the number of txt files from 150(ish) to 280. Here are the benchmarks:

localhost
writing threaded: 1.605351
reading threaded: 0.712160
preparing aggregation: 1.238278

writing singledfs: 1.963724
reading singledfs: 0.809319
preparing aggregation: 1.176273

writing concatenated: 3.038349
reading concatenated: 1.831411
preparing aggregation: 1.855426



remote
writing threaded: 12.210598
reading threaded: 1.907261
preparing aggregation: 1.230276

writing singledfs: 26.320585
reading singledfs: 8.096666
preparing aggregation: 1.181274

writing concatenated: 4.295626
reading concatenated: 2.486687
preparing aggregation: 1.853415

The complete code I'm using, in case someone wants to reproduce, is here: https://github.com/Saturnix/ArcticFiddle

bmoscon · 2019-09-08T14:48:59Z

So it seems like some improvement, no?

Saturnix · 2019-09-08T21:17:22Z

Yes, it is! Thanks for the tip!

bmoscon · 2019-09-08T21:40:44Z

I think that the concatenated DFs might get better performance simply because you'll probably get better compression on much larger dataframe than you would on many small ones being compressed independently

shashank88 · 2019-09-18T13:20:00Z

@Saturnix closing this issue, feel free to reopen if you have more questions regarding this.

shashank88 closed this as completed Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for reading multiple symbols with a single query #814

Support for reading multiple symbols with a single query #814

Saturnix commented Sep 6, 2019

bmoscon commented Sep 6, 2019

Saturnix commented Sep 6, 2019 •

edited

Saturnix commented Sep 6, 2019

bmoscon commented Sep 8, 2019

Saturnix commented Sep 8, 2019

bmoscon commented Sep 8, 2019

shashank88 commented Sep 18, 2019

Support for reading multiple symbols with a single query #814

Support for reading multiple symbols with a single query #814

Comments

Saturnix commented Sep 6, 2019

Arctic Version

Arctic Store

Description of problem and/or code sample that reproduces the issue

bmoscon commented Sep 6, 2019

Saturnix commented Sep 6, 2019 • edited

Saturnix commented Sep 6, 2019

bmoscon commented Sep 8, 2019

Saturnix commented Sep 8, 2019

bmoscon commented Sep 8, 2019

shashank88 commented Sep 18, 2019

Saturnix commented Sep 6, 2019 •

edited