Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading multiple symbols with a single query #814

Closed
Saturnix opened this issue Sep 6, 2019 · 7 comments
Closed

Support for reading multiple symbols with a single query #814

Saturnix opened this issue Sep 6, 2019 · 7 comments

Comments

@Saturnix
Copy link

Saturnix commented Sep 6, 2019

Arctic Version

Latest

Arctic Store

# VersionStore

Description of problem and/or code sample that reproduces the issue

Pulling many different symbols (300 and more) in a loop is slow (can take several seconds).

dfs = []
for i in mylist:
	dfs.append(library.read(i))

Especially if this is from a remote mongodb instance. I suppose this is because for each .read() a separate query is made?

My solution is to store the dataframes concatenated:

con = pd.concat(my_dfs, keys=mylist)
library.write("con",  con)
con = library.read("con").data
dfs = []
for i in mylist:
	dfs.append(con.loc[i].copy())

Is there any way to do this, without any performance hit against my solution above?

df1, df2, df3 = library.read(["df1", "df2", "df3"])

@bmoscon
Copy link
Collaborator

bmoscon commented Sep 6, 2019

Parallel reads would work great. A process pool or threadpool should help a lot this this many symbols

@Saturnix
Copy link
Author

Saturnix commented Sep 6, 2019

Thanks for your reply! I've made some tests. Changed my code to:

from multiprocessing.pool import ThreadPool
p = ThreadPool(300)

def write(_i, _df, _lib):
    _lib.write(str(_i), _df)


arr = []
for i in range(len(mylist)):
    arr.append((i, mylist[i], lib))

pool_output = p.starmap(write, arr)

def read(_i, _lib):
    return _lib.read(str(_i)).data


arr = []
for i in range(len(mylist)):
    arr.append((i, lib))

pool_output = p.starmap(read, arr)

Here are the results, where localhost is a local instange of MongoDB and remote is a MongoDB cluster hosted on Atlas with the free tier.

Concat and singledfs are the 2 solutions in my first post. Threaded is the solution in this post.

localhost
writing threaded: 1.605351
reading threaded: 0.712160
preparing aggregation: 1.238278

writing singledfs: 1.963724
reading singledfs: 0.809319
preparing aggregation: 1.176273

writing concatenated: 3.038349
reading concatenated: 1.831411
preparing aggregation: 1.855426


Remote threaded
writing threaded: 21.979554
reading threaded: 3.176056
preparing aggregation: 0.698157

Remote singledfs
writing singledfs: 15.967014
reading singledfs: 4.898361
preparing aggregation: 0.692155

Remote concat
writing concatenated: 1.685871
reading concatenated: 1.240596
preparing aggregation: 1.031221

The performance boost of local threaded read against others makes me think I don't see the same results with the remote instance because it's the free-tier, rather than other reasons (network overhead and similar). Hopefully I'll get better results with the paid one. Will update if I decide to go this route.

Any input on the matter is very appreciated.

edit: disregard "preparing aggregation". I used that to compare performance between iterating over concat and an array of dfs.

edit2: I'm both testing on a better remote instance and publishing the complete code so that it can be reproduced, in a few minutes...

@Saturnix
Copy link
Author

Saturnix commented Sep 6, 2019

So, I've upgraded the remote MongoDB intance to a more powerful one changed the number of txt files from 150(ish) to 280. Here are the benchmarks:

localhost
writing threaded: 1.605351
reading threaded: 0.712160
preparing aggregation: 1.238278

writing singledfs: 1.963724
reading singledfs: 0.809319
preparing aggregation: 1.176273

writing concatenated: 3.038349
reading concatenated: 1.831411
preparing aggregation: 1.855426



remote
writing threaded: 12.210598
reading threaded: 1.907261
preparing aggregation: 1.230276

writing singledfs: 26.320585
reading singledfs: 8.096666
preparing aggregation: 1.181274

writing concatenated: 4.295626
reading concatenated: 2.486687
preparing aggregation: 1.853415

The complete code I'm using, in case someone wants to reproduce, is here: https://github.com/Saturnix/ArcticFiddle

@bmoscon
Copy link
Collaborator

bmoscon commented Sep 8, 2019

So it seems like some improvement, no?

@Saturnix
Copy link
Author

Saturnix commented Sep 8, 2019

Yes, it is! Thanks for the tip!

@bmoscon
Copy link
Collaborator

bmoscon commented Sep 8, 2019

I think that the concatenated DFs might get better performance simply because you'll probably get better compression on much larger dataframe than you would on many small ones being compressed independently

@shashank88
Copy link
Contributor

@Saturnix closing this issue, feel free to reopen if you have more questions regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants