Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError when saving a dataframe with large strings to TickStore #810

Closed
alanbogossian opened this issue Aug 4, 2019 · 12 comments
Closed

Comments

@alanbogossian
Copy link

alanbogossian commented Aug 4, 2019

Arctic Version

1.79.2

Arctic Store

TickStore

Platform and version

Python 3.6.7, Linux Mint 19 Cinnamon 64-bit

Description of problem and/or code sample that reproduces the issue

Hi,
I'm trying to save the following data:
https://drive.google.com/file/d/1dWWBNvx6vjyNK4kjZTVL4-YM0fmWxT5b/view?usp=sharing

to TickStore, code:
https://pastebin.com/jEqXxq2t

and getting a MemoryError, see the stack traces:
https://pastebin.com/Uy4pYAfH

I'm quite new to arctic so I might be doing something wrong, and I would appreciate if you could guide me with this.

Side question:
Considering the nature of my data (2 col made of a time stamp and long string/json file), what is the best way to store these using arctic?

Thanks,
Alan

@shashank88
Copy link
Contributor

TickStore is probably not what you want if you are storing strings in your dataset. VersionStore (which is the default) or ChunkStore should be more suitable. I can take a further look at your data later.

@alanbogossian
Copy link
Author

alanbogossian commented Aug 9, 2019 via email

@alanbogossian
Copy link
Author

Hi @shashank88

Just to give a bit more background: I am recording tick by tick market data and saving to arctic TickStore every minute.

Then I read (issue 301) that it is possible to achieve some compression by reading back the data for the entire day and saving them again into Arctic.

This is what I am trying to do here and I am getting the MemoryError on that last step.
I would like to confirm whether ChunkStore and VersionStore are appropriate for that.

Also it would be great if you could check why we are having this memory issue.
I've tried to save slice of my data frame but it seems that the error is not consistently happening on the same row but at random. I was not able to isolate the row that was causing the issue.

Thanks!

@alanbogossian
Copy link
Author

Hi @shashank88,

I've tried to store using VERSION_STORE (reading all minute input and saving at once for the whole day) and the disk size is actually worst than TICK_STORE with input saved every minute.

I've then tried to store using CHUNK_STORE and this seems to be even worse: either no disk saving on some data (that I have not shared here) and the same MemoryError on the data shared in first post, although the message this time is slightly different:

  File "/home/alan/test.py", line 68, in __init__
    self.store_lib_chunk_symb.write(symbol_, df)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/chunkstore/chunkstore.py", line 357, in write
    data = self.serializer.serialize(record)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 188, in serialize
    ret = self.converter.docify(df)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 124, in docify
    raise e
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 119, in docify
    arrays.append(arr.tostring())
MemoryError

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 11, 2019

You don’t have enough memory to do the operation. An entire day of data is likely very large. Pandas is very memory intensive

@alanbogossian
Copy link
Author

Hi @bmoscon, thanks for your reply

The test I've done (data provided in first message) were actually for just two hours of recording.

Also saving to VERSION_STORE has no issue. Also I've had no issue when saving to CSV.
The data having issue is a data frame of less than 60,000 rows. The values are large strings though (json files): these are raw messages coming from the exchanges.

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 12, 2019

if you want to store that in arctic you should probably parse the dictionary and store the data in columns. strings can be problematic, especially very large ones like you have in here. If you look at the code, version store operates in a wholly different manner than tickstore and chunkstore, so its not surprising that one works and they dont.

@alanbogossian
Copy link
Author

Thanks for your reply.

Candid question (I'm very novice in this area): why is storing a string problematic?

I thought about storing the parsed message, however there might be some data that I am currently ignoring from the raw message and that I might need in the future so I thought I should store the raw message. Also I thought if I found any issue with my parsing function, it will be safer to just store the raw data. So would you recommend me not to use arctic if I want to store raw messages? What should I use? Or should I absolutely avoid storing raw messages at all?

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 25, 2019

If you want to store raw data i'd recommend something else like redis or memcached

@alanbogossian
Copy link
Author

Thanks Bryant. We want to store historical data over several years, so we should probably not use redis or memcached?

By the way, have you found any issue with the data I tried to saved? It is still not clear to me why we got this error message in the first place. I understand your comment about the fact that we should not store raw messages into arctic, but I would still be interested to know what caused the error.

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 25, 2019

it works for me

In [1]: import arctic

In [2]: import pandas as pd

In [3]: df = pd.read_csv('tick_store_bitflyer_FX_BTC_JPY_lightning_board.csv')

In [4]: df.index = pd.to_datetime(df.index, utc=True)

In [5]: a = arctic.Arctic('127.0.0.1')

In [7]: a.initialize_library('temp-test', arctic.TICK_STORE)

In [8]: lib = a['temp-test']

In [9]: lib.write('testdata', df)
UserWarning: Discarding nonzero nanoseconds in conversion
UserWarning: Discarding nonzero nanoseconds in conversion
  bucket, initial_image = TickStore._pandas_to_bucket(x[i:i + self._chunk_size], symbol, initial_image)
NB treating all values as 'exists' - no longer sparse
FutureWarning: The 'convert_datetime64' parameter is deprecated and will be removed in a future version
  recs = df.to_records(convert_datetime64=False)
FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  array = TickStore._ensure_supported_dtypes(recs[col])

In [10]: lib.read('testdata')
 FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(dtype, int):
Out[10]:
                                                 Unnamed: 0                                           response
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.666000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.777000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.880000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.056000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.225000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.273000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.535000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.622000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.731000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.839000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.122000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.345000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.515000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.628000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.763000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.844000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.951000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.080000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.216000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.305000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.423000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.642000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.909000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.013000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.137000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.357000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.476000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.593000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.704000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.847000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
...                                                     ...                                                ...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.696000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.809000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.988000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.056000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.179000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.272000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.384000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.499000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.646000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.826000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.055000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.128000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.235000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.368000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.509000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.587000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.711000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.798000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.923000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.032000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.143000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.254000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.407000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.495000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.598000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.734000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.855000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.958000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:59.066000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:59.185000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...

[54679 rows x 2 columns]

In [11]:

@bmoscon
Copy link
Collaborator

bmoscon commented Sep 8, 2019

Closing this as it’s not reproducible

@bmoscon bmoscon closed this as completed Sep 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants