[Feature Request] In-memory database #32

Techno-coder · 2021-05-18T10:14:25Z

Hi there, thanks for making this library! I was wondering if it's possible to add an option for the database to be forcibly created in memory:

class ExecutionContext(object):
    # ...
    def __init__(self, path, schema, auto_commit=True):
        source = sqlite3.connect(str(path))
        self.conn = sqlite3.connect(':memory:')
        source.backup(self.conn)
        # ...

I added this snippet to puchikarui.py and it sped up lookups by about 30-40% (7.7 seconds down to ~4 seconds). Of course, ideally the database would be kept outside of the context construction. I'm currently have reuse of contexts enabled.

The text was updated successfully, but these errors were encountered:

letuananh · 2021-05-18T11:38:25Z

Hi @Techno-coder thank you for the suggestion. I really appreciate it

I think holding the whole database in memory and query from there directly is a great idea. Can you elaborate on how one may use this feature?
Are we building the DB from source XML files into :memory: before use every time and use it? I'm thinking may be we can also dump the pre-built database into an .sql file and insert it in batch mode. May be that will be faster than importing from XML directly.
How about loading the prebuilt database into memory directly? That feature is available in Python 3.7 backup() function but we may have to do something for earlier versions like 3.5 or 3.6.

Anyway this is great. I'd love to work on this feature. Please let me know what you think.

Techno-coder · 2021-05-18T12:32:46Z

At the moment I'm loading the prebuilt database from jamdict-data. The main reason I want the whole database in memory is because I am doing a lot of lookups (in the thousands) and the disk I/O time incurs some significant overhead in lookup time. With regards to startup time I imagine that loading the prebuilt database will be faster than building from the XML files.

This answer on StackOverflow dumps the database into a .sql file as you suggest before loading it into memory. It should work on nearly all Python versions.

letuananh · 2021-05-18T12:42:44Z

I use :memory: quite often when I'm testing Jamdict code so the fucntionality is actually already there. Let me try how to make it convenient I'll update you on this soon. Thanks again for suggesting this feature.

letuananh · 2021-05-19T15:25:51Z

Hi @Techno-coder again.

I have added a new data source type into puchikarui based on your suggestion. The actual code change is in on this line https://github.com/letuananh/puchikarui/blob/ec4f445d29373c98de64e51aee2760cb58ab2258/puchikarui/puchikarui.py#L273

This data source is available from puchikarui 0.2a1 on PyPI: https://pypi.org/project/puchikarui/0.2a1/

To use this feature one can just pass in a MemorySource object as path, for example:

from puchikarui import Database, MemorySource

db = Database(MemorySource("source.db"))
ctx = db.open()  # this ctx will operate on :memory:
...

With this, I added a new kwarg memory_mode to Jamdict. When memory_mode=True Jamdict database will be loaded into memory before use. For example:

jam = Jamdict(memory_mode=True)

On my computer this takes about a minute to load. I haven't tested to see how much the query performance is improved.
I'll get back to you on this.

This change will be available on Jamdict >= 0.1a10. I only deployed it to TestPyPI though: https://test.pypi.org/project/jamdict/0.1a10/ You can install this from Github or TestPyPI

# install from (my) Github repo
pip install git+https://github.com/letuananh/jamdict.git

# install from TestPyPI
pip install -i https://test.pypi.org/simple/ jamdict==0.1a10

I'll test this change a little bit more before releasing it to PyPI. Mean while it's best if you can help me to test it out to see how it works on your side. Thank you.

Techno-coder · 2021-05-19T22:46:33Z

Thank you very much for working on this feature! It looks like it's working as intended on my end.

Without memory_mode enabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   174423    3.739    0.000    3.739    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.523    0.523    0.766    0.766 vocab.pyx:468(from_disk)
   174423    0.444    0.000    5.494    0.000 puchikarui.py:506(execute)
   174423    0.402    0.000    0.630    0.000 puchikarui.py:300(build_select)
   174423    0.395    0.000    0.625    0.000 puchikarui.py:118(<listcomp>)
   174423    0.350    0.000    7.242    0.000 puchikarui.py:446(select_record)
   178597    0.318    0.000    0.318    0.000 {method 'format' of 'str' objects}
   177570    0.228    0.000    0.530    0.000 __init__.py:1284(getLogger)
    10589    0.183    0.000    8.012    0.001 dictionary.py:10(find_definitions)
     3840    0.182    0.000    6.291    0.002 jmdict_sqlite.py:172(get_entry)

With memory_mode enabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   174423    0.798    0.000    0.798    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.522    0.522    0.766    0.766 vocab.pyx:468(from_disk)
        1    0.490    0.490    0.490    0.490 {method 'backup' of 'sqlite3.Connection' objects}
   174423    0.371    0.000    0.584    0.000 puchikarui.py:300(build_select)
   174423    0.359    0.000    2.232    0.000 puchikarui.py:506(execute)
   174423    0.304    0.000    3.652    0.000 puchikarui.py:446(select_record)
   178597    0.235    0.000    0.235    0.000 {method 'format' of 'str' objects}
   174423    0.232    0.000    0.420    0.000 puchikarui.py:118(<listcomp>)
   177571    0.209    0.000    0.460    0.000 __init__.py:1284(getLogger)
    10589    0.175    0.000    4.819    0.000 dictionary.py:10(find_definitions)

The time spent in sqlite3 has been reduced by 80%. Load time for me is negligible for both versions (about a second or two).

By the way, I had to enable auto_config=False otherwise it would throw this exception:

  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/util.py", line 249, in __init__
    self.jmd_xml_file = jmd_xml_file if jmd_xml_file else config.get_file('JMDICT_XML') if auto_config else None
  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/config.py", line 130, in get_file
    _config = read_config()
  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/config.py", line 98, in read_config
    __app_config.load()
TypeError: load() missing 1 required positional argument: 'file_path'

Not sure why I'm getting this error since it doesn't look like the code's changed between versions. Maybe I've messed up my environment somehow?

letuananh · 2021-05-20T02:02:18Z

Hi again. Thanks for your performance test log. This looks great.

As for the read_config() it was indeed a bug as config file is no longer mandatory with the release of the jamdict-data package. Thanks for spotting it out. I fixed it but somehow I forgot to deploy the patch. I updated the release on Test PyPI (version 0.1a10.post2) and hopefully this will solve the issue you are encountering.
https://test.pypi.org/project/jamdict/0.1a10.post2/

Techno-coder · 2021-05-20T02:33:41Z

Yep, updating to post2 fixed it :)

letuananh · 2021-05-20T02:38:05Z

Yay, this is great, thanks! I will test it a little bit more, and will deploy it to PyPI soon.

letuananh · 2021-05-21T14:50:14Z

Hi, I have tested Jamdict 0.1a10 on several OSes with different setup and now it's live on PyPI

https://pypi.org/project/jamdict/0.1a10/

If you have any issue please feel free to open this thread to continue or create a new one. Thanks again for suggesting this feature.

letuananh added the enhancement label May 18, 2021

letuananh self-assigned this May 18, 2021

letuananh added this to the jamdict 0.1 official milestone May 18, 2021

letuananh mentioned this issue May 19, 2021

Add feature to load a database into memory to boost up performance letuananh/puchikarui#9

Closed

letuananh closed this as completed May 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] In-memory database #32

[Feature Request] In-memory database #32

Techno-coder commented May 18, 2021

letuananh commented May 18, 2021

Techno-coder commented May 18, 2021

letuananh commented May 18, 2021

letuananh commented May 19, 2021

Techno-coder commented May 19, 2021 •

edited

Loading

letuananh commented May 20, 2021

Techno-coder commented May 20, 2021

letuananh commented May 20, 2021

letuananh commented May 21, 2021

[Feature Request] In-memory database #32

[Feature Request] In-memory database #32

Comments

Techno-coder commented May 18, 2021

letuananh commented May 18, 2021

Techno-coder commented May 18, 2021

letuananh commented May 18, 2021

letuananh commented May 19, 2021

Techno-coder commented May 19, 2021 • edited Loading

letuananh commented May 20, 2021

Techno-coder commented May 20, 2021

letuananh commented May 20, 2021

letuananh commented May 21, 2021

Techno-coder commented May 19, 2021 •

edited

Loading