Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] In-memory database #32

Closed
Techno-coder opened this issue May 18, 2021 · 9 comments
Closed

[Feature Request] In-memory database #32

Techno-coder opened this issue May 18, 2021 · 9 comments
Assignees

Comments

@Techno-coder
Copy link

Hi there, thanks for making this library! I was wondering if it's possible to add an option for the database to be forcibly created in memory:

class ExecutionContext(object):
    # ...
    def __init__(self, path, schema, auto_commit=True):
        source = sqlite3.connect(str(path))
        self.conn = sqlite3.connect(':memory:')
        source.backup(self.conn)
        # ...

I added this snippet to puchikarui.py and it sped up lookups by about 30-40% (7.7 seconds down to ~4 seconds). Of course, ideally the database would be kept outside of the context construction. I'm currently have reuse of contexts enabled.

@letuananh
Copy link
Contributor

Hi @Techno-coder thank you for the suggestion. I really appreciate it

I think holding the whole database in memory and query from there directly is a great idea. Can you elaborate on how one may use this feature?
Are we building the DB from source XML files into :memory: before use every time and use it? I'm thinking may be we can also dump the pre-built database into an .sql file and insert it in batch mode. May be that will be faster than importing from XML directly.
How about loading the prebuilt database into memory directly? That feature is available in Python 3.7 backup() function but we may have to do something for earlier versions like 3.5 or 3.6.

Anyway this is great. I'd love to work on this feature. Please let me know what you think.

@letuananh letuananh self-assigned this May 18, 2021
@Techno-coder
Copy link
Author

At the moment I'm loading the prebuilt database from jamdict-data. The main reason I want the whole database in memory is because I am doing a lot of lookups (in the thousands) and the disk I/O time incurs some significant overhead in lookup time. With regards to startup time I imagine that loading the prebuilt database will be faster than building from the XML files.

This answer on StackOverflow dumps the database into a .sql file as you suggest before loading it into memory. It should work on nearly all Python versions.

@letuananh
Copy link
Contributor

I use :memory: quite often when I'm testing Jamdict code so the fucntionality is actually already there. Let me try how to make it convenient I'll update you on this soon. Thanks again for suggesting this feature.

@letuananh
Copy link
Contributor

Hi @Techno-coder again.

I have added a new data source type into puchikarui based on your suggestion. The actual code change is in on this line https://github.com/letuananh/puchikarui/blob/ec4f445d29373c98de64e51aee2760cb58ab2258/puchikarui/puchikarui.py#L273

This data source is available from puchikarui 0.2a1 on PyPI: https://pypi.org/project/puchikarui/0.2a1/

To use this feature one can just pass in a MemorySource object as path, for example:

from puchikarui import Database, MemorySource

db = Database(MemorySource("source.db"))
ctx = db.open()  # this ctx will operate on :memory:
... 

With this, I added a new kwarg memory_mode to Jamdict. When memory_mode=True Jamdict database will be loaded into memory before use. For example:

jam = Jamdict(memory_mode=True)

On my computer this takes about a minute to load. I haven't tested to see how much the query performance is improved.
I'll get back to you on this.

This change will be available on Jamdict >= 0.1a10. I only deployed it to TestPyPI though: https://test.pypi.org/project/jamdict/0.1a10/ You can install this from Github or TestPyPI

# install from (my) Github repo
pip install git+https://github.com/letuananh/jamdict.git

# install from TestPyPI
pip install -i https://test.pypi.org/simple/ jamdict==0.1a10

I'll test this change a little bit more before releasing it to PyPI. Mean while it's best if you can help me to test it out to see how it works on your side. Thank you.

@Techno-coder
Copy link
Author

Techno-coder commented May 19, 2021

Thank you very much for working on this feature! It looks like it's working as intended on my end.

Without memory_mode enabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   174423    3.739    0.000    3.739    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.523    0.523    0.766    0.766 vocab.pyx:468(from_disk)
   174423    0.444    0.000    5.494    0.000 puchikarui.py:506(execute)
   174423    0.402    0.000    0.630    0.000 puchikarui.py:300(build_select)
   174423    0.395    0.000    0.625    0.000 puchikarui.py:118(<listcomp>)
   174423    0.350    0.000    7.242    0.000 puchikarui.py:446(select_record)
   178597    0.318    0.000    0.318    0.000 {method 'format' of 'str' objects}
   177570    0.228    0.000    0.530    0.000 __init__.py:1284(getLogger)
    10589    0.183    0.000    8.012    0.001 dictionary.py:10(find_definitions)
     3840    0.182    0.000    6.291    0.002 jmdict_sqlite.py:172(get_entry)

With memory_mode enabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   174423    0.798    0.000    0.798    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.522    0.522    0.766    0.766 vocab.pyx:468(from_disk)
        1    0.490    0.490    0.490    0.490 {method 'backup' of 'sqlite3.Connection' objects}
   174423    0.371    0.000    0.584    0.000 puchikarui.py:300(build_select)
   174423    0.359    0.000    2.232    0.000 puchikarui.py:506(execute)
   174423    0.304    0.000    3.652    0.000 puchikarui.py:446(select_record)
   178597    0.235    0.000    0.235    0.000 {method 'format' of 'str' objects}
   174423    0.232    0.000    0.420    0.000 puchikarui.py:118(<listcomp>)
   177571    0.209    0.000    0.460    0.000 __init__.py:1284(getLogger)
    10589    0.175    0.000    4.819    0.000 dictionary.py:10(find_definitions)

The time spent in sqlite3 has been reduced by 80%. Load time for me is negligible for both versions (about a second or two).

By the way, I had to enable auto_config=False otherwise it would throw this exception:

  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/util.py", line 249, in __init__
    self.jmd_xml_file = jmd_xml_file if jmd_xml_file else config.get_file('JMDICT_XML') if auto_config else None
  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/config.py", line 130, in get_file
    _config = read_config()
  File "/Users/z/.local/share/virtualenvs/yomeru-nlp-HqGNwivf/lib/python3.9/site-packages/jamdict/config.py", line 98, in read_config
    __app_config.load()
TypeError: load() missing 1 required positional argument: 'file_path'

Not sure why I'm getting this error since it doesn't look like the code's changed between versions. Maybe I've messed up my environment somehow?

@letuananh
Copy link
Contributor

Hi again. Thanks for your performance test log. This looks great.

As for the read_config() it was indeed a bug as config file is no longer mandatory with the release of the jamdict-data package. Thanks for spotting it out. I fixed it but somehow I forgot to deploy the patch. I updated the release on Test PyPI (version 0.1a10.post2) and hopefully this will solve the issue you are encountering.
https://test.pypi.org/project/jamdict/0.1a10.post2/

@Techno-coder
Copy link
Author

Yep, updating to post2 fixed it :)

@letuananh
Copy link
Contributor

Yay, this is great, thanks! I will test it a little bit more, and will deploy it to PyPI soon.

@letuananh
Copy link
Contributor

Hi, I have tested Jamdict 0.1a10 on several OSes with different setup and now it's live on PyPI

https://pypi.org/project/jamdict/0.1a10/

If you have any issue please feel free to open this thread to continue or create a new one. Thanks again for suggesting this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants