Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String property max chars when using MLMD with PostgreSQL #194

Open
tarilabs opened this issue Mar 3, 2024 · 1 comment
Open

String property max chars when using MLMD with PostgreSQL #194

tarilabs opened this issue Mar 3, 2024 · 1 comment

Comments

@tarilabs
Copy link
Contributor

tarilabs commented Mar 3, 2024

  • When you attempt to store a string based MLMD property backed by MySQL, MEDIUMTEXT is max ~16 M chars; the indexes make use of the first 255 chars (here) and you can store "~arbitrarily long text" in those string based MLMD properties.
    This is working fine.
  • When instead you attempt to store a string based MLMD property backed by PostgreSQL, TEXT is max ~536 M chars; however, the index by default would limit you to merely ~2690 characters because is not taking the substring approach above (here) and as a net result because of default's btree limit (here) you can store at max ~2690 chars.

To me, this seems to indicate a potential typo/overlook/misconfiguration of the PostreSQL index, because as of today can store in practice only 2k chars of text when MLMD backed by PostgreSQL, when practically MLMD backed by MySQL you could do 1000x magnitude.

Potential solutions provided in PR: #195

@tarilabs
Copy link
Contributor Author

tarilabs commented Mar 3, 2024

Reproducer

With PostgreSQL ready, for instance:

docker run --name some-postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=mypsw \
-e PGUSER=postgres \
-e PGPASSWORD=mypsw \
-p 5432:5432 \
-d "postgres:12"

Run the following Python script:

from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2
from faker import Faker
fake = Faker()

def create_store():
    connection_config = metadata_store_pb2.ConnectionConfig()
    connection_config.postgresql.host = 'localhost'
    connection_config.postgresql.port = '5432'
    connection_config.postgresql.user = 'postgres'
    connection_config.postgresql.password = 'mypsw'
    connection_config.postgresql.dbname = 'pgserver'
    # connection_config.postgresql.skip_db_creation = 'false'
    # connection_config.postgresql.ssloption.sslmode = '...' # disable, allow, verify-ca, verify-full, etc.
    # connection_config.postgresql.ssloption.sslcert = '...'
    # connection_config.postgresql.ssloption.sslkey = '...'
    # connection_config.postgresql.ssloption.sslpassword = '...'
    # connection_config.postgresql.ssloption.sslrootcert = '...'
    store = metadata_store.MetadataStore(connection_config, enable_upgrade_migration=True)
    return store

def create_ctx_type(store):
    experiment_type = metadata_store_pb2.ContextType()
    experiment_type.name = "Experiment"
    experiment_type.properties["note"] = metadata_store_pb2.STRING
    experiment_type_id = store.put_context_type(experiment_type)
    results = store.get_context_types()
    print(results)
    return experiment_type_id

def working(experiment_type_id):
    my_experiment = metadata_store_pb2.Context()
    my_experiment.type_id = experiment_type_id
    my_experiment.name = "exp1"
    my_experiment.properties["note"].string_value = "My first experiment."
    [experiment_id] = store.put_contexts([my_experiment])
    print(experiment_id)
    result = store.get_contexts_by_id([experiment_id])
    print(result)

def lorem(max_chars):
    result = ""
    while len(result) < max_chars:
        result += fake.sentence() + " "
    return result[:max_chars]

def not_working(experiment_type_id):
    my_experiment = metadata_store_pb2.Context()
    my_experiment.type_id = experiment_type_id
    my_experiment.name = "exp_longnote"
    my_experiment.properties["note"].string_value = lorem(2690) # up to ~2680 works
    [experiment_id] = store.put_contexts([my_experiment])
    print(experiment_id)
    result = store.get_contexts_by_id([experiment_id])
    print(result)

if __name__ == '__main__':
    store = create_store()
    experiment_type_id = create_ctx_type(store)
    working(experiment_type_id)
    not_working(experiment_type_id)

Resulting in:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0303 16:19:42.571434 13378 postgresql_metadata_source.cc:208] Connecting to database. 
I0303 16:19:42.576880 13378 postgresql_metadata_source.cc:215] Connection to database succeed.
I0303 16:19:43.322311 13378 postgresql_metadata_source.cc:208] Connecting to database. 
I0303 16:19:43.327381 13378 postgresql_metadata_source.cc:215] Connection to database succeed.
[id: 10
name: "Experiment"
properties {
  key: "note"
  value: STRING
}
]
1
[id: 1
type_id: 10
name: "exp1"
properties {
  key: "note"
  value {
    string_value: "My first experiment."
  }
}
type: "Experiment"
create_time_since_epoch: 1709479183548
last_update_time_since_epoch: 1709479183548
]
E0303 16:19:43.555275 13378 postgresql_metadata_source.cc:128] Execution failed: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
WARNING:absl:mlmd client InternalError: PostgreSQL metadata source error: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Traceback (most recent call last):
  File "/home/tarilabs/git/demo20231009-mlmdpg/python/test3.py", line 61, in <module>
    not_working(experiment_type_id)
  File "/home/tarilabs/git/demo20231009-mlmdpg/python/test3.py", line 52, in not_working
    [experiment_id] = store.put_contexts([my_experiment])
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 520, in put_contexts
    self._call('PutContexts', request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 203, in _call
    return self._call_method(method_name, request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 224, in _call_method
    self._pywrap_cc_call(cc_method, request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 255, in _pywrap_cc_call
    raise errors.make_exception(error_message.decode('utf-8'), status_code)
ml_metadata.errors.InternalError: PostgreSQL metadata source error: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Demonstrating is potentially a typo/overlook/misconfiguration of the Index which is limiting the amount of chars which can be stored in the MLMD string property:

PostgreSQL metadata source error: ERROR: index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL: Index row references tuple (0,2) in relation "contextproperty".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant