Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(clickhouse): make arrays non nullable #8501

Merged

Conversation

mneedham
Copy link
Contributor

Description of changes

This PR makes ClickHouse arrays non nullable. I ran into the issue when trying to convert text to an array of float values as part of a RAG pipeline and run into this error:

File ~/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py:361, in HttpClient._error_handler(self, response, retried)
    359     err_msg = common.format_error(err_content.decode(errors='backslashreplace'))
    360     err_str = f':{err_str}\n {err_msg}'
--> 361 raise OperationalError(err_str) if retried else DatabaseError(err_str) from None

DatabaseError: :HTTPDriver for http://localhost:8123 returned response code 500)
 Code: 43. DB::Exception: Nested type Array(Nullable(Float64)) cannot be inside Nullable type. (ILLEGAL_TYPE_OF_ARGUMENT) (version 24.2.1.1933 (official build))

I think treating arrays the way that maps are treated should handle this.

Copy link
Contributor

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@mneedham mneedham changed the title ClickHouse: Make arrays non nullable fix: ClickHouse: Make arrays non nullable Feb 29, 2024
@lostmygithubaccount lostmygithubaccount changed the title fix: ClickHouse: Make arrays non nullable fix(clickhouse): make arrays non nullable Feb 29, 2024
@mneedham mneedham force-pushed the clickhouse-arrays-not-nullable branch 5 times, most recently from 2bb0f1f to f27f842 Compare February 29, 2024 15:55
@cpcloud
Copy link
Member

cpcloud commented Feb 29, 2024

@mneedham Thanks for the PR!

I can take care of fixing up the tests there.

@cpcloud cpcloud added the clickhouse The ClickHouse backend label Feb 29, 2024
@cpcloud cpcloud added this to the 9.0 milestone Feb 29, 2024
@cpcloud cpcloud added the bug Incorrect behavior inside of ibis label Feb 29, 2024
@cpcloud
Copy link
Member

cpcloud commented Feb 29, 2024

Seems like I can't push to PR, so I'll try to make suggestions

@cpcloud
Copy link
Member

cpcloud commented Feb 29, 2024

@mneedham Can you apply this patch to your PR?

diff --git a/ibis/backends/tests/test_array.py b/ibis/backends/tests/test_array.py
index 584c4ac3e..156c31bda 100644
--- a/ibis/backends/tests/test_array.py
+++ b/ibis/backends/tests/test_array.py
@@ -938,8 +938,8 @@ def flatten_data():
             marks=[
                 pytest.mark.notyet(
                     ["clickhouse"],
-                    reason="doesn't support nullable array elements",
-                    raises=ClickHouseDatabaseError,
+                    reason="Arrays are never nullable",
+                    raises=AssertionError,
                 )
             ],
         ),
@@ -950,8 +950,8 @@ def flatten_data():
             marks=[
                 pytest.mark.notyet(
                     ["clickhouse"],
-                    reason="doesn't support nullable array elements",
-                    raises=ClickHouseDatabaseError,
+                    reason="Arrays are never nullable",
+                    raises=AssertionError,
                 )
             ],
         ),

@mneedham mneedham force-pushed the clickhouse-arrays-not-nullable branch from f27f842 to d40a8e5 Compare February 29, 2024 16:18
@mneedham
Copy link
Contributor Author

@cpcloud have applied it

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@cpcloud
Copy link
Member

cpcloud commented Feb 29, 2024

@mneedham Just out of curiosity are Nullable(Array(T)) types ever going to be implemented in ClickHouse? Or is that something that is "done" for the foreseeable future?

@cpcloud cpcloud enabled auto-merge (squash) February 29, 2024 16:21
@mneedham
Copy link
Contributor Author

@mneedham Just out of curiosity are Nullable(Array(T)) types ever going to be implemented in ClickHouse? Or is that something that is "done" for the foreseeable future?

As far as I know there aren't any plans to add that functionality in t he near future. I think people usually create an empty array for null use cases.

I see someone did try to implement it, but that seems to have stalled - ClickHouse/ClickHouse#53443

@cpcloud cpcloud merged commit 1caf6de into ibis-project:main Feb 29, 2024
74 checks passed
@mneedham
Copy link
Contributor Author

mneedham commented Mar 1, 2024

@cpcloud hey - not sure where to ask this question, so gonna ask it here. I have this script:

import os
import ibis
import ollama

from dotenv import load_dotenv
from pathlib import Path

import pandas as pd
import ibis.expr.datatypes as dt

rag_con = ibis.connect(f"clickhouse://localhost")


for table in rag_con.list_tables():
    rag_con.drop_table(table)

rag_con.list_tables()


table_name = "docs5"
for filepath in Path("/Users/markhneedham/projects/clickhouse-docs/docs").glob("**/*.md"):
    contents = filepath.read_text()

    data = {
        "filepath": [str(filepath).split("docs/")[-1]],
        "contents": [contents],
    }

    t = pd.DataFrame(data)
    schema = ibis.schema(
        names=["filepath", "contents"], 
        types=[dt.String(nullable=False), dt.String(nullable=False)]
    )

    if table_name not in rag_con.list_tables():
        rag_con.create_table(name=table_name, obj=t, schema=schema)
    else:
        rag_con.insert(name=table_name, obj=t)

rag_con.list_tables()


t = rag_con.table(table_name)

t = (
    t.mutate(tokens_estimate=t["contents"].length() // 4)
    .order_by(ibis._["tokens_estimate"].desc())
    .relocate("filepath", "tokens_estimate")
)
t

def _embed(text: str) -> list[float]:
    """Text to fixed-length array embedding."""
    text = text.replace("\n", " ")
    try:
        return (
            ollama.embeddings(model='nomic-embed-text', prompt=text)['embedding']
        )
    except Exception as e:
        print(e)
    return [0.0] * 768


@ibis.udf.scalar.python
def embed(text: str, tokens_estimate: int) -> list[float]:
    """Text to fixed-length array embedding."""
    if 0 < tokens_estimate < 8191:
        return _embed(text)
    return [0.0] * 768


t = t.mutate(
    embedding=embed(t["contents"], t["tokens_estimate"])
).cache()
t

So I'm now running against ibis-framework on the GitHub main branch. And I get this error:

Traceback (most recent call last):
  File "/Users/markhneedham/projects/examples/LearnClickHouseWithMark/ibis/app_ch.py", line 101, in <module>
    ).cache()
      ^^^^^^^
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/ibis/expr/types/relations.py", line 3414, in cache
    return current_backend._cached(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/ibis/backends/__init__.py", line 1128, in _cached
    self._query_cache.store(expr)
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/ibis/common/caching.py", line 122, in store
    self.populate(name, input)
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/ibis/backends/sql/__init__.py", line 216, in _load_into_cache
    self.create_table(name, expr, schema=expr.schema(), temp=True)
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/ibis/backends/clickhouse/__init__.py", line 722, in create_table
    self.con.raw_query(sql, external_data=external_data)
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 472, in raw_query
    return self._raw_request(body, params, fields=fields).data
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 437, in _raw_request
    self._error_handler(response)
  File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ibis-clickhouse-ZMCVYiHj-py3.11/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 361, in _error_handler
    raise OperationalError(err_str) if retried else DatabaseError(err_str) from None
clickhouse_connect.driver.exceptions.DatabaseError: :HTTPDriver for http://localhost:8123 returned response code 404)
 Code: 46. DB::Exception: Unknown function embed_0: While processing filepath, CAST(floor(length(contents) / 4), 'Nullable(Int64)') AS tokens_estimate, contents, embed_0(contents, CAST(floor(length(contents) / 4), 'Nullable(Int64)')) AS embedding. (UNKNOWN_FUNCTION) (version 24.2.1.1933 (official build))

How do I go about debugging that? Is there a way to see a list of the functions that have been registered? I can't tell if my function failed to register or if it's registered with an unexpected signature?

@cpcloud
Copy link
Member

cpcloud commented Mar 1, 2024

@mneedham I will move these comments to an issue. That's the best place for bug reports. Alternatively, a GitHub Discussion would also be fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis clickhouse The ClickHouse backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants