Skip to content

LanceDB's only_if("col IN ('x', 'y') fails if col is of type dictionary #7002

@valkum

Description

@valkum

LanceDB's only_if("col IN ('x', 'y') fails if col is of type dictionary due to safe_coerce_scalar missing a Dict arm.

We have a lancedb with a dictionary column that we would like to use in a filter.

The call chain is

Query::only_if("etld IN ('com', 'de')").execute_query
 -> Scanner::create_plan -> Scanner::create_filter_plan
 -> ExprFilter::to_datafusion
 -> Planner::parse_filter 
-> resolve_expr
 -> coerce_expr
 -> resolve_value
 -> safe_coerce_scalar

safe_coerce_scalar is lacking an arm for dictionaries.

Test:

async fn dictionary_string_dataset() -> Dataset {
    use arrow_array::{Int16Array, Int16DictionaryArray};

    let schema = Arc::new(ArrowSchema::new(vec![ArrowField::new(
        "etld",
        DataType::Dictionary(Box::new(DataType::Int16), Box::new(DataType::Utf8)),
        false,
    )]));

    let dictionary = Arc::new(StringArray::from(vec!["a", "b", "c"]));
    let indices = Int16Array::from((0..30).map(|i| i % 3).collect::<Vec<_>>());
    let dict_array = Int16DictionaryArray::try_new(indices, dictionary).unwrap();

    let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(dict_array)]).unwrap();
    let reader = RecordBatchIterator::new(vec![Ok(batch)], schema.clone());
    Dataset::write(reader, "memory://test_dict_filter", None)
        .await
        .unwrap()
}


#[tokio::test]
async fn test_filter_on_dictionary_string_column() {
    let dataset = dictionary_string_dataset().await;

    // Equality predicate.
    let count = dataset
        .scan()
        .filter("etld = 'a'")
        .unwrap()
        .try_into_batch()
        .await
        .unwrap()
        .num_rows();
    assert_eq!(count, 10);

    // IN-list predicate.
    let count = dataset
        .scan()
        .filter("etld IN ('a', 'b')")
        .unwrap()
        .try_into_batch()
        .await
        .unwrap()
        .num_rows();
    assert_eq!(count, 20);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions