Enhance support for parameterised SQL query datasets #1089

jstammers · 2021-12-03T10:10:48Z

Description

It would be useful to extend the ability of SQLQueryDatasets to make use of parameterised queries that can be given parameters at runtime.

Context

With the current SQLQueryDataset, parameters can be used in some cases. For example parameters can be passed in from globals.yml

#globals.yml
foo: bar

#catalog.yml

sql:
  type: pandas.SQLQueryDataset
  sql: "SELECT * FROM table WHERE column = ${foo}"

However, when these values are loaded from a yaml file, their string representation of the corresponding python object is used. This is a problem for lists as the following would not produce a valid SQL query

#globals.yml
list:
 - a
 - b
 - c

#catalog.yml

sql:
  type: pandas.SQLQueryDataset
  sql: "SELECT * FROM table WHERE column in ${foo};" # parsed as "SELECT * FROM table WHERE column in ['a','b','c'];"

Possible Implementation

The jinjasql package provides some utilities for parsing templates using jinja syntax.

import jinjasql
def render_template_query(template, **kwargs) -> (str, dict):
    """Renders a query from a sql template.

    Examples
    --------
    Additional keyword arguments are used to fill in parameters
    >>> template = "SELECT * FROM table WHERE column = {{foo}}"
    >>> query, params = render_template_query(template, foo='bar')

    Columns and Tables must be marked as 'sqlsafe' to be parameterised.
    >>> template = "SELECT * FROM {{table | sqlsafe}} WHERE {{col | sqlsafe}} = {{foo}}"
    >>> query, params = render_template_query(template, table='mytable', col='column', foo='bar')

    Collections of values can be used to paramterise in-clauses
    >>> template = "SELECT * FROM table WHERE column in {{values | inclause}}"
    >>> query, params = render_template_query(template, values=['a','b','c'])
    """
    # pyodbc uses qmark syntax
    jinja = jinjasql.JinjaSql(param_style="qmark")
    query, params = jinja.prepare_query(template, kwargs)
    return query, params

This can then be passed into pandas.read_sql_query as follows

import pandas as pd
con = ...
template= "SELECT * FROM table WHERE column in {{values | inclause}}"
query, params = render_template_query(template, values=['a','b','c'])
pd.read_sql_query(query, con=con, params=params)

From a configuration point of view, it might be useful in add a keyword to SQLQueryDataset to make it explicit that the query is a template, rather than a valid SQL string, e.g.

sql:
  type: pandas.SQLQueryDataset
  template: "SELECT * FROM table WHERE column in {{foo | inclause}};"

I haven't given much thought yet as to how this could take values from runtime parameters. I think it would require some additional validation, e.g. checking that all the parameters in the template have a value

The text was updated successfully, but these errors were encountered:

antonymilne · 2021-12-03T11:42:38Z

Super easy solution: don't use a yaml list, just a literal string which make the SQL query valid:

# globals.yml
foo: "('a', 'b', 'c')"

This is undoubtedly quite ugly and not a very general solution though. However, I think the already existing Jinja 2 support in TemplatedConfigLoader might already enable you to do at least some of the fancier things that jinjasql enables.

@datajoely will be able to advise better here, but when it comes to templating SQL queries I think you may be straying into the territory of dbt rather than kedro. This is something they do really well. We're hoping to improve SQL support in kedro, but I think it's very feasible (even recommended?) to use dbt as well as kedro if you're trying to do relatively complex SQL stuff.

datajoely · 2021-12-03T12:02:17Z

I think currently we're taking a view that SQL in Kedro should be mostly about extraction and saving rather than transformative and business logic. It took me a while to be convinced of the reasoning but it can be best explained as this:

We want to limit the number 'of out of DAG' operations that make it difficult to reproduce.

This isn't 100% adhered to since we support things like UPSERTS and INSERTS in some cases in general SQL isn't Kedro's strong point and should be used when you need to use DataFrame like APIs in the Python world.

jstammers · 2021-12-03T16:12:27Z

Thanks for the tips. I think you're suggestion @AntonyMilneQB should work for my particular use-case.
I can see the reasoning behind limiting the usage of SQL in Kedro. Perhaps I need to look into dbt and see if that's more suitable

merelcht · 2022-02-11T15:30:25Z

Hi @jstammers do you need more help with this issue or can we close it?

jstammers · 2022-02-14T09:12:40Z

Hi @MerelTheisenQB, please feel free to close this issue. I've been able to resolve my problem elsewhere.

merelcht · 2022-02-14T09:57:52Z

Okay great! I'm glad it's resolved 🙂

jstammers added the Issue: Feature Request New feature or improvement to existing feature label Dec 3, 2021

merelcht closed this as completed Feb 14, 2022

stichbury mentioned this issue Jan 31, 2023

Create one or more blog posts about Kedro & "dynamic pipelines" that sets out the various requirements, solutions and links to docs kedro-org/kedro-devrel#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance support for parameterised SQL query datasets #1089

Enhance support for parameterised SQL query datasets #1089

jstammers commented Dec 3, 2021

antonymilne commented Dec 3, 2021

datajoely commented Dec 3, 2021

jstammers commented Dec 3, 2021

merelcht commented Feb 11, 2022

jstammers commented Feb 14, 2022

merelcht commented Feb 14, 2022

Enhance support for parameterised SQL query datasets #1089

Enhance support for parameterised SQL query datasets #1089

Comments

jstammers commented Dec 3, 2021

Description

Context

Possible Implementation

antonymilne commented Dec 3, 2021

datajoely commented Dec 3, 2021

jstammers commented Dec 3, 2021

merelcht commented Feb 11, 2022

jstammers commented Feb 14, 2022

merelcht commented Feb 14, 2022