Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance support for parameterised SQL query datasets #1089

Closed
jstammers opened this issue Dec 3, 2021 · 6 comments
Closed

Enhance support for parameterised SQL query datasets #1089

jstammers opened this issue Dec 3, 2021 · 6 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@jstammers
Copy link
Contributor

Description

It would be useful to extend the ability of SQLQueryDatasets to make use of parameterised queries that can be given parameters at runtime.

Context

With the current SQLQueryDataset, parameters can be used in some cases. For example parameters can be passed in from globals.yml

#globals.yml
foo: bar
#catalog.yml

sql:
  type: pandas.SQLQueryDataset
  sql: "SELECT * FROM table WHERE column = ${foo}"

However, when these values are loaded from a yaml file, their string representation of the corresponding python object is used. This is a problem for lists as the following would not produce a valid SQL query

#globals.yml
list:
 - a
 - b
 - c
#catalog.yml

sql:
  type: pandas.SQLQueryDataset
  sql: "SELECT * FROM table WHERE column in ${foo};" # parsed as "SELECT * FROM table WHERE column in ['a','b','c'];"

Possible Implementation

The jinjasql package provides some utilities for parsing templates using jinja syntax.

import jinjasql
def render_template_query(template, **kwargs) -> (str, dict):
    """Renders a query from a sql template.

    Examples
    --------
    Additional keyword arguments are used to fill in parameters
    >>> template = "SELECT * FROM table WHERE column = {{foo}}"
    >>> query, params = render_template_query(template, foo='bar')

    Columns and Tables must be marked as 'sqlsafe' to be parameterised.
    >>> template = "SELECT * FROM {{table | sqlsafe}} WHERE {{col | sqlsafe}} = {{foo}}"
    >>> query, params = render_template_query(template, table='mytable', col='column', foo='bar')

    Collections of values can be used to paramterise in-clauses
    >>> template = "SELECT * FROM table WHERE column in {{values | inclause}}"
    >>> query, params = render_template_query(template, values=['a','b','c'])
    """
    # pyodbc uses qmark syntax
    jinja = jinjasql.JinjaSql(param_style="qmark")
    query, params = jinja.prepare_query(template, kwargs)
    return query, params

This can then be passed into pandas.read_sql_query as follows

import pandas as pd
con = ...
template= "SELECT * FROM table WHERE column in {{values | inclause}}"
query, params = render_template_query(template, values=['a','b','c'])
pd.read_sql_query(query, con=con, params=params)

From a configuration point of view, it might be useful in add a keyword to SQLQueryDataset to make it explicit that the query is a template, rather than a valid SQL string, e.g.

sql:
  type: pandas.SQLQueryDataset
  template: "SELECT * FROM table WHERE column in {{foo | inclause}};" 

I haven't given much thought yet as to how this could take values from runtime parameters. I think it would require some additional validation, e.g. checking that all the parameters in the template have a value

@jstammers jstammers added the Issue: Feature Request New feature or improvement to existing feature label Dec 3, 2021
@antonymilne
Copy link
Contributor

Super easy solution: don't use a yaml list, just a literal string which make the SQL query valid:

# globals.yml
foo: "('a', 'b', 'c')"

This is undoubtedly quite ugly and not a very general solution though. However, I think the already existing Jinja 2 support in TemplatedConfigLoader might already enable you to do at least some of the fancier things that jinjasql enables.

@datajoely will be able to advise better here, but when it comes to templating SQL queries I think you may be straying into the territory of dbt rather than kedro. This is something they do really well. We're hoping to improve SQL support in kedro, but I think it's very feasible (even recommended?) to use dbt as well as kedro if you're trying to do relatively complex SQL stuff.

@datajoely
Copy link
Contributor

I think currently we're taking a view that SQL in Kedro should be mostly about extraction and saving rather than transformative and business logic. It took me a while to be convinced of the reasoning but it can be best explained as this:

We want to limit the number 'of out of DAG' operations that make it difficult to reproduce.

This isn't 100% adhered to since we support things like UPSERTS and INSERTS in some cases in general SQL isn't Kedro's strong point and should be used when you need to use DataFrame like APIs in the Python world.

@jstammers
Copy link
Contributor Author

Thanks for the tips. I think you're suggestion @AntonyMilneQB should work for my particular use-case.
I can see the reasoning behind limiting the usage of SQL in Kedro. Perhaps I need to look into dbt and see if that's more suitable

@merelcht
Copy link
Member

Hi @jstammers do you need more help with this issue or can we close it?

@jstammers
Copy link
Contributor Author

Hi @MerelTheisenQB, please feel free to close this issue. I've been able to resolve my problem elsewhere.

@merelcht
Copy link
Member

Okay great! I'm glad it's resolved 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants