Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: regression: slow to generate SQL, generated SQL is much slower #8484

Closed
1 task done
NickCrews opened this issue Feb 27, 2024 · 37 comments · Fixed by #8592
Closed
1 task done

bug: regression: slow to generate SQL, generated SQL is much slower #8484

NickCrews opened this issue Feb 27, 2024 · 37 comments · Fixed by #8592
Assignees
Labels
bug Incorrect behavior inside of ibis performance Issues related to ibis's performance regression Issues related to things that used to work but don't anymore
Milestone

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Feb 27, 2024

What happened?

I have some very complicated ibis expression.

On 8.0.0:

print(ibis.to_sql(expr)) # 0.1 sec
expr.cache(). # 1.5 sec

and I get the sql

WITH t0 AS (
  SELECT
    t9.prefix AS prefix,
    CASE
      WHEN (
        TRIM(
          LOWER(
            NULLIF(
              TRIM(
                REGEXP_REPLACE(
                  REGEXP_REPLACE(REGEXP_REPLACE(t9.first_name, '[^\\x00-\\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                  '\\s+',
                  ' ',
                  'g'
                ),
                ' 	

'
              ),
              ''
            )
          ),
          ' 	

'
        ) = TRIM(
          LOWER(
            NULLIF(
              TRIM(
                REGEXP_REPLACE(
                  REGEXP_REPLACE(REGEXP_REPLACE(t9.last_name, '[^\\x00-\\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                  '\\s+',
                  ' ',
                  'g'
                ),
                ' 	

'
              ),
              ''
            )
          ),
          ' 	

'
        )
        AND COALESCE(
          ARRAY_LENGTH(
            STR_SPLIT(
              NULLIF(
                TRIM(
                  REGEXP_REPLACE(
                    TRIM(
                      LOWER(
                        NULLIF(
                          TRIM(
                            REGEXP_REPLACE(
                              REGEXP_REPLACE(REGEXP_REPLACE(t9.first_name, '[^\\x00-\\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                              '\\s+',
                              ' ',
                              'g'
                            ),
                            ' 	

'
                          ),
                          ''
                        )
                      ),
                      ' 	

'
                    ),
                    '\\s+',
                    ' ',
                    'g'
                  ),
                  ' 	

'
                ),
                ''
              ),
              ' '
            )
          ),
          CAST(0 AS TINYINT)
        ) = CAST(1 AS TINYINT)
        AND COALESCE(
          ARRAY_LENGTH(
            STR_SPLIT(
              NULLIF(
                TRIM(
                  REGEXP_REPLACE(
                    TRIM(
                      LOWER(
                        NULLIF(
                          TRIM(
                            REGEXP_REPLACE(
                              REGEXP_REPLACE(REGEXP_REPLACE(t9.last_name, '[^\\x00-\\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                              '\\s+',
                              ' ',
                              'g'
                            ),
                            ' 	

'
                          ),
                          ''
                        )
                      ),
                      ' 	

'
                    ),
                    '\\s+',
                    ' ',
                    'g'
                  ),
                  ' 	

'
                ),
                ''
              ),
              ' '
            )
          ),
          CAST(0 AS TINYINT)
        ) = CAST(1 AS TINYINT)
      )
      THEN NULL
      ELSE t9.first_name
    END AS first_name,
    t9.middle_name AS middle_name,
    t9.last_name AS last_name,
    t9.suffix AS suffix,
    t9.nickname AS nickname
  FROM main.ibis_cache_m5utd5pimjfz7jmp2ojvapbyim AS t9
), t1 AS (
  SELECT
    t0.prefix AS prefix,
    t0.first_name AS first_name,
    t0.middle_name AS middle_name,
    t0.last_name AS last_name,
    t0.suffix AS suffix,
    t0.nickname AS nickname,
    STR_SPLIT(t0.prefix, ' ') AS prefix_tokens,
    STR_SPLIT(t0.first_name, ' ') AS first_name_tokens,
    STR_SPLIT(t0.middle_name, ' ') AS middle_name_tokens,
    STR_SPLIT(t0.last_name, ' ') AS last_name_tokens,
    STR_SPLIT(t0.suffix, ' ') AS suffix_tokens,
    STR_SPLIT(t0.nickname, ' ') AS nickname_tokens
  FROM t0
), t2 AS (
  SELECT
    t1.prefix AS prefix,
    t1.first_name AS first_name,
    t1.middle_name AS middle_name,
    t1.last_name AS last_name,
    t1.suffix AS suffix,
    t1.nickname AS nickname,
    t1.prefix_tokens AS prefix_tokens,
    t1.first_name_tokens AS first_name_tokens,
    t1.middle_name_tokens AS middle_name_tokens,
    t1.last_name_tokens AS last_name_tokens,
    t1.suffix_tokens AS suffix_tokens,
    t1.nickname_tokens AS nickname_tokens,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.prefix_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.prefix)
      ELSE NULL
    END AS prefix_single,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.first_name_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.first_name)
      ELSE NULL
    END AS first_name_single,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.middle_name_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.middle_name)
      ELSE NULL
    END AS middle_name_single,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.last_name_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.last_name)
      ELSE NULL
    END AS last_name_single,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.suffix_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.suffix)
      ELSE NULL
    END AS suffix_single,
    CASE
      WHEN (
        ARRAY_LENGTH(t1.nickname_tokens) = CAST(1 AS TINYINT)
      )
      THEN UPPER(t1.nickname)
      ELSE NULL
    END AS nickname_single
  FROM t1
), t3 AS (
  SELECT
    t2.prefix AS prefix,
    t2.first_name AS first_name,
    t2.middle_name AS middle_name,
    t2.last_name AS last_name,
    t2.suffix AS suffix,
    t2.nickname AS nickname,
    t2.prefix_tokens AS prefix_tokens,
    t2.first_name_tokens AS first_name_tokens,
    t2.middle_name_tokens AS middle_name_tokens,
    t2.last_name_tokens AS last_name_tokens,
    t2.suffix_tokens AS suffix_tokens,
    t2.nickname_tokens AS nickname_tokens,
    t2.prefix_single AS prefix_single,
    t2.first_name_single AS first_name_single,
    t2.middle_name_single AS middle_name_single,
    t2.last_name_single AS last_name_single,
    t2.suffix_single AS suffix_single,
    t2.nickname_single AS nickname_single,
    CAST([t2.first_name_single, t2.middle_name_single, t2.last_name_single, t2.suffix_single, t2.nickname_single] AS TEXT[]) AS singles_except_prefix,
    CAST([t2.prefix_single, t2.middle_name_single, t2.last_name_single, t2.suffix_single, t2.nickname_single] AS TEXT[]) AS singles_except_first_name,
    CAST([t2.prefix_single, t2.first_name_single, t2.last_name_single, t2.suffix_single, t2.nickname_single] AS TEXT[]) AS singles_except_middle_name,
    CAST([t2.prefix_single, t2.first_name_single, t2.middle_name_single, t2.suffix_single, t2.nickname_single] AS TEXT[]) AS singles_except_last_name,
    CAST([t2.prefix_single, t2.first_name_single, t2.middle_name_single, t2.last_name_single, t2.nickname_single] AS TEXT[]) AS singles_except_suffix,
    CAST([t2.prefix_single, t2.first_name_single, t2.middle_name_single, t2.last_name_single, t2.suffix_single] AS TEXT[]) AS singles_except_nickname
  FROM t2
), t4 AS (
  SELECT
    t3.prefix AS prefix,
    t3.first_name AS first_name,
    t3.middle_name AS middle_name,
    t3.last_name AS last_name,
    t3.suffix AS suffix,
    t3.nickname AS nickname,
    t3.prefix_tokens AS prefix_tokens,
    t3.first_name_tokens AS first_name_tokens,
    t3.middle_name_tokens AS middle_name_tokens,
    t3.last_name_tokens AS last_name_tokens,
    t3.suffix_tokens AS suffix_tokens,
    t3.nickname_tokens AS nickname_tokens,
    t3.prefix_single AS prefix_single,
    t3.first_name_single AS first_name_single,
    t3.middle_name_single AS middle_name_single,
    t3.last_name_single AS last_name_single,
    t3.suffix_single AS suffix_single,
    t3.nickname_single AS nickname_single,
    t3.singles_except_prefix AS singles_except_prefix,
    t3.singles_except_first_name AS singles_except_first_name,
    t3.singles_except_middle_name AS singles_except_middle_name,
    t3.singles_except_last_name AS singles_except_last_name,
    t3.singles_except_suffix AS singles_except_suffix,
    t3.singles_except_nickname AS singles_except_nickname,
    LIST_FILTER(
      t3.prefix_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_prefix, UPPER(__ibis_param_token__))
    ) AS prefix_tokens_filtered,
    LIST_FILTER(
      t3.first_name_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_first_name, UPPER(__ibis_param_token__))
    ) AS first_name_tokens_filtered,
    LIST_FILTER(
      t3.middle_name_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_middle_name, UPPER(__ibis_param_token__))
    ) AS middle_name_tokens_filtered,
    LIST_FILTER(
      t3.last_name_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_last_name, UPPER(__ibis_param_token__))
    ) AS last_name_tokens_filtered,
    LIST_FILTER(
      t3.suffix_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_suffix, UPPER(__ibis_param_token__))
    ) AS suffix_tokens_filtered,
    LIST_FILTER(
      t3.nickname_tokens,
      __ibis_param_token__ -> NOT ARRAY_CONTAINS(t3.singles_except_nickname, UPPER(__ibis_param_token__))
    ) AS nickname_tokens_filtered
  FROM t3
), t5 AS (
  SELECT
    ARRAY_AGGR(t4.prefix_tokens_filtered, 'string_agg', ' ') AS prefix,
    ARRAY_AGGR(t4.first_name_tokens_filtered, 'string_agg', ' ') AS first_name,
    ARRAY_AGGR(t4.middle_name_tokens_filtered, 'string_agg', ' ') AS middle_name,
    ARRAY_AGGR(t4.last_name_tokens_filtered, 'string_agg', ' ') AS last_name,
    ARRAY_AGGR(t4.suffix_tokens_filtered, 'string_agg', ' ') AS suffix,
    ARRAY_AGGR(t4.nickname_tokens_filtered, 'string_agg', ' ') AS nickname,
    t4.prefix_tokens AS prefix_tokens,
    t4.first_name_tokens AS first_name_tokens,
    t4.middle_name_tokens AS middle_name_tokens,
    t4.last_name_tokens AS last_name_tokens,
    t4.suffix_tokens AS suffix_tokens,
    t4.nickname_tokens AS nickname_tokens,
    t4.prefix_single AS prefix_single,
    t4.first_name_single AS first_name_single,
    t4.middle_name_single AS middle_name_single,
    t4.last_name_single AS last_name_single,
    t4.suffix_single AS suffix_single,
    t4.nickname_single AS nickname_single,
    t4.singles_except_prefix AS singles_except_prefix,
    t4.singles_except_first_name AS singles_except_first_name,
    t4.singles_except_middle_name AS singles_except_middle_name,
    t4.singles_except_last_name AS singles_except_last_name,
    t4.singles_except_suffix AS singles_except_suffix,
    t4.singles_except_nickname AS singles_except_nickname,
    t4.prefix_tokens_filtered AS prefix_tokens_filtered,
    t4.first_name_tokens_filtered AS first_name_tokens_filtered,
    t4.middle_name_tokens_filtered AS middle_name_tokens_filtered,
    t4.last_name_tokens_filtered AS last_name_tokens_filtered,
    t4.suffix_tokens_filtered AS suffix_tokens_filtered,
    t4.nickname_tokens_filtered AS nickname_tokens_filtered
  FROM t4
), t6 AS (
  SELECT
    t5.prefix AS prefix,
    t5.first_name AS first_name,
    t5.middle_name AS middle_name,
    t5.last_name AS last_name,
    t5.suffix AS suffix,
    t5.nickname AS nickname
  FROM t5
), t7 AS (
  SELECT
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.prefix, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS prefix,
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.first_name, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS first_name,
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.middle_name, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS middle_name,
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.last_name, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS last_name,
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.suffix, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS suffix,
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              ARRAY_AGGR(
                LIST_APPLY(
                  STR_SPLIT(REGEXP_REPLACE(t6.nickname, '[^a-zA-Z \\-'']', '', 'g'), ' '),
                  __ibis_param_t__ -> CASE
                    WHEN (
                      LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                      AND LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                      OR LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                      AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                    )
                    THEN __ibis_param_t__
                    ELSE CONCAT(UPPER(SUBSTR(__ibis_param_t__, 1, 1)), LOWER(SUBSTR(__ibis_param_t__, 2)))
                  END
                ),
                'string_agg',
                ' '
              ),
              '\\s+',
              ' ',
              'g'
            ),
            '(\\W) (\\W)',
            '\\1\\2',
            'g'
          ),
          '(\\w) (\\W)',
          '\\1\\2',
          'g'
        ),
        '(\\W) (\\w)',
        '\\1\\2',
        'g'
      ),
      ' 	

'
    ) AS nickname
  FROM t6
)
SELECT
  t8.prefix,
  t8.first_name,
  CASE
    WHEN (
      COALESCE(t8.nickname LIKE CONCAT(t8.middle_name, '%'), CAST(FALSE AS BOOLEAN))
      AND NOT t8.first_name LIKE CONCAT(t8.middle_name, '%')
    )
    THEN t8.nickname
    ELSE t8.middle_name
  END AS middle_name,
  t8.last_name,
  t8.suffix,
  t8.nickname
FROM (
  SELECT
    CASE
      WHEN (
        UPPER(t7.prefix) = 'JR'
      )
      THEN 'Jr'
      WHEN (
        UPPER(t7.prefix) = 'SR'
      )
      THEN 'Sr'
      WHEN (
        UPPER(t7.prefix) = 'PHD'
      )
      THEN 'PhD'
      WHEN (
        UPPER(t7.prefix) = 'MR'
      )
      THEN 'Mr'
      WHEN (
        UPPER(t7.prefix) = 'MRS'
      )
      THEN 'Mrs'
      WHEN (
        UPPER(t7.prefix) = 'MS'
      )
      THEN 'Ms'
      WHEN (
        UPPER(t7.prefix) = 'DR'
      )
      THEN 'Dr'
      ELSE UPPER(t7.prefix)
    END AS prefix,
    t7.first_name AS first_name,
    t7.middle_name AS middle_name,
    t7.last_name AS last_name,
    CASE
      WHEN (
        UPPER(t7.suffix) = 'JR'
      )
      THEN 'Jr'
      WHEN (
        UPPER(t7.suffix) = 'SR'
      )
      THEN 'Sr'
      WHEN (
        UPPER(t7.suffix) = 'PHD'
      )
      THEN 'PhD'
      WHEN (
        UPPER(t7.suffix) = 'MR'
      )
      THEN 'Mr'
      WHEN (
        UPPER(t7.suffix) = 'MRS'
      )
      THEN 'Mrs'
      WHEN (
        UPPER(t7.suffix) = 'MS'
      )
      THEN 'Ms'
      WHEN (
        UPPER(t7.suffix) = 'DR'
      )
      THEN 'Dr'
      ELSE UPPER(t7.suffix)
    END AS suffix,
    t7.nickname AS nickname
  FROM t7
) AS t8

On main (b5e6373) I get

print(ibis.to_sql(expr)) # 1 sec
expr.cache() #8.4 sec

and I get the sql (but actually much longer, but github won't let me post something that long), but you can see how this uses nested expressions, not CTEs:

SELECT
  CASE
    WHEN UPPER(
      TRIM(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                ARRAY_TO_STRING(
                  LIST_APPLY(
                    STR_SPLIT(
                      REGEXP_REPLACE(
                        ARRAY_TO_STRING(
                          LIST_FILTER(
                            STR_SPLIT("t0"."prefix", ' '),
                            __ibis_param_token__ -> NOT (
                              ARRAY_CONTAINS(
                                [CASE
                                  WHEN ARRAY_LENGTH(
                                    STR_SPLIT(
                                      CASE
                                        WHEN (
                                          (
                                            TRIM(
                                              LOWER(
                                                NULLIF(
                                                  TRIM(
                                                    REGEXP_REPLACE(
                                                      REGEXP_REPLACE(REGEXP_REPLACE("t0"."first_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                      '\s+',
                                                      ' ',
                                                      'g'
                                                    ),
                                                    ' 	

'
                                                  ),
                                                  ''
                                                )
                                              ),
                                              ' 	

'
                                            ) = TRIM(
                                              LOWER(
                                                NULLIF(
                                                  TRIM(
                                                    REGEXP_REPLACE(
                                                      REGEXP_REPLACE(REGEXP_REPLACE("t0"."last_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                      '\s+',
                                                      ' ',
                                                      'g'
                                                    ),
                                                    ' 	

'
                                                  ),
                                                  ''
                                                )
                                              ),
                                              ' 	

'
                                            )
                                          )
                                          AND (
                                            COALESCE(
                                              ARRAY_LENGTH(
                                                STR_SPLIT(
                                                  NULLIF(
                                                    TRIM(
                                                      REGEXP_REPLACE(
                                                        TRIM(
                                                          LOWER(
                                                            NULLIF(
                                                              TRIM(
                                                                REGEXP_REPLACE(
                                                                  REGEXP_REPLACE(REGEXP_REPLACE("t0"."first_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                                  '\s+',
                                                                  ' ',
                                                                  'g'
                                                                ),
                                                                ' 	

'
                                                              ),
                                                              ''
                                                            )
                                                          ),
                                                          ' 	

'
                                                        ),
                                                        '\s+',
                                                        ' ',
                                                        'g'
                                                      ),
                                                      ' 	

'
                                                    ),
                                                    ''
                                                  ),
                                                  ' '
                                                )
                                              ),
                                              CAST(0 AS TINYINT)
                                            ) = CAST(1 AS TINYINT)
                                          )
                                        )
                                        AND (
                                          COALESCE(
                                            ARRAY_LENGTH(
                                              STR_SPLIT(
                                                NULLIF(
                                                  TRIM(
                                                    REGEXP_REPLACE(
                                                      TRIM(
                                                        LOWER(
                                                          NULLIF(
                                                            TRIM(
                                                              REGEXP_REPLACE(
                                                                REGEXP_REPLACE(REGEXP_REPLACE("t0"."last_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                                '\s+',
                                                                ' ',
                                                                'g'
                                                              ),
                                                              ' 	

'
                                                            ),
                                                            ''
                                                          )
                                                        ),
                                                        ' 	

'
                                                      ),
                                                      '\s+',
                                                      ' ',
                                                      'g'
                                                    ),
                                                    ' 	

'
                                                  ),
                                                  ''
                                                ),
                                                ' '
                                              )
                                            ),
                                            CAST(0 AS TINYINT)
                                          ) = CAST(1 AS TINYINT)
                                        )
                                        THEN NULL
                                        ELSE "t0"."first_name"
                                      END,
                                      ' '
                                    )
                                  ) = CAST(1 AS TINYINT)
                                  THEN UPPER(
                                    CASE
                                      WHEN (
                                        (
                                          TRIM(
                                            LOWER(
                                              NULLIF(
                                                TRIM(
                                                  REGEXP_REPLACE(
                                                    REGEXP_REPLACE(REGEXP_REPLACE("t0"."first_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                    '\s+',
                                                    ' ',
                                                    'g'
                                                  ),
                                                  ' 	

'
                                                ),
                                                ''
                                              )
                                            ),
                                            ' 	

'
                                          ) = TRIM(
                                            LOWER(
                                              NULLIF(
                                                TRIM(
                                                  REGEXP_REPLACE(
                                                    REGEXP_REPLACE(REGEXP_REPLACE("t0"."last_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                    '\s+',
                                                    ' ',
                                                    'g'
                                                  ),
                                                  ' 	

'
                                                ),
                                                ''
                                              )
                                            ),
                                            ' 	

'
                                          )
                                        )
                                        AND (
                                          COALESCE(
                                            ARRAY_LENGTH(
                                              STR_SPLIT(
                                                NULLIF(
                                                  TRIM(
                                                    REGEXP_REPLACE(
                                                      TRIM(
                                                        LOWER(
                                                          NULLIF(
                                                            TRIM(
                                                              REGEXP_REPLACE(
                                                                REGEXP_REPLACE(REGEXP_REPLACE("t0"."first_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                                '\s+',
                                                                ' ',
                                                                'g'
                                                              ),
                                                              ' 	

'
                                                            ),
                                                            ''
                                                          )
                                                        ),
                                                        ' 	

'
                                                      ),
                                                      '\s+',
                                                      ' ',
                                                      'g'
                                                    ),
                                                    ' 	

'
                                                  ),
                                                  ''
                                                ),
                                                ' '
                                              )
                                            ),
                                            CAST(0 AS TINYINT)
                                          ) = CAST(1 AS TINYINT)
                                        )
                                      )
                                      AND (
                                        COALESCE(
                                          ARRAY_LENGTH(
                                            STR_SPLIT(
                                              NULLIF(
                                                TRIM(
                                                  REGEXP_REPLACE(
                                                    TRIM(
                                                      LOWER(
                                                        NULLIF(
                                                          TRIM(
                                                            REGEXP_REPLACE(
                                                              REGEXP_REPLACE(REGEXP_REPLACE("t0"."last_name", '[^\x00-\x7F]+', '', 'g'), '[^A-Za-z0-9]+', ' ', 'g'),
                                                              '\s+',
                                                              ' ',
                                                              'g'
                                                            ),
                                                            ' 	

'
                                                          ),
                                                          ''
                                                        )
                                                      ),
                                                      ' 	

'
                                                    ),
                                                    '\s+',
                                                    ' ',
                                                    'g'
                                                  ),
                                                  ' 	

'
                                                ),
                                                ''
                                              ),
                                              ' '
                                            )
                                          ),
                                          CAST(0 AS TINYINT)
                                        ) = CAST(1 AS TINYINT)
                                      )
                                      THEN NULL
                                      ELSE "t0"."first_name"
                                    END
                                  )
                                  ELSE NULL
                                END, CASE
                                  WHEN ARRAY_LENGTH(STR_SPLIT("t0"."middle_name", ' ')) = CAST(1 AS TINYINT)
                                  THEN UPPER("t0"."middle_name")
                                  ELSE NULL
                                END, CASE
                                  WHEN ARRAY_LENGTH(STR_SPLIT("t0"."last_name", ' ')) = CAST(1 AS TINYINT)
                                  THEN UPPER("t0"."last_name")
                                  ELSE NULL
                                END, CASE
                                  WHEN ARRAY_LENGTH(STR_SPLIT("t0"."suffix", ' ')) = CAST(1 AS TINYINT)
                                  THEN UPPER("t0"."suffix")
                                  ELSE NULL
                                END, CASE
                                  WHEN ARRAY_LENGTH(STR_SPLIT("t0"."nickname", ' ')) = CAST(1 AS TINYINT)
                                  THEN UPPER("t0"."nickname")
                                  ELSE NULL
                                END],
                                UPPER(__ibis_param_token__)
                              )
                            )
                          ),
                          ' '
                        ),
                        '[^a-zA-Z \-'']',
                        '',
                        'g'
                      ),
                      ' '
                    ),
                    __ibis_param_t__ -> CASE
                      WHEN (
                        (
                          LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(2 AS TINYINT)
                        )
                        AND (
                          LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^a-z]', '', 'g')) >= CAST(1 AS TINYINT)
                        )
                      )
                      OR (
                        (
                          LENGTH(REGEXP_REPLACE(__ibis_param_t__, '[^A-Z]', '', 'g')) >= CAST(1 AS TINYINT)
                        )
                        AND REGEXP_MATCHES(__ibis_param_t__, '[a-z]')
                      )
                      THEN __ibis_param_t__
                      ELSE UPPER(
                        CASE
                          WHEN (
                            CAST(0 AS TINYINT) + 1
                          ) >= 1
                          THEN SUBSTRING(__ibis_param_t__, CAST(0 AS TINYINT) + 1, CAST(1 AS TINYINT))
                          ELSE SUBSTRING(
                            __ibis_param_t__,
                            CAST(0 AS TINYINT) + 1 + LENGTH(__ibis_param_t__),
                            CAST(1 AS TINYINT)
                          )
                        END
                      ) || LOWER(
                        CASE
                          WHEN (
                            CAST(1 AS TINYINT) + 1
                          ) >= 1
                          THEN SUBSTRING(__ibis_param_t__, CAST(1 AS TINYINT) + 1, LENGTH(__ibis_param_t__))
                          ELSE SUBSTRING(
                            __ibis_param_t__,
                            CAST(1 AS TINYINT) + 1 + LENGTH(__ibis_param_t__),
                            LENGTH(__ibis_param_t__)
                          )
                        END
                      )
                    END
                  ),
                  ' '
                ),
                '\s+',
                ' ',
                'g'
              ),
              '(\W) (\W)',
              '\1\2',
              'g'
            ),
            '(\w) (\W)',
            '\1\2',
            'g'
          ),
          '(\W) (\w)',
          '\1\2',
          'g'
        ),
        ' 	
.....

'
  ) AS "nickname"
FROM "ibis_cache_v5hid6mlqzejtef24szwfrzhty" AS "t0"

In actuality, I used to have an even more complex expression, but that literally took minutes to compile to SQL and I had to ctrl-c before it even finished. The .cache() also took longer than I had patience for. Then I threw some .cache()s in the middle of that chain to break it up, and this gave me the expression I have here. If I throw more .cache()s into the chain, the difference between the timings shrinks more.

What version of ibis are you using?

b5e6373 vs 8.0.0

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Feb 27, 2024
@NickCrews
Copy link
Contributor Author

I can get around this regression by throwing in more intermediate .cache()s, but

  1. this is annoying to have to do
  2. I guess a lot of users aren't going to notice/think about this, and will get suboptimal SQL. It would be better if the default was better.

The code I have that generates that expression isn't shareable for a repro, but I think it would be valuable to

  1. come up with some actual benchmark code that does something meaningful (eg like cleaning names as I have here). It should be moderately complex, as I have here.
  2. add this to the suite of benchmarks in ibis.
  3. use this to improve how we are compiling expressions.

I can help with task 1, but 2 and 3 should probably be done by a maintainer.

@kszucs
Copy link
Member

kszucs commented Feb 27, 2024

Do you have a reproducer perhaps? If not could you post the repr() of the expression for the two versions?

@NickCrews
Copy link
Contributor Author

sorry I posted more info in a followup comment. No repro on hand so I will have to make one. Since I have to do that do you have any sort of guidelines on it? eg as I'm trying to choose a scale for it, it should take ~.1 seconds to compile on 8.0.0?

@NickCrews
Copy link
Contributor Author

NickCrews commented Feb 27, 2024

on 9.0.0.dev337:

r0 := DatabaseTable: ibis_cache_v5hid6mlqzejtef24szwfrzhty
  prefix      string
  first_name  string
  middle_name string
  last_name   string
  suffix      string
  nickname    string

r1 := Project[r0]
  prefix:      r0.prefix
  first_name:  IfElse(bool_expr=Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(RegexReplace(r0.first_name, 
pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), pattern='\\s+', 
replacement=' ')), null_if_expr=''))) == 
Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(RegexReplace(r0.last_name, pattern='[^\\x00-\\x7F]+', 
replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), pattern='\\s+', replacement=' ')), null_if_expr=''))) &
Coalesce([ArrayLength(StringSplit(NullIf(Strip(RegexReplace(Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(
RegexReplace(r0.first_name, pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), 
pattern='\\s+', replacement=' ')), null_if_expr=''))), pattern='\\s+', replacement=' ')), null_if_expr=''), 
delimiter=' ')), 0]) == 1 & 
Coalesce([ArrayLength(StringSplit(NullIf(Strip(RegexReplace(Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(
RegexReplace(r0.last_name, pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), 
pattern='\\s+', replacement=' ')), null_if_expr=''))), pattern='\\s+', replacement=' ')), null_if_expr=''), 
delimiter=' ')), 0]) == 1, true_expr=None, false_null_expr=r0.first_name)
  middle_name: r0.middle_name
  last_name:   r0.last_name
  suffix:      r0.suffix
  nickname:    r0.nickname

r2 := Project[r1]
  prefix:             r1.prefix
  first_name:         r1.first_name
  middle_name:        r1.middle_name
  last_name:          r1.last_name
  suffix:             r1.suffix
  nickname:           r1.nickname
  prefix_tokens:      StringSplit(r1.prefix, delimiter=' ')
  first_name_tokens:  StringSplit(r1.first_name, delimiter=' ')
  middle_name_tokens: StringSplit(r1.middle_name, delimiter=' ')
  last_name_tokens:   StringSplit(r1.last_name, delimiter=' ')
  suffix_tokens:      StringSplit(r1.suffix, delimiter=' ')
  nickname_tokens:    StringSplit(r1.nickname, delimiter=' ')

r3 := Project[r2]
  prefix:             r2.prefix
  first_name:         r2.first_name
  middle_name:        r2.middle_name
  last_name:          r2.last_name
  suffix:             r2.suffix
  nickname:           r2.nickname
  prefix_tokens:      r2.prefix_tokens
  first_name_tokens:  r2.first_name_tokens
  middle_name_tokens: r2.middle_name_tokens
  last_name_tokens:   r2.last_name_tokens
  suffix_tokens:      r2.suffix_tokens
  nickname_tokens:    r2.nickname_tokens
  prefix_single:      IfElse(bool_expr=ArrayLength(r2.prefix_tokens) == 1, true_expr=Uppercase(r2.prefix), 
false_null_expr=None)
  first_name_single:  IfElse(bool_expr=ArrayLength(r2.first_name_tokens) == 1, true_expr=Uppercase(r2.first_name), 
false_null_expr=None)
  middle_name_single: IfElse(bool_expr=ArrayLength(r2.middle_name_tokens) == 1, 
true_expr=Uppercase(r2.middle_name), false_null_expr=None)
  last_name_single:   IfElse(bool_expr=ArrayLength(r2.last_name_tokens) == 1, true_expr=Uppercase(r2.last_name), 
false_null_expr=None)
  suffix_single:      IfElse(bool_expr=ArrayLength(r2.suffix_tokens) == 1, true_expr=Uppercase(r2.suffix), 
false_null_expr=None)
  nickname_single:    IfElse(bool_expr=ArrayLength(r2.nickname_tokens) == 1, true_expr=Uppercase(r2.nickname), 
false_null_expr=None)

r4 := Project[r3]
  prefix:                     r3.prefix
  first_name:                 r3.first_name
  middle_name:                r3.middle_name
  last_name:                  r3.last_name
  suffix:                     r3.suffix
  nickname:                   r3.nickname
  prefix_tokens:              r3.prefix_tokens
  first_name_tokens:          r3.first_name_tokens
  middle_name_tokens:         r3.middle_name_tokens
  last_name_tokens:           r3.last_name_tokens
  suffix_tokens:              r3.suffix_tokens
  nickname_tokens:            r3.nickname_tokens
  prefix_single:              r3.prefix_single
  first_name_single:          r3.first_name_single
  middle_name_single:         r3.middle_name_single
  last_name_single:           r3.last_name_single
  suffix_single:              r3.suffix_single
  nickname_single:            r3.nickname_single
  singles_except_prefix:      Array([r3.first_name_single, r3.middle_name_single, r3.last_name_single, 
r3.suffix_single, r3.nickname_single])
  singles_except_first_name:  Array([r3.prefix_single, r3.middle_name_single, r3.last_name_single, 
r3.suffix_single, r3.nickname_single])
  singles_except_middle_name: Array([r3.prefix_single, r3.first_name_single, r3.last_name_single, r3.suffix_single,
r3.nickname_single])
  singles_except_last_name:   Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.suffix_single, r3.nickname_single])
  singles_except_suffix:      Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.last_name_single, r3.nickname_single])
  singles_except_nickname:    Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.last_name_single, r3.suffix_single])

r5 := Project[r4]
  prefix:                      r4.prefix
  first_name:                  r4.first_name
  middle_name:                 r4.middle_name
  last_name:                   r4.last_name
  suffix:                      r4.suffix
  nickname:                    r4.nickname
  prefix_tokens:               r4.prefix_tokens
  first_name_tokens:           r4.first_name_tokens
  middle_name_tokens:          r4.middle_name_tokens
  last_name_tokens:            r4.last_name_tokens
  suffix_tokens:               r4.suffix_tokens
  nickname_tokens:             r4.nickname_tokens
  prefix_single:               r4.prefix_single
  first_name_single:           r4.first_name_single
  middle_name_single:          r4.middle_name_single
  last_name_single:            r4.last_name_single
  suffix_single:               r4.suffix_single
  nickname_single:             r4.nickname_single
  singles_except_prefix:       r4.singles_except_prefix
  singles_except_first_name:   r4.singles_except_first_name
  singles_except_middle_name:  r4.singles_except_middle_name
  singles_except_last_name:    r4.singles_except_last_name
  singles_except_suffix:       r4.singles_except_suffix
  singles_except_nickname:     r4.singles_except_nickname
  prefix_tokens_filtered:      ArrayFilter(r4.prefix_tokens, body=Not(ArrayContains(r4.singles_except_prefix, 
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string)))), param='__ibis_param_token__')
  first_name_tokens_filtered:  ArrayFilter(r4.first_name_tokens, 
body=Not(ArrayContains(r4.singles_except_first_name, other=Uppercase(Argument(name='token', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string)))), param='__ibis_param_token__')
  middle_name_tokens_filtered: ArrayFilter(r4.middle_name_tokens, 
body=Not(ArrayContains(r4.singles_except_middle_name, other=Uppercase(Argument(name='token', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string)))), param='__ibis_param_token__')
  last_name_tokens_filtered:   ArrayFilter(r4.last_name_tokens, body=Not(ArrayContains(r4.singles_except_last_name,
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string)))), param='__ibis_param_token__')
  suffix_tokens_filtered:      ArrayFilter(r4.suffix_tokens, body=Not(ArrayContains(r4.singles_except_suffix, 
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string)))), param='__ibis_param_token__')
  nickname_tokens_filtered:    ArrayFilter(r4.nickname_tokens, body=Not(ArrayContains(r4.singles_except_nickname, 
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string)))), param='__ibis_param_token__')

r6 := Project[r5]
  prefix:                      ArrayStringJoin(r5.prefix_tokens_filtered, sep=' ')
  first_name:                  ArrayStringJoin(r5.first_name_tokens_filtered, sep=' ')
  middle_name:                 ArrayStringJoin(r5.middle_name_tokens_filtered, sep=' ')
  last_name:                   ArrayStringJoin(r5.last_name_tokens_filtered, sep=' ')
  suffix:                      ArrayStringJoin(r5.suffix_tokens_filtered, sep=' ')
  nickname:                    ArrayStringJoin(r5.nickname_tokens_filtered, sep=' ')
  prefix_tokens:               r5.prefix_tokens
  first_name_tokens:           r5.first_name_tokens
  middle_name_tokens:          r5.middle_name_tokens
  last_name_tokens:            r5.last_name_tokens
  suffix_tokens:               r5.suffix_tokens
  nickname_tokens:             r5.nickname_tokens
  prefix_single:               r5.prefix_single
  first_name_single:           r5.first_name_single
  middle_name_single:          r5.middle_name_single
  last_name_single:            r5.last_name_single
  suffix_single:               r5.suffix_single
  nickname_single:             r5.nickname_single
  singles_except_prefix:       r5.singles_except_prefix
  singles_except_first_name:   r5.singles_except_first_name
  singles_except_middle_name:  r5.singles_except_middle_name
  singles_except_last_name:    r5.singles_except_last_name
  singles_except_suffix:       r5.singles_except_suffix
  singles_except_nickname:     r5.singles_except_nickname
  prefix_tokens_filtered:      r5.prefix_tokens_filtered
  first_name_tokens_filtered:  r5.first_name_tokens_filtered
  middle_name_tokens_filtered: r5.middle_name_tokens_filtered
  last_name_tokens_filtered:   r5.last_name_tokens_filtered
  suffix_tokens_filtered:      r5.suffix_tokens_filtered
  nickname_tokens_filtered:    r5.nickname_tokens_filtered

r7 := Project[r6]
  prefix:      r6.prefix
  first_name:  r6.first_name
  middle_name: r6.middle_name
  last_name:   r6.last_name
  suffix:      r6.suffix
  nickname:    r6.nickname

r8 := Project[r7]
  prefix:      
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.pref
ix, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
  first_name:  
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.firs
t_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
  middle_name: 
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.midd
le_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
  last_name:   
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.last
_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
  suffix:      
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.suff
ix, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
  nickname:    
Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(ArrayMap(StringSplit(RegexReplace(r7.nick
name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x1044682e0>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x1044682e0>, 
dtype=string))), param='__ibis_param_t__'), sep=' '), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))

r9 := Project[r8]
  prefix:      SearchedCase(cases=[Uppercase(r8.prefix) == 'JR', Uppercase(r8.prefix) == 'SR', Uppercase(r8.prefix)
== 'PHD', Uppercase(r8.prefix) == 'MR', Uppercase(r8.prefix) == 'MRS', Uppercase(r8.prefix) == 'MS', 
Uppercase(r8.prefix) == 'DR'], results=['Jr', 'Sr', 'PhD', 'Mr', 'Mrs', 'Ms', 'Dr'], default=Uppercase(r8.prefix))
  first_name:  r8.first_name
  middle_name: r8.middle_name
  last_name:   r8.last_name
  suffix:      SearchedCase(cases=[Uppercase(r8.suffix) == 'JR', Uppercase(r8.suffix) == 'SR', Uppercase(r8.suffix)
== 'PHD', Uppercase(r8.suffix) == 'MR', Uppercase(r8.suffix) == 'MRS', Uppercase(r8.suffix) == 'MS', 
Uppercase(r8.suffix) == 'DR'], results=['Jr', 'Sr', 'PhD', 'Mr', 'Mrs', 'Ms', 'Dr'], default=Uppercase(r8.suffix))
  nickname:    r8.nickname

Project[r9]
  prefix:      r9.prefix
  first_name:  r9.first_name
  middle_name: IfElse(bool_expr=Coalesce([StringSQLLike(r9.nickname, pattern=StringConcat([r9.middle_name, '%'])), 
False]) & Not(StringSQLLike(r9.first_name, pattern=StringConcat([r9.middle_name, '%']))), true_expr=r9.nickname, 
false_null_expr=r9.middle_name)
  last_name:   r9.last_name
  suffix:      r9.suffix
  nickname:    r9.nickname

on 8.0.0:

r0 := DatabaseTable: ibis_cache_rubcpix3p5fkjmjhurv34vxf3u
  prefix      string
  first_name  string
  middle_name string
  last_name   string
  suffix      string
  nickname    string

r1 := Selection[r0]
  selections:
    prefix:      r0.prefix
    first_name:  
IfElse(bool_expr=Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(RegexReplace(r0.first_name, 
pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), pattern='\\s+', 
replacement=' ')), null_if_expr=''))) == 
Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(RegexReplace(r0.last_name, pattern='[^\\x00-\\x7F]+', 
replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), pattern='\\s+', replacement=' ')), null_if_expr=''))) &
Coalesce([ArrayLength(StringSplit(NullIf(Strip(RegexReplace(Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(
RegexReplace(r0.first_name, pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), 
pattern='\\s+', replacement=' ')), null_if_expr=''))), pattern='\\s+', replacement=' ')), null_if_expr=''), 
delimiter=' ')), 0]) == 1 & 
Coalesce([ArrayLength(StringSplit(NullIf(Strip(RegexReplace(Strip(Lowercase(NullIf(Strip(RegexReplace(RegexReplace(
RegexReplace(r0.last_name, pattern='[^\\x00-\\x7F]+', replacement=''), pattern='[^A-Za-z0-9]+', replacement=' '), 
pattern='\\s+', replacement=' ')), null_if_expr=''))), pattern='\\s+', replacement=' ')), null_if_expr=''), 
delimiter=' ')), 0]) == 1, true_expr=None, false_null_expr=r0.first_name)
    middle_name: r0.middle_name
    last_name:   r0.last_name
    suffix:      r0.suffix
    nickname:    r0.nickname

r2 := Selection[r1]
  selections:
    r1
    prefix_tokens:      StringSplit(r1.prefix, delimiter=' ')
    first_name_tokens:  StringSplit(r1.first_name, delimiter=' ')
    middle_name_tokens: StringSplit(r1.middle_name, delimiter=' ')
    last_name_tokens:   StringSplit(r1.last_name, delimiter=' ')
    suffix_tokens:      StringSplit(r1.suffix, delimiter=' ')
    nickname_tokens:    StringSplit(r1.nickname, delimiter=' ')

r3 := Selection[r2]
  selections:
    r2
    prefix_single:      IfElse(bool_expr=ArrayLength(r2.prefix_tokens) == 1, true_expr=Uppercase(r2.prefix), 
false_null_expr=None)
    first_name_single:  IfElse(bool_expr=ArrayLength(r2.first_name_tokens) == 1, 
true_expr=Uppercase(r2.first_name), false_null_expr=None)
    middle_name_single: IfElse(bool_expr=ArrayLength(r2.middle_name_tokens) == 1, 
true_expr=Uppercase(r2.middle_name), false_null_expr=None)
    last_name_single:   IfElse(bool_expr=ArrayLength(r2.last_name_tokens) == 1, true_expr=Uppercase(r2.last_name), 
false_null_expr=None)
    suffix_single:      IfElse(bool_expr=ArrayLength(r2.suffix_tokens) == 1, true_expr=Uppercase(r2.suffix), 
false_null_expr=None)
    nickname_single:    IfElse(bool_expr=ArrayLength(r2.nickname_tokens) == 1, true_expr=Uppercase(r2.nickname), 
false_null_expr=None)

r4 := Selection[r3]
  selections:
    r3
    singles_except_prefix:      Array([r3.first_name_single, r3.middle_name_single, r3.last_name_single, 
r3.suffix_single, r3.nickname_single])
    singles_except_first_name:  Array([r3.prefix_single, r3.middle_name_single, r3.last_name_single, 
r3.suffix_single, r3.nickname_single])
    singles_except_middle_name: Array([r3.prefix_single, r3.first_name_single, r3.last_name_single, 
r3.suffix_single, r3.nickname_single])
    singles_except_last_name:   Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.suffix_single, r3.nickname_single])
    singles_except_suffix:      Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.last_name_single, r3.nickname_single])
    singles_except_nickname:    Array([r3.prefix_single, r3.first_name_single, r3.middle_name_single, 
r3.last_name_single, r3.suffix_single])

r5 := Selection[r4]
  selections:
    r4
    prefix_tokens_filtered:      ArrayFilter(r4.prefix_tokens, body=Not(ArrayContains(r4.singles_except_prefix, 
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string)))), param='__ibis_param_token__')
    first_name_tokens_filtered:  ArrayFilter(r4.first_name_tokens, 
body=Not(ArrayContains(r4.singles_except_first_name, other=Uppercase(Argument(name='token', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string)))), param='__ibis_param_token__')
    middle_name_tokens_filtered: ArrayFilter(r4.middle_name_tokens, 
body=Not(ArrayContains(r4.singles_except_middle_name, other=Uppercase(Argument(name='token', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string)))), param='__ibis_param_token__')
    last_name_tokens_filtered:   ArrayFilter(r4.last_name_tokens, 
body=Not(ArrayContains(r4.singles_except_last_name, other=Uppercase(Argument(name='token', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string)))), param='__ibis_param_token__')
    suffix_tokens_filtered:      ArrayFilter(r4.suffix_tokens, body=Not(ArrayContains(r4.singles_except_suffix, 
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string)))), param='__ibis_param_token__')
    nickname_tokens_filtered:    ArrayFilter(r4.nickname_tokens, body=Not(ArrayContains(r4.singles_except_nickname,
other=Uppercase(Argument(name='token', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string)))), param='__ibis_param_token__')

r6 := Selection[r5]
  selections:
    prefix:                      ArrayStringJoin(sep=' ', arg=r5.prefix_tokens_filtered)
    first_name:                  ArrayStringJoin(sep=' ', arg=r5.first_name_tokens_filtered)
    middle_name:                 ArrayStringJoin(sep=' ', arg=r5.middle_name_tokens_filtered)
    last_name:                   ArrayStringJoin(sep=' ', arg=r5.last_name_tokens_filtered)
    suffix:                      ArrayStringJoin(sep=' ', arg=r5.suffix_tokens_filtered)
    nickname:                    ArrayStringJoin(sep=' ', arg=r5.nickname_tokens_filtered)
    prefix_tokens:               r5.prefix_tokens
    first_name_tokens:           r5.first_name_tokens
    middle_name_tokens:          r5.middle_name_tokens
    last_name_tokens:            r5.last_name_tokens
    suffix_tokens:               r5.suffix_tokens
    nickname_tokens:             r5.nickname_tokens
    prefix_single:               r5.prefix_single
    first_name_single:           r5.first_name_single
    middle_name_single:          r5.middle_name_single
    last_name_single:            r5.last_name_single
    suffix_single:               r5.suffix_single
    nickname_single:             r5.nickname_single
    singles_except_prefix:       r5.singles_except_prefix
    singles_except_first_name:   r5.singles_except_first_name
    singles_except_middle_name:  r5.singles_except_middle_name
    singles_except_last_name:    r5.singles_except_last_name
    singles_except_suffix:       r5.singles_except_suffix
    singles_except_nickname:     r5.singles_except_nickname
    prefix_tokens_filtered:      r5.prefix_tokens_filtered
    first_name_tokens_filtered:  r5.first_name_tokens_filtered
    middle_name_tokens_filtered: r5.middle_name_tokens_filtered
    last_name_tokens_filtered:   r5.last_name_tokens_filtered
    suffix_tokens_filtered:      r5.suffix_tokens_filtered
    nickname_tokens_filtered:    r5.nickname_tokens_filtered

r7 := Selection[r6]
  selections:
    prefix:      r6.prefix
    first_name:  r6.first_name
    middle_name: r6.middle_name
    last_name:   r6.last_name
    suffix:      r6.suffix
    nickname:    r6.nickname

r8 := Selection[r7]
  selections:
    prefix:      Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.prefix, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
    first_name:  Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.first_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
    middle_name: Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.middle_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
    last_name:   Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.last_name, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
    suffix:      Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.suffix, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))
    nickname:    Strip(RegexReplace(RegexReplace(RegexReplace(RegexReplace(ArrayStringJoin(sep=' ', 
arg=ArrayMap(StringSplit(RegexReplace(r7.nickname, pattern="[^a-zA-Z \\-']", replacement=''), delimiter=' '), 
body=IfElse(bool_expr=StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 
0x10b960b80>, dtype=string), pattern='[^A-Z]', replacement='')) >= 2 & StringLength(RegexReplace(Argument(name='t',
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[^a-z]', replacement='')) >= 1 
| StringLength(RegexReplace(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string), pattern='[^A-Z]', replacement='')) >= 1 & RegexSearch(Argument(name='t', 
shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), pattern='[a-z]'), 
true_expr=Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, dtype=string), 
false_null_expr=Capitalize(Argument(name='t', shape=<ibis.expr.datashape.Columnar object at 0x10b960b80>, 
dtype=string))), param='__ibis_param_t__')), pattern='\\s+', replacement=' '), pattern='(\\W) (\\W)', 
replacement='\\1\\2'), pattern='(\\w) (\\W)', replacement='\\1\\2'), pattern='(\\W) (\\w)', replacement='\\1\\2'))

r9 := Selection[r8]
  selections:
    prefix:      SearchedCase(cases=[Uppercase(r8.prefix) == 'JR', Uppercase(r8.prefix) == 'SR', 
Uppercase(r8.prefix) == 'PHD', Uppercase(r8.prefix) == 'MR', Uppercase(r8.prefix) == 'MRS', Uppercase(r8.prefix) ==
'MS', Uppercase(r8.prefix) == 'DR'], results=['Jr', 'Sr', 'PhD', 'Mr', 'Mrs', 'Ms', 'Dr'], 
default=Uppercase(r8.prefix))
    first_name:  r8.first_name
    middle_name: r8.middle_name
    last_name:   r8.last_name
    suffix:      SearchedCase(cases=[Uppercase(r8.suffix) == 'JR', Uppercase(r8.suffix) == 'SR', 
Uppercase(r8.suffix) == 'PHD', Uppercase(r8.suffix) == 'MR', Uppercase(r8.suffix) == 'MRS', Uppercase(r8.suffix) ==
'MS', Uppercase(r8.suffix) == 'DR'], results=['Jr', 'Sr', 'PhD', 'Mr', 'Mrs', 'Ms', 'Dr'], 
default=Uppercase(r8.suffix))
    nickname:    r8.nickname

Selection[r9]
  selections:
    prefix:      r9.prefix
    first_name:  r9.first_name
    middle_name: IfElse(bool_expr=Coalesce([StringSQLLike(r9.nickname, pattern=StringConcat([r9.middle_name, 
'%'])), False]) & Not(StringSQLLike(r9.first_name, pattern=StringConcat([r9.middle_name, '%']))), 
true_expr=r9.nickname, false_null_expr=r9.middle_name)
    last_name:   r9.last_name
    suffix:      r9.suffix
    nickname:    r9.nickname

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2024

If you can post the timings for each of the different versions that would be helpful.

@NickCrews
Copy link
Contributor Author

@cpcloud what timings do you mean? In the original post I have the timings for compiling the SQL and for .cache()ing it. The user code that generates the expressions is the same between 8.0.0 and main.

@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2024

Ah I see the timings now in a comment.

~1sec for 8.0
More than minutes for main

@NickCrews
Copy link
Contributor Author

NickCrews commented Feb 27, 2024

ah, just to be clear:

I started with e1, which took ~10 sec to compile for main, many minutes for main. So, I threw some cache statements in that pipeline to make it simpler. That gave me e2, which gave me the timings I posted originally (.1 sec vs 1 sec to compile, 1 sec vs 8.4 sec to execute).

@kszucs
Copy link
Member

kszucs commented Feb 28, 2024

@NickCrews could you try to call expr.unbind() then pickle it and share that with us? Then we would have the expression and we could inspect where is the slowdown.

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2024

I suspect the CTE generation, performance of generating SQL and the execution performance are all related. If we're repeating the same subquery multiple times, the database may not automatically extract them into CTEs. We've already seen this with SQLite.

@NickCrews
Copy link
Contributor Author

Let me actually craft some user code that can repro this. Then it will be much easier for you to understand and tweak, and it can get turned into a benchmark.

@NickCrews
Copy link
Contributor Author

sql_benchmark.zip

OK, here is a notebook that can repro the issue, and some test data of names (from the Alaska Division of Elections voterfile, this is public data)

@lostmygithubaccount lostmygithubaccount added this to the 9.0 milestone Mar 1, 2024
@lostmygithubaccount lostmygithubaccount added the regression Issues related to things that used to work but don't anymore label Mar 1, 2024
@NickCrews
Copy link
Contributor Author

It looks like y'all are doing some work on this. I really appreciate that, this looks tricky. I'm curious how y'all are thinking about the goal here. My thoughts would be that we should be aiming to generate SQL that is

  1. as fast as possible to execute
    1. on which backends?
    2. for what sorts of queries?
  2. as a second priority, fast to generate
  3. as a third priority, fairly readable

How are y'all thinking about this? This probably should actually get written down somewhere in the docs as the official stance of ibis, this feels important and it will impact all sorts of other decisions. ie if "fast as possible to execute" isn't the number 1 priority, some people might be unwilling to use Ibis, and might just hand-roll their own SQL.

@cpcloud
Copy link
Member

cpcloud commented Mar 7, 2024

I think you've got the priorities in the correct order.

In this particular case, my hunch is that the first two priorities are related:

By failing to recursively extract CTEs, we duplicate subqueries which leads to more code to generate (and thus slower to generate said code) and engines that do not perform common expression elimination are doomed to execute identical queries more than once.

@cpcloud
Copy link
Member

cpcloud commented Mar 7, 2024

Trino is the only backend I'm aware of that executes CTEs once for each reference (i.e., equivalent to repeating the subquery)

@cpcloud
Copy link
Member

cpcloud commented Mar 7, 2024

Agree that we should write this down somewhere, perhaps in one of the concepts docs?

@kszucs
Copy link
Member

kszucs commented Mar 8, 2024

Thanks Nick for sharing, I am able to reproduce. Currently inspecting what happens, the ibis IR and the rewrite just before the sqlglot translation is instant, so something related to the sqlglot object creation. The memory also explodes, so this is an important issue to fix.

Just throwing the profiling result here:

33.892 <module>  slow.py:1
└─ 33.892 to_sql  ibis/expr/sql.py:342
   ├─ 17.801 transpile  sqlglot/__init__.py:133
   │     [1558 frames hidden]  sqlglot
   └─ 16.077 Backend._to_sql  ibis/backends/sql/__init__.py:124
      └─ 16.076 Backend.compile  ibis/backends/sql/__init__.py:115
         ├─ 10.697 Backend._to_sqlglot  ibis/backends/duckdb/__init__.py:106
         │  └─ 10.697 Backend._to_sqlglot  ibis/backends/sql/__init__.py:92
         │     └─ 10.697 DuckDBCompiler.translate  ibis/backends/sql/compiler.py:422
         │        └─ 10.628 Select.map  ibis/common/graph.py:232
         │           └─ 10.622 fn  ibis/backends/sql/compiler.py:455
         │              └─ 10.621 DuckDBCompiler.visit_node  ibis/backends/sql/compiler.py:494
         │                 ├─ 5.161 DuckDBCompiler.impl  ibis/backends/sql/compiler.py:350
         │                 │  └─ 5.160 <lambda>  ibis/backends/sql/compiler.py:97
         │                 │     └─ 5.160 func  sqlglot/expressions.py:6874
         │                 │           [162 frames hidden]  sqlglot, copy
         │                 ├─ 4.496 DuckDBCompiler.visit_Select  ibis/backends/sql/compiler.py:1101
         │                 │  ├─ 2.340 Select.from_  sqlglot/expressions.py:2812
         │                 │  │     [201 frames hidden]  sqlglot, copy
         │                 │  └─ 2.156 DuckDBCompiler._dedup_name  ibis/backends/sql/compiler.py:1074
         │                 │     └─ 2.156 If.as_  sqlglot/expressions.py:735
         │                 │           [239 frames hidden]  sqlglot, copy
         │                 └─ 0.380 DuckDBCompiler.visit_Coalesce  ibis/backends/sql/compiler.py:956
         │                    └─ 0.380 <lambda>  ibis/backends/sql/compiler.py:97
         │                       └─ 0.380 func  sqlglot/expressions.py:6874
         │                             [37 frames hidden]  sqlglot, copy
         └─ 5.379 Select.sql  sqlglot/expressions.py:513
               [276 frames hidden]  sqlglot, copy

@kszucs
Copy link
Member

kszucs commented Mar 8, 2024

My guess that every time we construct a sqlglot expression we create a full copy of the previous sqlglot expression, hence the memory explode. The sqlglot function and methods have a copy=True argument, I assume we need to try not to copy by default.

@lostmygithubaccount
Copy link
Member

agree on a concepts article -- perhaps replacing the current one on internals: https://ibis-project.org/concepts/internals

kszucs added a commit that referenced this issue Mar 13, 2024
…reate a sqlglot object (#8592)

Still profiling and adding more `copy=False` options, apparently it
greatly improves the performance.

According to profiling the recursive generation of sqlglot is still a
bottleneck for big queries which cannot be fixed on the ibis side. There
could be one option though to compile the fragments in a greedy fashion
which are going to be cached by `Node.map()` and inject those as
arbitrary strings to other sqlglot expressions.

Now it uses 30% of the memory it used before. Apparently there are still
a lot of sqlglot Literal objects around, looking into that.

fixes #8484

---------

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
@cpcloud
Copy link
Member

cpcloud commented Mar 13, 2024

@kszucs Was this resolved by 461293b?

I thought we still had to merge #8633 before this would be resolved

@NickCrews
Copy link
Contributor Author

Can we re-open this? This is currently the number one blocker for me, my app has basically become unusable due to this. Is there any temp workaround I can do on my ibis fork that will make CTEs get extracted in the short term?

@cpcloud cpcloud added the performance Issues related to ibis's performance label Mar 23, 2024
@kszucs
Copy link
Member

kszucs commented Mar 24, 2024

@NickCrews the real bottleneck is sqlglot here. Now we avoid excessive sqlglot deepcopying as well as generating non pretty sql by default which made the overall compilation much better. I can think of a single additional optimization, which would eagerly compile the ibis expression to sql strings and use those strings in other sqlglot expressions ensuring that the same expression compiled only once.

I would suggest to raise an issue in sqlglot upstream. In the meantime we may get more information about better usage or possibly optimizations for sqlglot.

@NickCrews
Copy link
Contributor Author

Gotcha. If I understand correctly, even if we do that, then there still won't be CTEs extracted. I guess, but haven't benchmarked, that the actual execution was the cause of my slowdown, since we moved from having a lot of CTEs to no CTEs. Did you investigate this when you were benchmarking above, or should I? If we find the execution is the bottleneck, would you consider going back to the CTE-heavy format?

@kszucs
Copy link
Member

kszucs commented Mar 25, 2024

Well, we do extract CTEs for unique selections occurring more than once, see the implementation here:

def extract_ctes(node):
result = []
cte_types = (Select, ops.Aggregate, ops.JoinChain, ops.Set, ops.Limit, ops.Sample)
dont_count = (ops.Field, ops.CountStar, ops.CountDistinctStar)
g = Graph.from_bfs(node, filter=~InstanceOf(dont_count))
for node, dependents in g.invert().items():
if isinstance(node, ops.View) or (
len(dependents) > 1 and isinstance(node, cte_types)
):
result.append(node)
return result

@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

Thinking about a possible "extreme" solution here: what if we make every Select into a CTE?

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 25, 2024

I do this sort of thing a lot in my processing code:

import ibis

t = ibis.examples.penguins.fetch()
t = t.mutate(species2=t.species.upper())
t = t.mutate(species3=t.species2.strip())
ibis.to_sql(t)

eg transform some expression, then transform that result again
This generates

SELECT
  "t0"."species",
  "t0"."island",
  "t0"."bill_length_mm",
  "t0"."bill_depth_mm",
  "t0"."flipper_length_mm",
  "t0"."body_mass_g",
  "t0"."sex",
  "t0"."year",
  UPPER("t0"."species") AS "species2",
  TRIM(UPPER("t0"."species"), ' 	

') AS "species3"
FROM "penguins" AS "to"

Note how the UPPER("t0"."species") occurs twice. I would expect that to get extracted into a CTE. In real code, instead of UPPER(x), it is a giant hairball.

Note that this might not actually be a problem for execution, at least duckdb appears to be smart enough to only execute this once?

import ibis
import pyarrow.compute as pc

i = 0


@ibis.udf.scalar.pyarrow
def plus1(x: int) -> int:
    global i
    i += 1
    return pc.add(x, 1)


t = ibis.examples.penguins.fetch().head(10)
t = t.mutate(body_mass_g2=plus1(t.body_mass_g))
t = t.mutate(body_mass_g3=t.body_mass_g2 * 2)
t.execute()
print(i) # shows 1

But even if execution is smart enough, perhaps this is also the cause of the SQL generation slowdown, since I think this would be a branching_factor**depth explosion of what needs to get generated?

Thinking about a possible "extreme" solution here: what if we make every Select into a CTE?

Sorry I don't think I follow, I think we are suggesting the same thing? eg

WITH t99 AS (
  SELECT
    "penguins"."species",
    UPPER("penguins"."species") AS "species2",
)
SELECT
  "t99"."species",
  "t99"."species2",
  TRIM("t99"."species2", ' 	

') AS "species3"

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 25, 2024

OK, here is actually a problematic case where duckdb isn't smart enough:

import ibis

a = ibis.array([1, 2, 3])

i = 0


@ibis.udf.scalar.python
def plus1(x: int) -> int:
    global i
    i += 1
    return x + 1


b = a.map(plus1).unique()
b.execute()
print(I) # 24
SELECT
  CASE
    WHEN LIST_APPLY(
      [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
      __ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
    ) IS NULL
    THEN NULL
    ELSE LIST_DISTINCT(
      LIST_APPLY(
        [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
        __ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
      )
    ) + CASE
      WHEN LIST_COUNT(
        LIST_APPLY(
          [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
          __ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
        )
      ) < LENGTH(
        LIST_APPLY(
          [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
          __ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
        )
      )
      THEN [NULL]
      ELSE []
    END
  END AS "ArrayDistinct(ArrayMap(Array(), plus1_5(x)))"

I don't quite understand why it is 24, I would expect 3*4=12 times.

Perhaps this existed before the sqlglot refactor and I just didn't notice. But it seems like the same sort of question of "under what circumstances should we extract CTEs?"

@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

All bets are off with global mutable state. It's not the same problem, because the function is impure.

There's no reasonable way we can write a function that would determine whether global state is mutated.

If you mutate state like that you're 100% at the mercy of the query engine, its calling convention, and probably its optimizer.

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 25, 2024

If I pass side_effects=False during UDF registration (which is a lie of course for this scenario), I would think that this would enable to duckdb to optimize this, but I still see that i is 24 at the end. In my real use case my function is pure, so even if I theoretically can take advantage of optimization, duckdb still isn't smart enough.

@NickCrews
Copy link
Contributor Author

oh, I see I think you are talking about the feasibility of doing the optimization on the ibis side. Hmmm that is trickier. What about for non-UDFs, can we assume that all non-UDFs are pure and thus we can extract them?

@NickCrews
Copy link
Contributor Author

Wait, I'm not sure that the pureness of a function decide our behavior here. If someone writes a UDF (either pure or impure), we don't make that promise anywhere in our docs as to the number of times that will get called, and I don't think we should. Therefore, we have the freedom to call it as many times as we want, and I think the natural preference for us should be to call it as few times as possible.

@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

The number of times a function gets called is definitely not guaranteed, especially since Ibis is effectively never the caller. The only thing that should be guaranteed is the correct result.

I probably should have said something like: global state introduces dependencies on how something is done, and that will lead to invalid assumptions about function calls in SELECT statements, which are mostly free of global state.

@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

I think we're getting way into the weeds without a clear indication of what the problem is:

Right now we know the following:

  1. sqlglot is slow to generate because of $REASONS
  2. When code is finally generated, it's slower than it was before we started using sqlglot.

In theory, we can fix number one without fixing number 2 (which wouldn't be a useful thing to do IMO).

@kszucs We still need to address this issue, and it's still not clear how addressing sqlglot's sql generation speed would help address the execution speed.

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 25, 2024

it's slower than it was before we started using sqlglot.

It may have been slow before, I may just not have noticed this problem. But regardless, I think this is still something that we need to address.

@cpcloud
Copy link
Member

cpcloud commented Mar 25, 2024

Well, now I'm confused :)

I thought you had something that was working (and reasonably fast) that is now unusable.

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 25, 2024

ugh sorry yes I'm being confusing. Indeed, if you try out the repro example I shared above (with a few steps commented out to make it actually finish on main)

on 8.0.0:

  • 0.0 sec to generate 425 lines of SQL
  • .6 sec to .cache() the result

on main:

  • 0.0 sec to generate 4184 lines of SQL
  • 47 sec to .cache() the result

The thing that is the same in 8.0.0 vs main is that there are still some subexpressions that are repeated many times, when they possibly could/should be extracted into a single instance. But there are just many many more repeats on main.

@kszucs
Copy link
Member

kszucs commented Apr 8, 2024

Resolved by #8825

@kszucs kszucs closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis performance Issues related to ibis's performance regression Issues related to things that used to work but don't anymore
Projects
Archived in project
4 participants