[query] implement `hl.dummy_code` #13601

danking · 2023-09-11T14:49:24Z

What happened?

Categorical data requires the user to preprocess their data. The subtle distinctions between dummy coding and one-hot encoding are not obvious to all users. We should provide a simple method, clear docs, and clear examples to ease the analysis of categorical variables.

Here's a prototype implementation

def dummy_code(mt: hl.MatrixTable, *fields: str) -> DummyCode:
    field_categories = mt.aggregate_cols(**{
        field: hl.agg.collect_as_set(mt[field]) for field in fields
    })

    dummy_codes = {
        f'{field}_{category}': mt[field] == c
        for field in field_categories
        for category in field_categories[field]
    }

    mt = mt.annotate_cols(**dummy_codes)

    dummy_code_fields = list(dummy_codes)

    return field_categories, dummy_code_fields, mt


# Example

_, dummy_code_fields, mt = hl.dummy_code(mt, 'breed', 'color')
mt = hl.linear_regression_rows(
    x=mt.GT.n_alt_alleles(),
    y=[1.0, *dc.dummy_code_fields]
)

References

How to handle categorical manually in Hail. https://discuss.hail.is/t/how-do-i-include-a-categorical-variable-as-a-covariate-in-my-logistic-or-linear-regression/1362
Dummy coding vs one-hot encoding. https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
A recent user request for this feature. https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/categorical.20covariates.20in.20regression

Version

0.2.122

Relevant log output

No response

Will-Tyler · 2024-02-07T20:04:57Z

After some reading, I am still not sure what exactly the difference is between dummy coding and one-hot encoding.

Suppose there is a categorical variable with $n$ categories. The referenced Stack Exchange question suggests that a one-hot encoding converts the categorical variable to $n$ indicator variables (one for each category) and that a dummy coding converts the categorical variable to $n-1$ indicator variables. With these definitions, the dummy coding is the one-hot encoding without one of the indicator variables.

However, from the prototype implementation in this issue, the scikit-learn one-hot encoder documentation, and the dummy variable Wikipedia article, I get the impression that dummy coding and one-hot encoding are synonyms and that there is no real distinction.

Anyway, I would like to work on this issue. I will base my implementation on the prototype, and perhaps we can add a parameter to drop one of the indicator variables similar to what the scikit-learn one-hot encoder has.

Closes hail-is#13601

danking added the new-feature label Sep 11, 2023

danking added the query label Oct 23, 2023

Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Feb 7, 2024

Add dummy code method

53375bf

Closes hail-is#13601

Will-Tyler linked a pull request Feb 7, 2024 that will close this issue

Add convenience method to dummy code categorical variables #14269

Open

Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Mar 8, 2024

Add dummy code method

1459b83

Closes hail-is#13601

Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Mar 9, 2024

Add dummy code method

0930617

Closes hail-is#13601

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] implement `hl.dummy_code` #13601

[query] implement `hl.dummy_code` #13601

danking commented Sep 11, 2023

Will-Tyler commented Feb 7, 2024

[query] implement hl.dummy_code #13601

[query] implement hl.dummy_code #13601

Comments

danking commented Sep 11, 2023

What happened?

Version

Relevant log output

Will-Tyler commented Feb 7, 2024

[query] implement `hl.dummy_code` #13601

[query] implement `hl.dummy_code` #13601