Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] implement hl.dummy_code #13601

Open
danking opened this issue Sep 11, 2023 · 1 comment · May be fixed by #14269
Open

[query] implement hl.dummy_code #13601

danking opened this issue Sep 11, 2023 · 1 comment · May be fixed by #14269

Comments

@danking
Copy link
Contributor

danking commented Sep 11, 2023

What happened?

Categorical data requires the user to preprocess their data. The subtle distinctions between dummy coding and one-hot encoding are not obvious to all users. We should provide a simple method, clear docs, and clear examples to ease the analysis of categorical variables.

Here's a prototype implementation

def dummy_code(mt: hl.MatrixTable, *fields: str) -> DummyCode:
    field_categories = mt.aggregate_cols(**{
        field: hl.agg.collect_as_set(mt[field]) for field in fields
    })

    dummy_codes = {
        f'{field}_{category}': mt[field] == c
        for field in field_categories
        for category in field_categories[field]
    }

    mt = mt.annotate_cols(**dummy_codes)

    dummy_code_fields = list(dummy_codes)

    return field_categories, dummy_code_fields, mt


# Example

_, dummy_code_fields, mt = hl.dummy_code(mt, 'breed', 'color')
mt = hl.linear_regression_rows(
    x=mt.GT.n_alt_alleles(),
    y=[1.0, *dc.dummy_code_fields]
)

References

Version

0.2.122

Relevant log output

No response

@Will-Tyler
Copy link
Contributor

After some reading, I am still not sure what exactly the difference is between dummy coding and one-hot encoding.

Suppose there is a categorical variable with $n$ categories. The referenced Stack Exchange question suggests that a one-hot encoding converts the categorical variable to $n$ indicator variables (one for each category) and that a dummy coding converts the categorical variable to $n-1$ indicator variables. With these definitions, the dummy coding is the one-hot encoding without one of the indicator variables.

However, from the prototype implementation in this issue, the scikit-learn one-hot encoder documentation, and the dummy variable Wikipedia article, I get the impression that dummy coding and one-hot encoding are synonyms and that there is no real distinction.

Anyway, I would like to work on this issue. I will base my implementation on the prototype, and perhaps we can add a parameter to drop one of the indicator variables similar to what the scikit-learn one-hot encoder has.

Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Feb 7, 2024
Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Mar 8, 2024
Will-Tyler added a commit to Will-Tyler/hail that referenced this issue Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants