Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: from_dummies #8745

Closed
jreback opened this issue Nov 6, 2014 · 33 comments
Closed

API/ENH: from_dummies #8745

jreback opened this issue Nov 6, 2014 · 33 comments
Assignees
Labels
Categorical Categorical Data Type Enhancement
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Nov 6, 2014

Motivating from SO

This is the inverse of pd.get_dummies. So maybe invert_dummies is better?
I think this name makes more sense though.

This seems a reasonable way to do it. Am I missing anything?

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

NB. this is buggy ATM.

In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)),categories=df.categories)
@jreback jreback added Bug Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Nov 6, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 6, 2014
@TomAugspurger
Copy link
Contributor

We'll need to handle the case of a DataFrame with dummy columns and non-dummy columns.

@jorisvandenbossche
Copy link
Member

@TomAugspurger Can't we say that it is up to the user to provide the correct selection of columns? (and so error on non-dummy columns?)

I am not really sold on get_categories (as this could also mean a lot of other things, you can get categories from other type of data than dummies), so something with 'dummies' in the name feels better (invert_dummies, from_dummies, .. or something with the meaning of 'condense/melt dummies')

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche, yeah, by "handle" I meant think about, and I think raising is the best solution, sorry.

What to do with NaNs? pd.get_dummies(['a', 'b', np.nan], dummy_na=True) We should probably have a symmetrical argument for from_dummies. (I'm not sure how Categorical handles a NaN as a category).

@jreback jreback changed the title API/ENH: get_categories API/ENH: from_dummies Nov 6, 2014
@jreback
Copy link
Contributor Author

jreback commented Nov 6, 2014

I like from_dummies

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Nov 30, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@metasyn
Copy link

metasyn commented May 17, 2015

+1

@pkch
Copy link

pkch commented Nov 30, 2015

Should the milestone be modified from 0.16.0 to 0.18.0?

@hayd
Copy link
Contributor

hayd commented Dec 30, 2015

Here's a function for DataFrames (again from SO):

from collections import defaultdict

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

@TomAugspurger
Copy link
Contributor

What kind of roundtrip-ability can we hope for here. Ideally we have

x == pd.from_dummies(pd.get_dummies(x))

The problem is we lose the Categorical information when calling get_dummies.
In order to fully reconstruct a Categorical we would need to include the categories (if any, remember get_dummies will work on non-categorical) and the ordering when calling from_dummies.

def from_dummies(data, categories, ordered):
   ...

Additionally it could be that data came from a DataFrame, so they're might be multiple sets of dummy columns and non-dummy columns. In this case we have something like

def from_dummies(data, categories, ordered, prefixes)
    pass

Where all of prefixes, categories and ordered are scalars or lists of the same length (special case for categories and ordered as scalars and prefixes=None to handle inverting pd.get_dummies(Series).

Thoughts? That's kind of messy, but I don't see any way around it and I think we should shoot for perfect roundtrip-ability.

@jreback
Copy link
Contributor Author

jreback commented Jan 9, 2016

you can simply infer the categories (as they are the labels of the matrix).

@TomAugspurger
Copy link
Contributor

Categories you can get, but not whether it's ordered and what the ordering is if they are ordered.

EDIT: Oh, you can't necessarily infer categories even since pd.get_dummies(['a', 'a', 'b']) is the same as pd.get_dummies(pd.Series(pd.Categorical(['a', 'a', 'b'])))

On Jan 9, 2016, at 15:25, Jeff Reback notifications@github.com wrote:

you can simply infer the categories (as they are the labels of the matrix).


Reply to this email directly or view it on GitHub.

@jorisvandenbossche
Copy link
Member

@TomAugspurger How does the signature look like in the version you are working on?
Is the purpose to detect the different sets of dummies based on the column names (as the output of get_dummies looks like)?
Would it return object or category columns?

@TomAugspurger
Copy link
Contributor

Current signature

def from_dummies(data, categories=None, ordered=None, prefixes=None):
    '''
    The inverse transformation of ``pandas.get_dummies``.

    Parameters
    ----------
    data : DataFrame
    categories : Index or list of Indexes
    ordered : boolean or list of booleans
    prefixes : str or list of str

    Returns
    -------
    transformed : Series or DataFrame

    Notes
    -----
    To recover a Categorical, you must provide the categories and
    maybe whether it is ordered (default False). To invert a DataFrame that includes either
    multiple sets of dummy-encoded columns or a mixture of dummy-encoded
    columns and regular columns, you must specify ``prefixes``.

The default will be to return a regular Series where the values are the column labels (so int or str probably). To return a Categorical you pass in the categories. If I switched to returning a Categorical by default, we would need to provide a flag like return_categorical to disable that.

Is the purpose to detect the different sets of dummies based on the column names

That's what my prefixes argument is for. If you have multiple dummy-encoded sets you use prefixes=["fist_dummy_set", "second_set", ..."] and that will find all the ones with that as the prefix. This will maybe fail (or succeed silently!) if you have a column name that happened to share a prefix... This is beginning to look pretty complicated.

@jpgrossman
Copy link

This is exactly what I'm looking for... any progress? Beta?

Thanks!

@TomAugspurger
Copy link
Contributor

@jpgrossman I have a branch at https://github.com/TomAugspurger/pandas/tree/from_dummies, though it's been a while since I've looked at that. There are several changes I would make to that, so if you're interested you could use that as a starting point (maybe just the tests).

@jpgrossman
Copy link

jpgrossman commented Oct 25, 2016

Thank you Tom – will have a look at this soon.

@jreback
Copy link
Contributor Author

jreback commented Jan 28, 2017

pull requests are welcome!

@liorshk
Copy link

liorshk commented Jun 20, 2017

Any update here?
@TomAugspurger Your link doesn't work anymore

@TomAugspurger
Copy link
Contributor

@liorshk I haven't had time. Would you have a chance to submit a PR?

@kevin-winter
Copy link

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

@joshlk
Copy link

joshlk commented May 21, 2018

Would it make more sense to provide an option in get_dummies to also output a map between the original column name, new column name and categories? This could then be used to feed the reverse from_dummies function to recreate the old dataframe

@raam93
Copy link

raam93 commented Sep 1, 2018

I have edited @kevin-winter 's code in case someone has drop_first=True in pd.get_dummies():
i.e., dummies = pd.get_dummies(df, drop_first=True)

def from_dummies(data, categorical_cols, categorical_cols_first, prefix_sep='_'):
    out = data.copy()

    for col_parent in categorical_cols:
        
        filter_col = [col for col in data if col.startswith(col_parent)]
        cols_with_ones = np.argmax(data[filter_col].values, axis=1)
        
        org_col_values = []
        for row, col in enumerate(cols_with_ones):
            if((col==0) & (data[filter_col].iloc[row][col] < 1)):
                org_col_values.append(categorical_cols_first.get(col_parent))
            else:
                org_col_values.append(data[filter_col].columns[col].split(col_parent+prefix_sep,1)[1])
        
        out[col_parent] = pd.Series(org_col_values).values
        out.drop(filter_col, axis=1, inplace=True)    
        
    return out

categorical_cols_first is a dictionary of first levels (of each categorical variables) that will be dropped by pd.get_dummies()

categorical_cols_first = []
for col in categorical_cols:
    categorical_cols_first.append(df[col].value_counts().sort_index().keys()[0])
categorical_cols_first = dict(zip(categorical_cols, categorical_cols_first))

Wrote it quickly, so please comment if there is any bug. It worked for me though.
Hope this helps!

@andreaaraldo
Copy link

I would raise en exception in the function of @kevin-winter in case data[cols] is empty, explaining that one of the provided cols is incorrect

@MarcoGorelli
Copy link
Member

Seems like a popular request, I'll start working on this

@clbarnes
Copy link
Contributor

clbarnes commented May 19, 2020

I failed to find this on a search, and so created a duplicate issue.

My approach was to add from_dummies as an alternate constructor for Categorical: that way it's clear what it creates, it's easy to discover and to find documentation for, and the additional arguments are passed straight to that object. And let's not forget, "Namespaces are one honking great idea -- let's do more of those!".

This implementation minimises loops in python (although there are a couple of whole-dataframe copies), but doesn't do a lot of nannying for incorrect inputs:

import numpy as np 
import pandas as pd

class Categorical:
    ...
    
    @classmethod
    def from_dummies(cls, df: pd.DataFrame, **kwargs):
        onehot = df.astype(bool)

        if (onehot.sum(axis=1) > 1).any():
            raise ValueError("Some rows belong to >1 category")

        index_into = pd.Series([np.nan] + list(onehot.columns))
        mult_by = np.arange(1, len(index_into))

        indexes = (onehot.astype(int) * mult_by).sum(axis=1)
        values = index_into[indexes]

        return cls(values, df.columns, **kwargs)

@clbarnes
Copy link
Contributor

Think I'm taking this on, should be able to have a go tomorrow. For the sake of symmetry, I'd also like to give Categorical a to_dummies. If we go down that route, it might be nice to eventually deprecate the get_dummies free function so as to keep categorical-related functionality on the Categorical class and not duplicate API surface.

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here? Users can always .astype(bool) on it.

@TomAugspurger
Copy link
Contributor

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here?

Why do you say they're float dtype?

In [4]: pd.get_dummies(pd.Series([1, 2, 3])).dtypes
Out[4]:
1    uint8
2    uint8
3    uint8
dtype: object

@clbarnes
Copy link
Contributor

I just had a look through some docs and it looked like the term "dummy variable" is used mainly in regression, in cases where you have a categorical variable but need to encode it as continuous (i.e. floating) for the purposes of that regression. The term "one-hot encoding" seems more commonly used in applications which deals in actual booleans. For both of them, the information itself is binary, of course.

I may be completely making up that distinction, though.

@TomAugspurger
Copy link
Contributor

In my experience "one-hot encoding" and "dummy variables" are synonymous.

@MarcoGorelli
Copy link
Member

In my experience "one-hot encoding" and "dummy variables" are synonymous.

Seems the scikit-learn docs would agree

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

@clbarnes
Copy link
Contributor

take

@MarcoGorelli MarcoGorelli removed their assignment May 27, 2020
@mroeschke mroeschke removed API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug labels Apr 11, 2021
@pckSF pckSF mentioned this issue Jun 9, 2021
10 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.5 Feb 1, 2022
@mroeschke
Copy link
Member

Closed by #41902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.