New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist: make get_dummies() usable for train / test framework #8918

Closed
chrish42 opened this Issue Nov 28, 2014 · 17 comments

Comments

Projects
None yet
10 participants
@chrish42
Contributor

chrish42 commented Nov 28, 2014

Having get_dummies() in Pandas is really nice, but to be useful for machine learning, it would need to be usable in a train / test framework (or "fit_transform" and "transform", with the sklearn terminology). Let me know if this needs more explanations.

So, I guess this is a wishlist bug report to add that functionality to Pandas. I can even create a pull request, if people agree this would be something useful to have in Pandas (and are willing to coach a bit and do code review for what would be my first contribution to this project).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 28, 2014

Contributor

well how about a pseudo code example with inputs and outputs from a sample frame would be useful

Contributor

jreback commented Nov 28, 2014

well how about a pseudo code example with inputs and outputs from a sample frame would be useful

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 29, 2014

Contributor

@chrish42, an example would be great.

FYI scikit-learn has the OneHotEncoder class which fits into their pipeline.

Something like this should work?

import pandas as pd
from sklearn.pipeline import TransformerMixin

class DummyEncoder(TransformerMixin):

    def __init__(self, columns=None):

        self.columns = columns

    def transform(self, X, y=None, **kwargs):

        return pd.get_dummies(X, columns=self.columns)

    def fit(self, X, y=None, **kwargs):

        return self

Giving

In [15]: df
Out[15]: 
   A  B  C
0  1  a  a
1  2  b  a

In [16]: DummyEncoder().transform(df)
Out[16]: 
   A  B_a  B_b  C_a
0  1    1    0    1
1  2    0    1    1

Be careful with the ordering of the columns.

Contributor

TomAugspurger commented Nov 29, 2014

@chrish42, an example would be great.

FYI scikit-learn has the OneHotEncoder class which fits into their pipeline.

Something like this should work?

import pandas as pd
from sklearn.pipeline import TransformerMixin

class DummyEncoder(TransformerMixin):

    def __init__(self, columns=None):

        self.columns = columns

    def transform(self, X, y=None, **kwargs):

        return pd.get_dummies(X, columns=self.columns)

    def fit(self, X, y=None, **kwargs):

        return self

Giving

In [15]: df
Out[15]: 
   A  B  C
0  1  a  a
1  2  b  a

In [16]: DummyEncoder().transform(df)
Out[16]: 
   A  B_a  B_b  C_a
0  1    1    0    1
1  2    0    1    1

Be careful with the ordering of the columns.

@chrish42

This comment has been minimized.

Show comment
Hide comment
@chrish42

chrish42 Jan 6, 2015

Contributor

@TomAugspurger, actually the compatibility with the sklearn processing pipeline itself is not the part that interests me. What I would like is the ability to save the transformation done by get_dummes() to a dataset, and then apply said transformation as is (creating the exact same columns), even if the second dataset has a subset of the values of the first one in some column, etc. That's actually what I meant by "usable in a train/test framework". Is this explanation clearer? (I can add an example someone thinks that's still needed.)

I'm aware of the OneHotEncoder class in sklearn, but it has other limitations.

Contributor

chrish42 commented Jan 6, 2015

@TomAugspurger, actually the compatibility with the sklearn processing pipeline itself is not the part that interests me. What I would like is the ability to save the transformation done by get_dummes() to a dataset, and then apply said transformation as is (creating the exact same columns), even if the second dataset has a subset of the values of the first one in some column, etc. That's actually what I meant by "usable in a train/test framework". Is this explanation clearer? (I can add an example someone thinks that's still needed.)

I'm aware of the OneHotEncoder class in sklearn, but it has other limitations.

@davidbp

This comment has been minimized.

Show comment
Hide comment
@davidbp

davidbp Oct 5, 2015

I stumbled upton the same problem as @chrish42 and I found get_dummies giving me some headache.

Example of the limitations of the current get dummies

Let us assume we work with data from the following df_train DataFrame

df_train = pandas.DataFrame({"car":["seat","bmw"], "color":["red","green"]})
pandas.get_dummies(df_train )

   car_bmw  car_seat  color_green  color_red
0        0         1            0          1
1        1         0            1          0

Then we are provided with

df_test = pandas.DataFrame({"car":["seat","mercedes"], "color":["red","green"]})
pandas.get_dummies(df_test )

         car_mercedes  car_seat  color_green  color_red
0             0         1            0          1
1             1         0            1          0

Since I have never observed a "mercedes" value for variable "car" in df_train I would like to be able to get the following one hot encoding:

           car_bmw  car_seat  color_green  color_red
0             0         1            0          1
1             0         0            1          0

Where the column car_mercedes actually never appears.

This could be solved by allowing get_dummies to receive an input dictionary stating the accepted values that we allow for each column.

Returning to the previous example, we could give as input to get_dummies the following dict of sets

accepted_values_per_column = {'car': {'bmw', 'seat'}, 'color': {'green', 'red'}}

and we would expect get_dummies to return

get_dummies(df_test, accepted_values_per_column = accepted_values_per_column) 

           car_bmw  car_seat  color_green  color_red
0             0         1            0          1
1             0         0            1          0

and expect get_dummies(df_test) to return what already returns.

davidbp commented Oct 5, 2015

I stumbled upton the same problem as @chrish42 and I found get_dummies giving me some headache.

Example of the limitations of the current get dummies

Let us assume we work with data from the following df_train DataFrame

df_train = pandas.DataFrame({"car":["seat","bmw"], "color":["red","green"]})
pandas.get_dummies(df_train )

   car_bmw  car_seat  color_green  color_red
0        0         1            0          1
1        1         0            1          0

Then we are provided with

df_test = pandas.DataFrame({"car":["seat","mercedes"], "color":["red","green"]})
pandas.get_dummies(df_test )

         car_mercedes  car_seat  color_green  color_red
0             0         1            0          1
1             1         0            1          0

Since I have never observed a "mercedes" value for variable "car" in df_train I would like to be able to get the following one hot encoding:

           car_bmw  car_seat  color_green  color_red
0             0         1            0          1
1             0         0            1          0

Where the column car_mercedes actually never appears.

This could be solved by allowing get_dummies to receive an input dictionary stating the accepted values that we allow for each column.

Returning to the previous example, we could give as input to get_dummies the following dict of sets

accepted_values_per_column = {'car': {'bmw', 'seat'}, 'color': {'green', 'red'}}

and we would expect get_dummies to return

get_dummies(df_test, accepted_values_per_column = accepted_values_per_column) 

           car_bmw  car_seat  color_green  color_red
0             0         1            0          1
1             0         0            1          0

and expect get_dummies(df_test) to return what already returns.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 5, 2015

Contributor

You simply need make your variables Categorical if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]: 
    car  color
0  seat    red
1   bmw  green

In [7]: pd.get_dummies(df_train )
Out[7]: 
   car_seat  car_bmw  car_mercedes  color_green  color_red
0         1        0             0            0          1
1         0        1             0            1          0

The original question is not well specified, so closing.

Contributor

jreback commented Oct 5, 2015

You simply need make your variables Categorical if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]: 
    car  color
0  seat    red
1   bmw  green

In [7]: pd.get_dummies(df_train )
Out[7]: 
   car_seat  car_bmw  car_mercedes  color_green  color_red
0         1        0             0            0          1
1         0        1             0            1          0

The original question is not well specified, so closing.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 5, 2015

Contributor

And when you're going the other way, from the encoding to back to Categorical you'll use Categorical.from_codes.

One more bit of unsolicited advice. If you care at all about accurate estimates of the coefficients on the categoricals, drop one of the encoded columns or else you'll have multicolinearity with the intercept (if you have one).

On Oct 5, 2015, at 05:34, Jeff Reback notifications@github.com wrote:

You simply need make your variables Categorical if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green

In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
The original question is not well specified, so closing.


Reply to this email directly or view it on GitHub.

Contributor

TomAugspurger commented Oct 5, 2015

And when you're going the other way, from the encoding to back to Categorical you'll use Categorical.from_codes.

One more bit of unsolicited advice. If you care at all about accurate estimates of the coefficients on the categoricals, drop one of the encoded columns or else you'll have multicolinearity with the intercept (if you have one).

On Oct 5, 2015, at 05:34, Jeff Reback notifications@github.com wrote:

You simply need make your variables Categorical if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green

In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
The original question is not well specified, so closing.


Reply to this email directly or view it on GitHub.

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay

aileronajay Jan 13, 2017

Contributor

@TomAugspurger @jreback I think i have run into the same problem lately and i would like to state an example

train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']})

IsBadBuy Make_BMW Make_Mazda Make_Toyota
0 0 0 0 1
1 1 0 1 0
2 0 1 0 0

test_a = pd.DataFrame({"Make":['Toyota','BMW']})
print pd.get_dummies(test_a,columns=['Make'])

Make_BMW Make_Toyota
0 0 1
1 1 0

Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train.

Contributor

aileronajay commented Jan 13, 2017

@TomAugspurger @jreback I think i have run into the same problem lately and i would like to state an example

train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']})

IsBadBuy Make_BMW Make_Mazda Make_Toyota
0 0 0 0 1
1 1 0 1 0
2 0 1 0 0

test_a = pd.DataFrame({"Make":['Toyota','BMW']})
print pd.get_dummies(test_a,columns=['Make'])

Make_BMW Make_Toyota
0 0 1
1 1 0

Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 13, 2017

Contributor
Contributor

TomAugspurger commented Jan 13, 2017

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay
Contributor

aileronajay commented Jan 13, 2017

@brifordwylie

This comment has been minimized.

Show comment
Hide comment
@brifordwylie

brifordwylie May 10, 2017

The PyData Chicago 2016 talk given by @TomAugspurger was really well done. He did a fantastic job of illustrating all the reasons why this issue/request should not be closed. IMHO either his class DummyEncoder or some reasonable equivalent should be included in Pandas proper. Yes I can go to his github and copy/emulate his class but it would be much nicer to just have it supported within the library.

brifordwylie commented May 10, 2017

The PyData Chicago 2016 talk given by @TomAugspurger was really well done. He did a fantastic job of illustrating all the reasons why this issue/request should not be closed. IMHO either his class DummyEncoder or some reasonable equivalent should be included in Pandas proper. Yes I can go to his github and copy/emulate his class but it would be much nicer to just have it supported within the library.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 11, 2017

Contributor
Contributor

TomAugspurger commented May 11, 2017

@brifordwylie

This comment has been minimized.

Show comment
Hide comment
@brifordwylie

brifordwylie May 11, 2017

@TomAugspurger ah, good point, I hadn't considered the dependency issue, okay I'll retract my suggestion to put it into Pandas proper.

brifordwylie commented May 11, 2017

@TomAugspurger ah, good point, I hadn't considered the dependency issue, okay I'll retract my suggestion to put it into Pandas proper.

@joeddav

This comment has been minimized.

Show comment
Hide comment
@joeddav

joeddav Oct 17, 2017

Here's a little solution some of us worked on that may be helpful for some here. Dummy variables with fit/transform capabilities.

https://github.com/joeddav/get_smarties

Feedback and contributions would be helpful!

joeddav commented Oct 17, 2017

Here's a little solution some of us worked on that may be helpful for some here. Dummy variables with fit/transform capabilities.

https://github.com/joeddav/get_smarties

Feedback and contributions would be helpful!

@Aylr

This comment has been minimized.

Show comment
Hide comment
@Aylr

Aylr Nov 1, 2017

This appears related to #14017

Aylr commented Nov 1, 2017

This appears related to #14017

@yashu-seth

This comment has been minimized.

Show comment
Hide comment
@yashu-seth

yashu-seth Dec 14, 2017

I have created a solution that may be helpful exactly in this problem. One Hot Encoding categorical variable in a train test framework. It can also handle cases when the dataset is too large to fit in the machine memory.

https://github.com/yashu-seth/dummyPy

You can also find a small tutorial on this here.

yashu-seth commented Dec 14, 2017

I have created a solution that may be helpful exactly in this problem. One Hot Encoding categorical variable in a train test framework. It can also handle cases when the dataset is too large to fit in the machine memory.

https://github.com/yashu-seth/dummyPy

You can also find a small tutorial on this here.

@md733406

This comment has been minimized.

Show comment
Hide comment
@md733406

md733406 Jan 11, 2018

@TomAugspurger This Code doesn't work. When I go to transform my production single record data it only gives me the one hot encoded column for the single value that is present.
What am I missing?

import pyodbc
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

import numpy as np
import pandas as pd
from sklearn.pipeline import TransformerMixin
from sklearn.pipeline import make_pipeline

class DummyEncoder(TransformerMixin):
def fit(self, X, y=None):
self.index_ = X.index
self.columns_ = X.columns
self.cat_columns_ = X.select_dtypes(include=['category']).columns
self.non_cat_columns_ = X.columns.drop(self.cat_columns_)

    self.cat_map_ = {col: X[col].cat for col in self.cat_columns_}
    
    left = len(self.non_cat_columns_)
    self.cat_blocks_ = {}
    for col in self.cat_columns_:
        right = left + len(X[col].cat.categories)
        self.cat_blocks_[col], left = slice(left, right), right
    return self

def transform(self, X, y=None):
    return np.asarray(pd.get_dummies(X))

def inverse_transform(self, X):
    non_cat = pd.DataFrame(X[:, :len(self.non_Cat_columns_)],
                             columns=self.non_cat_columns_)
    cats = []
    for col, cat in self.cat_map_.items():
        slice_ = self.cat_blocks_[col]
        codes = X[:, slice_].argmax(1)
        series = pd.Series(pd.Categorical.from_codes(
                codes, cat.categories, ordered=cat.ordered
        ), name=col)
        cats.append(series)
    df = pd.concat([non_cat] + cats, axis=1)[self.columns_]
    return df

#import data from SQL to pandas Dataframe
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={XXXXX};DATABASE={ML_Learn_Taxi};UID={XXXX};PWD={XXXX}')
sql ="""
SELECT top 1 CONVERT(int, [order_key]) order_key
,CONVERT(int, [service_date_key]) service_date_key
,[order_source_desc]
,1 as 'return_flag'
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Online'
UNION
SELECT top 2 CONVERT(int, [order_key])
,CONVERT( int, [service_date_key])
,[order_source_desc]
,2
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Inbound Call'
UNION
SELECT top 1 CONVERT(int, [order_key])
,CONVERT(int, [service_date_key])
,[order_source_desc]
,1
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Outbound Call'
"""

prod_sql ="""
SELECT top 1 CONVERT(int, [order_key]) order_key
,CONVERT(int, [service_date_key]) service_date_key
,[order_source_desc]
,1 as 'return_flag'
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Online'
"""

InputDataSet = pd.read_sql(sql, cnxn)
ProdDataSet = pd.read_sql(prod_sql, cnxn)

print("*************** Data ******************")
print(InputDataSet)

print("******** Category Columns Info ***********")
columns = ['order_source_desc']
InputDataSet[columns] = InputDataSet[columns].apply(lambda x: x.astype('category'))

InputDataSet.info()

print("******** Linear Regression ***********")

X = InputDataSet.drop('return_flag', axis=1)
y = InputDataSet['return_flag']

A = ProdDataSet.drop('return_flag', axis=1)
B = ProdDataSet['return_flag']

enc = DummyEncoder()
enc.fit(X)
#rain = enc.transform(X)
Prod = enc.transform(A)

print(Prod)

OUTPUT: *************** Data ******************
order_key service_date_key order_source_desc return_flag
0 10087937 20151214 Online 1
1 10088174 20151201 Inbound Call 2
2 10088553 20151217 Inbound Call 2
3 663478 20160806 Outbound Call 1
******** Category Columns Info ***********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
order_key 4 non-null int64
service_date_key 4 non-null int64
order_source_desc 4 non-null category
return_flag 4 non-null int64
dtypes: category(1), int64(3)
memory usage: 284.0 bytes
******** Linear Regression ***********
[[10087937 20151214 1]]

md733406 commented Jan 11, 2018

@TomAugspurger This Code doesn't work. When I go to transform my production single record data it only gives me the one hot encoded column for the single value that is present.
What am I missing?

import pyodbc
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

import numpy as np
import pandas as pd
from sklearn.pipeline import TransformerMixin
from sklearn.pipeline import make_pipeline

class DummyEncoder(TransformerMixin):
def fit(self, X, y=None):
self.index_ = X.index
self.columns_ = X.columns
self.cat_columns_ = X.select_dtypes(include=['category']).columns
self.non_cat_columns_ = X.columns.drop(self.cat_columns_)

    self.cat_map_ = {col: X[col].cat for col in self.cat_columns_}
    
    left = len(self.non_cat_columns_)
    self.cat_blocks_ = {}
    for col in self.cat_columns_:
        right = left + len(X[col].cat.categories)
        self.cat_blocks_[col], left = slice(left, right), right
    return self

def transform(self, X, y=None):
    return np.asarray(pd.get_dummies(X))

def inverse_transform(self, X):
    non_cat = pd.DataFrame(X[:, :len(self.non_Cat_columns_)],
                             columns=self.non_cat_columns_)
    cats = []
    for col, cat in self.cat_map_.items():
        slice_ = self.cat_blocks_[col]
        codes = X[:, slice_].argmax(1)
        series = pd.Series(pd.Categorical.from_codes(
                codes, cat.categories, ordered=cat.ordered
        ), name=col)
        cats.append(series)
    df = pd.concat([non_cat] + cats, axis=1)[self.columns_]
    return df

#import data from SQL to pandas Dataframe
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={XXXXX};DATABASE={ML_Learn_Taxi};UID={XXXX};PWD={XXXX}')
sql ="""
SELECT top 1 CONVERT(int, [order_key]) order_key
,CONVERT(int, [service_date_key]) service_date_key
,[order_source_desc]
,1 as 'return_flag'
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Online'
UNION
SELECT top 2 CONVERT(int, [order_key])
,CONVERT( int, [service_date_key])
,[order_source_desc]
,2
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Inbound Call'
UNION
SELECT top 1 CONVERT(int, [order_key])
,CONVERT(int, [service_date_key])
,[order_source_desc]
,1
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Outbound Call'
"""

prod_sql ="""
SELECT top 1 CONVERT(int, [order_key]) order_key
,CONVERT(int, [service_date_key]) service_date_key
,[order_source_desc]
,1 as 'return_flag'
FROM [ML_Return_Customer].[dbo].[return_customers_test_set]
WHERE [order_source_desc]='Online'
"""

InputDataSet = pd.read_sql(sql, cnxn)
ProdDataSet = pd.read_sql(prod_sql, cnxn)

print("*************** Data ******************")
print(InputDataSet)

print("******** Category Columns Info ***********")
columns = ['order_source_desc']
InputDataSet[columns] = InputDataSet[columns].apply(lambda x: x.astype('category'))

InputDataSet.info()

print("******** Linear Regression ***********")

X = InputDataSet.drop('return_flag', axis=1)
y = InputDataSet['return_flag']

A = ProdDataSet.drop('return_flag', axis=1)
B = ProdDataSet['return_flag']

enc = DummyEncoder()
enc.fit(X)
#rain = enc.transform(X)
Prod = enc.transform(A)

print(Prod)

OUTPUT: *************** Data ******************
order_key service_date_key order_source_desc return_flag
0 10087937 20151214 Online 1
1 10088174 20151201 Inbound Call 2
2 10088553 20151217 Inbound Call 2
3 663478 20160806 Outbound Call 1
******** Category Columns Info ***********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
order_key 4 non-null int64
service_date_key 4 non-null int64
order_source_desc 4 non-null category
return_flag 4 non-null int64
dtypes: category(1), int64(3)
memory usage: 284.0 bytes
******** Linear Regression ***********
[[10087937 20151214 1]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment