Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Closed
jaradc opened this issue Jan 15, 2016 · 3 comments
Closed
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jaradc
Copy link

jaradc commented Jan 15, 2016

When doing linear regression and encoding categorical variables, perfect collinearity can be a problem. To get around this, the suggested approach is to use n-1 columns. It would be useful if pd.get_dummies() had a boolean parameter that returns n-1 for each categorical column that gets encoded.

Example:

>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone
>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0 

Instead, I'd like to have some parameter such as drop_first=True in get_dummies() and it does something like this:

>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0

Sources
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

@TomAugspurger TomAugspurger added Stats Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 15, 2016
@TomAugspurger
Copy link
Contributor

Sounds good, interested in submitting a pull request?

@StephenKappel
Copy link
Contributor

👍

@jreback jreback added this to the 0.18.0 milestone Jan 27, 2016
@jreback jreback closed this as completed in 62363d2 Feb 8, 2016
cldy pushed a commit to cldy/pandas that referenced this issue Feb 11, 2016
…iables out of n levels.

closes pandas-dev#12042     Some times it's useful to only accept n-1 variables
out of n categorical levels.

Author: Bran Yang <yangbo.84@gmail.com>

Closes pandas-dev#12092 from BranYang/master and squashes the following commits:

0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty
0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level.
45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.
@jcress
Copy link

jcress commented Nov 9, 2017

Would be advantageous to allow dropping a specific value, not just the 'first'.

The omitted category (reference group) influences the interpretation of coefficients.

For example, one best practice is to omit the 'largest' value as the reference category;


hot = df[['vol_k', 'activation']]

cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
    new = pd.get_dummies(df[var])
    hot = hot.join(new)
    
    #drop most frequent variable for ref category
    drop_col = df.groupby([var]).size().idxmax()
    hot.drop(drop_col, axis=1, inplace=True)
  
    print(var + " dropping " + drop_col)
    print(df.groupby([var]).size())`
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants