Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

jaradc · 2016-01-15T06:11:22Z

When doing linear regression and encoding categorical variables, perfect collinearity can be a problem. To get around this, the suggested approach is to use n-1 columns. It would be useful if pd.get_dummies() had a boolean parameter that returns n-1 for each categorical column that gets encoded.

Example:

>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone

>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0

Instead, I'd like to have some parameter such as drop_first=True in get_dummies() and it does something like this:

>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0

Sources
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-01-15T13:37:12Z

Sounds good, interested in submitting a pull request?

StephenKappel · 2016-01-19T01:38:57Z

👍

…iables out of n levels. closes pandas-dev#12042 Some times it's useful to only accept n-1 variables out of n categorical levels. Author: Bran Yang <yangbo.84@gmail.com> Closes pandas-dev#12092 from BranYang/master and squashes the following commits: 0528c57 [Bran Yang] Compare with empty DataFrame, not just check empty 0d99c2a [Bran Yang] Test the case that `drop_first` is on and categorical variable only has one level. 45f14e8 [Bran Yang] ENH: GH12042 Add parameter `drop_first` to get_dummies to get k-1 variables out of n levels.

jcress · 2017-11-09T21:18:09Z

Would be advantageous to allow dropping a specific value, not just the 'first'.

The omitted category (reference group) influences the interpretation of coefficients.

For example, one best practice is to omit the 'largest' value as the reference category;


hot = df[['vol_k', 'activation']]

cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
    new = pd.get_dummies(df[var])
    hot = hot.join(new)
    
    #drop most frequent variable for ref category
    drop_col = df.groupby([var]).size().idxmax()
    hot.drop(drop_col, axis=1, inplace=True)
  
    print(var + " dropping " + drop_col)
    print(df.groupby([var]).size())`
```

TomAugspurger added Stats Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 15, 2016

BranYang mentioned this issue Jan 19, 2016

ENH: GH12042 Add parameter drop_first to get_dummies to get n-1 variables out of n levels. #12092

Closed

jreback added this to the 0.18.0 milestone Jan 27, 2016

jreback closed this as completed in 62363d2 Feb 8, 2016

traveling-desi mentioned this issue Jun 26, 2017

Get_dummies trap: Issue with collinearity udacity/mlnd_issues_tracker#17

Open

plpxsk mentioned this issue May 11, 2018

How to reverse reference level in Cox regression output? CamDavidsonPilon/lifelines#458

Closed

hvgazula mentioned this issue Jul 23, 2019

Dummy Encoding: omit the 'largest' value as the reference category; trendscenter/coinstac-regression-vbm#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

jaradc commented Jan 15, 2016

TomAugspurger commented Jan 15, 2016

StephenKappel commented Jan 19, 2016

jcress commented Nov 9, 2017 •

edited

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Comments

jaradc commented Jan 15, 2016

TomAugspurger commented Jan 15, 2016

StephenKappel commented Jan 19, 2016

jcress commented Nov 9, 2017 • edited

jcress commented Nov 9, 2017 •

edited