Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Apply pd.get_dummies() on string type columns of pandas dataframe do nothing? #44965

Closed
3 tasks done
stevesolun opened this issue Dec 18, 2021 · 17 comments · Fixed by #45516
Closed
3 tasks done

BUG: Apply pd.get_dummies() on string type columns of pandas dataframe do nothing? #44965

stevesolun opened this issue Dec 18, 2021 · 17 comments · Fixed by #45516
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Strings String extension data type and string data
Milestone

Comments

@stevesolun
Copy link

stevesolun commented Dec 18, 2021

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

I have the following dataframe:

df = pd.DataFrame({'a': [6.6, -5.2, 2.1, 3.3, 1.1],
              'b': ['a', 'a', 'c', 'b', 'a'],
              'c': ['kfr', 'kfr', 'lu', 'ku', 'lu'],
              'd': ['t', 's', 's', 't', 'a']})

All the dtypes (columns b, c, d) are of type string.

If I call df = df.convert_dtypes(), and then call pd.get_dummies(), nothing happens:

df = pd.DataFrame({'a': [6.6, -5.2, 2.1, 3.3, 1.1],
              'b': ['a', 'a', 'c', 'b', 'a'],
              'c': ['kfr', 'kfr', 'lu', 'ku', 'lu'],
              'd': ['t', 's', 's', 't', 'a']})
df = df.convert_dtypes()
pd.get_dummies(df)

But if I will change the df.convert_dtypes() to df.convert_dtypes(convert_string=False) it will work as expected.
Why is this happening? Is it a bug?

Expected Behavior

pandas pd.get_dummies() should work on strings columns dtype also.

Installed Versions

1.3.5
@stevesolun stevesolun added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2021
@cheruiyot-amon
Copy link

when you write the pd.get_dummies() without the dataframe inside the bracket, it will throw some errors. The best you can do is create a variable say ,,"dummies"
dummies=pd.get_dummies(name_of_dataframe)

@stevesolun
Copy link
Author

when you write the pd.get_dummies() without the dataframe inside the bracket, it will throw some errors. The best you can do is create a variable say ,,"dummies" dummies=pd.get_dummies(name_of_dataframe)

What does it do to fix the bug?

@cheruiyot-amon
Copy link

cheruiyot-amon commented Dec 21, 2021 via email

@cheruiyot-amon
Copy link

Just write this code:
dummies=pd.get_dummies(df)
print(dummies)
.This will fix your problem

@stevesolun
Copy link
Author

I am sorry but it's not the problem here...
The problem is that if I am using convert_dtypes() on my df before get_dummies() it will return the same df without any change. Please read the question again and go step by step. Your solution doesn't solve the bug.

@cheruiyot-amon
Copy link

For your case, I think it is not possible to convert strings to integers, float, or even boolean. However, you can convert an integer to float and vice versa.

@stevesolun
Copy link
Author

Did you run my example?
I am able to solve it if I will explicitly tell convert_dtypes() to skip converting strings.

@cheruiyot-amon
Copy link

Yes. I did run your example. could you please send a screenshot a code so that i could have a look at it

@stevesolun
Copy link
Author

I am sorry but if you were able to reproduce the issue, can you please suggest a fix? You are one of the pandas developers?
Maybe we can create a Zoom call?

@cheruiyot-amon
Copy link

cheruiyot-amon commented Dec 22, 2021 via email

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 22, 2021

Hi @stevesolun

I think @cheruiyot-amon 's point is that your example doesn't run as it's written - in order to expedite resolution, could you fix it up so it can be easily copy-and-pasted please?

@stevesolun
Copy link
Author

stevesolun commented Dec 22, 2021

@MarcoGorelli sure. Done

@asishm
Copy link
Contributor

asishm commented Dec 22, 2021

To minimize this

>>> df = pd.DataFrame({'a': ['a', 'b']})
>>> print(pd.get_dummies(df))
   a_a  a_b
0    1    0
1    0    1
>>> print(pd.get_dummies(df.convert_dtypes()))
   a
0  a
1  b

@MarcoGorelli
Copy link
Member

I see - thanks for the report, and @asishm for the minimal code !

@asishm
Copy link
Contributor

asishm commented Dec 22, 2021

Issue is because pd.get_dummies only selects the dtypes - object, category to encode (when passed a DataFrame), but with convert_dtypes they are now of dtype string. However, with a Series, it factorizes regardless of the dtype.

dtypes_to_encode = ["object", "category"]

@lithomas1 lithomas1 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Dec 23, 2021
@stevesolun
Copy link
Author

@asishm @lithomas1 thanks a lot!
It's a bug or by design? Does it mean that the dev team will change the behavior of get_dummies to support strings also?

@lukemanley
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants