Skip to content

Last DataFrame column missing when using get_dummies successively #17542

@aoussou

Description

@aoussou

Hello,

I am trying to create dummy columns for the Titanic Kaggle problem data set. The test.csv data set in the code below can be obtained here.

I am applying get_dummies successively on columns Embarked and Pclass:

column_names = pd.read_csv("test.csv", nrows=1).columns
df = pd.read_csv("test.csv", skipinitialspace=True,
                           skiprows=1, names = column_names, na_values=-1 )

df = pd.get_dummies(df, columns=["Embarked"]).head()
df = pd.get_dummies(df, columns=["Pclass"]).head()

The initial columns, as obtained using df.columns, are:

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

After running the above code the columns are now:

Index(['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_2',
       'Pclass_3'],
      dtype='object')

Since drop_first is not set to True, there should be a column for Pclass1.

In fact, we should get the same result as when using get_dummies on both columns at once using

df = pd.get_dummies(df, columns=["Embarked","Pclass"]).head()

which outputs:

Index(['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_1',
       'Pclass_2', 'Pclass_3'],
      dtype='object')

This is a problem because:

  1. it's not possible to drop one categorical level for some columns but not for others
  2. one might not realize this behaviour it if there are many columns in the original data set
  3. if one writes a pre-processing function using the above code, the DataFrame structure could be different depending on which column get truncated.
INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-33-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions