-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Hello,
I am trying to create dummy columns for the Titanic Kaggle problem data set. The test.csv
data set in the code below can be obtained here.
I am applying get_dummies
successively on columns Embarked
and Pclass
:
column_names = pd.read_csv("test.csv", nrows=1).columns
df = pd.read_csv("test.csv", skipinitialspace=True,
skiprows=1, names = column_names, na_values=-1 )
df = pd.get_dummies(df, columns=["Embarked"]).head()
df = pd.get_dummies(df, columns=["Pclass"]).head()
The initial columns, as obtained using df.columns
, are:
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
After running the above code the columns are now:
Index(['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
'Cabin', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_2',
'Pclass_3'],
dtype='object')
Since drop_first
is not set to True
, there should be a column for Pclass1
.
In fact, we should get the same result as when using get_dummies
on both columns at once using
df = pd.get_dummies(df, columns=["Embarked","Pclass"]).head()
which outputs:
Index(['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
'Cabin', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_1',
'Pclass_2', 'Pclass_3'],
dtype='object')
This is a problem because:
- it's not possible to drop one categorical level for some columns but not for others
- one might not realize this behaviour it if there are many columns in the original data set
- if one writes a pre-processing function using the above code, the DataFrame structure could be different depending on which column get truncated.
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None