Hello there,

I noticed that some boolean features (X10 to X385) are constant within the classes of some categorical variables, in particular those of X0, X1, X2 and X5. This phenomenon is especially marked for X2 and X0 with 65 and 19 features involved respectively. These columns vary across the dataset, but they remain fixed to either 0 or 1 within the classes of the aforementioned categorical variables.

What does it mean? Could it be that some of these categorical variables are actually encoding some of the boolean features? 

Here follows the code that lead to my analysis. Please let me know if there is something wrong with it :D and please upvote this notebook if you liked it!

Cheers!

In [None]:
### load modules
import pandas as pd
import numpy as np

In [None]:
# load training data (we could also include the test data in our analysis, it won't change much)
train_df  = pd.read_csv('../input/train.csv')

# remove ID, y and constant columns 
df = train_df.drop(['ID','y'], axis = 1)
df = df.loc[:, (df != df.ix[0]).any()] 

In [None]:
# now let's loop across the categorical variables
categorical = ['X0','X1','X2','X3','X4','X5','X6','X8']
for cat in categorical:   
    # this groupby finds the columns which are constant within classes in the categorical feature
    temp = (df.groupby(cat).std().mean()==0)    
    constant_cols = temp[temp==True].index.tolist()
    print('{1} constant columns across {0}\n'.format(cat,len(constant_cols)))
    print(constant_cols)
    print('********************************')

In [None]:
# let's see for instance the columns which are constant across X0 (taken from above)
const_cols_across_X0 = ['X29', 'X54', 'X76', 'X118', 'X119', 'X136', 'X186', 
                         'X187', 'X194', 'X231', 'X232', 'X236', 'X263', 'X277', 
                         'X279', 'X313', 'X314', 'X315', 'X316']
df.groupby('X0').mean()[const_cols_across_X0]

See? The mean values are either 0 or 1!

### Let's consider also the test dataset (as kindly suggested by Mike)

In [None]:
# load test data
test_df  = pd.read_csv('../input/test.csv')

# remove ID, y, combine datasets and remove constant columns 
df = pd.concat([train_df.drop(['ID','y'], axis = 1),test_df.drop(['ID'], axis = 1)]).reset_index(drop = True)
df = df.loc[:, (df != df.ix[0]).any()] 

In [None]:
# now let's loop across the categorical variables
categorical = ['X0','X1','X2','X3','X4','X5','X6','X8']
for cat in categorical:   
    # this groupby finds the columns which are constant within classes in the categorical feature
    temp = (df.groupby(cat).std().mean()==0)    
    constant_cols = temp[temp==True].index.tolist()
    print('{1} constant columns across {0}\n'.format(cat,len(constant_cols)))
    print(constant_cols)
    print('********************************')

Things don't seem to change that much after all (?)