### Kaggle Predicting Red Hat Business Value

As this my first kernel, I would very much appreciate some feedback.  While i'm confident that most of my process, particularly the initial clearning and transformation from cateogrical to numeric are correct, i'm more unsure about the steps I took in re: to train, test, split, etc.  

Thank you to everyone who contributes with tips and kernels of their own!

In [1]:
# importing libraries

import numpy as np
import pandas as pd
from IPython.display import display, HTML

from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier



In [2]:
# reading in data

people = pd.read_csv('../input/people.csv')
activity_train = pd.read_csv('../input/act_train.csv')
activity_test = pd.read_csv('../input/act_test.csv')

In [3]:
# merging the dataframes into train, test

df = activity_train.merge(people, how='left', on='people_id' )
df_test = activity_test.merge(people, how='left', on='people_id' )

In [4]:
# the shape of the dataframes

print (df.shape)
print (df_test.shape)

(2197291, 55)
(498687, 54)


In [5]:
# filling NaN values first

df = df.fillna('0', axis=0)
df_test = df_test.fillna('0', axis=0)

In [6]:
# taking a look at the first few rows

df.head()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,ppl_100,act2_1734928,2023-08-26,type 4,0,0,0,0,0,0,...,False,True,True,False,False,True,True,True,False,36
1,ppl_100,act2_2434093,2022-09-27,type 2,0,0,0,0,0,0,...,False,True,True,False,False,True,True,True,False,36
2,ppl_100,act2_3404049,2022-09-27,type 2,0,0,0,0,0,0,...,False,True,True,False,False,True,True,True,False,36
3,ppl_100,act2_3651215,2023-08-04,type 2,0,0,0,0,0,0,...,False,True,True,False,False,True,True,True,False,36
4,ppl_100,act2_4109017,2023-08-26,type 2,0,0,0,0,0,0,...,False,True,True,False,False,True,True,True,False,36


In [7]:
df_test.head()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,ppl_100004,act1_249281,2022-07-20,type 1,type 5,type 10,type 5,type 1,type 6,type 1,...,True,True,True,True,True,True,True,True,True,76
1,ppl_100004,act2_230855,2022-07-20,type 5,0,0,0,0,0,0,...,True,True,True,True,True,True,True,True,True,76
2,ppl_10001,act1_240724,2022-10-14,type 1,type 12,type 1,type 5,type 4,type 6,type 1,...,False,True,True,True,True,True,True,True,True,90
3,ppl_10001,act1_83552,2022-11-27,type 1,type 20,type 10,type 5,type 4,type 6,type 1,...,False,True,True,True,True,True,True,True,True,90
4,ppl_10001,act2_1043301,2022-10-15,type 5,0,0,0,0,0,0,...,False,True,True,True,True,True,True,True,True,90


### preprocessing

In [8]:
# a multi-column LabelEncoder()

# this solution for applying LabelEncoder() across multiple columns was suggested in the following thread
# http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn

# I like this solution but is it the most efficient?  Would another method be more practical, particularly if 
# applied to different type of model 

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns 

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

In [9]:
# defining a processor 

def processor(data):
    data = MultiColumnLabelEncoder(columns = ['people_id','activity_id', 'activity_category', 'date_x', 'char_1_x', 'char_2_x',
                                        'char_3_x', 'char_4_x', 'char_5_x', 'char_6_x', 'char_7_x', 'char_8_x', 'char_9_x',
                                        'char_10_x', 'char_1_y', 'group_1', 'char_2_y', 'date_y', 'char_3_y', 'char_4_y',
                                        'char_5_y', 'char_6_y', 'char_7_y', 'char_8_y', 'char_9_y']).fit_transform(df)
    
    bool_map = {True:1, False:0}

    data = data.applymap(lambda x: bool_map.get(x,x))
    
    return data

In [10]:
# applying processor to training data

df_encoded = processor(df)

In [11]:
df_encoded.head()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,0,503691,405,3,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
1,0,832759,72,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
2,0,1289703,72,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
3,0,1406406,383,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
4,0,1623050,405,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36


In [12]:
df_encoded.dtypes

people_id            int64
activity_id          int64
date_x               int64
activity_category    int64
char_1_x             int64
char_2_x             int64
char_3_x             int64
char_4_x             int64
char_5_x             int64
char_6_x             int64
char_7_x             int64
char_8_x             int64
char_9_x             int64
char_10_x            int64
outcome              int64
char_1_y             int64
group_1              int64
char_2_y             int64
date_y               int64
char_3_y             int64
char_4_y             int64
char_5_y             int64
char_6_y             int64
char_7_y             int64
char_8_y             int64
char_9_y             int64
char_10_y            int64
char_11              int64
char_12              int64
char_13              int64
char_14              int64
char_15              int64
char_16              int64
char_17              int64
char_18              int64
char_19              int64
char_20              int64
c

In [13]:
# applying processor to test data

df_test_encoded = processor(df_test)

In [14]:
df_test_encoded.head()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,0,503691,405,3,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
1,0,832759,72,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
2,0,1289703,72,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
3,0,1406406,383,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
4,0,1623050,405,1,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36


In [15]:
df_test_encoded.dtypes

people_id            int64
activity_id          int64
date_x               int64
activity_category    int64
char_1_x             int64
char_2_x             int64
char_3_x             int64
char_4_x             int64
char_5_x             int64
char_6_x             int64
char_7_x             int64
char_8_x             int64
char_9_x             int64
char_10_x            int64
outcome              int64
char_1_y             int64
group_1              int64
char_2_y             int64
date_y               int64
char_3_y             int64
char_4_y             int64
char_5_y             int64
char_6_y             int64
char_7_y             int64
char_8_y             int64
char_9_y             int64
char_10_y            int64
char_11              int64
char_12              int64
char_13              int64
char_14              int64
char_15              int64
char_16              int64
char_17              int64
char_18              int64
char_19              int64
char_20              int64
c

## modeling


In [16]:
# defining X and y (features and target label)

X = df_encoded
y = X.pop('outcome')

In [17]:
# shape of X and y

print (X.shape)
print (y.shape)

(2197291, 54)
(2197291,)


In [18]:
'''

train, test, split the data.  hold out 25% for test

generally if not provided a test set, this would be the way to move forward.
yet I feel something is off in my process and would love feedback!

'''

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)

In [19]:
# random forest classifier

model = RandomForestClassifier(77, n_jobs=-1, random_state=7)
model.fit(X_train, y_train)
print ("model score ", model.score(X_test, y_test))

model score  0.992874865971


In [20]:
# predicting test data

pred = model.predict(X_test)
pred

array([0, 0, 0, ..., 0, 0, 0])

### Final thoughts

While the X_test a split of our traning data, as achieved through train, test, split, it is not the same as the actual test set provided in the data by kaggle.  For submission, this would be an issue.  For me, this competition is unique in although fairly straightforward, I haven't had to label encode categoricals on a provided test dataset.  Generally these are untouched until model prediction time.

Hopefully some other less experienced users such as myself can make some use of my inital sets in this process.