## Kaggle Competetion (Since Aug./7th/2016)

### Predicting Red Hat Business Value

Like most companies, Red Hat is able to gather a great deal of information over time about the behavior of individuals who interact with them. They’re in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.

In this competition, Kagglers are challenged to <b>create a classification algorithm</b> that accurately identifies which <u>customers have the most potential business value</u> for Red Hat based on their characteristics and activities.

With an improved prediction model in place, Red Hat will be able to more efficiently prioritize resources to generate more business and better serve their customers.

### Data Samples
This competition uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.

The **people file** contains all of the **unique people** (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique **people_id**.

The **activity file** contains *all of the unique activities* (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents **a unique activity** performed by a person on **a certain date**. Each activity has **a unique activity_id**.

The challenge of this competition is **to predict the potential business value of a person who has performed a specific activity**. *The business value outcome is defined by a yes/no field attached to each unique activity in the activity file*. The outcome field indicates whether or not each person has completed the outcome within a fixed window of time after each unique activity was performed.

The activity file contains *several different categories of activities*. Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).

To develop a predictive model with this data, you will likely **need to join the files together into a single data set**. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

predict potential business value => defined by yes/no attached to each unique activity => indicates whether or not each person has completed the outcome within a fixed window of time after unique activity



In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

In [2]:
# Load data files : people and an activity
import pandas as pd

people = pd.read_csv("./data/people.csv")
activity = pd.read_csv("./data/act_train.csv")

In [3]:
combined = pd.merge(people, activity, how='inner', on='people_id', left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

In [4]:
combined.head(3)

Unnamed: 0,people_id,char_1_x,group_1,char_2_x,date_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,...,char_2_y,char_3_y,char_4_y,char_5_y,char_6_y,char_7_y,char_8_y,char_9_y,char_10_y,outcome
0,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,,,,,,,,,type 76,0
1,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,,,,,,,,,type 1,0
2,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,,,,,,,,,type 1,0


In [5]:
combined = combined.fillna('type 0')

In [6]:
combined.columns

Index([u'people_id', u'char_1_x', u'group_1', u'char_2_x', u'date_x',
       u'char_3_x', u'char_4_x', u'char_5_x', u'char_6_x', u'char_7_x',
       u'char_8_x', u'char_9_x', u'char_10_x', u'char_11', u'char_12',
       u'char_13', u'char_14', u'char_15', u'char_16', u'char_17', u'char_18',
       u'char_19', u'char_20', u'char_21', u'char_22', u'char_23', u'char_24',
       u'char_25', u'char_26', u'char_27', u'char_28', u'char_29', u'char_30',
       u'char_31', u'char_32', u'char_33', u'char_34', u'char_35', u'char_36',
       u'char_37', u'char_38', u'activity_id', u'date_y', u'activity_category',
       u'char_1_y', u'char_2_y', u'char_3_y', u'char_4_y', u'char_5_y',
       u'char_6_y', u'char_7_y', u'char_8_y', u'char_9_y', u'char_10_y',
       u'outcome'],
      dtype='object')

In [10]:
colnames = [u'people_id', u'group_1', u'date_x', u'date_y', u'activity_id', u'char_1_x', u'char_2_x',
       u'char_3_x', u'char_4_x', u'char_5_x', u'char_6_x', u'char_7_x',
       u'char_8_x', u'char_9_x', u'char_10_x', u'char_11', u'char_12',
       u'char_13', u'char_14', u'char_15', u'char_16', u'char_17', u'char_18',
       u'char_19', u'char_20', u'char_21', u'char_22', u'char_23', u'char_24',
       u'char_25', u'char_26', u'char_27', u'char_28', u'char_29', u'char_30',
       u'char_31', u'char_32', u'char_33', u'char_34', u'char_35', u'char_36',
       u'char_37', u'char_38', u'activity_category',
       u'char_1_y', u'char_2_y', u'char_3_y', u'char_4_y', u'char_5_y',
       u'char_6_y', u'char_7_y', u'char_8_y', u'char_9_y', u'char_10_y',
       u'outcome']

In [11]:
combined = combined[colnames]

In [12]:
combined = combined.sort_values(['people_id', 'date_y'], ascending=[1, 1])

In [None]:
#frame['a'].corr(frame['b'], method='spearman')

In [16]:
# 1.13.4.3. Tree-based feature selection
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

clf = ExtraTreesClassifier()




In [17]:
# subset for classifier
df = combined[[u'activity_id',
       u'char_1_x', u'char_2_x', u'char_3_x', u'char_4_x', u'char_5_x',
       u'char_6_x', u'char_7_x', u'char_8_x', u'char_9_x', u'char_10_x',
       u'char_11', u'char_12', u'char_13', u'char_14', u'char_15', u'char_16',
       u'char_17', u'char_18', u'char_19', u'char_20', u'char_21', u'char_22',
       u'char_23', u'char_24', u'char_25', u'char_26', u'char_27', u'char_28',
       u'char_29', u'char_30', u'char_31', u'char_32', u'char_33', u'char_34',
       u'char_35', u'char_36', u'char_37', u'char_38', u'activity_category',
       u'char_1_y', u'char_2_y', u'char_3_y', u'char_4_y', u'char_5_y',
       u'char_6_y', u'char_7_y', u'char_8_y', u'char_9_y', u'char_10_y','outcome']]

In [18]:
df = df.set_index('activity_id')

In [21]:
df.head(3)

Unnamed: 0_level_0,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,char_8_x,char_9_x,char_10_x,...,char_2_y,char_3_y,char_4_y,char_5_y,char_6_y,char_7_y,char_8_y,char_9_y,char_10_y,outcome
activity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
act2_2434093,type 2,type 2,type 5,type 5,type 5,type 3,type 11,type 2,type 2,True,...,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 1,0
act2_3404049,type 2,type 2,type 5,type 5,type 5,type 3,type 11,type 2,type 2,True,...,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 1,0
act2_3651215,type 2,type 2,type 5,type 5,type 5,type 3,type 11,type 2,type 2,True,...,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 0,type 1,0


In [19]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

In [23]:
def label_encoding(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = LabelEncoder().fit_transform(df[col])
    
    return df

In [24]:
df = label_encoding(df)

In [25]:
df.head(3)

Unnamed: 0_level_0,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,char_8_x,char_9_x,char_10_x,...,char_2_y,char_3_y,char_4_y,char_5_y,char_6_y,char_7_y,char_8_y,char_9_y,char_10_y,outcome
activity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
act2_2434093,1,1,38,20,4,2,2,1,1,True,...,0,0,0,0,0,0,0,0,1,0
act2_3404049,1,1,38,20,4,2,2,1,1,True,...,0,0,0,0,0,0,0,0,1,0
act2_3651215,1,1,38,20,4,2,2,1,1,True,...,0,0,0,0,0,0,0,0,1,0


In [26]:
X_train = df.iloc[:, :49]
y_train = df.iloc[:, 49:]

In [46]:
X_train.shape

(2197291, 49)

In [27]:
clf = clf.fit(X_train, y_train)

  if __name__ == '__main__':


In [28]:
clf.feature_importances_  

array([ 0.02213771,  0.1390756 ,  0.031476  ,  0.03238435,  0.02954832,
        0.04772588,  0.05281376,  0.03076296,  0.02499934,  0.00811377,
        0.00591582,  0.00521429,  0.03875568,  0.0085889 ,  0.00499068,
        0.00470058,  0.00478325,  0.00619121,  0.00335824,  0.00479402,
        0.00290265,  0.00453703,  0.00537238,  0.00463775,  0.0315355 ,
        0.00462386,  0.00665525,  0.00304273,  0.00499287,  0.00452099,
        0.00662452,  0.01295464,  0.00451513,  0.03123183,  0.00607981,
        0.01622538,  0.01182498,  0.26853896,  0.01644082,  0.00152432,
        0.00157687,  0.00155558,  0.00108644,  0.00125954,  0.00131785,
        0.00140051,  0.0014261 ,  0.00172259,  0.03354279])

In [29]:
model = SelectFromModel(clf, prefit=True)

In [30]:
model

SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
        prefit=True, threshold=None)

In [32]:
X_new = model.transform(X_train)

In [33]:
X_new.shape

(2197291, 14)

In [34]:
X_new

array([[  1.00000000e+00,   1.00000000e+00,   3.80000000e+01, ...,
          1.00000000e+00,   3.60000000e+01,   1.00000000e+00],
       [  1.00000000e+00,   1.00000000e+00,   3.80000000e+01, ...,
          1.00000000e+00,   3.60000000e+01,   1.00000000e+00],
       [  1.00000000e+00,   1.00000000e+00,   3.80000000e+01, ...,
          1.00000000e+00,   3.60000000e+01,   1.00000000e+00],
       ..., 
       [  1.00000000e+00,   2.00000000e+00,   1.10000000e+01, ...,
          1.00000000e+00,   9.50000000e+01,   4.23000000e+02],
       [  1.00000000e+00,   2.00000000e+00,   1.10000000e+01, ...,
          1.00000000e+00,   9.50000000e+01,   1.00000000e+00],
       [  1.00000000e+00,   2.00000000e+00,   1.10000000e+01, ...,
          1.00000000e+00,   9.50000000e+01,   5.38300000e+03]])

In [36]:
from sklearn.ensemble import RandomForestClassifier

# rclf = Pipeline([
#   ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
#   ('classification', RandomForestClassifier())
# ])

rclf = RandomForestClassifier()
rclf = rclf.fit(X_new, y_train)



In [37]:
rclf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [30]:
from IPython import display
from sklearn.externals.six import StringIO  
from sklearn import tree, utils
import pydot 

In [31]:
# Generate a plot of the decision tree
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data)

In [32]:
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 

In [None]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [67]:
clf = RandomForestClassifier(n_estimators=10, max_depth=10, min_samples_split=1, random_state=0)
clf = clf.fit(X_train, y_train)

  from ipykernel import kernelapp as app


In [65]:
df_test = pd.read_csv("./data/act_test.csv")

In [66]:
df_test = pd.merge(people, df_test, how='inner', on='people_id', left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

In [67]:
df_test.head(3)

Unnamed: 0,people_id,char_1_x,group_1,char_2_x,date_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,...,char_1_y,char_2_y,char_3_y,char_4_y,char_5_y,char_6_y,char_7_y,char_8_y,char_9_y,char_10_y
0,ppl_100004,type 2,group 22593,type 3,2022-07-20,type 40,type 25,type 9,type 4,type 16,...,type 5,type 10,type 5,type 1,type 6,type 1,type 1,type 7,type 4,
1,ppl_100004,type 2,group 22593,type 3,2022-07-20,type 40,type 25,type 9,type 4,type 16,...,,,,,,,,,,type 682
2,ppl_10001,type 2,group 25417,type 3,2022-10-14,type 6,type 6,type 4,type 1,type 1,...,type 12,type 1,type 5,type 4,type 6,type 1,type 1,type 13,type 10,


In [68]:
df_test = df_test[[u'activity_id',
       u'char_1_x', u'char_2_x', u'char_3_x', u'char_4_x', u'char_5_x',
       u'char_6_x', u'char_7_x', u'char_8_x', u'char_9_x', u'char_10_x',
       u'char_11', u'char_12', u'char_13', u'char_14', u'char_15', u'char_16',
       u'char_17', u'char_18', u'char_19', u'char_20', u'char_21', u'char_22',
       u'char_23', u'char_24', u'char_25', u'char_26', u'char_27', u'char_28',
       u'char_29', u'char_30', u'char_31', u'char_32', u'char_33', u'char_34',
       u'char_35', u'char_36', u'char_37', u'char_38', u'activity_category',
       u'char_1_y', u'char_2_y', u'char_3_y', u'char_4_y', u'char_5_y',
       u'char_6_y', u'char_7_y', u'char_8_y', u'char_9_y', u'char_10_y']]

In [69]:
df_test = df_test.set_index('activity_id')

In [70]:
df_test = label_encoding(df_test)

  flag = np.concatenate(([True], aux[1:] != aux[:-1]))


In [71]:
df_test.head(3)

Unnamed: 0_level_0,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,char_8_x,char_9_x,char_10_x,...,char_1_y,char_2_y,char_3_y,char_4_y,char_5_y,char_6_y,char_7_y,char_8_y,char_9_y,char_10_y
activity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
act1_249281,1,2,34,17,8,3,7,1,1,True,...,42,2,7,1,6,1,1,16,14,0
act2_230855,1,2,34,17,8,3,7,1,1,True,...,0,0,0,0,0,0,0,0,0,3135
act1_240724,1,2,37,21,3,0,0,1,1,True,...,4,1,7,4,6,1,1,5,2,0


In [72]:
df_test.shape

(498687, 49)

In [73]:
X_test = model.transform(df_test)

In [74]:
y_test = rclf.predict(X_test)

In [75]:
y_test[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [76]:
submit = pd.read_csv("./data/act_test.csv")

In [77]:
submit = submit[['activity_id']]

In [78]:
submit.columns

Index([u'activity_id'], dtype='object')

In [79]:
submit.shape

(498687, 1)

In [80]:
submit['outcome'] = y_test

In [81]:
submit = submit.sort_values(by='activity_id')

In [82]:
submit.head()

Unnamed: 0,activity_id,outcome
240682,act1_1,1
79698,act1_100006,0
358220,act1_100050,0
59778,act1_100065,0
117803,act1_100068,0


In [83]:
submit.to_csv(path_or_buf='./data/submit_3.csv',index=False) # feature selection

In [57]:
#submit.to_csv(path_or_buf='./data/submit_1.csv',index=False) # 0.854019