Use a C4.5 decision tree to predict conflict management outcome from provided features and visualize tree

In [1]:
from modshogun import *
import pandas as pd
import numpy as np
import csv

Read in ICM2004.csv using pandas. Clean data by deleting row where CM14 is empty. CM14 is a categorical variable that records conflict management outcome on a scale of 0-5(0: No management, 1: offered only, 2: unsuccessful, 3: Cease-fire, 4:Partial agreement, 5: full settlement). 

In [2]:
#read in csv
df = pd.read_csv('ICM2004.csv')
#delete rows where cm14 is empty
df = df[np.isfinite(df['cm14'])]
#print df.shape
#extract train_label and train_feats

label = df['cm14'].values

feats = df.ix[:,df.columns !='cm14'].values.T

print df.shape

(5004, 238)


Divide label and feature into train(80%) and test(20%)

In [58]:
#divide into train and test
#keep 20% for test
ntest = int(label.shape[0]*0.2)
ntrain = label.shape[0]-ntest
subset = np.int32(np.random.permutation(label.shape[0]))
train_feats = feats[:,subset[0:ntrain]]
train_labels = label[subset[0:ntrain]]
test_feats = feats[:,subset[ntrain:ntrain+ntest]]
test_labels = label[subset[ntrain:ntrain+ntest]]
print train_feats
print train_labels

[[   1.  168.   93. ...,  220.  248.  325.]
 [  13.    2.    9. ...,   15.    6.    0.]
 [  45.   75.   62. ...,   82.   86.   98.]
 ..., 
 [  nan   nan   nan ...,   nan   nan   nan]
 [  nan   nan   nan ...,   nan   nan   nan]
 [  nan   nan   nan ...,   nan   nan   nan]]
[ 2.  2.  2. ...,  2.  1.  5.]


We then split train into train and validation and set feature array

In [45]:
#split into train and validation
subset = np.int32(np.random.permutation(ntrain))
nvalidation = int(train_feats.shape[0]*0.2)
# form training subset and validation subset
train_subset = subset[0:ntrain-nvalidation]
validation_subset = subset[ntrain-nvalidation:ntrain]

#create a feature type np array. All our features are categorical data, so all are set to True
feat_types = np.full((train_feats.shape[0],),True,dtype=bool)
#print feat_types.shape

print train_feats.shape

(237, 4004)


Write a method for training decision tree. Create RealFeatures and MulticlassLabels for Shogun tree. 

In [46]:
def train_tree(feats,types,labels):
    #initialize a tree object
    tree = C45ClassifierTree()
    #set labels
    tree.set_labels(labels)
    # supply attribute types
    tree.set_feature_types(types)
    #suppy training matrix and train
    tree.train(feats)
    
    return tree

#create shogun features and labels from given data
#training data
train_feats = RealFeatures(train_feats)
train_labels = MulticlassLabels(train_labels)

#test data
test_feats = RealFeatures(test_feats)
test_labels = MulticlassLabels(test_labels)


Train decision tree using train. Use validation datasets for pruning tree

In [47]:
#remove validation subset before training the tree
train_feats.add_subset(train_subset)
train_labels.add_subset(train_subset)
#train the tree
C45Tree = train_tree(train_feats,feat_types,train_labels)

# remove data belonging to training subset
train_feats.remove_subset()
train_labels.remove_subset()

# add validation subset
train_feats.add_subset(validation_subset)
train_labels.add_subset(validation_subset)

#prune the tree
C45Tree.prune_tree(train_feats,train_labels)

train_feats.remove_subset()
train_labels.remove_subset()

In [48]:
def classify_data(tree,data):
    #get classification labels
    output = tree.apply_multiclass(data)
    #get classification certainty
    output_certainty = tree.get_certainty_vector()
    return output, output_certainty


Check prediction results and get accuracy

In [49]:
# get results
output, output_certainty = classify_data(C45Tree,test_feats)
accuracy = MulticlassAccuracy()
print 'Accuracy : ' + str(accuracy.evaluate(output, test_labels))

Accuracy : 0.818


I wanted to visualize the tree. But I could not find or think of way of doing it with Shogun.
I wanted to visualize the tree with sklearn but sklearn.DecisionTree does not handle NaN values. 
if we delete all NaN values data, then we only have one data left. This obviously is not ideal.
I ended up deleting columns that have NaN data. Then, we would have 123 features with 5004 samples.
As this is just a sample with plotting, I used fitted all the data to plot the tree.

In [26]:
#delete rows where cm14 is empty
sklearn_df = df[np.isfinite(df['cm14'])]
sklearn_df = sklearn_df.dropna(axis=1)
print sklearn_df.shape

sklearn_labels = sklearn_df['cm14'].values

sklearn_feats = sklearn_df.ix[:,sklearn_df.columns !='cm14'].values
sklearn_feature_name = list(sklearn_df.ix[:,sklearn_df.columns !='cm14'].columns[:])
print sklearn_feats.shape
print sklearn_feature_name

(5004, 124)
(5004, 123)
['d1', 'd2a', 'd2b', 'd3a', 'd3b', 'd4', 'd4a', 'd5a', 'd7', 'd8', 'd9', 'd10', 'd11a', 'd11b', 'd11c', 'd11d', 'd11e', 'd12', 'd13', 'd14', 'd14a', 'd14b', 'd15', 'd16', 'd17', 'd18', 'd18a', 'd19', 'd20', 'd21', 'd22', 'd23', 'd24', 'd25', 'd26', 'd27', 'd29', 'p1', 'p2', 'p3', 'p4a', 'p4b', 'p5a', 'p5b', 'p6a', 'p6b', 'p7', 'p8a', 'p8b', 'p9a', 'p9b', 'p10b', 'p11a', 'p11b', 'p12', 'p13a', 'p13b', 'p14a', 'p14b', 'p17a', 'p17b', 'p19a', 'p19b', 'p20b', 'p21b', 'p22b', 'p23a', 'p23b', 'p24', 'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'rdcm', 'cm2c', 'cm3', 'cm4', 'cm9', 'cm10a', 'cm10b', 'cm44', 'd14c', 'd14d', 'p4c', 'p6ac', 'p6bc', 'p6c', 'p6d', 'p8c', 'p8d', 'p9c', 'p9d', 'p11ac', 'p11bc', 'p11', 'p14ac', 'p14bc', 'p14c', 'p14d', 'p17ac', 'p17bc', 'p17c', 'p19ac', 'p19bc', 'p19c', 'p19d', 'p20bc', 'p21bc', 'p22bc', 'p22bd', 'p23ac', 'p23bc', 'p23', 'p26a', 'p28a', 'p29a', 'cm10c', 'cm14b']


0

Fitting the tree and visualization. This outputs a dot file, which could be converted to ps file to visualize locally. I checked that .dot could be read in with d3 for visualization but I have not looked into it. Please see

In [20]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=20, random_state=99,max_depth=3)
clf = clf.fit(sklearn_feats,sklearn_labels)
tree.export_graphviz(clf, out_file='tree.dot',feature_names=sklearn_feature_name)

<img src="tree.png">

This decision tree only generated depth = 3 because otherwise the tree would grow to be too big. There are 5 cases listed in the tree while we have 6 classes. This is because all conflicts we have are conflicts with some kind of management. So, there are 0 cases with cm14 =0. The first label in each box represents categorical variable. For example, d25 represents UN involvement with 1 representing "involvement" and 2 representing "no involvement". Gini represents the quality of the split. 

Other plans: 
1.apply dimensionality reduction, like PCA, to the data.
2.create a heat map to visualize places of conflict on a world map.
