In this assignment, you will build a classification framework. You need to implement the decision tree algorithm with Gini-index as the attribute selection measure.

You will be given a training dataset "training.txt" and a testing dataset "testing.txt". In the training dataset, the data format is
```
<label> <index1>:<value1> <index2>:<value2> ...

......
```

Each line contains an instance and is ended by a '\n' character. <label> is an integer indicating the class label. The pair <index>:<value> gives a feature (attribute) value: <index> is a non-negative integer and <value> is a number (we only consider categorical attributes in this assignment). Note that one attribute may have more than 2 possible values, meaning it is a multi-value categorical attribute.

In the testing dataset, the data format is
```
<index1>:<value1> <index2>:<value2> ...

......
```

You will no longer have the `<label>` in each line.

You need to submit a file titled `"result.txt"`. Each line contains one integer representing the predicted label of a testing sample.

You will be graded based on whether your file format is correct and on the precision of your classifier on the testing dataset. You will get a full score as long as your precision is above a certain threshold.

## Data Engineering

In [98]:
import pandas as pd
import numpy as np

In [51]:
train_untouched = pd.read_csv('./training.txt', sep=" ", header=None)
train_untouched.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,119,120,121,122,123,124,125,126,127,128
0,4,0:3,1:2,2:2,3:2,4:1,5:2,6:2,7:3,8:1,...,118:2,119:2,120:2,121:2,122:2,123:2,124:1,125:2,126:2,127:2
1,4,0:2,1:3,2:2,3:1,4:2,5:3,6:2,7:3,8:2,...,118:1,119:1,120:2,121:3,122:1,123:3,124:2,125:3,126:3,127:1
2,2,0:2,1:2,2:2,3:1,4:1,5:1,6:2,7:3,8:1,...,118:3,119:2,120:2,121:2,122:2,123:2,124:2,125:2,126:2,127:2
3,4,0:1,1:2,2:2,3:2,4:2,5:3,6:2,7:2,8:2,...,118:1,119:1,120:2,121:2,122:2,123:2,124:2,125:3,126:3,127:3
4,4,0:2,1:2,2:2,3:2,4:1,5:2,6:2,7:2,8:1,...,118:2,119:2,120:3,121:2,122:1,123:1,124:1,125:2,126:2,127:1


In [52]:
# remove labels before processing
df_train = train_untouched.iloc[:, 1:]
df_train.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,119,120,121,122,123,124,125,126,127,128
0,0:3,1:2,2:2,3:2,4:1,5:2,6:2,7:3,8:1,9:2,...,118:2,119:2,120:2,121:2,122:2,123:2,124:1,125:2,126:2,127:2
1,0:2,1:3,2:2,3:1,4:2,5:3,6:2,7:3,8:2,9:2,...,118:1,119:1,120:2,121:3,122:1,123:3,124:2,125:3,126:3,127:1
2,0:2,1:2,2:2,3:1,4:1,5:1,6:2,7:3,8:1,9:1,...,118:3,119:2,120:2,121:2,122:2,123:2,124:2,125:2,126:2,127:2
3,0:1,1:2,2:2,3:2,4:2,5:3,6:2,7:2,8:2,9:3,...,118:1,119:1,120:2,121:2,122:2,123:2,124:2,125:3,126:3,127:3
4,0:2,1:2,2:2,3:2,4:1,5:2,6:2,7:2,8:1,9:2,...,118:2,119:2,120:3,121:2,122:1,123:1,124:1,125:2,126:2,127:1


In [75]:
# https://stackoverflow.com/questions/25900332/find-last-word-in-a-string-within-a-list-pandas-python-3

# funcion to extract the value out of every cell
def extract_value(col):
    return col.str.split(':').str[-1] 

In [76]:
# apply function to extract values, and attach label to the end
train_values = df_train.apply(extract_value, axis=0)
train_values['label'] = train_untouched.iloc[:,0]
# convert everything to integer
train_values = train_values.apply(pd.to_numeric)

# convert to numpy matrix
train = train_values.as_matrix() # last column 

In [78]:
print(train.shape)
train

(3000, 129)


array([[3, 2, 2, ..., 2, 2, 4],
       [2, 3, 2, ..., 3, 1, 4],
       [2, 2, 2, ..., 2, 2, 2],
       ...,
       [2, 2, 2, ..., 2, 1, 3],
       [2, 3, 2, ..., 2, 2, 4],
       [1, 2, 2, ..., 2, 1, 4]])

## Same Process for Test Data

In [73]:
test_untouched = pd.read_csv('./testing.txt', sep=" ", header=None)
test_untouched.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
0,0:3,1:1,2:1,3:2,4:2,5:1,6:2,7:2,8:2,9:2,...,118:2,119:3,120:1,121:2,122:2,123:2,124:2,125:2,126:3,127:2
1,0:2,1:2,2:2,3:1,4:2,5:2,6:1,7:2,8:2,9:2,...,118:3,119:2,120:2,121:2,122:3,123:3,124:3,125:2,126:2,127:3
2,0:2,1:2,2:2,3:2,4:2,5:2,6:3,7:3,8:2,9:1,...,118:3,119:3,120:2,121:2,122:2,123:2,124:2,125:1,126:2,127:1
3,0:2,1:1,2:2,3:2,4:2,5:2,6:2,7:2,8:2,9:2,...,118:2,119:2,120:2,121:1,122:3,123:2,124:1,125:2,126:3,127:2
4,0:1,1:2,2:2,3:1,4:2,5:2,6:2,7:2,8:3,9:1,...,118:1,119:1,120:1,121:2,122:2,123:2,124:2,125:1,126:2,127:2


In [79]:
# apply function to extract values, and attach label to the end
test_values = test_untouched.apply(extract_value, axis=0)

# convert everything to integer
test_values = test_values.apply(pd.to_numeric)

# convert to numpy matrix
test = test_values.as_matrix() # last column 

In [80]:
print(test.shape)
test

(1000, 128)


array([[3, 1, 1, ..., 2, 3, 2],
       [2, 2, 2, ..., 2, 2, 3],
       [2, 2, 2, ..., 1, 2, 1],
       ...,
       [2, 1, 1, ..., 2, 2, 2],
       [2, 3, 3, ..., 2, 2, 2],
       [2, 1, 2, ..., 1, 2, 2]])

## Fit Model & Make Predictions

In [91]:
# http://scikit-learn.org/stable/modules/tree.html
from sklearn import tree

In [92]:
# assign train and test data
X_train = train[:, 0:128]
y_train = train[:, 128]
X_test = test

In [93]:
# fit Decision Tree Classifier
clf = tree.DecisionTreeClassifier() # by default "gini" is the criterion
clf = clf.fit(X_train, y_train)

In [94]:
# make predictions
preds = clf.predict(X_test)

In [101]:
print(type(preds))
print(type(preds[1]))
preds

<class 'numpy.ndarray'>
<class 'numpy.int64'>


array([1, 4, 3, 4, 3, 3, 2, 4, 1, 2, 4, 3, 4, 3, 3, 4, 1, 2, 3, 4, 2, 3,
       4, 2, 3, 4, 2, 2, 1, 1, 2, 3, 1, 4, 2, 3, 4, 3, 4, 2, 3, 1, 4, 4,
       1, 2, 4, 1, 4, 2, 2, 3, 1, 3, 4, 2, 4, 1, 2, 4, 2, 4, 1, 4, 3, 4,
       2, 2, 3, 1, 4, 2, 2, 4, 1, 4, 1, 4, 1, 4, 2, 1, 3, 1, 4, 2, 4, 4,
       2, 3, 2, 3, 3, 3, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1, 2, 4, 2, 2, 4, 4,
       1, 2, 2, 3, 1, 4, 2, 4, 1, 2, 3, 1, 2, 1, 2, 2, 1, 1, 2, 4, 2, 2,
       1, 3, 3, 4, 2, 3, 3, 4, 2, 1, 1, 4, 2, 3, 1, 2, 4, 4, 1, 3, 2, 2,
       4, 1, 2, 3, 2, 1, 4, 3, 2, 3, 3, 1, 2, 4, 4, 1, 1, 4, 2, 1, 2, 1,
       2, 2, 1, 1, 2, 4, 4, 3, 2, 2, 2, 1, 3, 4, 3, 2, 4, 4, 4, 2, 2, 2,
       2, 1, 3, 2, 2, 3, 3, 1, 4, 1, 3, 2, 4, 4, 1, 3, 3, 3, 3, 3, 2, 1,
       1, 2, 1, 4, 2, 4, 1, 3, 4, 1, 1, 3, 4, 1, 2, 1, 1, 1, 1, 4, 3, 3,
       1, 4, 3, 2, 1, 3, 2, 4, 3, 3, 4, 3, 3, 4, 3, 4, 3, 2, 1, 1, 4, 1,
       1, 1, 3, 1, 1, 4, 2, 3, 3, 4, 1, 1, 1, 2, 3, 2, 3, 1, 3, 1, 4, 1,
       3, 3, 2, 2, 3, 3, 1, 1, 4, 4, 3, 4, 4, 1, 1,

In [102]:
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
np.savetxt("./result.txt", X=preds, newline='\n', fmt='%d')

## Try RandomForestClassifier for better results?

In [103]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=50)
clf = clf.fit(X_train, y_train)

# make predictions
preds = clf.predict(X_test)

# https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
np.savetxt("./result.txt", X=preds, newline='\n', fmt='%d')