### Determine the most important feature with a Decision Tree (Section 6.5)
* Zweck: Die Vorteile des einfachen Aufbaus eines Decision Trees aufzeigen. Auslesen der Wurzel Decision um das aussagekräftigste Feature zu bestimmen.
* Matrix: Klassifikationsmatrix auf aktuellem Stand "current", sowie mit Feature History des mozilla-central Repository
* Features: Includes, Function Calls, Definitions, Names, Conditions
* Modell: Decision Tree Classifier

#### Setup
* Training-Set/Test-Set: Stratified sampling auf einer Matrix (2/3 : 1/3)

#### Results
Tabellarische Auflistung des aussagekräftigsten Features für verschiedene Arten von Features bei "current" und Feature History. Zusätzliche für jede Feature Art die Precision und Recall.

In [7]:
import time
import matplotlib.pyplot as plt
import numpy as np
from prettytable import PrettyTable

from imports.matrix_helper import MatrixHelper
from imports.prediction_helper import PredictionHelper
from sklearn.metrics import precision_recall_curve

matrix_helper = MatrixHelper()

features = [('incl', 'Includes'), ('cond', 'Conditions'), ('defs', 'Definitions'), ('names', 'Names'), ('calls', 'Function Calls')]
table = PrettyTable(['Features', 'Precision', 'Recall', 'Most important feature', 'time for fitting'])
table.align["Features"] = "l"
table.align["Most important feature"] = "l"
counter = 1

for feature in features:
    for h_type in ['current', 'history']:
        # Read pickle
        matrices = matrix_helper.load_from_parse('data/matrices/matrix_cla_{}_{}.pickle'.format(feature[0], h_type))

        # Instantiate Prediction Helper Class and predict values for compare matrix with a DT
        prediction_helper = PredictionHelper()
        prediction_helper.calculate_validation_compare_matrix(matrices, sampling_factor=(2.0/3), model_type='DT')
        compare_matrix = prediction_helper.get_compare_matrix()

        # Compute Precision-Recall
        precision, recall, thresholds = precision_recall_curve(np.array(compare_matrix[:, 2], dtype='f'), np.array(compare_matrix[:, 1], dtype='f'))

        feature_name = "{} ({})".format(feature[1], h_type)
        precision = "{:.3f}".format(precision[1])
        recall = "{:.3f}".format(recall[1])
        time = "{:.2f}min".format((prediction_helper.time_fitting / 60.0))
        
        table.add_row([feature_name, precision, recall, prediction_helper.most_important_feature, time])
        print(' * {} - done {}/{}'.format(feature_name, counter, 2 * len(features)))
        counter += 1
print(table)

 * Includes (current) - done 1/10
 * Includes (history) - done 2/10
 * Conditions (current) - done 3/10
 * Conditions (history) - done 4/10
 * Definitions (current) - done 5/10
 * Definitions (history) - done 6/10
 * Names (current) - done 7/10
 * Names (history) - done 8/10
 * Function Calls (current) - done 9/10
 * Function Calls (history) - done 10/10
+--------------------------+-----------+--------+------------------------+------------------+
| Features                 | Precision | Recall | Most important feature | time for fitting |
+--------------------------+-----------+--------+------------------------+------------------+
| Includes (current)       |   0.456   | 0.367  | nsContentUtils.h       |     1.81min      |
| Includes (history)       |   0.569   | 0.554  | jscntxt.h              |     2.36min      |
| Conditions (current)     |   0.600   | 0.156  | DEBUG                  |     9.31min      |
| Conditions (history)     |   0.733   | 0.324  | DEBUG                  |     