## Classification with additional features (Section 6.3)
* Zweck: Validierung von Klassifikation mit anderen Features als Includes und Function Calls.
* Matrix: Klassifikationsmatrix auf aktuellem Stand "current", sowie mit Feature History des mozilla-central Repository
* Features: Includes, Function Calls, Definitions, Names, Conditions
* Modell: Support Vector Machine Classifier

### Setup
* Training-Set/Test-Set: Stratified sampling auf einer Matrix (2/3 : 1/3)

#### Benötigte Matrizen
* ```data/matrices/matrix_cla_incl_current.pickle```
* ```data/matrices/matrix_cla_incl_history.pickle```
* ```data/matrices/matrix_cla_cond_current.pickle```
* ```data/matrices/matrix_cla_cond_history.pickle```
* ```data/matrices/matrix_cla_defs_current.pickle```
* ```data/matrices/matrix_cla_defs_history.pickle```
* ```data/matrices/matrix_cla_names_current.pickle```
* ```data/matrices/matrix_cla_names_history.pickle```
* ```data/matrices/matrix_cla_calls_current.pickle```
* ```data/matrices/matrix_cla_calls_history.pickle```

### Resultate
Tabellarischer Vergleich der durchschnittlichen Precision und Recall Werte für verschiedene Features bei n=5 Experimenten. Weiter werden die Anzahl extrahierter Features und die durchschnittliche Laufzeit für das Trainieren des Modells aufgelistet.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from prettytable import PrettyTable

from imports.matrix_helper import MatrixHelper
from imports.prediction_helper import PredictionHelper
from sklearn.metrics import precision_recall_curve

matrix_helper = MatrixHelper()
experiments_per_feature = 5

features = [('incl', 'Includes'), ('cond', 'Conditions'), ('defs', 'Definitions'), ('names', 'Names'), ('calls', 'Function Calls')]
table = PrettyTable(['Features', 'Feature count', 'Precision', 'Recall', 'Precision sd', 'Recall sd', 'Time'])
table.align["Features"] = "l"

for feature in features:
    for h_type in ['current', 'history']:
        # Read pickle
        matrices = matrix_helper.load_from_parse('data/matrices/matrix_cla_{}_{}.pickle'.format(feature[0], h_type))
        
        feature_name = "{} ({})".format(feature[1], h_type)
        precision_list = []
        recall_list = []
        time_list = []
        
        for i in range(experiments_per_feature):
            print '* {:20}: {:2}/{:2}\r'.format(feature_name, i+1, experiments_per_feature),
            # Instantiate Prediction Helper Class and predict values for compare matrix with an SVM
            prediction_helper = PredictionHelper()
            prediction_helper.calculate_validation_compare_matrix(matrices, sampling_factor=(2.0/3), model_type='LinearSVC')
            compare_matrix = prediction_helper.get_compare_matrix()

            # Compute Precision-Recall
            precision, recall, thresholds = precision_recall_curve(np.array(compare_matrix[:, 2], dtype='f'), np.array(compare_matrix[:, 1], dtype='f'))
            precision_list.append(precision[1])
            recall_list.append(recall[1])
            time_list.append(prediction_helper.time_fitting)
        print
        
        divisor = float(experiments_per_feature)
        precision = "{:.3f}".format(sum(precision_list)/divisor)
        recall = "{:.3f}".format(sum(recall_list)/divisor)
        precision_sd = '{:.3f}'.format(np.std(precision_list))
        recall_sd = '{:.3f}'.format(np.std(recall_list))
        time = "{:.2f}min".format((sum(time_list)/divisor) / 60.0)
        
        table.add_row([feature_name, len(matrices[2]), precision, recall, precision_sd, recall_sd, time])

print(table)

* Includes (current)  :  5/ 5
* Includes (history)  :  5/ 5
* Conditions (current):  5/ 5
* Conditions (history):  5/ 5
* Definitions (current):  5/ 5
* Definitions (history):  5/ 5
* Names (current)     :  5/ 5
* Names (history)     :  5/ 5
* Function Calls (current):  5/ 5
* Function Calls (history):  5/ 5
+--------------------------+---------------+-----------+--------+--------------+-----------+---------+
| Features                 | Feature count | Precision | Recall | Precision sd | Recall sd |   Time  |
+--------------------------+---------------+-----------+--------+--------------+-----------+---------+
| Includes (current)       |     15362     |   0.669   | 0.361  |    0.019     |   0.019   | 0.04min |
| Includes (history)       |     16383     |   0.748   | 0.525  |    0.016     |   0.011   | 0.06min |
| Conditions (current)     |     19569     |   0.761   | 0.144  |    0.032     |   0.010   | 0.06min |
| Conditions (history)     |     20081     |   0.800   | 0.291  |    0.0