# Lesson 02 - Basic classification

In [None]:
import pandas as pd
import numpy as np

## Load data and basic info

Let's load the same dataset as in Lesson 01.

In [None]:
bugs = pd.read_csv('./data/bugs_train.csv', parse_dates=['Opened', 'Changed'], index_col=None)

In [None]:
bugs.head(4)

## The classification task

Let's assume we would like to predict what would be the *resolution* for the defect report based on other columns.

## Data preparation

Sklearn algorithms cannot handle string features. Therefore we need to convert them to integers.

We will start by predicting the resolution only based on Component and Severity.

In [None]:
bugs_small = bugs[['Component', "Severity", "Resolution"]]

We will first convert Component feature to integers. Component a nominal (categorical) variable. 

In [None]:
bugs['Component'].unique()

If we just map each Component value to integer (e.g. Debug = 0, UI = 1, ...),  we could cause the classifier to assume that there is an order UI > Debug. To overcome this problem, we usually use so-called one-hot encoding. Each possible value of the feature because a 0/1 feature itself. So we will have features like Debug, UI, Core, ...

We can use sklearn classes OneHotEncoder, however, if we use pandas library to manipulate our data it can be done directly using get_dummies function (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [None]:
bugs_small = pd.get_dummies(bugs_small, columns=['Component'], prefix="Component")
bugs_small.head(4)

We can do the same with severity, however, in this case converting it to number is not that bad idea since it is ordinal variable. However, we need to map it to number in the way that corresponds to the scale.

In [None]:
bugs_small['Severity'].unique()

We would like to have the following order: 
'enhancement', 'trivial', 'minor', 'normal', 'major', 'critical', 'blocker'

In [None]:
bugs_small['Severity'] = bugs_small['Severity'].map(
    {'enhancement':0, 'trivial':1, 'minor':2, 'normal':3, 'major':4, 'critical':5, 'blocker':6})
bugs_small.head(4)

Finally, we need to convert our decision class Resolution to numbers. We can use LabelEncoder from sklearn to do it. Technically, the operation is the same as for what we did with Severity. However, we would like to be able to easily go back from numbers to labels.

But, let's first divide the bugs_small data frame to X (features), Y (decision class - Resolution)

In [None]:
Y = bugs_small['Resolution']

In [None]:
X = bugs_small.drop(['Resolution'], axis=1, inplace=False)

Now, convert Y to ints.

In [None]:
Y.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# create an instance of the class
y_encoder = LabelEncoder()

# fit the converter to the data
y_encoder.fit(Y)

# let's see the mapping
for y_label in Y.unique():
    print(y_label, y_encoder.transform([y_label]))

In [None]:
# convert y to numbers
Y = y_encoder.transform(Y)
Y

## Training a classifier

Let's train a random forest classifier.

In [None]:
# create an instance of the classifier; a forest of 20 trees
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=20)

In [None]:
# now, let's randomly split our data into a training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=10)

In [None]:
# let's train our random forest 
random_forest.fit(X_train, y_train)

In [None]:
# we can use the trained model to classify new instances
y_pred = random_forest.predict(X_test)
y_pred

In [None]:
# since we know what are the true classes, we can calculate different prediction quality measures, e.g., 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

"Accuracy = {:.3f}, Precision = {:.3f}, Recall = {:.3f}, F1-score = {:.3f}".format(acc, prec, rec, f1)

We can also analyze accuracy using confusion matrix

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    np.set_printoptions(precision=2)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


In [None]:
from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(1, 2, 1)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.subplot(1, 2, 2)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_, normalize=True,
                      title='Normalized confusion matrix')

Here, we validated accuracy using test / train split. However, we very often use cross-validation for that purpose.

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(random_forest, X, Y, cv=10)

In [None]:
acc = accuracy_score(Y, y_pred)
prec = precision_score(Y, y_pred, average='macro')
rec = recall_score(Y, y_pred, average='macro')
f1 = f1_score(Y, y_pred, average='macro')

"Accuracy = {:.3f}, Precision = {:.3f}, Recall = {:.3f}, F1-score = {:.3f}".format(acc, prec, rec, f1)

In [None]:
cnf_matrix = confusion_matrix(Y, y_pred)

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.subplot(1, 2, 2)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_, normalize=True,
                      title='Normalized confusion matrix')

## Tasks

Task 1. Prepare a new training set which also includes *Priority* as a feature

Task 2. Compare prediction quality of random forest with 10, 20, 30 trees using 10-fold cross-validation.

Task 3. Train a new classifier - DecisionTreeClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) and compare its accuracy to random forest.