# Lesson 03 - Classification, text, dates

In [2]:
import pandas as pd
import numpy as np

## Load data and basic info

Let's load the same dataset as in Lesson 01.

In [None]:
bugs = pd.read_csv('./data/bugs_train.csv', parse_dates=['Opened', 'Changed'], index_col=None)

In [None]:
bugs.head(4)

## The classification task (the problem to solve)

Our task remains the same for this lesson - we would be to predict what will be the resolution of the defect report (y) based on the description of a defect (X). 

## Data preparation (features)

Let's quickly replicate processing of the Component and Severity features, as well as converting the decision class.

In [None]:
# we will make a copy of the main data
bugs_small = bugs[["Component", "Severity", "Status", "Priority", "Opened", "Changed", "Summary", "Resolution"]]

# Component
bugs_small = pd.get_dummies(bugs_small, columns=['Component'], prefix="Component")

# Severity
bugs_small['Severity'] = bugs_small['Severity'].map(
    {'enhancement':0, 'trivial':1, 'minor':2, 'normal':3, 'major':4, 'critical':5, 'blocker':6})

# Status
bugs_small['Status'] = bugs_small['Status'].map(
    {'VERIFIED':0, 'RESOLVED':1, 'CLOSED':2})

# Priority
bugs_small['Priority'] = bugs_small['Priority'].map(
    {'P1':1, 'P2':2, 'P3':3, 'P4':4, 'P5':5})

y = bugs_small['Resolution']
X = bugs_small.drop(['Resolution'], axis=1, inplace=False)

from sklearn.preprocessing import LabelEncoder

# create an instance of the class
y_encoder = LabelEncoder()

# fit the converter to the data
y_encoder.fit(y)

# let's see the mapping
for y_label in y.unique():
    print(y_label, y_encoder.transform([y_label]))

# convert y to numbers
y = y_encoder.transform(y)

In [None]:
X.head(4)

In [None]:
y

### Dates - days being processed

Let's focus on features we could create from two dates Opened and Changed. In the form they are right now, they are not usable as features. We could convert each of them to set of features, like year, month, day, etc. We can also think about new features that somehow combain both dates. Let's create a feature that will be the number of days the defect is being repaired.

In [None]:
#Using lambda function
X['Days'] = X.apply(lambda x: (x.Changed - x.Opened).days, axis=1)
X.head(2)

In [None]:
# using iteration
days_processed = [x.days for x in (X['Changed'] - X['Opened'])]
X['Days'] = pd.Series(days_processed)
X.head(2)

In [None]:
# remove Changed and Opened
X.drop(["Changed", "Opened"], inplace=True, axis=1)
X.head(2)

# Text - summary

One of the challenging types of features to analyze is a textual features. It does not make sense to convert longer text to one hot encoding since it is very unlikely that exactly the same text appear twice. 

The simplest method to extract features from a text is so-called bag of words. First, a vocabulary is created and each word in a vocabulary consistutes a features on its own. Usually, we limit the number of feature and exclude "stop words" (words/tokens that appear very often without any special meaning). In some cases it is also good to include not only single words as features but also pairs, triples - called n-grams.

Let's create a simple bag of words for Summary using the CountVectorizer class (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
# let's first make sure that we don't have any NaN values as summaries.
X['Summary'] = X['Summary'].fillna('')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# create CountVectorizer class; we take only 5000 most frequently appearing features
count_vect = CountVectorizer(max_features=5000, stop_words="english")

# CountVectorizer fit method extracts vocabulary while transform performs the transformation. There is also
# the method fit_transform that does both.
bag_of_words = count_vect.fit_transform(list(X['Summary'])).todense()

# We create a list of names of columns 
colnames = ["Summary_"+x for x in sorted(count_vect.vocabulary_.keys())]

# Finally, we create a dataframe with bag of words features
summary_bow = pd.DataFrame(bag_of_words, columns=colnames)
summary_bow.head(2)

In [None]:
# now merge the bag of words with X
X = pd.concat([X.reset_index(drop=True), summary_bow], axis=1)
X.drop(["Summary"], inplace=True, axis=1)
X.head(2)

## Training a classifier

Let's train a random forest classifier.

In [None]:
# create an instance of the classifier; a forest of 60 trees
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=60)

In [None]:
# now, let's randomly split our data into a training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=10)

In [None]:
# let's train our random forest 
random_forest.fit(X_train, y_train)

In [None]:
# we can use the trained model to classify new instances
y_pred = random_forest.predict(X_test)
y_pred

In [None]:
# since we know what are the true classes, we can calculate different prediction quality measures, e.g., 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

"Accuracy = {:.3f}, Precision = {:.3f}, Recall = {:.3f}, F1-score = {:.3f}".format(acc, prec, rec, f1)

We can also analyze accuracy using confusion matrix

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    np.set_printoptions(precision=2)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


In [None]:
from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(1, 2, 1)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.subplot(1, 2, 2)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_, normalize=True,
                      title='Normalized confusion matrix')

Here, we validated accuracy using test / train split. However, we very often use cross-validation for that purpose.

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(random_forest, X, y, cv=10)

In [None]:
acc = accuracy_score(y, y_pred)
prec = precision_score(y, y_pred, average='macro')
rec = recall_score(y, y_pred, average='macro')
f1 = f1_score(y, y_pred, average='macro')

"Accuracy = {:.3f}, Precision = {:.3f}, Recall = {:.3f}, F1-score = {:.3f}".format(acc, prec, rec, f1)

In [None]:
cnf_matrix = confusion_matrix(y, y_pred)

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.subplot(1, 2, 2)
plot_confusion_matrix(cnf_matrix, classes=y_encoder.classes_, normalize=True,
                      title='Normalized confusion matrix')

## Tasks

Task 1. Look into the documentation of CountVectorizer class 
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 
and change code creating bag of words so it:
- takes into account bi-grams
- is 0/1 feature (a word is present in the text or not; and not its frequency in the text)

Task 2. There is also a column 'Assignee' that we didn't use. Create two features:
- Eclipse Assignee - 1 if 'Assignee' ends with 'eclipse' (str.endswith('eclipse'))
- Inbox Assignee - 1 if 'Assignee' ends with '-inbox' (str.endswith('-inbox'))

Add the new feature and see if it improved the accuracy.

Task3. Transform the 'Text' column to a bag of words form. Experiment with different n-grams (unigrams, bigrams).

In [3]:
texts = ["Ann has a dog.", "Dog likes to eat.", "Ann likes to play with a dog."]
texts_df = pd.DataFrame(pd.Series(texts, name="Text"))