# Document Classification - Basic Steps

## Import all important packages

In [17]:
import numpy as np
import pandas as pd
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import KFold, cross_val_score # import KFold

from datetime import datetime 

## Load dataset

### Read data files

Use Python package `pandas` to read files. This dataset consists of 2 text files, one containing 5,331 positive sentences and the other 5,331 negative sentiences.

In [2]:
df_sent_pos = pd.read_csv('data/sentence-polarity-dataset/sentence-polarity.pos', sep='\t', header=None)
df_sent_neg = pd.read_csv('data/sentence-polarity-dataset/sentence-polarity.neg', sep='\t', header=None)

df_sent_pos.head()

Unnamed: 0,0
0,the rock is destined to be the 21st century's ...
1,"the gorgeously elaborate continuation of "" the..."
2,effective but too-tepid biopic
3,if you sometimes like to go to the movies to h...
4,"emerges as something rare , an issue movie tha..."


### Create internal representation of dataset

For the training and testing, we want two lists, one containing all sentences and another containing the respective labels (here `0` representing negative ans `1` representing positive sentences). Note that there is nothing special about labelling the classes. We could equally use the strings `"negative"` and `"positive"`. Some additional explanations:

- The list method `A.extend(B)` attaches list `B` to list `A`

- `[0]*len(df_sent_neg)` creates a a list `[0, 0, 0, 0, 0, ...]` of length $N$ with $N$ being the number of, here, negative sentences

- `np.array(A)` converts a normal n-dimensional Python list into an n-dimensional numpy array (see `import numpy as np` above). It is not crucial since methods for training and test take both standard lists and numpy arrays as input, but numpy arrays come with a long list of useful functions and features.

In [3]:
# Create a list for all sentences and ad the sentences from both read files
sentences = []
sentences.extend(df_sent_neg[0].tolist())
sentences.extend(df_sent_pos[0].tolist())

# Create a list for all lables
polarities = []
polarities.extend([0]*len(df_sent_neg))
polarities.extend([1]*len(df_sent_pos))

# Convert from lists to numpy arrays
sentences = np.array(sentences)
polarities = np.array(polarities)

Right now, the dataset contains 5,331 positive sentences followed by 5,331 negative sentence. Before we can split the dataset into training and test data, we first have to shuffle the order to ensure a balanced dataset to in turn ensure a balanced training and test data size. Some additional explanations:

- `combined = list(zip(sentences, polarities))`: We have to lists containing the sentences and the labels. Of course, we have to ensure the both list are shuffled the same way so that each label keeps associated with the same sentence. The `zip()` method accomplishes this, both zipping and unzipping.

- `random.seed(int)` (optional): the `shuffle()` method does not truly randomize the order of the elements of a list, but generates a "pseudo-randomized" order. This in turn allows that, by providing a fixed $seed$, we can ensure that shuffling always returns the same "random" order. This makes the whole process deterministic and can be useful find problems.

In [8]:
combined = list(zip(sentences, polarities))

#random.seed(1) (optional)
random.shuffle(combined)

# split the "zipped" list into the two lists of sentences and labels/polarities
sentences[:], polarities[:] = zip(*combined)

### Generate training and test data

Given 100% of the data, a common way is to split it into 80% (90%) of training data and 20% (10%) of test data. The training is only done using the training and the test only using the testing data, respectively. To make meaningful statements about the effectiveness of the classifiers requires (at least) the the testing is done using data the classifiers has never seen before. Some additional explanations:

- `A[:n]` returns the first $n$ elements of list A

- `A[n:]` returns all the elements after the $n$-th elemnts of list A

In [9]:
# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.8

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(sentences))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = sentences[:train_set_size], sentences[train_set_size:]
y_train, y_test = polarities[:train_set_size], polarities[train_set_size:]

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

Size of training set: 8529
Size of test: 2133


## Generate feature set

The feature set is the document n-gram matrix with n-grams of size 1 and 3. You can change this value, e.g., `ngram_range=(1, 1)` to consider only unigrams (i.e, individual words/tokens) or `ngram_range=(1, 2)` to consider only unigrams and bigrams. Larger values are the less common since the size of the feature vectors quickly explodes.

First, we define the `TfidfVectorizer` give out specification...

In [10]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3))

...and then generate feature set as the document n-gram matrix.

In [11]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

## Train classifier

We use the Multinomial Naive Bayes (MultinomialNB) classifier which usually provides good results from word/n-gram features.

In [19]:
mnb_classifier = MultinomialNB().fit(X_train_tfidf, y_train)

## Test classifier

We first need to generate the document n-gram matrix from the test data given the vocabulary of the vecotrizer derived from the training data: `transform()` instead of `fit_transform()`.

In [20]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)

We  can now use the trained classifier to predict the polarities for the test data.

In [21]:
y_pred = mnb_classifier.predict(X_test_tfidf)

`classification_report()` is a useful method to quickly show the results of the evaluation by means of precision, recall, and f1-score for each class (here only to) as well as the average precision, recall, and f1-score. By default, the average is a weighted average. The weight is the support, i.e., the number of data items for each class.

In [22]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.78      0.82      0.80      1074
          1       0.81      0.77      0.79      1059

avg / total       0.79      0.79      0.79      2133



## Training using k-fold cross validation

K-fold cross validation splits a dataset into $k$ equally sized blocks and performs $k$ training-testing cycles using always $k-1$ different blocks for training the remaining block for testing.

- `cross_val_score()` is a handy method to automize the k-fold cross validation

Note that in this example we use only the training data (`X_train_tfidf` and `y_train`) for the cross validation which are is only 80% of the whole dataset. This adheres to the notion to use the training data and validation data for each fold.

In [23]:
f1_scores_list = cross_val_score(MultinomialNB(), X_train_tfidf, y_train, cv=10, scoring ='f1')

print(f1_scores_list)

[0.7957497  0.7719715  0.79627474 0.77180406 0.77974087 0.7688734
 0.76407186 0.78271309 0.77584204 0.78708134]


Usually, the reported result are the average scores and their standard variations.

In [24]:
print("f1 score (mean/average): {:.3f}".format(f1_scores_list.mean()))
print("f1 score (standard deviation): {:.3f}".format(f1_scores_list.std()))

f1 score (mean/average): 0.779
f1 score (standard deviation): 0.010


A low standard deviation for the results is a good sign. If some score differ too much from the rest usually indicates that (a) the dataset is not well shuffled to always ensure a balanced training and test dataset or that (b) that the size of the dataset is simply not large enough to properly learn in all cases.

For the sake of completeness, the block below performs the k-fold cross validation over the whole dataset.

In [25]:
# Convert all sentences (100% of the dataset) into the feature set
sentences_tfidf = tfidf_vectorizer.fit_transform(sentences)

# Perform 10-fold cross validation over all sentences
f1_scores_list = cross_val_score(MultinomialNB(), sentences_tfidf, polarities, cv=10, scoring ='f1')

# Print reported numbers
print("f1 score (mean/average): {:.3f}".format(f1_scores_list.mean()))
print("f1 score (standard deviation): {:.3f}".format(f1_scores_list.std()))

f1 score (mean/average): 0.783
f1 score (standard deviation): 0.013


The results should be a little bit better since we simply use more data for the training.

## A full and proper example

In [26]:
best_score = -1.0
best_classifier = None
best_ngram_size = -1

ngram_sizes = [1, 2, 3, 4]
classifiers = [LinearSVC(), MultinomialNB(), DecisionTreeClassifier()]
#classifiers = [LinearSVC(), MultinomialNB() ]

for classifier in classifiers:
    for s in ngram_sizes:
        # Save the start time
        start_time = datetime.now() 
        # Define vecotrizer to generate feature set as document ngram matrix
        tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, s))
        # Generate feature set as document ngram matrix
        X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
        # Perform 10-fold valdiation only over training data
        f1_scores_list = cross_val_score(classifier, X_train_tfidf, y_train, cv=10, scoring ='f1')
        # Calculate the average core
        average_f1_score = f1_scores_list.mean()
        # Caluclate the required runtime for this parameter setting:
        time_elapsed = datetime.now() - start_time
        # Print results for current setting
        print("Classifier: {}, n-gram size: {} ==> f1-score: {:.3f} [{}]".format(type(classifier).__name__, s, average_f1_score, time_elapsed))
        # If the average score is better than the current best score, save all the current parameter values
        if average_f1_score > best_score:
            best_score = average_f1_score
            best_ngram_size = s
            best_classifier = classifier

print()
print("Best f1-score: {:.3f} [classifier: {},n-gram size: {}]".format(best_score, type(best_classifier).__name__, best_ngram_size))

Classifier: LinearSVC, n-gram size: 1 ==> f1-score: 0.770 [0:00:00.977359]
Classifier: LinearSVC, n-gram size: 2 ==> f1-score: 0.777 [0:00:01.729566]
Classifier: LinearSVC, n-gram size: 3 ==> f1-score: 0.772 [0:00:02.716950]
Classifier: LinearSVC, n-gram size: 4 ==> f1-score: 0.770 [0:00:03.818618]
Classifier: MultinomialNB, n-gram size: 1 ==> f1-score: 0.777 [0:00:00.466956]
Classifier: MultinomialNB, n-gram size: 2 ==> f1-score: 0.782 [0:00:01.180710]
Classifier: MultinomialNB, n-gram size: 3 ==> f1-score: 0.779 [0:00:02.177325]
Classifier: MultinomialNB, n-gram size: 4 ==> f1-score: 0.778 [0:00:03.262065]
Classifier: DecisionTreeClassifier, n-gram size: 1 ==> f1-score: 0.611 [0:00:30.683293]
Classifier: DecisionTreeClassifier, n-gram size: 2 ==> f1-score: 0.615 [0:01:42.058731]
Classifier: DecisionTreeClassifier, n-gram size: 3 ==> f1-score: 0.608 [0:03:09.032940]
Classifier: DecisionTreeClassifier, n-gram size: 4 ==> f1-score: 0.599 [0:04:56.483723]

Best f1-score: 0.782 [classifie

Two important things to notice:

- More features do not automatically yield better results

- More features usually also result in larger training times

### Testing with the best parameter setting(s)

Having identified the best parameter settings (i.e., which classifier and which n-gram size), we can train a classifier over the whole training data. Note that this the one and only time we touch the test data `X_test`. The k-fold cross validation was done using `X_train` only, which splits it into the actual training set and the validation set.

In [27]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, best_ngram_size))

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

best_classifier = best_classifier.fit(X_train_tfidf, y_train)
y_pred = best_classifier.predict(X_test_tfidf)



print(classification_report(y_test, y_pred))
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

             precision    recall  f1-score   support

          0       0.79      0.82      0.80      1074
          1       0.81      0.78      0.79      1059

avg / total       0.80      0.80      0.80      2133

Accuracy: 0.797


Again, the final results are a bit better then the ones from the cross validation since the training has been done over the whole training data. These numbers are finally the ones that are usually reported. That the averafe f1-score and the accuracy is almost identical is not a given, but is the result of a almost perfectly balanced dataset.

## Interpretation of results

### A random classifier (flipping a fair coin - 2 classes!) and a stupid classifier (always says "positive")

In [17]:
def predict_random(y):
    y_pred_random = []
    for _ in range(len(y)):
        y_pred_random.append(random.randint(0, 1))
    return y_pred_random
    
def predict_stupid(y):
    y_pred_stupid = []
    for _ in range(len(y_test)):
        y_pred_stupid.append(1)
    return y_pred_stupid

In [18]:
y_pred_random = predict_random(y_test)
y_pred_stupid = predict_stupid(y_test)
    
print("Results for random classifier:")
print(classification_report(y_test, y_pred_random))
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred_random)))
print()
print("------------------------------------------------------------------")
print()
print("Results for stupid classifier:")
print(classification_report(y_test, y_pred_stupid))
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred_stupid)))
print()

Results for random classifier:
             precision    recall  f1-score   support

          0       0.48      0.48      0.48      1045
          1       0.50      0.49      0.49      1088

avg / total       0.49      0.49      0.49      2133

Accuracy: 0.488

------------------------------------------------------------------

Results for stupid classifier:
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1045
          1       0.51      1.00      0.68      1088

avg / total       0.26      0.51      0.34      2133

Accuracy: 0.510



  'precision', 'predicted', average, warn_for)


### Generation of skewed dataset

The following steps generate an imbalance dataset. Since we do not reall train anything here, we only need to do this for the labels, and the order does not matter at all. So the final list of labels contains, e.g., 90% 1's and 10% 0's.

In [19]:
# Set a ratio how much the labels are skewed towards "positive"
skew_ratio = 0.9

pos_set_size = int(skew_ratio * len(y_test))

y_test_pos = [1] * pos_set_size
y_test_neg = [0] * (len(y_test) - pos_set_size)

y_test_skewed = np.concatenate((y_test_pos, y_test_neg))

We can now evaluate the random and stupid classifier over the skewed dataset.

In [20]:
y_pred_skewed_random = predict_random(y_test_skewed)
y_pred_skewed_stupid = predict_stupid(y_test_skewed)
    
print("Results for random classifier:")
print(classification_report(y_test_skewed, y_pred_skewed_random))
print("Accuracy: {:.3f}".format(accuracy_score(y_test_skewed, y_pred_skewed_random)))
print()
print("------------------------------------------------------------------")
print()
print("Results for stupid classifier:")
print(classification_report(y_test_skewed, y_pred_skewed_stupid))
print("Accuracy: {:.3f}".format(accuracy_score(y_test_skewed, y_pred_skewed_stupid)))
print()

Results for random classifier:
             precision    recall  f1-score   support

          0       0.11      0.54      0.18       214
          1       0.91      0.50      0.65      1919

avg / total       0.83      0.51      0.60      2133

Accuracy: 0.508

------------------------------------------------------------------

Results for stupid classifier:
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       214
          1       0.90      1.00      0.95      1919

avg / total       0.81      0.90      0.85      2133

Accuracy: 0.900



  'precision', 'predicted', average, warn_for)


The most important result is that the stupid classifier has a good accuracy since it is right most of the time due to the skewed dataset. Just looking at the accuracy can lead to wrong conclusions. In realty, most datasets are not perfectly balanced -- at least if they are not carefully handcrafted. Also the average scores look pretty good, with an f1-score of 0.71 (compared to the 0.79 if the true classifier). Only the individual results for each class show the real picture.