# COSI140b: A Brief Tutorial on Machine Learning with NLTK/Python
Keigh Rim, 4/21/2017


## Intro

This is a tutorial on machine learning using NLTK in Python. It is based on Python 2.7 and NLTK3. 

The tutorial is about 
* reparing proper representation of data for machine learning
* training machine learning algorithms in NLTK
* testing the trained algorithms 

It is not about 
* engineering features
* theoretical aspects of machine learning

Now, let's start with importing nltk.

In [None]:
import nltk

## Supervised Classification

Here we will write Python code to train and test out a classifier using NLTK packages.

Our example task is to train a Naive Bayes classifier that predicts gender of the author given a blog post. 
As our dataset, we will use [`blog-gender-dataset`](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). For details, refer to the paper;
* Mukherjee, A., & Liu, B. (2010, October). Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing (pp. 207-217). Association for Computational Linguistics.

In the dataset, we have around 3,000 blog posts, their raw unicode texts and their labels ('F', 'M') indicating the authors' gender. Let's take a peek at it. (The file is in CSV format.)


In [None]:
import csv

with open('blog-gender-dataset.csv') as dataset: 
    for i, row in enumerate(csv.reader(dataset)):
        print(row)
        if i > 0: break

These are our *raw* data.

### Feature Representation

So, as saw above, our *raw* dataset is simply a set of <**unicode string**, **label**> pairs.

Now we are turning this into a <**features**, **label**> set. That is, we will write this piece of code. 

In [None]:
def normalize_label(label):
    return label.strip().upper()

def unicodify(text):
    return unicode(text, 'utf-8')
    
def extract_features(text):
    raise NotImplementedError
    
dataset_file = open('blog-gender-dataset.csv')
raw_data = [(unicodify(text), normalize_label(label)) 
            for text, label in csv.reader(dataset_file) 
            if label]  # some items are missing label
dataset_file.close()
# consider lazy evaluation for the real implementation

feature_representation = {extract_features(text) : label 
                              for text, label in raw_data}

#### Defining features
Then, what is a feature? A feature is simply a fragmentary/single-faceted description of an instance/item in the data. It can be anything, for example: 
* *Does the document have a proper name in it? - **Yes***  (Binary feature)
* *How many characters are appearing in the document? - **3***  (Discrete numerical feature)
* *What is the most frequent non-stopword in the document? - **'time'*** (Nominal feature)

A set of features is exactly the way we want to describe an item. And the set of feature values for an item is the representation of that item in the model we design. Thus, after all, we are transforming a document into a set of name-value pairs.

However, to statistically model the data, each description (feature value) **needs to be a number**! Binary and numerical features are inherently convertible to numbers. Nominal or ordinal features can be mapped to numbers using random variables. Designing such random variables is beyond the scope of this tutorial and, for the sake of simplicity, we will be using only binary features in the rest of this section. 



#### Getting feature values
Next question: how can we get values of the features? For instance, say we have a binary feature that represents whether a document contains the word 'time'. How do we get the value for it? Let's take this feature function as an answer. 

In [None]:
def has_time(document):
    # returns a feature (yes | no) value directly
    return 'time' in document.split()  

Besides of the `has_time()` function, suppose we also have `has_flies()`, `has_like()`, `has_an()`, and `has_arrow()` functions. 

In [None]:
def has_flies(document):
    return 'flies' in document.split()
def has_like(document):
    return 'like' in document.split()
def has_an(document):
    return 'an' in document.split()
def has_arrow(document):
    return 'arrow' in document.split()

Then these two short documents 

In [None]:
document1 = 'time flies like an arrow'
document2 = 'time flies when you are having fun'

can be transformed into feature representations, or projected into the feature space, like this: 

In [None]:
doc1_features = {'has_time' : has_time(document1) ,
                'has_flies' : has_flies(document1) ,
                'has_like' : has_like(document1) ,
                'has_an' : has_an(document1) ,
                'has_arrow' : has_arrow(document1)}
doc2_features = {'has_time' : has_time(document2) ,
                'has_flies' : has_flies(document2) ,
                'has_like' : has_like(document2) ,
                'has_an' : has_an(document2) ,
                'has_arrow' : has_arrow(document2)}
print 'document1: ', doc1_features
print 'document2: ', doc2_features

Then, we can factor out the common "extracting" procedure as our `extract_features()` function.

In [None]:
feature_functions = [has_time, 
                     has_flies, 
                     has_like, 
                     has_an, 
                     has_arrow ] 

def extract_features(document):
    return {feature_function.__name__: feature_function(document) 
            for feature_function in feature_functions}

print 'document1: ', extract_features(document1)

*Note that each feature function here returns the value of the feature directly.*

Okay, that was a silly example, now let's consider a more realistic example using `blog-gender-dataset`. 

Let's say we want to use all the unigrams as our features, using simple whitespace tokenization. Then we are going to have ...

In [None]:
vocabulary = set()
for document, _ in raw_data:
    vocabulary.update(document.split())
print(len(vocabulary))

Yup, we are going to have 128k features. So, do we need to write .1 million feature functions? Of course not. We can re-write the `extract_features()` function, to avoid writing 128k other functions, like such: 

In [None]:
def extract_features(document):
    # now each feature_function can handle multiple features
    # and should return the values wrapped in a dict
    features = {}
    for feature_function in feature_functions:
        features.update(feature_function(document))  
    return features

def unigram_bow(document):
    return {token: (token in document.split()) for token in vocabulary}
    
feature_functions = [unigram_bow]

Now, how can you write another feature function `bigram_bow()` and incorporate it into `extract_features()`? And then how many features are we going to have? If we iterate through all three thousands of documents in the dataset and compute all the features, how long would it take? Can you write something in a more efficient way? For example, are all the word types including stopwords in the corpus going to be helpful features? How about stemming and lemmaization?

#### Feature representation for NLTK classifiers
Naive Bayes and MaxEnt in NLTK takes a list of tuples (feature_representation, label) as a training set. And they are smart enough to automatically take non-specified features as having *null*  values (`False`, `None`, `0`, etc) by default. As a result, for example, to get `unigram_bow()` features, we don't need to iterate over the whole vocabulary over and over. Rather, we re-write the function:

In [None]:
def unigram_bow_rewrite(document):
    return {token: True for token in document.split()}


feature_functions = [unigram_bow_rewrite]
print(extract_features(raw_data[0][0]))

Putting all together, we finally translate the entire dataset into this `feature_representation`. 

In [None]:
feature_representation = [(extract_features(raw_text), label) 
                          for raw_text, label in raw_data]

Lastly, let's split the dataset into the training set and the test set for testing the algorithm.

In [None]:
testset_percentage = 10

def split_dataset(dataset, testset_percentage):
    cutoff = testset_percentage * len(dataset) / 100
    return dataset[cutoff:], dataset[:cutoff]

trainset, testset = split_dataset(feature_representation, 
                                  testset_percentage)

print len(trainset)
print len(testset)

### Training classifiers

Once we have the feature representation, training a classifier is straightforward. We only need to call `train()` method on our data. 


In [None]:
nb_classifier = nltk.classify.NaiveBayesClassifier.train(trainset)

Other than Naive Bayes, NLTK comes with MaxEnt, decision tree, and many other classifier implementations. For details, refer to [the official documentation](http://www.nltk.org/api/nltk.classify.html). 

### Testing a classifier

Now let's see how good our naive bayes classifier is. After training, the classifier can perform classification upon `classify()` calls. 

In [None]:
predictions = [nb_classifier.classify(test_document) 
               for test_document, _ in testset]

We can measure its performance on many different aspect. First, let's see its accuracy.

In [None]:
def accuracy(predictions, golds):
    return (len([p for p, g in zip(predictions, golds) if p == g]) 
            / float(len(golds)))

golds = [label for _ , label in testset]
print(accuracy(predictions, golds))

To compute precision and recall, NLTK provides a convenient method to draw a confusion matrix as well as computation of precision and recall. However, we need a different data structure; sets of document id's of different labels. 

In [None]:
from collections import defaultdict as ddict

def to_labeled_ids(labeled_data):
    d = ddict(set)
    for doc_id, label in enumerate(labeled_data):
        d[label].add(doc_id)
    return d

def print_clf_scores(gld, hyp):
    print nltk.ConfusionMatrix(gld, hyp)

    gld_sets = to_labeled_ids(gld)
    hyp_sets = to_labeled_ids(hyp)

    from nltk.metrics import scores
    for label in set(gld):
        r = scores.recall(gld_sets[label], hyp_sets[label])
        p = scores.precision(gld_sets[label], hyp_sets[label])
        f = scores.f_measure(gld_sets[label], hyp_sets[label])
        print('<{}> P: {:.2}, R: {:.2}, F: {:.2}'.format(label, p, r, f))
 
print_clf_scores(golds, predictions)

Lastly, but definately not least, NLTK provides a method to rank the helpfulness of features.

In [None]:
nb_classifier.show_most_informative_features(10)

### Using `scikit-learn` wrapper in `NLTK`

NLTK also has a wrapper class that wraps around `scikit-learn` classifiers. Here's an example of using a support vector classifier from scikit-learn (`sklearn.svm.SVC`). Note that you should have `scikit-learn` installed.

In [None]:
from sklearn.svm import SVC
from nltk.classify.scikitlearn import SklearnClassifier

linearsvm_clf = SklearnClassifier(SVC(kernel='linear')).train(trainset)

The way it wraps a `scikit-learn` classifier is pretty straightforward: 1) import the classifier (*in scikit-learn, they use a term `estimator`*) class, 2) initiate an instance of it, and 3) wrap it in `nltk.classify.scikitlearn.SklearnClassifier` class. 

Now let's have another SVM using a different kernel function, just for fun. 

In [None]:
radialsvm_clf = SklearnClassifier(SVC(kernel='rbf')).train(trainset)

And see how they works. Note here we use `classifiy_many` method. 

In [None]:
svm_predictions = [clf.classify_many([test_doc for test_doc, _ in testset]) 
                   for clf in [linearsvm_clf, radialsvm_clf]]
print(accuracy(svm_predictions[0], golds))
print(accuracy(svm_predictions[1], golds))

So far, we have seen how to write Python code to transform documents into feature sets and to train/test a classifier for a simple classification problem. Now let's take a look at different types of problems.

## Clustering

### k_means clustering

NLTK clustering implementations requires a **complete vector representation** (using `numpy.ndarray`) of the corpus, not a `dict` based featuresets.

Let's do this with 128k unigram features, which will result in a very sparse feature vector. 

In [None]:
import numpy 

feature_indices = list(vocabulary) # we need an indexed feature names

feature_vectors = []   # this will be a #documents X #features matrix
                       # a huge and sparse one
    
def generate_document_vector(document):
    document_vector = numpy.zeros(len(feature_indices))
    for word_type in set(document.split()):
        document_vector[feature_indices.index(word_type)] = 1
    return document_vector

feature_vectors = [generate_document_vector(document) 
                   for document, _ in raw_data[:10]]

By the way, [`numpy.ndarray`](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html) is a matrix-like python object that is highly optimized for numerical computation. Linear algebraic operations, such as arithmetic on matrices or matrix manipulations, are super fast with ndarrays, outperforming list comprehensions in Python with a huge gap (not to mention looping), so most scientific packages in Python including all widely-used machine learning packages (scikit-learn, theano, tensorflow, you name it) are heavily relying on `numpy` data structures and functions. 

Next, we will build 2 clusters using k-means algorithm based on cosine similarity between these vectors. Note that the k-means algorithm starts with random seeds and does not guarantee the global convergence. Thus, one might want to repeat the algorithm then take the most common result as a 'good enough' clustering (by using `repeats` parameter).

In [None]:
dist = nltk.cluster.cosine_distance
kmc = nltk.cluster.kmeans.KMeansClusterer(2, dist, repeats=10)

clustered = kmc.cluster(feature_vectors, True)
print(clustered)

gold_labels = [int(label=='M') for _, label in raw_data[:10]]
print(gold_labels)

After trained, the clustering can be used a sort-of classifier, like such:

In [None]:
kmc.classify(numpy.array(generate_document_vector(raw_data[111][0])))

## Supervised Sequencial Tagging

### HMM tagger

Here, we are going to train a HMM sequence tagger for Chinese word segmentation, using **BIO tagging**. Included in the directory is word segmentation annotation of around 20k Chinese sentences from news articles, a small part of [Chinese Treebank corpus](http://www.cs.brandeis.edu/~clp/ctb/). 
Let's take a look at the annotation.

In [None]:
with open('tagged.seg') as seg_annotation:
    for i, line in enumerate(seg_annotation):
        print line
        if i > 1: break

For details for CTB, see the paper: 
* Xue, N., Xia, F., Chiou, F. D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural language engineering, 11(02), 207-238.

The original word segmentation annotation is done by simply inserting whitespaces, and I did BIO tagging based on the annotation last night. It was a complete hack job, and all errors in BIO tagging are my fault. 

To feed the data into supervised training process implemented in NLTK, we need to prepare a sequence of <**observation, state**> tuples. And in this case, our observations will just going to be unicode code points of each character and states are word boundaries encoded with BIO tags. 

In [None]:
tagged_segmentation = []
with open('tagged.seg') as annotation:
    for line in annotation:
        tagged_segmentation.append(
            [tuple(token.split("_")) for token in line.split()])
    
print(len(tagged_segmentation))
print(tagged_segmentation[234])



Okay, now we have about 20k sequences of word segmentation tagging. As always, training starts with split the data into the train set and the test set.

In [None]:
trainset, testset = split_dataset(tagged_segmentation, 
                                  testset_percentage)  # = 10

Again, training is fairly straightforward. We need to create an trainer object and then call `train()` method.

In [None]:
from nltk.tag.hmm import HiddenMarkovModelTrainer
hmm_tagger = HiddenMarkovModelTrainer().train(trainset)

After all, we now have a tagger that can perform word segmentation on an arbitrary Chinese sentence. Let's try.

In [None]:
def glue_chars(chars_seq):
    return "".join(chars_seq)

def bio_to_whtspc(tagged_seq):
    return glue_chars([" " + char if tag == "B" else char 
                       for char, tag in tagged_seq]).strip()

test_sent_idx = 2

print "ori:", glue_chars([char for char, tag in testset[test_sent_idx]])
print "gld:", bio_to_whtspc(testset[test_sent_idx])
print "hyp:", bio_to_whtspc(hmm_tagger.tag([char for char, tag in testset[test_sent_idx]]))

But, how good does it perform in general? We can measure the tagger's accuracy with a simple computation, as we did in the above. 

In [None]:
predictions= []
gold = []

for sent in testset:
    predictions.extend([predicted for _, predicted 
                        in hmm_tagger.tag([char for char, tag in sent])])
    gold.extend([gold_tag for _, gold_tag in sent])

print "accuracy:", accuracy(predictions, gold)

As well as confusion matrix, and P/R/F measures.

In [None]:
print_clf_scores(gold, predictions)