In [None]:
from IPython.display import Image

#  Text classification

Detecting patterns and using them to classify text is a key part of NLP. It is also 
an example of one of the central problems of machine learning: "given some data, can I
identify some features that help me to classify that data as having some property that I care about?"

In the context of NLP this might take the form of deciding whether an e-mail message is spam or not;
deciding whether a customer review is positive or negative; deciding what a person's gender is
based on their name; or which language a word is from.

The goal of classification is to choose the correct _label_ for a given _input_. 
In basic classification tasks, each input is considered as being independent from all
other inputs, and the classification labels are defined in advance. More complex 
classification tasks include extending the classification to multi-class 
classification (i.e. more than one label is allowed per input); open-class 
classification (classes need not be pre-defined); and sequence classification 
(a list of inputs are jointly classified).


## Supervised classification
If a classifier is built on training data with the correct label for each input,
it is known as _supervised_ classification.

The general framework for supervised classification involves a training stage and a prediction
stage. During training, a feature extractor is used to convert each input value to a feature set. 
These feature sets capture the basic information about each input that should be used to classify it. 
Pairs of feature sets and labels are fed into the machine learning algorithm to generate a 
classification model.

During the prediction stage, the same feature extractor is used to convert 
previously unseen inputs to feature sets. These feature sets are then 
fed into the model, which generates predicted labels.



![Supervised classification workflow](supervised-classification.png)

Figure from nltk.org/book



## Building a gender classifier for names

In some languages male and female names have some distinctive features: 
For example, female names are more likely to end in _a_, _e_, and _i_, relative to male names
which are more likely to end in _k_, _o_, _r_, _s_, or _t_.

We're going to build a name gender classifier. The first thing we need for this is a 'feature extractor'. 
This is a function that we can give an input (e.g. a name) at it gives back a list of one or more features of the input (e.g. the last letter of the name).

To work with the NLTK classifier that we're going to use, our feature extractor needs to produce output that looks like a dictionary of the form `{'feature name': 'feature value'}`.

In [None]:
def gender_features(word):
  # This function returns the last character of the input it is given
  return {'last_letter': word[-1]}

gender_features('Dion')

Now we need to prepare a list of labeled data to use for training and testing our 
name classifier.

We're going to start with a list of names from the NLTK corpus. They're probably mostly
European, or at least from the global North.

In [None]:
from nltk.corpus import names

# For all the names in the dataset male.txt, label them as male and store then as a pair of the form `('name', 'gender')`
# Do the same for the names in the dataset female.txt
# Then join the two datasets together.
labeled_nltk_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
print(len(labeled_nltk_names))


# Finally, shuffle the list of labeled names
import random
random.shuffle(labeled_nltk_names)
print(labeled_nltk_names[:5])

Now that we have some labeled inputs and we're ready to apply our feature extractor `gender_features`.

It's a good idea to split the names data into a training set and a test set.

We're also going to apply the feature extractor to get some data that looks like `({'feature name': 'feature value'}, 'label')`

In [None]:
# apply the feature extractor to our name data
featuresets_nltk = [(gender_features(name),gender) for (name, gender) in labeled_nltk_names] # a tuple of features and labels
print(len(featuresets_nltk),'names in the NLTK set')

#split the data into test and training sets
train_set_nltk_names, test_set_nltk_names = featuresets_nltk[:5000], featuresets_nltk[5000:]

print(train_set_nltk_names[0:3])

Then we will use the feature set we just produced to train a 'feature classifier'. In this case we're going to use a _naive Bayes classifier_.

In [None]:
from nltk import NaiveBayesClassifier, classify
nb_classifier_nltk = NaiveBayesClassifier.train(train_set_nltk_names)
nb_classifier_nltk.show_most_informative_features(8)

Finally, it's time to test our classifier.

In [None]:
nb_classifier_nltk.classify(gender_features('Dion'))
# note that the classifier uses the features extracted from the name, not the name itself.
# this means that it is really doing something like
#nb_classifier.classify({'last_letter': 'n'})

You can test the classifier on some names that it might not have seen before. 

Let's try it on some names it might not have seen in the training data 
(we could go a check that these names aren't actually in the training data if we wanted to).

In [None]:

print(nb_classifier_nltk.classify(gender_features('Kahu')))
print(nb_classifier_nltk.classify(gender_features('Kiri')))
print(nb_classifier_nltk.classify(gender_features('Manakore')))
print(nb_classifier_nltk.classify(gender_features('Aperahama')))


## Ko wai tō ingoa?

Next, we're going to load in a list of "Māori names" that we scraped from a website. We can have a look at the code for scraping this names in the notebook `getCorpora.ipynb`.

Before we do anything with them, it's worth thinking a bit about where these came from and any issues that might be related to using this data.

In [None]:
# load some "Māori names" that we scraped from a US website

import ast # we'll use ast for turning strings that look like tuples into tuples

with open('maoriNames.txt', 'r') as f:
  labeled_maori_names = f.read().splitlines()
  labeled_maori_names = [ast.literal_eval(item) for item in labeled_maori_names]

random.shuffle(labeled_maori_names)
print(labeled_maori_names[:4])
print(len(labeled_maori_names))


We'll follow the same process as we did above.
First we're going to use the feature extractor to convert the labeled names to a feature set.

In [None]:
featuresets_maori = [(gender_features(name),gender) for (name, gender) in labeled_maori_names] # a tuple of features and labels
print(len(featuresets_maori),'names')

#split the data into test and training sets
train_set_maori_names, test_set_maori_names = featuresets_maori[:70], featuresets_maori[70:]

But rather than training our classifier on the new data, we're first going to look at how it performs if
we apply it directly to the new data set.

We can do this by using a tool from NLTK that uses our labeled test data to calculate the accuracy.

In [None]:

print('Accuracy:',classify.accuracy(nb_classifier_nltk, test_set_maori_names),'\n')

How does this compare with the accuracy of the NLTK names?

In [None]:

print('Accuracy:',classify.accuracy(nb_classifier_nltk, test_set_nltk_names),'\n')

Now let's look at what happens if we train the classifier on the set of Māori names.

In [None]:
nb_classifier_maori = NaiveBayesClassifier.train(train_set_maori_names)
nb_classifier_maori.show_most_informative_features(8)

In [None]:

print('Accuracy:',classify.accuracy(nb_classifier_maori, test_set_maori_names),'\n')

In [None]:

print('Accuracy:',classify.accuracy(nb_classifier_maori, test_set_nltk_names),'\n')

It looks like training the classifier on the same sort of data as what we intend 
to use it has very little impact on its performance. 
(This is probably due to a combination of a very small data set and uninformative features for the problems we're trying to solve.) 
And it has made the performance on the NLTK data set worse.

Clearly, the "look at the last letter" heuristic does't always work well.

Try modifying the feature extractor part of the code below to see if you can come up with something that performs better for the Māori names data set.

In [None]:
def gender_features(name):
  return{'last_letter': name[-1],
         'first_letter': name[0],
         'length': len(name)
           }

featuresets_maori = [(gender_features(name),gender) for (name, gender) in labeled_maori_names] # a tuple of features and labels

#split the data into test and training sets
train_set_maori_names, test_set_maori_names = featuresets_maori[:70], featuresets_maori[70:]
nb_classifier = NaiveBayesClassifier.train(test_set_maori_names)
nb_classifier.show_most_informative_features(9)
print('Accuracy:',classify.accuracy(nb_classifier, test_set_maori_names),'\n')

# if you want you can also test how well your classifier works on the NLTK names
#print('Accuracy:',classify.accuracy(nb_classifier, test_set_nltk_names),'\n')

### Aside: Choosing the right features

Selecting relevant features can have an enormous impact on a learning method's ability to
extract a good model. Most of the work of building a good classifier is in deciding what features might relevant
and how they can best be represented in conjunction with other features.

This process of designing a feature extractor involves a process of trial-and error. 
A common approach is to start with all the possible features that you can think of
and the refine the feature extractor to only those which are actually helpful.

However, there are limits to the number of features that you should train on.
Providing too many features risks _over-fitting_ which can lead to an uninformative 
classifier when it is extended from the training set to the test set and 
can result in worse performance than a classifier that uses a smaller number of features.

_You can try building and using a feature extractor that checks for the presence of every letter of the alphabet
as a feature._

Hint: you can use something like
```
def extract_features(name):
  features = {}
  for letter in 'abcdefghijklmnopqrstuvwxyz':
    features['includes({})'.format(letter)] = letter in name.lower()
  return features
```

In [None]:

def extract_features(name):
  features = {}
  for letter in 'abcdefghijklmnopqrstuvwxyz':
    features['includes({})'.format(letter)] = letter in name.lower()
  return features

extract_features('Dion')

In [None]:
# You can test the feature extractor above on the names data, or write your own one.

## Classifying reo

Now that we've had an introduction to the how classifiers work and how they can be applied to text, we can look at trying to build a reo classifier that can be given a word and will classify it as either Māori or English.

To start with, we're going to need some labeled data. We've made this by taking two documents, converting them to words and then tagging those words. One document we have assumed is entirely kupu Māori, the other we have assumed is entirely English words.

If you want, you can look more closely at how the text data was collected.
Details are in `getCorpora.ipynb`

In [None]:
# load and format labeled kupu Māori
with open('kupuMaori.txt', 'r', encoding='utf8') as f:
  labeled_maori_text = f.read().splitlines()
  labeled_maori_text = [ast.literal_eval(item) for item in labeled_maori_text]

random.shuffle(labeled_maori_text)
print(labeled_maori_text[:5])
print(len(labeled_maori_text))

# load and format labeled English words
with open('englishWords.txt', 'r', encoding='utf8') as f:
  labeled_english_text = f.read().splitlines()
  labeled_english_text = [ast.literal_eval(item) for item in labeled_english_text]

random.shuffle(labeled_english_text)
print(labeled_english_text[:5])
print(len(labeled_english_text))

# combine the Māori and English kupu
labeled_text = labeled_maori_text + labeled_english_text
random.shuffle(labeled_text)
print(labeled_text[:5])


Next, we need a feature extractor to use with our text. 
One possibility is to start with the example above - `extract_features()` where the presence or absence of each letter is a feature.

Or we could try something new of our own devising...

In [None]:
def extract_kupu_features(kupu):
  features = {}
  features['first_letter'] = kupu[0]
  features['last_letter'] = kupu[-1]
  features['has_macron'] = True in [macron_letter in kupu.lower() for macron_letter in 'āēīōū']
  features['english_only_letters'] = True in [english_letter in kupu.lower() for english_letter in 'bcdfjlqsvxyz']
  return features

In [None]:
#check our possible feature extractors

print(extract_features('consideration'))

print(extract_kupu_features('consideration'))

print(extract_features('rōpū'))

print(extract_kupu_features('rōpū'))

Now let's try applying one of our feature extractors to our corpus of mixed Māori and English words.
We'll follow a similar process to above where we split the data into some for training and some for testing.

In [None]:

featureset_kupu= [(extract_kupu_features(kupu.lower()),reo) for (kupu, reo) in labeled_text] # a tuple of features and labels
print(len(featureset_kupu))

#split the data into test and training sets
train_set_kupu, test_set_kupu = featureset_kupu[:3000], featureset_kupu[3000:]
print(labeled_text[:3])
print(featureset_kupu[:3])

# train the classifier
nb_classifier = NaiveBayesClassifier.train(train_set_kupu)
nb_classifier.show_most_informative_features(10)

Let's test our classifier on some kupu.

In [None]:
test_kupu = 'maramatanga'

#check that our test word isn't one that is already in the list of kupu that could be part of the training set
print('Is \'{}\' in the training data?:'.format(test_kupu), test_kupu in [kupu for (kupu, reo) in labeled_text])

# classify the kupu
nb_classifier.classify(extract_features('maramatanga'))


We can also look at the accuracy of our classifier using the test set from the data set that we split.

In [None]:

print('Accuracy:',classify.accuracy(nb_classifier, test_set_kupu),'\n')

In [None]:

# We can also print out the result of classifying some words from an untagged corpus with a mix of reo

#load and format our untagged text
with open('mixedWords.txt', 'r', encoding='utf8') as f:
  mixedText = f.read().splitlines()
  mixedText = [ast.literal_eval(item) for item in mixedText]

print(mixedText[:5])
print(len(mixedText))

# apply the classifier and print some results
for (word, tag) in mixedText[:20]:
  print("Word:",word, " - Reo:", nb_classifier.classify(extract_kupu_features(word)))


How do the results above compare with the accuracy as estimated by the classifier from the tagged data? 