Today we'll walk through two implementations of naïve Bayes classifiers, both in Python.

# Data

NLTK has a [corpus of pet names coded for sex](http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html). We will try to predict the sex using simple features of the name.

In [1]:
import random

import nltk

# I had to install the corpus ahead of time. But you only have to do this once.
assert nltk.download("names")

female = nltk.corpus.names.words("female.txt")
print(f"Loaded {len(female)} female names")
male = nltk.corpus.names.words("male.txt")
print(f"Loaded {len(male)} male names")

# Then, we'll concatenate the data and turn it into (x, y) pairs.
name_list = [(name, "female") for name in female] + [(name, "male") for name in male]

# Then, we'll shuffle the data.
random.seed(562)
random.shuffle(name_list)  # This works in place.

# Then, we'll split it into training and test data.
train = name_list[:-1000]
test = name_list[-1000:]

Loaded 5001 female names
Loaded 2943 male names


[nltk_data] Downloading package names to
[nltk_data]     /home/kbg/.anaconda3/share/nltk_data...
[nltk_data]   Package names is already up-to-date!


# Feature extraction

Now let's define a feature function. This one will take a name as input and returns a dictionary with string keys. I've adapted this from [chapter six of the NLTK book](https://www.nltk.org/book/ch06.html), but I tried to improve the code there a bit. Naturally you can always use tricks to avoid

In [2]:
import string

from typing import Dict


FeatureVector = Dict[str, object]


def extract_features(name: str) -> FeatureVector:
    """Extracts features for a single example."""
    name_lowercase = name.casefold()
    vowels = frozenset("aeiou")
    features = {}
    features["startswith(vowel)"] = name[0] in vowels
    features["endswith(vowel)"] = name[-1] in vowels
    for char in string.ascii_lowercase:
        count = name.count(char)
        features[f"count({char})"] = count
        features[f"has({char})"] = bool(count)
    return features


# An example dictionary vector.
print(extract_features("Bodie"))

{'startswith(vowel)': False, 'endswith(vowel)': True, 'count(a)': 0, 'has(a)': False, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 1, 'has(d)': True, 'count(e)': 1, 'has(e)': True, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 1, 'has(i)': True, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 0, 'has(n)': False, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}


# In NLTK

The [NLTK implementation](https://www.nltk.org/book/ch06.html) of naïve Bayes supports:

* features of arbitrary types (so long as they're hashable)
* multinomial classification (though we'll just give it a binary classification problem)

[Internally](https://www.nltk.org/_modules/nltk/classify/naivebayes.html), it simply constructs $\hat{P}(y = y')$ and $\hat{P}(F_i \mid y = y')$, using $c = .5$ (I believe) by default. Then, to predict, it computes the log posterior distribution $log \hat{P}(y = y' \mid \mathcal{F})$ and returns the argmax. It is written in pure Python: consider taking a look!

NLTK expects the training and test data laid out as a list of (string label, feature vector) pairs.

In [3]:
nltk_train_set = [(extract_features(name), label) for (name, label) in train]

# An example pair.
print(nltk_train_set[0])

({'startswith(vowel)': False, 'endswith(vowel)': False, 'count(a)': 1, 'has(a)': True, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 1, 'has(d)': True, 'count(e)': 2, 'has(e)': True, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 0, 'has(i)': False, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 1, 'has(l)': True, 'count(m)': 0, 'has(m)': False, 'count(n)': 1, 'has(n)': True, 'count(o)': 0, 'has(o)': False, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 1, 'has(r)': True, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 1, 'has(x)': True, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}, 'male')


Next we train.

In [4]:
# This is actually a class method that returns a model object of 
# type `NaiveBayesClassifier`.
nltk_clf = nltk.classify.naivebayes.NaiveBayesClassifier.train(nltk_train_set)

To predict, we simply call the `classify` method with a feature vector.

In [5]:
lambda_features = extract_features("Lambda")  # Weird pet name, I know.
print(nltk_clf.classify(lambda_features))

female


The NLTK implementation also can show us which features it determined to be most informative, providing them as a "X to 1" ratio.

In [6]:
nltk_clf.show_most_informative_features()

Most Informative Features
                count(y) = 2              female : male   =      5.1 : 1.0
                count(a) = 3              female : male   =      4.7 : 1.0
                count(f) = 2                male : female =      3.9 : 1.0
                  has(f) = True             male : female =      3.7 : 1.0
                count(e) = 3              female : male   =      3.6 : 1.0
                count(f) = 1                male : female =      3.6 : 1.0
                count(w) = 1                male : female =      3.3 : 1.0
                  has(w) = True             male : female =      3.3 : 1.0
                count(u) = 2                male : female =      3.2 : 1.0
                count(i) = 3                male : female =      3.1 : 1.0


Finally, we evaluate. Accuracy (the percentage of correct classifications) on a held-out test set seems like an obvious measure here.

In [7]:
nltk_test_set = [(extract_features(name), label) for (name, label) in test]
print(f"{nltk.classify.util.accuracy(nltk_clf, nltk_test_set):.4f}")

0.7370


# In Scikit-learn

The [Scikit-learn implementation](https://scikit-learn.org/stable/modules/naive_bayes.html) of naïve Bayes supports:

* numerical features only
* multinomial classification (though we'll just give it a binary classification problem)

The desired value of $c$ is passed to the constructor using the `alpha` keyword (and defaults to $c = 1$).

Whereas NLTK wanted data laid out using lists of (string label, feature vector) pairs, Scikit-learn wants separate lists (or arrays) of numerical feature vectors ($X$ in their notation) and string labels ($Y$).

In [8]:
import sklearn.feature_extraction

skl_vectorizer = sklearn.feature_extraction.DictVectorizer(sparse=False)
# This first prepares the encoding (`fit`) then applies it to the data (`transform`).
skl_train_x = skl_vectorizer.fit_transform(extract_features(name) for (name, _) in train)

# An example numerical feature vector.
print(skl_train_x[0])

[1. 0. 0. 1. 2. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 0. 0.]


In [9]:
# We also need to pull out Y; we'll encode it using a boolean where female == True.
skl_train_y = [sex == "female" for (_, sex) in train]

# An example boolean class vector.
print(skl_train_y[0])

False


In [10]:
import sklearn.naive_bayes

# Unlike in NLTK, we create the classifier object before we train.
skl_clf = sklearn.naive_bayes.MultinomialNB()
_ = skl_clf.fit(skl_train_x, skl_train_y)  # Assignment just to silence it.

To predict, we simply call the `predict` method with an iterable of feature vectors. Note this is unlike NLTK's `classify` method, which only takes a single feature vector at a time.

In [11]:
skl_clf.predict(skl_vectorizer.transform(extract_features("Lambda")))

array([ True])

Finally we can evaluate, using test accuracy.

In [12]:
import sklearn.metrics

# `fit` or `fit_transform` should only be called once; after that we can only `transform`.
skl_test_x = skl_vectorizer.transform(extract_features(name) for (name, _) in test)
skl_test_y = [sex == "female" for (_, sex) in test]
# Predicts the labels for the test data.
skl_pred_y = skl_clf.predict(skl_test_x)
print(f"{sklearn.metrics.accuracy_score(skl_test_y, skl_pred_y):.4f}")

0.7460


Accuracy is quite similar to the NLTK implementation.

## Stretch goals

Improve the above examples in one or more of the following ways:

* compute an absolute baseline: how hard is this task if we just guessed the most likely label?
* write functions for repeated operations like feature and/or class encoding
* write classes which wraps the two classifier with a more friendly interface that automatically handles feature encoding for you
* modify the feature extractor for better accuracy
* try another type of classifier from NLTK and/or Scikit-learn

In [34]:
import random

import nltk

# I had to install the corpus ahead of time. But you only have to do this once.
assert nltk.download("names")

female = nltk.corpus.names.words("female.txt")
print(f"Loaded {len(female)} female names")
male = nltk.corpus.names.words("male.txt")
print(f"Loaded {len(male)} male names")

# Then, we'll concatenate the data and turn it into (x, y) pairs.
name_list = [(name, "female") for name in female] + [(name, "male") for name in male]
#
#print(len(name_list))
# Then, we'll shuffle the data.
random.seed(0)
random.shuffle(name_list)  # This works in place.
print(name_list[:10])

# random.shuffle(name_list)  # This works in place.
# print(name_list[:10])

# Then, we'll split it into training and test data.
train = name_list[:-1000]
test = name_list[-1000:]

[nltk_data] Downloading package names to /Users/hmghaly/nltk_data...
[nltk_data]   Package names is already up-to-date!
Loaded 5001 female names
Loaded 2943 male names
[('Gilburt', 'male'), ('Robina', 'female'), ('Kissie', 'female'), ('Allegra', 'female'), ('Melantha', 'female'), ('Carmen', 'female'), ('Verile', 'female'), ('Ric', 'male'), ('Sela', 'female'), ('Tine', 'female')]
[('Garp', 'male'), ('Elli', 'female'), ('Weidar', 'male'), ('Dosi', 'female'), ('Osbert', 'male'), ('Freemon', 'male'), ('Gusty', 'female'), ('Aurore', 'female'), ('Mella', 'female'), ('Mabel', 'female')]


In [36]:
test[:10]

[('Lois', 'female'),
 ('Jacynth', 'female'),
 ('Andrea', 'male'),
 ('Cordy', 'female'),
 ('Benton', 'male'),
 ('Sherri', 'female'),
 ('Quint', 'male'),
 ('Karrie', 'female'),
 ('Briggs', 'male'),
 ('Bearnard', 'male')]

In [5]:
male[100:110]

['Alwin',
 'Amadeus',
 'Ambros',
 'Ambrose',
 'Ambrosi',
 'Ambrosio',
 'Ambrosius',
 'Amery',
 'Amory',
 'Amos']

In [30]:
import random
name_list = [(name, "female") for name in female] + [(name, "male") for name in male]
random.seed(0)
random.shuffle(name_list)  # This works in place.
print(name_list[:10])


[('Gilburt', 'male'), ('Robina', 'female'), ('Kissie', 'female'), ('Allegra', 'female'), ('Melantha', 'female'), ('Carmen', 'female'), ('Verile', 'female'), ('Ric', 'male'), ('Sela', 'female'), ('Tine', 'female')]


In [33]:
import random
#name_list = [(name, "female") for name in female] + [(name, "male") for name in male]
random.seed(0)
random.shuffle(name_list)  # This works in place.
print(name_list[:10])


[('Corinne', 'female'), ('Reece', 'male'), ('Hazel', 'male'), ('Wakefield', 'male'), ('Elly', 'female'), ('Rockwell', 'male'), ('Zelig', 'male'), ('Eddie', 'male'), ('Dell', 'female'), ('Salem', 'male')]


In [45]:
import string

from typing import Dict


FeatureVector = Dict[str, object]


def extract_features(name: str) -> FeatureVector:
    """Extracts features for a single example."""
    name_lowercase = name.casefold()
    vowels = frozenset("aeiou")
    features = {}
    features["startswith(vowel)"] = name_lowercase[0] in vowels
    features["endswith(vowel)"] = name_lowercase[-1] in vowels
    features["endswith(a)"] = name_lowercase[-1] == "a"
    features["endswith(o)"] = name_lowercase[-1] == "o"    
    features["endswith"] = name_lowercase[-1]
    for char in string.ascii_lowercase:
        count = name_lowercase.count(char)
        features[f"count({char})"] = count
        features[f"has({char})"] = bool(count)
    return features
# An example dictionary vector.
for a,b in extract_features("Carlo").items():
    print(a,b)
#print()

startswith(vowel) False
endswith(vowel) True
endswith(a) False
endswith(o) True
endswith o
count(a) 1
has(a) True
count(b) 0
has(b) False
count(c) 1
has(c) True
count(d) 0
has(d) False
count(e) 0
has(e) False
count(f) 0
has(f) False
count(g) 0
has(g) False
count(h) 0
has(h) False
count(i) 0
has(i) False
count(j) 0
has(j) False
count(k) 0
has(k) False
count(l) 1
has(l) True
count(m) 0
has(m) False
count(n) 0
has(n) False
count(o) 1
has(o) True
count(p) 0
has(p) False
count(q) 0
has(q) False
count(r) 1
has(r) True
count(s) 0
has(s) False
count(t) 0
has(t) False
count(u) 0
has(u) False
count(v) 0
has(v) False
count(w) 0
has(w) False
count(x) 0
has(x) False
count(y) 0
has(y) False
count(z) 0
has(z) False


In [47]:
nltk_train_set = [(extract_features(name), label) for (name, label) in train]

# An example pair.
print(nltk_train_set[0])
print(train[0])

({'startswith(vowel)': False, 'endswith(vowel)': False, 'endswith(a)': False, 'endswith(o)': False, 'endswith': 'p', 'count(a)': 1, 'has(a)': True, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 1, 'has(g)': True, 'count(h)': 0, 'has(h)': False, 'count(i)': 0, 'has(i)': False, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 0, 'has(n)': False, 'count(o)': 0, 'has(o)': False, 'count(p)': 1, 'has(p)': True, 'count(q)': 0, 'has(q)': False, 'count(r)': 1, 'has(r)': True, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}, 'male')
('Garp', 'male')


In [48]:
print(nltk_train_set[100])
print(train[100])

({'startswith(vowel)': False, 'endswith(vowel)': True, 'endswith(a)': False, 'endswith(o)': False, 'endswith': 'e', 'count(a)': 1, 'has(a)': True, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 2, 'has(e)': True, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 0, 'has(i)': False, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 2, 'has(n)': True, 'count(o)': 0, 'has(o)': False, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 2, 'has(t)': True, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}, 'female')
('Nanette', 'female')


In [50]:
nltk_clf = nltk.classify.naivebayes.NaiveBayesClassifier.train(nltk_train_set)


In [61]:
name="Campbel"
features=extract_features(name)
features
print(nltk_clf.classify(features))

male


In [63]:
nltk_clf.show_most_informative_features(20)

Most Informative Features
                endswith = 'k'              male : female =     71.6 : 1.0
                endswith = 'a'            female : male   =     35.4 : 1.0
             endswith(a) = True           female : male   =     35.3 : 1.0
                endswith = 'f'              male : female =     16.9 : 1.0
                endswith = 'p'              male : female =     11.3 : 1.0
                endswith = 'd'              male : female =     11.1 : 1.0
                endswith = 'v'              male : female =     10.0 : 1.0
                endswith = 'm'              male : female =      8.6 : 1.0
             endswith(o) = True             male : female =      7.8 : 1.0
                endswith = 'o'              male : female =      7.8 : 1.0
                endswith = 'r'              male : female =      6.6 : 1.0
                endswith = 'w'              male : female =      6.3 : 1.0
                endswith = 'g'              male : female =      4.8 : 1.0

In [64]:
nltk_test_set = [(extract_features(name), label) for (name, label) in test]
accuracy=nltk.classify.util.accuracy(nltk_clf, nltk_test_set)
print(accuracy)
#print(f"{nltk.classify.util.accuracy(nltk_clf, nltk_test_set):.4f}")

0.757
