## Tf-idf demo on the books in the Gutenberg corpus in NLTK

*By Mathias Creutz*

Let's compute tf-idf scores for words occurring in the different books in the NLTK version of the Gutenberg corpus.

First we take a look at **what books there are** in the corpus:

In [1]:
import nltk

booknames = nltk.corpus.gutenberg.fileids()

for name in booknames:
    print("{:24s} {:s}".format(name, " ".join(nltk.corpus.gutenberg.sents(name)[0])))

austen-emma.txt          [ Emma by Jane Austen 1816 ]
austen-persuasion.txt    [ Persuasion by Jane Austen 1818 ]
austen-sense.txt         [ Sense and Sensibility by Jane Austen 1811 ]
bible-kjv.txt            [ The King James Bible ]
blake-poems.txt          [ Poems by William Blake 1789 ]
bryant-stories.txt       [ Stories to Tell to Children by Sara Cone Bryant 1918 ]
burgess-busterbrown.txt  [ The Adventures of Buster Bear by Thornton W . Burgess 1920 ]
carroll-alice.txt        [ Alice ' s Adventures in Wonderland by Lewis Carroll 1865 ]
chesterton-ball.txt      [ The Ball and The Cross by G . K . Chesterton 1909 ]
chesterton-brown.txt     [ The Wisdom of Father Brown by G . K . Chesterton 1914 ]
chesterton-thursday.txt  [ The Man Who Was Thursday by G . K . Chesterton 1908 ]
edgeworth-parents.txt    [ The Parent ' s Assistant , by Maria Edgeworth ]
melville-moby_dick.txt   [ Moby Dick by Herman Melville 1851 ]
milton-paradise.txt      [ Paradise Lost by John Milton 1667 ]
shakespe

**How many word tokens** do these books contain?

In [2]:
for name in booknames:
    print("{:24s} {:7d}".format(name, len(nltk.corpus.gutenberg.words(name))))

austen-emma.txt           192427
austen-persuasion.txt      98171
austen-sense.txt          141576
bible-kjv.txt            1010654
blake-poems.txt             8354
bryant-stories.txt         55563
burgess-busterbrown.txt    18963
carroll-alice.txt          34110
chesterton-ball.txt        96996
chesterton-brown.txt       86063
chesterton-thursday.txt    69213
edgeworth-parents.txt     210663
melville-moby_dick.txt    260819
milton-paradise.txt        96825
shakespeare-caesar.txt     25833
shakespeare-hamlet.txt     37360
shakespeare-macbeth.txt    23140
whitman-leaves.txt        154883


Let's have a look **how many word tokens there are in all books combined**.

In [3]:
ntot_tokens = sum(len(nltk.corpus.gutenberg.words(name)) for name in booknames)
print("Total number of word tokens:", ntot_tokens)

Total number of word tokens: 2621613


What is the **size of the vocabulary**? How many different unique word forms (word types) are there in this corpus?

In [4]:
ntot_types = len(set(w.lower() for bookname in booknames \
                     for w in nltk.corpus.gutenberg.words(bookname)))
print("Total number of word types (which are our terms):", ntot_types)

Total number of word types (which are our terms): 42339


### Term frequencies

Let's look at which words are the most common in the corpus as a whole (**term frequencies in the entire corpus**). 

In [5]:
fdist = nltk.FreqDist(w.lower() for bookname in booknames \
                      for w in nltk.corpus.gutenberg.words(bookname))

print("The 70 most common word types in the Gutenberg corpus are:")
for w, f in fdist.most_common(70):
    print("{:8s} {:6d}".format(w, f))

The 70 most common word types in the Gutenberg corpus are:
,        186091
the      133583
and       95442
.         73746
of        71267
to        48057
:         47406
a         33960
in        33580
i         30265
that      28798
;         27329
he        25857
it        22303
his       21402
'         19873
for       19527
was       18715
with      17599
not       17373
is        16435
you       16398
be        16115
"         15422
as        14528
but       13944
all       13725
they      13104
him       13012
shall     11682
her       11552
my        10511
had       10317
them      10245
have      10116
s          9896
me         9469
said       9429
at         9174
she        9078
from       9076
unto       9010
this       8915
-          8850
which      8775
on         8596
lord       8590
by         8512
their      7806
so         7791
will       7368
?          7342
were       6871
are        6854
thou       6759
there      6380
when       6330
or         6325
one        62

However, for the calculation of the tf-idf scores we need **document-specific term frequencies**. Let's look at the 20 most frequent words in a few of our documents (= books). Do we see any big differences between the documents?

In [6]:
some_booknames = \
    [ 'austen-emma.txt', 'bible-kjv.txt', 'burgess-busterbrown.txt', \
      'carroll-alice.txt', 'melville-moby_dick.txt', 'shakespeare-caesar.txt' ]

for name in some_booknames:
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(name))
    print("Most frequent words in {:s}:".format(name))
    for w, f in fdist.most_common(20):
        print("{:8s} {:6d}".format(w, f))
    print()

Most frequent words in austen-emma.txt:
,         11454
.          6928
to         5239
the        5201
and        4896
of         4291
i          3178
a          3129
it         2528
her        2469
was        2398
she        2340
;          2199
in         2188
not        2140
"          2004
you        1980
be         1975
that       1806
he         1806

Most frequent words in bible-kjv.txt:
,         70509
the       64023
and       51696
:         43766
of        34670
.         26160
to        13580
that      12912
in        12667
he        10419
;         10084
shall      9838
unto       8997
for        8971
i          8854
his        8473
a          8177
lord       7964
they       7376
be         7013

Most frequent words in burgess-busterbrown.txt:
.           823
,           822
he          678
the         660
and         516
to          436
'           401
of          342
that        308
it          300
a           287
was         274
buster      229
had         211
little  

### Inverse document frequencies

To obtain inverse document frequencies, we first compute the **document frequencies**: For every word type, we count the number of documents (= books) it occurs in:

In [7]:
from collections import defaultdict

df = defaultdict(int)

for name in booknames:
    for word in set(w.lower() for w in nltk.corpus.gutenberg.words(name)):
        df[word] += 1

From the document frequencies we compute the corresponding **inverse document frequency** (idf) values.

In [8]:
import math

ndocs = len(booknames)
idf = defaultdict(float)

for w in df.keys():
    idf[w] = math.log(ndocs / df[w])

Next we sort the words by their document frequencies:

In [9]:
df_sorted_words = sorted(df.keys(), key=lambda w: df[w], reverse=True)

Let's take a look at some **words that occur in all the books**. What are their document and inverse document frequencies?

In [10]:
print("Examples of words that occur in all books (with df and idf values):")
for w in df_sorted_words[0:10]:
    print("{:16s} {:6d} {:12.3f}".format(w, df[w], idf[w]))

Examples of words that occur in all books (with df and idf values):
bed                  18        0.000
i                    18        0.000
some                 18        0.000
such                 18        0.000
tongue               18        0.000
want                 18        0.000
about                18        0.000
off                  18        0.000
am                   18        0.000
sweet                18        0.000


And some words that only **occur in one single book**:

In [11]:
print("Examples of words that occur in only one book (with df and idf values):")
for w in df_sorted_words[-10:]:
    print("{:16s} {:6d} {:12.3f}".format(w, df[w], idf[w]))

Examples of words that occur in only one book (with df and idf values):
mizra                 1        2.890
impell                1        2.890
trippers              1        2.890
imprints              1        2.890
peelings              1        2.890
stevedores            1        2.890
machinists            1        2.890
nativities            1        2.890
laugher               1        2.890
pregnancy             1        2.890


As well as some words that **occur in a few books**:

In [12]:
print("Examples of words that occur in some, but not all books:")
for w in df_sorted_words[5000:5010]:
    print("{:16s} {:6d} {:12.3f}".format(w, df[w], idf[w]))

Examples of words that occur in some, but not all books:
jokes                 6        1.099
roused                6        1.099
discerned             6        1.099
meditations           6        1.099
lists                 6        1.099
.,                    6        1.099
tax                   6        1.099
unkind                6        1.099
resisted              6        1.099
intolerable           6        1.099


### Tf-idf: combining tf with idf

How does it look, when we multiply tf with idf to get **tf-idf scores for all the words in the documents**? We display the 20 top scoring words for each document. 

In [13]:
# For each document (book) ...
for bookname in some_booknames:
    tf = {}
    tfidf = {}
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(bookname))

    # For each word in the document ...
    for w, f in fdist.most_common():
        # Compute the term frequency:
        tf[w] = 1 + math.log10(f)
        # ... as well as the tf-idf score:
        tfidf[w] = tf[w] * idf[w]
    
    # Sort the words by tf-idf
    tfidf_sorted_words = sorted(tfidf.keys(), key=lambda w: tfidf[w], reverse=True)
    
    # Show the highest scoring words in this document
    print("Highest tf-idf scoring words in {:s}:".format(bookname))
    for w in tfidf_sorted_words[0:20]:
        print("{:16s} {:6.3f} (tf: {:.3f}, idf: {:.3f})".format(w, tfidf[w], tf[w], idf[w]))
    print()

Highest tf-idf scoring words in austen-emma.txt:
harriet          10.706 (tf: 3.704, idf: 2.890)
weston           10.531 (tf: 3.643, idf: 2.890)
knightley        10.376 (tf: 3.590, idf: 2.890)
elton            10.363 (tf: 3.585, idf: 2.890)
woodhouse        10.103 (tf: 3.496, idf: 2.890)
fairfax           9.775 (tf: 3.382, idf: 2.890)
hartfield         9.261 (tf: 3.204, idf: 2.890)
highbury          8.951 (tf: 3.097, idf: 2.890)
emma              8.651 (tf: 3.937, idf: 2.197)
randalls          8.539 (tf: 2.954, idf: 2.890)
cole              8.051 (tf: 2.785, idf: 2.890)
goddard           8.009 (tf: 2.771, idf: 2.890)
campbell          7.826 (tf: 2.708, idf: 2.890)
donwell           7.776 (tf: 2.690, idf: 2.890)
dixon             7.641 (tf: 2.643, idf: 2.890)
enscombe          7.389 (tf: 2.556, idf: 2.890)
churchill         7.361 (tf: 3.350, idf: 2.197)
campbells         6.980 (tf: 2.415, idf: 2.890)
bates             6.966 (tf: 3.170, idf: 2.197)
_she_             6.880 (tf: 2.380, idf

Let's rerun the the tf-idf calculations, but instead of showing the highest scoring words in each document, we will show the scores of some (arbitrarily) selected words:

In [14]:
selected_words = [ "god", "jesus", "rich", "handsome", "romantic", "secret",
                   "honey", "bear", "ahab", "white", "whale", "alice",
                   "wonderland", "king", "queen", "lazy", "sin", "paradise" ]

for bookname in some_booknames:
    tf = {}
    tfidf = {}
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(bookname))
    for w, f in fdist.most_common():
        tf[w] = 1 + math.log10(f)
        tfidf[w] = tf[w] * idf[w]

    print("Tf-idf scores of some selected words in {:s}:".format(bookname))
    for w in selected_words:
        if w in tfidf.keys():
            print("{:16s} {:5.3f} (tf: {:.3f}, idf: {:.3f})"
                  .format(w, tfidf[w], tf[w], idf[w]))
        else:
            print("{:16s} missing".format(w))
    print()

Tf-idf scores of some selected words in austen-emma.txt:
god              0.224 (tf: 1.903, idf: 0.118)
jesus            missing
rich             0.123 (tf: 2.146, idf: 0.057)
handsome         1.270 (tf: 2.580, idf: 0.492)
romantic         0.944 (tf: 1.000, idf: 0.944)
secret           0.000 (tf: 2.322, idf: 0.000)
honey            missing
bear             0.155 (tf: 2.716, idf: 0.057)
ahab             missing
white            0.217 (tf: 1.845, idf: 0.118)
whale            missing
alice            missing
wonderland       missing
king             0.237 (tf: 1.301, idf: 0.182)
queen            0.182 (tf: 1.000, idf: 0.182)
lazy             0.944 (tf: 1.000, idf: 0.944)
sin              0.693 (tf: 1.000, idf: 0.693)
paradise         missing

Tf-idf scores of some selected words in bible-kjv.txt:
god              0.548 (tf: 4.651, idf: 0.118)
jesus            5.114 (tf: 3.993, idf: 1.281)
rich             0.166 (tf: 2.908, idf: 0.057)
handsome         missing
romantic         missing
secr

Does this make sense?

How come the tf-idf score is zero for some words? What might be the consequences of this? Are there any weaknesses in this approach? 