### Tf-idf demo on the books in the Gutenberg corpus in NLTK

Let's compute tf-idf scores for words occurring in the different books in the NLTK version of the Gutenberg corpus.

First we take a look at what books there are and how many words each of them contains:

In [1]:
import nltk

booknames = nltk.corpus.gutenberg.fileids()

for name in booknames:
    print("{:s} contains {:d} word tokens.".format(name, len(nltk.corpus.gutenberg.words(name))))

austen-emma.txt contains 192427 word tokens.
austen-persuasion.txt contains 98171 word tokens.
austen-sense.txt contains 141576 word tokens.
bible-kjv.txt contains 1010654 word tokens.
blake-poems.txt contains 8354 word tokens.
bryant-stories.txt contains 55563 word tokens.
burgess-busterbrown.txt contains 18963 word tokens.
carroll-alice.txt contains 34110 word tokens.
chesterton-ball.txt contains 96996 word tokens.
chesterton-brown.txt contains 86063 word tokens.
chesterton-thursday.txt contains 69213 word tokens.
edgeworth-parents.txt contains 210663 word tokens.
melville-moby_dick.txt contains 260819 word tokens.
milton-paradise.txt contains 96825 word tokens.
shakespeare-caesar.txt contains 25833 word tokens.
shakespeare-hamlet.txt contains 37360 word tokens.
shakespeare-macbeth.txt contains 23140 word tokens.
whitman-leaves.txt contains 154883 word tokens.


Let's have a look how many word tokens and types there are in all books combined.

In [2]:
ntot_tokens = sum(len(nltk.corpus.gutenberg.words(name)) for name in booknames)
print("Total number of word tokens:", ntot_tokens)

Total number of word tokens: 2621613


Can we handle all word types or do we need to drop some?

In [3]:
ntot_types = len(set(w.lower() for bookname in booknames for w in nltk.corpus.gutenberg.words(bookname)))
print("Total number of word types:", ntot_types)

Total number of word types: 42339


This number of word types is not too high, so we can keep them all in our vocabulary.

Let's look at which words are the most common in the corpus as a whole (**term frequencies** in the entire corpus). 

In [4]:
fdist = nltk.FreqDist(w.lower() for bookname in booknames for w in nltk.corpus.gutenberg.words(bookname))

print("The 100 most common word types in the Gutenberg corpus are:")
for w, f in fdist.most_common(100):
    print(w, f)

The 100 most common word types in the Gutenberg corpus are:
, 186091
the 133583
and 95442
. 73746
of 71267
to 48057
: 47406
a 33960
in 33580
i 30265
that 28798
; 27329
he 25857
it 22303
his 21402
' 19873
for 19527
was 18715
with 17599
not 17373
is 16435
you 16398
be 16115
" 15422
as 14528
but 13944
all 13725
they 13104
him 13012
shall 11682
her 11552
my 10511
had 10317
them 10245
have 10116
s 9896
me 9469
said 9429
at 9174
she 9078
from 9076
unto 9010
this 8915
- 8850
which 8775
on 8596
lord 8590
by 8512
their 7806
so 7791
will 7368
? 7342
were 6871
are 6854
thou 6759
there 6380
when 6330
or 6325
one 6218
no 5846
what 5792
out 5771
." 5751
! 5730
man 5615
thy 5609
-- 5445
god 5287
if 5245
we 5134
your 5121
then 5091
up 5023
an 4897
thee 4807
do 4707
," 4702
now 4677
ye 4674
upon 4587
who 4262
into 4098
more 4062
would 4046
very 4009
come 3642
could 3594
like 3468
before 3449
been 3410
any 3357
came 3337
day 3326
king 3149
did 3121
little 3065
know 3057
am 2987
than 2908
house 2900


What about **document frequencies**? For every word type, we count the number of documents (= books) it occurs in:

In [5]:
from collections import defaultdict

df = defaultdict(int)

for name in booknames:
    for word in set(w.lower() for w in nltk.corpus.gutenberg.words(name)):
        df[word] += 1

Next we sort the words by their document frequencies:

In [6]:
df_sorted_words = sorted(df.keys(), key=lambda w: df[w], reverse=True)

Let's take a look at some words that occur in _all_ the books:

In [7]:
print("Examples of words that occur in all books:")
for w in df_sorted_words[0:10]:
    print(w, df[w])

Examples of words that occur in all books:
' 18
those 18
by 18
then 18
death 18
all 18
foot 18
alone 18
fast 18
where 18


And some words that only occur in one single book:

In [8]:
print("Examples of words that occur in only one book:")
for w in df_sorted_words[-10:]:
    print(w, df[w])

Examples of words that occur in only one book:
democracies 1
perusing 1
manhattan 1
constipation 1
desolated 1
venereal 1
leadballs 1
palmerin 1
ejaculate 1
inauguration 1


As well as some words that occur in a few books:

In [9]:
print("Examples of words that occur in some, but not all books:")
for w in df_sorted_words[5000:5010]:
    print(w, df[w])

Examples of words that occur in some, but not all books:
improving 6
scrape 6
rack 6
clearness 6
pressure 6
praising 6
provocation 6
writer 6
pardoned 6
differ 6


From the document frequencies we compute the corresponding **inverse document frequency** (idf) values.

In [10]:
import math

ndocs = len(booknames)
idf = defaultdict(float)

for w in df.keys():
    idf[w] = math.log(ndocs / df[w])

What do the idf values look like for words that occur in all documents?

In [11]:
print("Show the IDF values of some words that occur in all books:")
for w in df_sorted_words[0:10]:
    print(w, idf[w])

Show the IDF values of some words that occur in all books:
' 0.0
those 0.0
by 0.0
then 0.0
death 0.0
all 0.0
foot 0.0
alone 0.0
fast 0.0
where 0.0


Or words that occur in one document only?

In [12]:
print("Show the IDF values of some words that occur in only one book:")
for w in df_sorted_words[-10:]:
    print(w, idf[w])

Show the IDF values of some words that occur in only one book:
democracies 2.8903717578961645
perusing 2.8903717578961645
manhattan 2.8903717578961645
constipation 2.8903717578961645
desolated 2.8903717578961645
venereal 2.8903717578961645
leadballs 2.8903717578961645
palmerin 2.8903717578961645
ejaculate 2.8903717578961645
inauguration 2.8903717578961645


Or somewhere in between?

In [13]:
print("Show the IDF values of some words that occur in some, but not all books:")
for w in df_sorted_words[5000:5010]:
    print(w, idf[w])

Show the IDF values of some words that occur in some, but not all books:
improving 1.0986122886681098
scrape 1.0986122886681098
rack 1.0986122886681098
clearness 1.0986122886681098
pressure 1.0986122886681098
praising 1.0986122886681098
provocation 1.0986122886681098
writer 1.0986122886681098
pardoned 1.0986122886681098
differ 1.0986122886681098


For the calculation of the tf-idf scores we also need the *document-specific term frequencies*. Let's look at the 20 most frequent words in each of the documents. Do we see any big differences between the documents?

In [14]:
for name in booknames:
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(name))
    print("Most frequent words in {:s}:".format(name))
    for w, f in fdist.most_common(20):
        print(w, f)
    print()

Most frequent words in austen-emma.txt:
, 11454
. 6928
to 5239
the 5201
and 4896
of 4291
i 3178
a 3129
it 2528
her 2469
was 2398
she 2340
; 2199
in 2188
not 2140
" 2004
you 1980
be 1975
that 1806
he 1806

Most frequent words in austen-persuasion.txt:
, 6750
the 3329
to 2808
and 2801
. 2741
of 2570
a 1595
in 1389
was 1337
; 1290
her 1204
had 1186
she 1146
i 1124
it 1038
he 961
be 950
not 934
that 882
as 809

Most frequent words in austen-sense.txt:
, 9397
to 4116
the 4105
. 3975
of 3572
and 3491
her 2551
a 2092
i 2004
in 1979
was 1861
it 1757
she 1613
" 1506
; 1419
that 1385
be 1305
for 1262
not 1248
as 1221

Most frequent words in bible-kjv.txt:
, 70509
the 64023
and 51696
: 43766
of 34670
. 26160
to 13580
that 12912
in 12667
he 10419
; 10084
shall 9838
unto 8997
for 8971
i 8854
his 8473
a 8177
lord 7964
they 7376
be 7013

Most frequent words in blake-poems.txt:
, 680
the 439
and 348
. 201
of 146
in 141
i 130
a 128
to 111
' 104
; 98
my 83
: 75
with 66
? 65
! 59
his 57
he 56
is 52
" 51


How does it look, when we compute **tf-idf** scores for the words in the documents and compare? We display the 20 top scoring words for each document. 

In [15]:
# For each document (book) ...
for bookname in booknames:
    tf = {}
    tfidf = {}
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(bookname))

    # For each word in the document ...
    for w, f in fdist.most_common():
        # Compute the term frequency:
        tf[w] = 1 + math.log10(f)
        # ... as well as the tf-idf score:
        tfidf[w] = tf[w] * idf[w]
    
    # Sort the words by tf-idf
    tfidf_sorted_words = sorted(tfidf.keys(), key=lambda w: tfidf[w], reverse=True)
    
    # Show the highest scoring words in this document
    print("Highest tf-idf scoring words in {:s}:".format(bookname))
    for w in tfidf_sorted_words[0:20]:
        print("{:s}: {:.3f} (tf: {:.3f}, idf: {:.3f})".format(w, tfidf[w], tf[w], idf[w]))
    print()

Highest tf-idf scoring words in austen-emma.txt:
harriet: 10.706 (tf: 3.704, idf: 2.890)
weston: 10.531 (tf: 3.643, idf: 2.890)
knightley: 10.376 (tf: 3.590, idf: 2.890)
elton: 10.363 (tf: 3.585, idf: 2.890)
woodhouse: 10.103 (tf: 3.496, idf: 2.890)
fairfax: 9.775 (tf: 3.382, idf: 2.890)
hartfield: 9.261 (tf: 3.204, idf: 2.890)
highbury: 8.951 (tf: 3.097, idf: 2.890)
emma: 8.651 (tf: 3.937, idf: 2.197)
randalls: 8.539 (tf: 2.954, idf: 2.890)
cole: 8.051 (tf: 2.785, idf: 2.890)
goddard: 8.009 (tf: 2.771, idf: 2.890)
campbell: 7.826 (tf: 2.708, idf: 2.890)
donwell: 7.776 (tf: 2.690, idf: 2.890)
dixon: 7.641 (tf: 2.643, idf: 2.890)
enscombe: 7.389 (tf: 2.556, idf: 2.890)
churchill: 7.361 (tf: 3.350, idf: 2.197)
campbells: 6.980 (tf: 2.415, idf: 2.890)
bates: 6.966 (tf: 3.170, idf: 2.197)
_she_: 6.880 (tf: 2.380, idf: 2.890)

Highest tf-idf scoring words in austen-persuasion.txt:
elliot: 10.003 (tf: 3.461, idf: 2.890)
wentworth: 9.649 (tf: 3.338, idf: 2.890)
musgrove: 9.000 (tf: 3.114, idf

Highest tf-idf scoring words in chesterton-thursday.txt:
syme: 10.738 (tf: 3.715, idf: 2.890)
marquis: 8.241 (tf: 2.851, idf: 2.890)
gogol: 7.423 (tf: 2.568, idf: 2.890)
anarchists: 7.279 (tf: 2.519, idf: 2.890)
ratcliffe: 6.931 (tf: 2.398, idf: 2.890)
renard: 6.712 (tf: 2.322, idf: 2.890)
gregory: 6.700 (tf: 3.049, idf: 2.197)
ducroix: 6.110 (tf: 2.114, idf: 2.890)
anarchist: 5.786 (tf: 2.633, idf: 2.197)
chairman: 5.648 (tf: 1.954, idf: 2.890)
witherspoon: 5.648 (tf: 1.954, idf: 2.890)
dynamiter: 5.501 (tf: 1.903, idf: 2.890)
anarchism: 5.333 (tf: 1.845, idf: 2.890)
beastly: 5.333 (tf: 1.845, idf: 2.890)
eustache: 5.333 (tf: 1.845, idf: 2.890)
cabs: 5.333 (tf: 1.845, idf: 2.890)
embankment: 5.140 (tf: 1.778, idf: 2.890)
secretary: 5.064 (tf: 2.826, idf: 1.792)
dynamiters: 4.911 (tf: 1.699, idf: 2.890)
bomb: 4.781 (tf: 2.176, idf: 2.197)

Highest tf-idf scoring words in edgeworth-parents.txt:
cecilia: 9.374 (tf: 3.243, idf: 2.890)
piedro: 9.269 (tf: 3.207, idf: 2.890)
jem: 8.941 (tf: 

Let's rerun the the tf-idf calculations, but instead of showing the highest scoring words in each document, we will show the scores of some (arbitrarily) selected words:

In [16]:
selected_words = [ "god", "jesus", "rich", "handsome", "romantic", "secret",
                   "honey", "bear", "ahab", "white", "whale", "alice",
                   "wonderland", "king", "queen", "lazy", "sin", "paradise" ]

for bookname in booknames:
    tf = {}
    tfidf = {}
    fdist = nltk.FreqDist(w.lower() for w in nltk.corpus.gutenberg.words(bookname))
    for w, f in fdist.most_common():
        tf[w] = 1 + math.log10(f)
        tfidf[w] = tf[w] * idf[w]

    print("Tf-idf scores of some selected words in {:s}:".format(bookname))
    for w in selected_words:
        if w in tfidf.keys():
            print("{:s}: {:.3f} (tf: {:.3f}, idf: {:.3f})".format(w, tfidf[w], tf[w], idf[w]))
        else:
            print("{:s}: missing".format(w))
    print()

Tf-idf scores of some selected words in austen-emma.txt:
god: 0.224 (tf: 1.903, idf: 0.118)
jesus: missing
rich: 0.123 (tf: 2.146, idf: 0.057)
handsome: 1.270 (tf: 2.580, idf: 0.492)
romantic: 0.944 (tf: 1.000, idf: 0.944)
secret: 0.000 (tf: 2.322, idf: 0.000)
honey: missing
bear: 0.155 (tf: 2.716, idf: 0.057)
ahab: missing
white: 0.217 (tf: 1.845, idf: 0.118)
whale: missing
alice: missing
wonderland: missing
king: 0.237 (tf: 1.301, idf: 0.182)
queen: 0.182 (tf: 1.000, idf: 0.182)
lazy: 0.944 (tf: 1.000, idf: 0.944)
sin: 0.693 (tf: 1.000, idf: 0.693)
paradise: missing

Tf-idf scores of some selected words in austen-persuasion.txt:
god: 0.236 (tf: 2.000, idf: 0.118)
jesus: missing
rich: 0.117 (tf: 2.041, idf: 0.057)
handsome: 1.122 (tf: 2.279, idf: 0.492)
romantic: 0.944 (tf: 1.000, idf: 0.944)
secret: 0.000 (tf: 1.954, idf: 0.000)
honey: missing
bear: 0.133 (tf: 2.322, idf: 0.057)
ahab: missing
white: 0.200 (tf: 1.699, idf: 0.118)
whale: missing
alice: missing
wonderland: missing
king:

Tf-idf scores of some selected words in shakespeare-hamlet.txt:
god: 0.284 (tf: 2.415, idf: 0.118)
jesus: missing
rich: 0.084 (tf: 1.477, idf: 0.057)
handsome: missing
romantic: missing
secret: 0.000 (tf: 1.602, idf: 0.000)
honey: missing
bear: 0.057 (tf: 1.000, idf: 0.057)
ahab: missing
white: 0.189 (tf: 1.602, idf: 0.118)
whale: 1.429 (tf: 1.301, idf: 1.099)
alice: missing
wonderland: missing
king: 0.590 (tf: 3.236, idf: 0.182)
queen: 0.347 (tf: 1.903, idf: 0.182)
lazy: missing
sin: missing
paradise: missing

Tf-idf scores of some selected words in shakespeare-macbeth.txt:
god: 0.253 (tf: 2.146, idf: 0.118)
jesus: missing
rich: 0.057 (tf: 1.000, idf: 0.057)
handsome: missing
romantic: missing
secret: 0.000 (tf: 1.477, idf: 0.000)
honey: missing
bear: 0.057 (tf: 1.000, idf: 0.057)
ahab: missing
white: 0.118 (tf: 1.000, idf: 0.118)
whale: missing
alice: missing
wonderland: missing
king: 0.500 (tf: 2.740, idf: 0.182)
queen: missing
lazy: missing
sin: missing
paradise: missing

Tf-idf sc

Does this make sense?

How come the tf-idf score is zero for some words? What might be the consequences of this? Are there any weaknesses in this approach? 