# Chapter 5: Words and Counting
## Demo for analyzing word counts in a document using Python and WordNet
### Code accompanies Section 5.1 Word Counts and Text Analysis

This notebook examines some techniques with Python. For sample code, built-in Text documents from nltk are used. 

In [1]:
# imports used in the notebook

from nltk.book import *
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### Preprocessing

The built-in texts are already tokenized and stored as NLTK Text objects.

A minimal amount of preprocessing is done on text4: the inaugural address corpus:
* lower case the tokens (text 4 is already tokenized)
* count the number of tokens with len()
* count the number of unique tokens with (set)


In [3]:
# lowercase the text
tokens4 = [t.lower() for t in text4]

print("\nThe number of tokens in text4: ", len(tokens4))
set4 = set(tokens4)
print("\nThe number of unique tokens in text4:", len(set4))
print("\nThe first 5 unique tokens in text4:", sorted(set4)[:5])




The number of tokens in text4:  149797

The number of unique tokens in text4: 9216

The first 5 unique tokens in text4: ['!', '"', '";', '"?', '$']


### More Preprocessing

In the output above, it seems there are a lot of tokens that are punctuation. Let's do more preprocessing:

* reduce the tokens to tokens that are alpha and are not stopwords
* create the counts again
* display the first few

In [5]:
# get rid of punctuation and stopwords
tokens4 = [t for t in tokens4 if t.isalpha() and
           t not in stopwords.words('english')]
print("\nThe number of important words in text4:", len(tokens4))
print("\nThe number of unique important words in text4:", len(set(tokens4)))
print("\nThe first 10 unique important words in text4:", tokens4[:10])



The number of important words in text4: 64336

The number of unique important words in text4: 8973

The first 10 unique important words in text4: ['fellow', 'citizens', 'senate', 'house', 'representatives', 'among', 'vicissitudes', 'incident', 'life', 'event']


### Lexical diversity

There are many varied definitions and formulas for lexical diversity, but they all try to measure how diverse or limited the vocabulary is. Here is one formula:

In [6]:
# lexical diversity
print("\nLexical diversity: %.2f" % (len(set4) / len(tokens4)))


Lexical diversity: 0.14


## Lemmas

Lemmas are the root form of the word. The next chunk of code reduces the tokens to just lemmas, in order to get a better picture of the kinds of things these documents are 'about'.

Future notebooks look in more detail at WordNet, for this notebook we use it's Lemmatizer.

In [7]:
# get the lemmas
wnl = WordNetLemmatizer()
lemmas = [wnl.lemmatize(t) for t in tokens4]
# make unique
lemmas_unique = list(set(lemmas))  # ?
print("\nThe number of unique lemmas in text4: ", len(lemmas_unique))


The number of unique lemmas in text4:  7935


### Dictionary of counts

How common is each lemma? We can make a dictionary where the key is the lemma and the value is a count of how many times tokens in the documents have that lemma.

In [8]:
# make a dictionary of counts
counts = {t:lemmas.count(t) for t in lemmas_unique}
print('citizen', counts['citizen'])

citizen 303


### Print the least and most common words

The book goes into detail about this line: sorted_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)

Here are the key points:
* sorted() returns a list of tuples: \[\('citizen': 303), (...\)\] because a dict is not sorted
* key=lambda x: x[1] means to sort on the second value of the tuples, which are the counts
* reverse=True means sort from high count to low count



In [9]:
# print 10 most common words
# dicts are unordered so we sort it and put it in a list of tuples
sorted_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
print("5 most common words:")
for i in range(5):
    print(sorted_counts[i])

print("\n5 least common words:")
for i in range(-1,-6, -1):
    print(sorted_counts[i])

5 most common words:
('government', 651)
('people', 623)
('nation', 515)
('u', 478)
('state', 442)

5 least common words:
('childish', 1)
('trim', 1)
('unrepealed', 1)
('journal', 1)
('fifth', 1)


### NLP is never perfect but improvements can be made

The code above discovered that 'u' was in the 5 most common lemmas. The reality of NLP projects is:

* NLP results are not perfect because language is messy.
* NLP results are not perfect because the available tools are not perfect.
* NLP is perfectable, meaning that results can be incrementally improved with hard work, patience, and persistence.

The remaining code blocks in this notebook do some detective work to see what happened with that 'u'.

In [10]:
# find all words of length 1 or 2 that start with u
x = set([t for t in text4 if t.startswith('u') and len(t) < 3])
x


{'up', 'us'}

In [11]:
# what happens when 'us' is lemmatized?
wnl.lemmatize('us')

'u'

Aha. 'Us' is the problem. There are a few possible ways to deal with this:

* Use a customized list of stop words, and include 'us'.
* Remove words of length 1 from the set of unique lemmas. 
* Try different lemmatizers to see if better results can be achieved.

Adding to the list of stopwords is straighforward:

```
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words += ['may', 'must', 'every', 'one']  # add more stop words
```