# Loading data as a collection of documents

In this chapter, we represent each document as a list of words.

In this way, your own corpus and the corpus provided by NLTK can be processed in the same way.

In [None]:
import nltk

In [None]:
f = open('bbc.txt')
bbc_docs = [nltk.word_tokenize(line) for line in f.readlines()]
f.close()

In [None]:
len(bbc_docs)

In [None]:
bbc_docs[0]

In [None]:
from nltk.corpus import brown
brown_docs = [brown.words(fileid) for fileid in brown.fileids()]

In [None]:
len(brown_docs)

In [None]:
brown_docs[0]

# Counting Words

`nltk.FreqDist` is an object type that computes and preseves frequency of elements in a given collection.

In [None]:
count = nltk.FreqDist(bbc_docs[0])

In [None]:
count

In [None]:
count.most_common(30)

### Case-insensitive count

Just make all the words to be in lower-case.

In [None]:
count = nltk.FreqDist([w.lower() for w in bbc_docs[0]])

In [None]:
count.most_common(30)

# Word Filtering by Regular Expression

`re` library contains functions for processing related with regular expression.

The regular expression `[a-zA-Z]+` matches with a string that contains only alphabet.

In [None]:
import re
count = nltk.FreqDist([w.lower() for w in bbc_docs[0] if re.match('[a-zA-Z]+', w)])
count.most_common(30)

# Stop Words

A list of words that seem not to be relevant to a topic of document, which is enumerated by someone without any theoretical evidence.

In [None]:
from nltk.corpus import stopwords
stopwords.words('English')

In [None]:
sws = set(stopwords.words('English'))

In [None]:
count = nltk.FreqDist([w.lower() for w in bbc_docs[0] 
                       if re.match('[a-zA-Z]+', w) and not w.lower() in sws])
count.most_common(30)

# Stemming

In [None]:
count['republican']

In [None]:
count['republicans']

## Porter Stemmer

Algorithmic approach that is based on a heuristics

- step1 gets rid of plurals and -ed or –ing
- step2 maps double suffices to single ones. So -ization ( = -ize plus -ation) maps to -ize etc.
- step3 deletes with -ic-, -full, -ness etc.
- step4 takes off -ant, -ence etc.
- step5 removes a final -e, and changes -ll to –l

In [None]:
test_raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
test_tokens = nltk.word_tokenize(test_raw)

In [None]:
porter = nltk.PorterStemmer()
stemmed = [porter.stem(w) for w in test_tokens]
print(stemmed)

## Lancaster Stemmer (Paice/Husk Stemmer)

Fully rule-based approach that uses the externally stored rules. For example, the rules include:

- -ied > -y
- -ceed > -cess
- -eed > -ee
- -ed > -
- -hood > -
- -e > -
- -lief > -liev
- -if > -
- -ing > -
- -iag > -y

In [None]:
lancaster = nltk.LancasterStemmer()
stemmed = [lancaster.stem(w) for w in test_tokens]
print(stemmed)

# Lemmatization

## WordNet Lemmatizer

A lemmatizer that uses WordNet as a reference dictionary.
WordNet is a dictionary that contains information about semantic relationship between words, so that you can use it as a thesaurus.

In [None]:
wnl = nltk.WordNetLemmatizer()
lemmas = [wnl.lemmatize(w) for w in test_tokens]
print(lemmas)

In [None]:
count = nltk.FreqDist([wnl.lemmatize(w.lower()) for w in bbc_docs[0] 
                       if re.match('[a-zA-Z]+', w) and not w.lower() in sws])
count.most_common(30)

In [None]:
count['republican']

In [None]:
count['republicans']

# Define your counting function

In [None]:
def count_words(doc):
    wnl = nltk.WordNetLemmatizer()
    return nltk.FreqDist([wnl.lemmatize(w.lower()) for w in doc
                          if re.match('[a-zA-Z]+', w) and not w.lower() in sws])

In [None]:
count = count_words(brown_docs[0])
count.most_common(30)

# Finding Salient Words by using Likelihood Ratio

The important words in a document would be ones that are used frequently in that document but infrequently in *other* documents.
The **likelihood ratio** is a quantity that directly reflects this idea.

In our case, the likelihood ratio is a ratio of the two likelihoods;
- The likelihood of a word $w$ with respect to the probabilistic distribution of word that is estimated by **the given document $d$**, which is denoted by $\tilde{p}_d(w)$
- The likelihood of a word $w$ with respect to the probabilistic distribution of word that pervades **every possible documents in this world**, which is denoted by $p_{all}(w)$.

### Estimating the *general* word distribution

We want to know the *true* word distribution $p_{all}$ that the text in this world follows.
But we have only limited samples of them, so we try to *estimate* the distribution using the corpus and denote it as $\tilde{p}_{all}$.

In [None]:
all_docs = brown_docs + bbc_docs
all_count = count_words([w for doc in all_docs for w in doc])

In [None]:
all_count.most_common(30)

### Maximum Likelihood Estimation for $\tilde{p}_{all}$

In [None]:
all_dist = nltk.MLEProbDist(all_count)

In [None]:
all_dist.prob('one')

In [None]:
all_dist.prob('republican')

### Maximum Likelihood Estimation for $\tilde{p}_{d}$

In [None]:
dist = nltk.MLEProbDist(count_words(bbc_docs[0]))

In [None]:
dist.prob('one')

In [None]:
dist.prob('republican')

### Computing the Likelihood Ratio $\frac{\tilde{p}_{d}(w)}{\tilde{p}_{all}(w)}$ for $w = $ `republican`

In [None]:
dist.prob('republican') / all_dist.prob('republican')

### Computing the Likelihood Ratio $\frac{\tilde{p}_{d}(w)}{\tilde{p}_{all}(w)}$ for all $w$

In [None]:
ratios = [(w, dist.prob(w) / all_dist.prob(w)) for w in vocab]

In [None]:
sorted(ratios, key=lambda x: -x[1])

### Laplace Estimation

Laplace Estimation is an estimation of word distribution that just adds 1 to the frequency of every words.
This mitigates the *peaky* estimation that is caused by the smallness of the corpus. We can say that this additional count doesn't affect the estimation almost at all if the corpus is sufficiently large.

In [None]:
brown_vocab = set(all_count.keys())
bbc_vocab = set([w for doc in bbc_docs for w in doc])
vocab = brown_vocab or bbc_vocab
n_vocab = len(vocab)

In [None]:
all_dist = nltk.LaplaceProbDist(all_count, bins=n_vocab)

In [None]:
all_dist.prob('one')

In [None]:
dist = nltk.LaplaceProbDist(count_words(bbc_docs[0]), bins=n_vocab)

In [None]:
dist.prob('one')

In [None]:
ratios = [(w, dist.prob(w) / all_dist.prob(w)) for w in vocab]

In [None]:
sorted(ratios, key=lambda x: -x[1])