# NLTK Complete Tutorial

**Summary:**
This notebook offers a complete tutorial on NLTK, covering text preprocessing, tokenization, tagging, and analysis for English texts.

# Complete NLTK Tutorial

This comprehensive notebook covers text mining and natural language processing using NLTK (Natural Language Toolkit).

## Overview

NLTK is a platform for building Python programs to analyze natural language data. It provides tools for:
- Text exploration and analysis
- Tokenization and text processing
- Finding collocations (word pairs)
- Part-of-speech tagging
- Named entity recognition
- Text classification using machine learning

## Contents

1. **Introduction & Setup** - Import libraries and download necessary NLTK data
2. **Part 1: Text Exploration** - Concordance, similarity, dispersion plots, lexical diversity
3. **Part 2: Tokenization & Text Processing** - Word and sentence tokenization
4. **Part 3: Collocations** - Finding bigrams with PMI (Pointwise Mutual Information)
5. **Part 4: Parts of Speech Tagging** - Extracting and visualizing nouns
6. **Part 5: Named Entity Recognition** - Extracting people, places, and organizations
7. **Part 6: Text Classification** - Gender classification and criminal conviction categorization

---

# 1. Introduction & Setup

First, we'll import all necessary libraries and download the NLTK data packages we need.

In [None]:
# Standard library imports
import os
import random
import csv
import requests

# NLTK imports
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.collocations import *
from nltk.corpus import names

# Visualization imports
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Download all required NLTK data
# Note: punkt_tab is required for newer versions of NLTK
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('names')
nltk.download('book')

---
# Part 1: Text Exploration

NLTK ships with several built-in text corpora. Let's load them and explore basic text analysis tools.

In [None]:
from nltk.book import *

The `nltk.book` module loads several classic texts:
- text1: Moby Dick by Herman Melville
- text2: Sense and Sensibility by Jane Austen
- text3: The Book of Genesis
- text4: Inaugural Address Corpus
- text5: Chat Corpus
- And more...

## Concordance

A concordance shows every occurrence of a word along with its surrounding context. This helps understand how a word is used in different situations.

In [None]:
text1.concordance("monstrous")

## Similar Words

Find words that appear in similar contexts to a given word. This reveals semantic relationships.

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

Note how the similar words differ between Moby Dick (text1) and Sense and Sensibility (text2), reflecting different writing styles and periods.

## Common Contexts

Find contexts shared by two or more words.

In [None]:
text2.common_contexts(["monstrous", "very"])

## Dispersion Plots

Visualize where words appear throughout a text. This is useful for tracking themes or concepts across a document.

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

## Lexical Diversity

Lexical diversity measures how many unique words are used relative to the total word count. Higher diversity indicates richer vocabulary.

In [None]:
def lexical_diversity(text):
    """Calculate the ratio of unique words to total words."""
    return len(set(text)) / len(text)

def percentage(count, total):
    """Calculate percentage."""
    return 100 * count / total

In [None]:
print(f"Text 3 (Genesis) lexical diversity: {lexical_diversity(text3):.4f}")
print(f"Text 3 length: {len(text3)} tokens")
print(f"Text 3 unique words: {len(set(text3))}")
print()
print(f"Text 1 (Moby Dick) lexical diversity: {lexical_diversity(text1):.4f}")
print(f"Text 1 length: {len(text1)} tokens")
print(f"Text 1 unique words: {len(set(text1))}")

---
# Part 2: Tokenization & Text Processing

Tokenization is the process of dividing text into words or sentences. While Python has built-in string methods like `split()`, NLTK's tokenizers are linguistically aware and handle punctuation, abbreviations, and other edge cases intelligently.

## Loading Sample Text

We'll load a sample text from a URL. This is *Australian Essays* by Francis W. L. Adams (1886) from the AustLit corpus.

In [None]:
# Fetch sample text from URL
link = "https://etc.mikelynch.org/tmfh/adaessa-plain.txt"
r = requests.get(link)
sample_text = r.text

# Display first 100 characters
print(sample_text[:100])

## Comparing Simple vs. Smart Tokenization

Let's see the difference between Python's basic `split()` and NLTK's word tokenizer.

In [None]:
# Simple split - notice all the whitespace codes
simple_split = sample_text.split(" ")[:20]
print("Simple split result:")
print(simple_split)

Notice how `split()` includes empty strings and whitespace characters like `\n` and `\t`. Now let's use NLTK's word tokenizer.

## Word Tokenization

NLTK's word tokenizer understands punctuation and handles contractions, abbreviations, and other linguistic features properly.

In [None]:
sample_words = word_tokenize(sample_text)
print(f"Total words: {len(sample_words)}")
print(f"First 20 tokens: {sample_words[:20]}")

## Sentence Tokenization

NLTK can also tokenize text into sentences. It's smart enough to know that periods in abbreviations like "W." don't mark sentence boundaries.

In [None]:
sentences = sent_tokenize(sample_text)
print(f"Total sentences: {len(sentences)}")
print(f"\nFirst 5 sentences:")
for i, sent in enumerate(sentences[:5], 1):
    print(f"\n{i}. {sent[:100]}...")

---
# Part 3: Collocations

Collocations are words that frequently occur together. Finding collocations can reveal phrases and concepts that are important in a text. We'll use PMI (Pointwise Mutual Information) to rank bigrams (pairs of words).

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(sample_words)

## Top Bigrams (All Frequencies)

First, let's look at the top bigrams using PMI scoring:

In [None]:
print("Top 20 bigrams by PMI (all frequencies):")
for bigram in finder.nbest(bigram_measures.pmi, 20):
    print(f"  {bigram[0]} {bigram[1]}")

## Filtering by Frequency

Many of these bigrams only occur once or twice. Let's filter to only show bigrams that appear at least 3 times:

In [None]:
finder.apply_freq_filter(3)
print("Top 20 bigrams by PMI (frequency >= 3):")
for bigram in finder.nbest(bigram_measures.pmi, 20):
    print(f"  {bigram[0]} {bigram[1]}")

## Wider Context Window

We can also look for words that co-occur within a larger window (not necessarily adjacent):

In [None]:
finder_wide = BigramCollocationFinder.from_words(sample_words, window_size=15)
finder_wide.apply_freq_filter(3)
print("Top 20 bigrams with window size 15:")
for bigram in finder_wide.nbest(bigram_measures.pmi, 20):
    print(f"  {bigram[0]} {bigram[1]}")

### Cultural Note

One interesting collocation in this text is "six-toed giant." This phrase comes from Matthew Arnold, who popularized the pejorative use of "philistine" to describe people with disdain for culture. Adams dedicated his book to Arnold and used this phrase repeatedly. This shows how data analysis can reveal culturally significant patterns!

---
# Part 4: Parts of Speech Tagging

Part-of-speech (POS) tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word. This is challenging because many words can serve different roles depending on context. We'll use NLTK's Perceptron tagger to extract nouns from our text.

## Tagging Words

The `pos_tag()` function takes a list of words and returns tuples of (word, tag):

In [None]:
tagged_words = pos_tag(sample_words)
print("First 30 tagged words:")
print(tagged_words[:30])

The tags come from the Penn Treebank tagset:
- NN: singular or mass noun
- NNS: plural noun
- NNP: proper noun, singular
- JJ: adjective
- VB: verb, base form
- And many more...

[Full Penn Treebank tag reference](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

## Extracting Nouns

Let's filter the tagged words to get only nouns (NN and NNS tags):

In [None]:
sample_nouns = [word for (word, pos) in tagged_words if pos in ['NN', 'NNS']]
print(f"Total nouns extracted: {len(sample_nouns)}")
print(f"First 30 nouns: {sample_nouns[:30]}")

## Visualizing Noun Frequency

Now let's visualize the most common nouns in the text:

In [None]:
freq_dist = nltk.FreqDist(sample_nouns)
freq_dist.plot(20, title="Frequency distribution for 20 most common nouns")

## Word Cloud

We can also create a word cloud to visualize the noun frequencies:

In [None]:
cloud = WordCloud(max_font_size=60, colormap='hsv', background_color='white').generate(' '.join(sample_nouns))
plt.rcParams['figure.figsize'] = (16, 12)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.title('Noun Word Cloud', fontsize=20)
plt.show()

---
# Part 5: Named Entity Recognition

Named Entity Recognition (NER) extracts and classifies named entities in text, such as:
- PERSON: People's names
- GPE: Geo-political entities (countries, cities, states)
- ORGANIZATION: Companies, institutions, agencies
- And more...

In [None]:
def extract_entities(text):
    """Extract and display named entities from text."""
    sentences = sent_tokenize(text)
    
    all_entities = []
    
    for sent in sentences:
        tokens = word_tokenize(sent)
        tagged = pos_tag(tokens)
        tree = nltk.ne_chunk(tagged)
        
        for subtree in tree:
            if hasattr(subtree, 'label'):
                entity = ' '.join(word for word, tag in subtree.leaves())
                entity_type = subtree.label()
                all_entities.append((entity_type, entity))
    
    return all_entities

In [None]:
sample_ner = "Barack Obama was born in Hawaii. He served as President of the United States. He visited the United Nations in New York."

entities = extract_entities(sample_ner)
print("Extracted entities:")
for entity_type, entity in entities:
    print(f"  {entity_type}: {entity}")

Let's also extract entities from our Australian Essays text (just the first 10 sentences to keep output manageable):

In [None]:
# Get first 10 sentences
first_sentences = ' '.join(sentences[:10])
entities_essays = extract_entities(first_sentences)

print("Entities from Australian Essays (first 10 sentences):")
for entity_type, entity in entities_essays[:20]:  # Show first 20 entities
    print(f"  {entity_type}: {entity}")

---
# Part 6: Text Classification

Text classification uses machine learning to categorize documents. We'll explore two classification examples:
1. **Gender Classification** - Predicting gender from names
2. **Criminal Conviction Classification** - Categorizing historical criminal records

## Example 1: Gender Classification from Names

We'll build a simple Naive Bayes classifier that predicts gender based on the last letter of a name.

In [None]:
# Load name corpus
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

print(f"Total names: {len(labeled_names)}")
print(f"Sample names: {labeled_names[:5]}")

### Define Feature Extractor

Our feature extractor is simple - it just looks at the last letter of the name:

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}

# Test it
print("Feature examples:")
print(f"  John: {gender_features('John')}")
print(f"  Mary: {gender_features('Mary')}")
print(f"  Alex: {gender_features('Alex')}")

### Train the Classifier

We split the data into training and test sets:

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

gender_classifier = nltk.NaiveBayesClassifier.train(train_set)

print(f"Training set size: {len(train_set)}")
print(f"Test set size: {len(test_set)}")

### Test the Classifier

In [None]:
test_names = ['Neo', 'Trinity', 'Alex', 'Sarah', 'Pat', 'Morgan', 'Taylor', 'Jordan']

print("Gender predictions:")
for name in test_names:
    prediction = gender_classifier.classify(gender_features(name))
    print(f"  {name}: {prediction}")

### Evaluate Accuracy

In [None]:
accuracy = nltk.classify.accuracy(gender_classifier, test_set)
print(f"Classifier accuracy: {accuracy:.2%}")

### Most Informative Features

In [None]:
gender_classifier.show_most_informative_features(10)

---
## Example 2: Criminal Conviction Classification

Now let's tackle a more complex classification problem. We'll use historical records of women's criminal convictions from 19th and early 20th century Australia (from Dr. Alana Piper's [Criminal Characters](https://criminalcharacters.com/) research project).

We'll train a classifier to categorize convictions into three types:
- **property** - Crimes involving theft or property damage
- **violent** - Violent crimes and assaults
- **nonviolent** - Other crimes like vagrancy, drunkenness, etc.

### Load Conviction Data

In [None]:
# Load the full corpus of convictions
r = requests.get("https://etc.mikelynch.org/tmfh/cc/convictions.txt")
corpus = r.text
convictions = corpus.splitlines()

print(f"Total convictions: {len(convictions)}")
print("\nFirst 5 convictions:")
for i, conv in enumerate(convictions[:5], 1):
    print(f"{i}. {conv}")

### Define Feature Extractor

For this classifier, we'll use the presence/absence of the top 100 most common words as features:

In [None]:
# Get top 100 words from entire corpus
all_words = nltk.FreqDist(word_tokenize(corpus))
top_words = sorted(all_words.keys(), key=lambda x: all_words[x], reverse=True)[:100]

print("Top 20 words in corpus:")
print(top_words[:20])

In [None]:
def conviction_features(conviction):
    """Extract features from a conviction record."""
    conviction_words = set(word_tokenize(conviction))
    features = {}
    for word in top_words:
        features[f'contains({word})'] = (word in conviction_words)
    return features

In [None]:
# Test the feature extractor
sample_conviction = '03-MAY-1875 ABBOT, NORAH: 3 MONTHS IMP VAGRANCY MARYBOROUGH PETTY SESSIONS'
features = conviction_features(sample_conviction)

# Show some of the features
print("Sample features (showing True values):")
for key, value in list(features.items())[:20]:
    if value:
        print(f"  {key}: {value}")

### Load Training Data

The training data is a CSV file with 200 manually classified convictions:

In [None]:
training_data = []

with requests.get("https://etc.mikelynch.org/tmfh/cc/training.csv", stream=True) as r:
    lines = (line.decode('utf-8-sig') for line in r.iter_lines())
    for row in csv.reader(lines):
        features = conviction_features(row[0])
        label = row[1]
        training_data.append((features, label))

print(f"Training data loaded: {len(training_data)} records")

### Split into Training and Test Sets

We'll use the first 100 records to test accuracy and the remaining 100 to train:

In [None]:
checking_data = training_data[:100]
training_data = training_data[100:]

print(f"Training set: {len(training_data)} records")
print(f"Test set: {len(checking_data)} records")

### Train the Classifier

In [None]:
conviction_classifier = nltk.NaiveBayesClassifier.train(training_data)
print("Classifier trained successfully!")

### Most Informative Features

In [None]:
conviction_classifier.show_most_informative_features(20)

### Evaluate Accuracy

In [None]:
count = 0
for features, label in checking_data:
    bayes_result = conviction_classifier.classify(features)
    if bayes_result == label:
        count += 1

accuracy = count / len(checking_data)
print(f"Classifier accuracy: {accuracy:.2%}")
print(f"Correct predictions: {count} out of {len(checking_data)}")

### Test on Sample Convictions

Let's see how the classifier performs on real convictions:

In [None]:
print("Classification results for first 15 convictions:\n")
for i, conv in enumerate(convictions[:15], 1):
    label = conviction_classifier.classify(conviction_features(conv))
    print(f"{i}. [{label.upper()}]")
    print(f"   {conv}")
    print()

### Classify Entire Corpus

Finally, let's classify all convictions and see the breakdown:

In [None]:
breakdown = {}

for conv in convictions:
    label = conviction_classifier.classify(conviction_features(conv))
    if label not in breakdown:
        breakdown[label] = [conv]
    else:
        breakdown[label].append(conv)

print("Classification breakdown:\n")
for label, convs in sorted(breakdown.items()):
    print(f"{label.upper()}: {len(convs)} convictions ({len(convs)/len(convictions)*100:.1f}%)")
    print(f"Sample convictions:")
    for conv in convs[:3]:
        print(f"  - {conv}")
    print()

### Improving the Classifier

The 85% accuracy is reasonable for such a small training set, but could be improved by:

1. **Larger training set** - More examples would help the classifier learn better patterns
2. **Better features** - We could filter out common names, or use bigrams/trigrams
3. **Feature engineering** - Extract specific crime-related keywords rather than just top words
4. **Try different algorithms** - Other classifiers like Maximum Entropy might perform better

The exploratory process of refining features and testing accuracy is key to building effective classifiers.

---
# Conclusion

This notebook has covered the core capabilities of NLTK:

1. **Text Exploration** - Understanding how words are used in context
2. **Tokenization** - Breaking text into meaningful units
3. **Collocations** - Finding phrases and word associations
4. **POS Tagging** - Identifying grammatical roles
5. **Named Entity Recognition** - Extracting people, places, and organizations
6. **Text Classification** - Building machine learning models to categorize text

## Additional Resources

- [NLTK Book - Natural Language Processing with Python](https://www.nltk.org/book/)
- [NLTK Documentation](https://www.nltk.org/)
- [Library Carpentry: Text & Data Mining](http://librarycarpentry.org/lc-tdm/)
- [Criminal Characters Research Project](https://criminalcharacters.com/)

## Credits

This notebook combines material from:
- Mike Lynch's "Text Mining for the Humanities" workshop (UTS eResearch)
- NLTK Book examples
- Library Carpentry resources

Sample data:
- *Australian Essays* by Francis W. L. Adams (1886) from [AustLit corpus](https://www.ausnc.org.au/corpora/austlit)
- Criminal conviction records from Dr. Alana Piper's [Criminal Characters](https://criminalcharacters.com/) project