# Introduction to NLTK

This notebook walks through core NLTK concepts from the [NLTK Book](https://www.nltk.org/book/), covering:

1. **Exploring Text** - Concordance, similar words, dispersion plots, lexical diversity
2. **Tagging & Classification** - Tokenization, POS tagging, Naive Bayes classification
3. **Named Entity Recognition** - Extracting entities from text

## Setup

In [None]:
import nltk
nltk.download('book', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('names', quiet=True)

---
# Part 1: Exploring Text (NLTK Book Ch. 1)

NLTK ships with several built-in text corpora. Let's load them and explore basic text analysis tools.

In [None]:
from nltk.book import *

### Concordance
Show every occurrence of a word along with its surrounding context.

In [None]:
text1.concordance("monstrous")

### Similar Words
Find words that appear in a similar context to the given word.

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

### Common Contexts
Find contexts shared by two or more words.

In [None]:
text2.common_contexts(["monstrous", "very"])

### Dispersion Plot
Visualize where words appear across the length of a text.

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

### Lexical Diversity
Measure how many unique words are used relative to the total word count.

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total

In [None]:
print(f"Text 3 (Genesis) lexical diversity: {lexical_diversity(text3):.4f}")
print(f"Text 3 length: {len(text3)} tokens")
print(f"Text 3 unique words: {len(set(text3))}")

---
# Part 2: Tagging & Classification (NLTK Book Ch. 5)

Moving beyond text exploration, NLTK provides tools for tokenization, part-of-speech tagging, and building text classifiers.

### Tokenization & POS Tagging
Split text into words and label each word with its part of speech.

In [None]:
from nltk import word_tokenize

sample_text = "The quick brown fox jumps over the lazy dog near the river bank."
tokens = word_tokenize(sample_text)
tagged = nltk.pos_tag(tokens)
tagged

### Naive Bayes Gender Classifier
Train a simple classifier that predicts gender from the last letter of a name.

In [None]:
import random
from nltk.corpus import names

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

def gender_features(word):
    return {'last_letter': word[-1]}

featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
for name in ['Neo', 'Trinity', 'Alex', 'Sarah', 'Pat']:
    print(f"{name}: {classifier.classify(gender_features(name))}")

### Named Entity Recognition
Extract named entities (people, places, organizations) from text using NLTK's chunker.

In [None]:
def extract_entities(text):
    sentences = nltk.sent_tokenize(text)
    for sent in sentences:
        tokens = nltk.word_tokenize(sent)
        tagged = nltk.pos_tag(tokens)
        tree = nltk.ne_chunk(tagged)
        for subtree in tree:
            if hasattr(subtree, 'label'):
                entity = ' '.join(word for word, tag in subtree.leaves())
                print(f"  {subtree.label()}: {entity}")

In [None]:
sample = "Barack Obama was born in Hawaii. He served as President of the United States."
extract_entities(sample)