# Classifying documents

Today we'll define probability distributions from sets of documents, and make predictions using those probability distributions.

We will encounter several problems:

1. Multiplying small probabilities is hard

2. Some words don't appear in every class

3. Classification can be "too easy" if we are comparing documents to themselves

In [None]:
import re, sys, glob, math
from collections import Counter
import numpy

We will use the same files and data structures as last week: Shakespeare plays in UTF-16 and Counters for each genre and play.

1. Write a function that takes a word and a genre and returns the probability of that word in that genre. Calculate this for "Brutus", "Romeo", "death", and "duke".

2. Write a function that takes a Counter of words and a genre and returns the probability of the full sequence. Calculate the probability of "to be or not to be that is the question" and "romeo romeo wherefore art thou romeo" in all three genres.

3. Add a smoothing parameter to the previous functions, with default 0. Add this paramter to the count for the word and the size of the vocabulary to the numerator. Recalculate the previous sentences. Find words that increase in probability and words that decrease in probability when you change the smoothing parameter.

4. Calculate the probability of the Counter for *Romeo and Juliet* in all three genres. Is this informative? Why or why not?

5. Create a new function that calculates the *log* probability of a Counter given a genre and a smoothing parameter. Calculate the log probability of *Romeo and Juliet*.

6. Is this a fair comparison? Why or why not? If not, what can we do to make it more fair? Write a function that calculates the log probability of a play as if that play had never been previously seen. Recalculate the log probability for *Romeo and Juliet*.

In [None]:
# Use the files from last week
genre_directories = { "tragedy" : "../week3/shakespeare/tragedies", "comedy" : "../week3/shakespeare/comedies", "history" : "../week3/shakespeare/historical" }

word_pattern = re.compile("\w[\w\-\']*\w|\w")

# This counter will store the total frequency of each word type across all plays
all_counts = Counter()

# This dictionary will have one counter for each genre
genre_counts = {}

# This dictionary will have one dictionary for each genre, each containing one Counter for each play in that genre
genre_play_counts = {}

In [None]:

for genre in genre_directories.keys():
    
    genre_play_counts[genre] = {}
    genre_counts[genre] = Counter()
    
    for filename in glob.glob("{}/*.txt".format(genre_directories[genre])):
        
        play_counter = Counter()
        
        genre_play_counts[genre][filename] = play_counter
        
        with open(filename, encoding="utf-16") as file: ## What encoding?
            
            ## This block reads a file line by line.
            for line in file:
                line = line.rstrip()
                if not line.startswith("\t"):
                    continue
                
                line = line.lower()
                
                tokens = word_pattern.findall(line)
                
                play_counter.update(tokens)
        
        genre_counts[genre] += play_counter
        all_counts += play_counter

In [None]:
genre_counts.keys()

In [None]:
# Since this is long, here's the dict key for R&J:
romeo_title = "../week3/shakespeare/tragedies/Romeo And Juliet.txt"

genre_play_counts["tragedy"][romeo_title].most_common(10)

In [None]:
vocabulary = [w for w, c in all_counts.most_common()]
vocabulary_size = len(vocabulary)

total_word_counts = numpy.array([all_counts[w] for w in vocabulary])
log_counts = numpy.log(total_word_counts)

word_ranks = numpy.arange(len(vocabulary)) + 1
log_ranks = numpy.log(word_ranks)

genres = genre_play_counts.keys()


In [None]:
# Write functions below
