# Clustering, normalization, weighting functions

We have been working on ways to represent a collection of documents as a matrix. So far we have looked at classification and two-dimensional projections. What else can we do with such a matrix?

In today's class we will try a new tool, clustering, and measure how properties of our input matrix affect the output of clustering algorithms.

We will consider three modifications:

1. Normalization for length of documents
2. Inverse Document Frequency weighting of words
3. Changes in vocabulary

In every case, we should always look for the possiblity of errors. Is the algorithm finding patterns in data, or artifacts of our curation process?

### Response:

Describe the effect of length, word weighting, and vocabulary choices on clustering. Provide specific examples of output. Compare your results to the results of others at your table. Do you see consistent results? Describe any similarities or differences.






In [None]:
import csv, sys, os, re
from collections import Counter
import numpy

from matplotlib import pyplot
from sklearn.cluster import KMeans, AgglomerativeClustering

word_pattern = re.compile("\w[\w\-\']*\w|\w")

In [None]:
documents = []

with open("../data/Gutenberg-2019-10-21/metadata.csv", encoding="utf-8") as reader:
    csv_reader = csv.DictReader(reader)
    for document in csv_reader:
        try:
            with open("../data/Gutenberg-2019-10-21/{}".format(document["Filename"]), encoding="utf-8") as reader:
                print(document["Author"] + " / " + document["Title"])

                lines = []
                for line in reader:
                    lines.append(line.rstrip())

                text = " ".join(lines)
                document["Text"] = text
                document["Tokens"] = word_pattern.findall(text)
                
                documents.append(document)
        except Exception as e:
            print("! Problem with {}: {}".format(document["Filename"], e))

In [None]:
all_counts = Counter()

for document in documents:
    doc_counter = Counter(document["Tokens"])
    all_counts += doc_counter   
    document["TokenCounts"] = doc_counter

In [None]:
Counter([doc["Author"] for doc in documents])

In [None]:
# Construct a fixed vocabulary

vocabulary = [w for w, c in all_counts.most_common()]

### This might be a good place to select subsets of the vocabulary

vocabulary_size = len(vocabulary)
reverse_vocab = { w: i for i, w in enumerate(vocabulary) }

In [None]:
def counter_to_vector(counter):
    vector = numpy.zeros(vocabulary_size)
    for word in counter.keys():
        ## look up the integer ID for the string *if* it has one
        if word in reverse_vocab:
            word_id = reverse_vocab[word]
            vector[word_id] = counter[word]
    
    return vector

In [None]:
# Convert counters to vectors
doc_word_matrix = numpy.zeros( (len(documents), len(vocabulary)) )

for doc_id, document in enumerate(documents):
    doc_word_matrix[doc_id,:] = counter_to_vector(document["TokenCounts"])

In [None]:
idf_weights = numpy.zeros(len(vocabulary))

for word_id, word in enumerate(vocabulary):
    docs_with_word = len(numpy.nonzero(doc_word_matrix[:,word_id])[0])
    idf_weights[word_id] = numpy.log( (1 + len(documents)) / docs_with_word )
    
for word_id in range(20):
    print(vocabulary[word_id], idf_weights[word_id])

In [None]:
# Multiply each column by the IDF weight for that word


In [None]:
doc_norms = numpy.linalg.norm(doc_word_matrix, axis=1)
print(sorted(zip(doc_norms, [doc["Title"] for doc in documents]), reverse=True))

In [None]:
# Divide each row by the norm of that document


In [None]:
num_clusters = 12

kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(doc_word_matrix)
clusters = kmeans.labels_

pyplot.hist(clusters)
pyplot.show()

In [None]:
authors = numpy.array(["{}".format(doc["Author"]) for doc in documents])
short_names = numpy.array(["{} / {}".format(doc["Author"], doc["Title"]) for doc in documents])

for cluster in range(num_clusters):
    print(Counter(authors[ clusters == cluster ]))
    print(short_names[clusters == cluster])
    print()

In [None]:
years = numpy.array([int(doc["Year"]) for doc in documents])
pyplot.scatter(years, clusters)
pyplot.show()

In [None]:
kmeans.cluster_centers_.shape

This section shows information about the "mean" of each of the $K$ clusters. Use this output to get ideas about how to modify the vocabulary.

In [None]:
for cluster in range(num_clusters):
    ## get the vector for the cluster mean
    word_weights = kmeans.cluster_centers_[cluster,:] # row for cluster, all columns
    ## sort the vocabulary by those mean values
    sorted_words = sorted(zip(word_weights, vocabulary), reverse=True)
    ## print the cluster number and then top twenty words, showing word and mean
    print(cluster, " ".join(["{} ({:.2f})".format(w, s) for s, w in sorted_words[:20]]))