# Document Clustering

This notebook demonstrates a relatively simple approach to automated record clustering based on keywords contents.

The purpose of the notebook is to:
 * demonstrate use of record management APIs
 * show that it is practical to use machine learning algorithms to automatically discover document clusters
 * provide refernce implementation that could be enhanced
 * generate a model that can subsequently be used to demonstration of automatic record classification

In a production system, unsupervised clustering techniques such as this would probably be used in an ensemble with ontology-based classification as well as supervised learning of labels that have been applied by human record mangers to a reference set.

In [1]:
# Import all the packages to be used
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd

### Fetching data from the API

The first step in this recipie is to fetch data from the search API (due to a small current bug, we are actually using the raw elasticsearch API behind the Platform Search API.

This code is wrapped in a class with two methods -- `get_total()` and `get_records()`.

In [2]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
import nltk
import re

class ESInterface(object):
    """ Interface to ElasticSearch instance """
    
    def __init__(self, url, username, password):
        """ Initializes object, creates ES connection """
        # Create the ES connection
        self.conn = Elasticsearch(url, verify_certs=True, http_auth=(username, password))
        self.index_name = 'digitalrecords-search'
        self.doc_type = 'modelresult'
        # Query to get records that have keywords
        self.query = {
           'query': {
              'exists': {
                 'field': 'keywords'
              }
           },
           '_source': ['id', 'keywords'],
           'size': 1000
        }
        # Get the total number of matching records
        self.get_total()
        
    def get_total(self):
        """ Gets the total number of records that match the query """
        query = self.query
        query['size'] = 1
        search = self.conn.search(body=query, index=self.index_name)
        self.total = search['hits']['total']
        
    def get_records(self, test_run=False):
        print('Total matches: %s' % self.total)
        results = []
        records = scan(self.conn, index=self.index_name, query=self.query, doc_type=self.doc_type)
        if test_run:
            records_num = 50000
        else:
            records_num = self.total
        if test_run:
            print('Total records to fetch: %s (TEST RUN)' % records_num)
        else:
            print('Total records to fetch: %s' % records_num)
        print('\nNow fetching records from ES...')
        for counter, record in enumerate(records, 1):
            if counter > records_num:
                break
            else:
                results += [record['_source']]
            if counter % 10000 == 0:
                print(' - Fetched %s records out of %s' % (counter, records_num))
        print('Finished fetching records from ES\n')
        return results

This class can now be used to access metadata from the API.

In [3]:
# Enter the ES credentials
ES_url = 'https://b8cecd38087b50c322e84733eb56a855.ap-southeast-2.aws.found.io:9243/'
ES_username = 'elastic'
ES_password = 'EwbDDTN54MnWni7E6Du0l8hG'

# Use the credentials to instantiate an ESInterface instance
elasticsearch = ESInterface(ES_url, ES_username, ES_password)

# Fetch the records in test_run mode, which means it will only get 50,000 matching records
records = elasticsearch.get_records(test_run=True)

Total matches: 16899580
Total records to fetch: 50000 (TEST RUN)

Now fetching records from ES...
 - Fetched 10000 records out of 50000
 - Fetched 20000 records out of 50000
 - Fetched 30000 records out of 50000
 - Fetched 40000 records out of 50000
 - Fetched 50000 records out of 50000
Finished fetching records from ES



As usual, the data needs to be cleaned up a little bit before it can be used.

In [4]:
# Build a dataframe out of the records
df = pd.DataFrame(records)

# Replace NaN's with empty string
df.fillna('', inplace=True)

# Make sure the keywords column is of string data type
df['keywords'] = df['keywords'].astype(str)

# Show a sample of the dataframe
df.head()

Unnamed: 0,id,keywords
0,records.recordname.ab8b1dce-d563-4e5d-b9ba-a9f...,"Faculty, Chapters, Contact, Location, Conferen..."
1,records.recordname.ba200f45-6c0d-479c-9f25-d60...,"patience, Comments, Archives, courage"
2,records.recordname.dbc97c7a-e526-4e5f-a1f5-8fc...,"Attributes, Documents, filters, Resources, Mic..."
3,records.recordname.8d1f4d87-3cd4-4bf0-a129-be0...,"Otwarcia, Centrum, Mochnackiego, Obsługi, Cent..."
4,records.recordname.a3377b20-c7e7-4d7c-bb44-f0f...,"Centrum, Informacje, Obsługi, Aktualności, Cen..."


Nowv we need to create a tokenize method to extract individual words (remove punctuation, stopwords, etc).

The `tokenize()` function that we create below is used by built-in vectorization classes from the SciKit-Learn library, as well as stopwords and punctuation libraries from the venerable NLTK (the Naural Language Took Kit).

In [5]:
def tokenize(text):
    # get the english stopwords
    nltk.download('stopwords', quiet=True)
    stopwords = nltk.corpus.stopwords.words('english')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    nltk.download('punkt', quiet=True)
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.wordpunct_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        # include only those that contains letters
        if re.search('[a-zA-Z]', token):
            # exclude stop words, those shorter than 3 characters, and those that
            # start with non-alphanumeric characters
            if token not in stopwords and len(token) > 2 and token[0].isalnum():
                filtered_tokens.append(token)
    return filtered_tokens

### Feature Extraction

We will use a vectorizer called TF-IDF (Term-Frequency times Inverse Document-Frequency).

The tokenization method above converts extracted text into a feature array. We need this to feed for our algorithm that determines salient features by comparing the frequencies of these features across different documents.

Read more here: http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [6]:
vectorizer = TfidfVectorizer(
    max_features=200000,
    stop_words='english',
    use_idf=True,
    tokenizer=tokenize)

feature_matrix = vectorizer.fit_transform(df['keywords'])
feature_names = vectorizer.get_feature_names()

### Clustering

For a simple demonstration, we use one of the fastest clustering algorithms available -- namely, k-means clustering.

In [7]:
# Execute the k-means clustering
nclusters = 10
km = KMeans(n_clusters=nclusters)
km.fit(feature_matrix)

# Create a dataframe out of the formed clusters
clusters = km.labels_.tolist()
docs = { 'id': df['id'], 'cluster': clusters }
clusters_df = pd.DataFrame(docs, columns = ['id', 'cluster'])

# Show a sample of the result of the clustering
clusters_df.head()

Unnamed: 0,id,cluster
0,records.recordname.ab8b1dce-d563-4e5d-b9ba-a9f...,7
1,records.recordname.ba200f45-6c0d-479c-9f25-d60...,8
2,records.recordname.dbc97c7a-e526-4e5f-a1f5-8fc...,0
3,records.recordname.8d1f4d87-3cd4-4bf0-a129-be0...,0
4,records.recordname.a3377b20-c7e7-4d7c-bb44-f0f...,0


To get the keywords/terms per cluster, we need to extract the k-means clustering object from the indices of terms that compose each cluster. In order to get the top defining terms, we reverse the ordering so we get the closest term to the cluster center as the first element in the list. These steps are accomplished in a single of code below.

In [8]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [9]:
# We proceed further to get only the top 20 terms per cluster
clusters = range(nclusters)
cluster_terms = {k: [] for k in clusters}
for i in clusters:
    cluster_top_terms = [feature_names[x] for x in order_centroids[i, :20]]
    cluster_terms[i] = cluster_top_terms

# We fuse the raw dataframe with the clusters dataframe so get
# a mapping of the record ID, keywords, and cluster in one dataframe
df.set_index('id', inplace=True)
clusters_df.set_index('id', inplace=True)
combined_df = df.join(clusters_df, how='outer')

# Then from the combined dataframe we can generate a good summary
# of the clusters with cluster number, the count of records, and their defining terms
cdf = combined_df.groupby('cluster').agg('count')
cdf.rename(columns={'keywords': 'count'}, inplace=True)
cdf['keywords'] = [', '.join(cluster_terms[x]) for x in cdf.index]
print(cdf)

         count                                           keywords
cluster                                                          
0        39207  resources, related, services, information, con...
1          871  request, password, information, completed, con...
2          296  twitter, welcome, chapter, followers, membersh...
3          309  resource, looking, temporarily, removed, unava...
4         1639  navigation, primary, sidebar, contact, related...
5         1222  newsletter, contact, subscribe, categories, si...
6         1043  facebook, twitter, popular, connect, subscribe...
7         2256  contact, information, kennesaw, marietta, init...
8         1946  categories, archives, comments, subscribe, pop...
9         1211  service, address, protection, customer, compan...


We now have ten clusters defined by a collection of characteristic keywords. The keywords associated with these clusters can be used to demonstrate record auto-classification.