# Document Clustering

This notebook demonstrates a relatively simple approach to automated record clustering based on keywords contents.

The purpose of the notebook is to:
 * demonstrate use of record management APIs
 * show that it is practical to use machine learning algorithms to automatically discover document clusters
 * provide refernce implementation that could be enhanced
 * generate a model that can subsequently be used to demonstration of automatic record classification

In a production system, unsupervised clustering techniques such as this would probably be used in an ensemble with ontology-based classification as well as supervised learning of labels that have been applied by human record mangers to a reference set.

In [1]:
# Import all the packages to be used
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd

### Fetching data from the API

The first step in this recipie is to fetch data from the search API (due to a small current bug, we are actually using the raw elasticsearch API behind the Platform Search API.

This code is wrapped in a class with two main methods -- `get_total()` and `get_records()`.

In [2]:
import requests

class SearchInterface(object):
    
    def __init__(self):
        self.base_url = 'https://digitalrecords.showthething.com/api/search/v0/records/'
        self.get_total()
        
    def get_total(self):
        """ Gets the total number of records that match the query """
        resp = self._fetch_batch(0, limit=1)
        data = resp.json()
        self.total = data['hits']['count']
        
    def _fetch_batch(self, offset, limit=1000):
        # The `exists`: `keywords` params makes sure that only those records with
        # keywords are fetched from the database
        params = {'limit': limit, 'offset': offset, 'exists': 'keywords'}
        resp = requests.get(self.base_url, params=params)
        return resp
    
    def get_records(self, test_run=False):
        print('Total matches: %s' % self.total)
        if test_run:
            # We just limit to the 10,000 records for the test run
            records_num = 10000
            print('Total records to fetch: %s (TEST RUN)' % records_num)
        else:
            records_num = self.total
            print('Total records to fetch: %s' % records_num)
        results = []
        batches = range(0, self.total, 1000)
        print('\nNow fetching records from ES...')
        for batch_offset in batches:
            fetch_count = batch_offset + 1000
            if fetch_count <= records_num:
                resp = self._fetch_batch(batch_offset)
                if resp.status_code == 200:
                    data = resp.json()
                    results += data['hits']['results']
                else:
                    raise Exception('Error fetching records from the server!')
                print(' - Fetched %s records out of %s' % (fetch_count, records_num))
            else:
                break
        print('Finished fetching records from ES\n')
        return results

This class can now be used to access metadata from the API. This is currently the slowest part of the process, even with the `test_run=True` parameter limiting it to 50,000 records. The approach of pulling keyword metadata down to a script works for hundreds of thousands of millions of records, but becomes impractical for tens or hundreds of millions of records. Future implementations should use a "send the code to the data (not the data to the code)" type of approach, such as map-reduce. 

In [3]:
# Fetch the records in test_run mode, which means it will only get 10,000 matching records
search = SearchInterface()
records = search.get_records(test_run=True)

Total matches: 16899620
Total records to fetch: 10000 (TEST RUN)

Now fetching records from ES...
 - Fetched 1000 records out of 10000
 - Fetched 2000 records out of 10000
 - Fetched 3000 records out of 10000
 - Fetched 4000 records out of 10000
 - Fetched 5000 records out of 10000
 - Fetched 6000 records out of 10000
 - Fetched 7000 records out of 10000
 - Fetched 8000 records out of 10000
 - Fetched 9000 records out of 10000
 - Fetched 10000 records out of 10000
Finished fetching records from ES



As usual, the data needs to be cleaned up a little bit before it can be used.

In [4]:
import pandas as pd

# Build a dataframe out of the records, taking only the uuid and keywords per record
df = pd.DataFrame(records, columns=['uuid', 'keywords'])

# Let's just rename the `uuid` column to `id`
df.rename(columns={'uuid': 'id'}, inplace=True)

# Replace NaN's with empty string
df.fillna('', inplace=True)

# Make sure the keywords column is of string data type
df['keywords'] = df['keywords'].astype(str)

# Show a sample of the dataframe
df.sample(10)

Unnamed: 0,id,keywords
6672,485705aa-35b8-492b-8978-37284955927e,"['Clothing', 'SERVICES', 'OPENING', 'LOCATION'..."
1670,b8a9361d-a59d-4f57-8282-035aba73c3ea,"['Equipment', '19/11/2016', 'ARTICLESMORE', 'C..."
4445,8fc4d25e-bbef-4a14-82c7-e79e73ea0dfd,"['Turbines', 'County', 'Independent', 'Natural..."
141,60f435ea-6827-4b22-9340-c70a8bf1cda4,"['charge total amount', 'payable alternative a..."
8552,69ff1ffa-5b26-4cea-a0a9-9c771f484452,['September']
6578,18e07718-002c-4729-958d-fca5e973c15b,"['followers', 'Twitter?', 'Twitter']"
5460,09ee9da0-89be-4adf-9f05-11d6907363f5,"['Archives', 'Display', 'Related', 'navigation..."
9366,87be0c6a-d3a0-4416-bc1e-c599a1818bf3,"['Encourage', 'November', 'Results', '/portkem..."
9745,d134fe16-5f37-4962-87d6-37166a73df4e,"['Telephone:', 'Angling', 'Yachting']"
7032,8166111e-bc5b-4997-99a6-2623730323e1,"['Newsletter', 'Description:', 'T-Shirt', 'Rev..."


Nowv we need to create a tokenize method to extract individual words (remove punctuation, stopwords, etc).

The `tokenize()` function that we create below is used by built-in vectorization classes from the SciKit-Learn library, as well as stopwords and punctuation libraries from the venerable NLTK (the Naural Language Took Kit).

In [5]:
def tokenize(text):
    # get the english stopwords
    nltk.download('stopwords', quiet=True)
    stopwords = nltk.corpus.stopwords.words('english')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    nltk.download('punkt', quiet=True)
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.wordpunct_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        # include only those that contains letters
        if re.search('[a-zA-Z]', token):
            # exclude stop words, those shorter than 3 characters, and those that
            # start with non-alphanumeric characters
            if token not in stopwords and len(token) > 2 and token[0].isalnum():
                filtered_tokens.append(token)
    return filtered_tokens

### Feature Extraction

We will use a vectorizer called TF-IDF (Term-Frequency times Inverse Document-Frequency).

The tokenization method above converts extracted text into a feature array. We need this to feed for our algorithm that determines salient features by comparing the frequencies of these features across different documents.

Read more here: http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re

# Note: We have limited the max_features to 100000 here for faster processing
vectorizer = TfidfVectorizer(
    max_features=100000,
    stop_words='english',
    use_idf=True,
    tokenizer=tokenize)

feature_matrix = vectorizer.fit_transform(df['keywords'])
feature_names = vectorizer.get_feature_names()

### Clustering

For a simple demonstration, we use one of the fastest clustering algorithms available -- namely, k-means clustering.

In [7]:
from sklearn.cluster import KMeans

# Execute the k-means clustering
nclusters = 3 # Split into 3 clusters
km = KMeans(n_clusters=nclusters)
km.fit(feature_matrix)

# Create a dataframe out of the formed clusters
clusters = km.labels_.tolist()
docs = { 'id': df['id'], 'cluster': clusters }
clusters_df = pd.DataFrame(docs, columns = ['id', 'cluster'])

# Show a sample of the result of the clustering
clusters_df.sample(10)

Unnamed: 0,id,cluster
9393,9ec802e2-1c0b-41ac-ac34-5201ca1e62c9,2
5075,27df15ae-9486-4bc4-8b5f-beed9db7a9b6,2
6554,cd9fcd10-224b-402f-8e91-23609d25c626,2
8682,a25216cb-eba7-47cd-b05f-28cefda547b9,2
9897,5cb595ad-439a-4cef-9b96-b00fd8ed8121,2
2703,fc78bdcc-cffb-4a61-9715-521d32b6b076,2
6125,460ebaa0-a30a-4bae-8052-8dc9486c14a9,2
4088,b0332f54-513d-4828-8f73-0c8e6a2c817d,2
4702,bd23f2ae-8184-486f-9a2a-7b61dd6a39a0,2
4957,a903167c-ec64-43c4-875d-4bc91a3e3e18,2


To get the keywords/terms per cluster, we need to extract the k-means clustering object from the indices of terms that compose each cluster. In order to get the top defining terms, we reverse the ordering so we get the closest term to the cluster center as the first element in the list. These steps are accomplished in a single of code below.

In [8]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [9]:
# We proceed further to get only the top 20 terms per cluster
cluster_terms = {k: [] for k in clusters}
for i in range(nclusters):
    cluster_top_terms = [feature_names[x] for x in order_centroids[i, :20]]
    cluster_terms[i] = cluster_top_terms

df.reset_index(inplace=True)
clusters_df.reset_index(inplace=True)

# We fuse the raw dataframe with the clusters dataframe so get
# a mapping of the record ID, keywords, and cluster in one dataframe
df.set_index('id', inplace=True)
clusters_df.set_index('id', inplace=True)
#combined_df = df.join(clusters_df, how='outer', on='id')
combined_df = df.merge(clusters_df)

# Then from the combined dataframe we can generate a good summary
# of the clusters with cluster number, the count of records, and their defining terms
cdf = combined_df.groupby('cluster').agg('count')
cdf.rename(columns={'keywords': 'count'}, inplace=True)
cdf['keywords'] = [', '.join(cluster_terms[x]) for x in cdf.index]

for i,row in cdf.iterrows():
    print('Cluster #%s' % row.name)
    print(row.keywords, '\n')

Cluster #0
easy, nameclassifier, udt, author, phase, status, class, proposed, advanced, project, notes, modified, connection, appears, version, type, direction, generalization, complexity, priority 

Cluster #1
september, coupons, coupon, similar, comments, essendon, warehouse, magazine, watches, business, wholesale, vistaprint, vintage, writers, accessories, furniture, wedding, verizon, wireless, country 

Cluster #2
related, contact, information, angling, yachting, telephone, categories, navigation, partners, newsletter, description, service, products, upcoming, archives, judonotes, twitter, education, welcome, comment 



We now have three clusters defined by a collection of characteristic keywords. The keywords associated with these clusters can be used to demonstrate record auto-classification.

### Saving the Clustering Results

For the auto-classification step, at least the generated clusters and the vectorizer object are needed. So we need to dumpt those into file. In Python, we do that using the `pickle` package. It serializes any complex Python object/data structure that can then be loaded later on for reuse.

In [10]:
import pickle
import os

results = {
    'vectorizer': vectorizer,
    'features': feature_matrix,
    'clusters': combined_df,
    'terms': cluster_terms
}

if not os.path.exists('temp'):
    os.mkdir('temp')
    
pickle.dump(results, open('temp/clusters.pkl', 'wb'))

The above step (saving the cluster model to disk) is necessary for the next stage, building an auto-classifier based on keywords. 