## Document classification using natural language processing

### Motivation

Our goal at Sourceress is to improve the recruiting process so that people can find meaningful  positions. Everyone knows about the opportunities available at large tech corporations and popular unicorns, but there are a fair number of young and growing companies with compelling missions. We want to be sure that mission-driven individuals can find and join teams where they share a common goal.

I work on backend, frontend, and devops as a software engineer at Sourceress, but my most recent project was to develop a data science tool to predict candidate fit at a startup using natural language processing tools. I'd like to share how I approached this challenge below.

### Introduction

I have a dataset of candidates for a role at a Bay Area startup, with a profile for each canididate. Along with those profiles, I have a record of which candidates were judged to be a fit for the role, denoted by being "APPROVED" or "REJECTED". The question is:  can I write a classifier that can take unstructured profile data and estimate the probability that a candidate is approved or rejected?

### Getting started

Jupyter notebook works great with Django. Just follow the instructions for installing Jupyter at their website, install django_extensions using pip, and then instantiate a notebook at the root of your project with:

```python manage.py shell_plus --notebook```

Once in the notebook, I'm going to initialize my project with:

In [1]:
import django
django.setup()

## Accessing and cleaning data

We need to access and filter internal data prior to analyses, but I'm going to gloss over this section because interactions with our internal API aren't that interesting. The general steps are:
- loading Django models, 
- querying database via object filters,
- removeing questionable data, and
- aggregating text content with convenience functions.

The critical result is that we will have an object called "documents" with data IDs and text content for analysis.

## Generating features using NLTK

With these documents, we can generate features for analyses. Spacy and NLTK are two Python libraries which are commonly recommended for NLP. I explored Spacy first because it was recommended for being fast and lightweight, but I found that it was lacking necessary functionality and was eating ~2GB of memory upon being loaded. I switched to NLTK because it had the out-of-the-box tools I needed and allowed me to be discerning about which modules I loaded.

We'll download and import NLTK resources for tokenization, stopwords, and lemmatization. Then we'll create a set of stopwords to remove low-information tokens and a lemmatizer to "normalize" tokens, with the ultimate goal of reducing our language size for more effective analyses.

In [8]:
import nltk

# Tokenization
assert nltk.download('punkt') is True, 'Download of NLTK resource "punkt" has failed.'
# Stopwords
assert nltk.download('stopwords') is True, 'Download of NLTK resource "stopwords" has failed.'
# Lemmatization
assert nltk.download('wordnet') is True, 'Download of NLTK resource "wordnet" has failed.'

from nltk.corpus import stopwords


STOPSET = set(stopwords.words('english'))
LEMMATIZER = nltk.WordNetLemmatizer()

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Next, we want to clean our documents so that the NLTK tokenization works as expected, and to further reduce language size. Comments in the code below explain the rational behind each cleaning step, and it's worth reminding ourselves that the text content has many technical terms that need special treatment.

In [12]:
import re

# I don't want to rely on the NLTK tokenizer to correctly parse phrases like
# ".NET" or "ASP.NET", and I also want to clean web addresses, so I'm just going 
# to change those phrases such that the "." is spelled out. Same thing with
# "node.js", "handlebars.js", and other variants.

def _clean_dot_net(text):
    # Note: re's look-behind's need to be fixed width
    options = [r'((?<=\b(ado|asp))\.)', r'((?<=^)\.)', r'((?<=\s)\.)']
    regex = r'(' + r'|'.join(options) + r')(?=net\b)(?!/)'
    replace = 'dot'
    return re.sub(regex, replace, text)


def _clean_dot_js(text):
    # Necessary because of link cleaning
    regex = r'\.(?=js\b)'
    replace = 'dot'
    return re.sub(regex, replace, text)

# Go figure, there are a lot of websites associated with the candidates in
# our dataset. It doesn't make sense to keep these as unique tokens right now,
# so I'm going to turn them into generic website names and see if a model can
# do anything with them before digging deeper.

def _clean_web_addresses(text):
    regex = r'\b(http|www|\.\w{3})\S*\b'
    replace = 'defaultwebaddr'
    return re.sub(regex, replace, text)

# The NLTK tokenizer doesn't break on slashes, so someone who might be a
# "data scientist/software engineer" will have tokens for "data",
# "scientist/software", and "engineer". Thus, I'm going to convert slashes to
# spaces to coerce tokenization in these instances.

def _clean_slash(text):
    regex = r'/'
    replace = ' '
    return re.sub(regex, replace, text)

# Two cleaning functions for special characters that are important, but are
# not handled correctly by NLTK tokenization or str.is_alpha() filters.

def _clean_plus(text):
    # Examples:  google+, c++
    regex = r'\+'
    replace = 'plus'
    return re.sub(regex, replace, text)


def _clean_c_sharp(text):
    regex = r'\bc#\B'
    replace = 'defaultcsharp'
    return re.sub(regex, replace, text)

# Finally, there are an infinite number of numeric tokens that may appear.
# I'm going to clean up all of the common cases so that our language is
# reduced even further, starting with specific types of numeric information
# and ending with generic numbers.

def _clean_years(text):
    regex = r'\b(19|20)\d{2}\b'
    replace = 'defaultyear'
    return re.sub(regex, replace, text)


def _clean_currency(text):
    regex = '\$(\+|-)?(?:0|[1-9]\d{0,2}(?:,?\d{3})*)(?:\.\d+)?'
    replace = 'defaultcurrency'
    return re.sub(regex, replace, text)


def _clean_percent(text):
    regex = '(\+|-)?(?:0|[1-9]\d{0,2}(?:,?\d{3})*)(?:\.\d+)?%'
    replace = 'defaultpercent'
    return re.sub(regex, replace, text)


def _clean_rank(text):
    regex = '#\d+'
    replace = 'defaultrank'
    return re.sub(regex, replace, text)


def _clean_numbers(text):
    regex = r'\b\d+\b'
    replace = 'defaultnumber'
    return re.sub(regex, replace, text)

You've probably noticed that there are potential collisions between these cleaning functions - that our results are dependent on the order in which the functions are applied to text. For instance, if we replaced slashes with spaces before replacing web addresses, we would be generating a soup of nonsense tokens from longer links. Thus, we will be careful to apply cleaning functions in a particular order.

Below, we'll create a function that cleans documents, tokenizes the resulting text, removes stopwords and non-alphabetic tokens, and lemmatizes the remaining tokens.

In [22]:
def parse_tokens_from_document(document):
    cleaning_functions = [
        _clean_dot_net, _clean_dot_js, _clean_web_addresses, _clean_slash, _clean_plus,
        _clean_c_sharp, _clean_years, _clean_currency, _clean_percent, _clean_rank,
        _clean_numbers
        ]
    for fxn in cleaning_functions:
        document = fxn(document)
    # Return lemmatized tokens if alphabetic and not in the stopset
    return [LEMMATIZER.lemmatize(token) for token in nltk.word_tokenize(document)
            if token not in STOPSET and token.isalpha()]

tokens = {person_id: parse_tokens_from_document(document)
          for person_id, document in documents.items()}

print('Sample tokens from a single document')
for token in list(tokens.values())[0][:10]:
    print('  ' + token)

Great! We have a list of tokens for each document. We can use this to generate a list of common unigrams to use as features, and we'll require that included unigrams occur in at least 100 documents.

In [23]:
import collections

# Minimum number of document occurrences needed for inclusion
MIN_TOKEN_COUNT = 100


def get_unigrams_from_tokens(tokens):
    # Count the number of times a token occurs at least once in a document
    counter = collections.Counter()
    for tkns in tokens.values():
        counter.update(set(tkns))
    # Return a list of unigrams that meet the inclusion criteria
    return [token for token, count in counter.items() if count > MIN_TOKEN_COUNT]


unigrams = get_unigrams_from_tokens(tokens)

print('Sample unigrams')
for unigram in unigrams[:10]:
    print('  ' + unigram)

The first ten unigrams returned:
 ['top', 'flex', 'opengl', 'decision', 'implementing', 'look', 'gateway', 'idea', 'capability', 'including']


We also want a list of bigrams. More specifically, we want a list of bigrams that occur more often than expected by chance, also known as collocations.

In [99]:
# Token window for collocation identification
COLLOCATION_WINDOW = 5  
# Minimum likelihood ratio for collocation inclusion
MIN_LIKELIHOOD_RATIO = 10  


def get_collocated_bigrams_from_tokens(tokens): 
    # Get all bigrams that occur within a rolling window of text
    finder_module = nltk.collocations.BigramCollocationFinder
    finder = finder_module.from_words([tkn for tkns in tokens.values() for tkn in tkns],
                                      COLLOCATION_WINDOW)
    # Only include bigrams that occur a certain number of ties
    finder.apply_freq_filter(MIN_TOKEN_COUNT)
    # Return a list of unique bigrams that meet the likelihood criteria
    measures = nltk.collocations.BigramAssocMeasures()
    bigrams = set([' '.join(sorted([first_word, second_word]))
                   for (first_word, second_word), score
                   in finder.score_ngrams(measures.likelihood_ratio)
                   if first_word != second_word and score > MIN_LIKELIHOOD_RATIO])
    return [re.split(' ', bigram) for bigram in bigrams]


bigrams = get_collocated_bigrams_from_tokens(tokens)

print('Sample bigrams')
for bigram in bigrams[:10]:
    print('  ' + bigram)

The first ten bigrams returned:
 [['day', 'per'], ['page', 'single'], ['cs', 'mysql'], ['computing', 'distributed'], ['facebook', 'social'], ['many', 'project'], ['defaultwebaddr', 'defaultyear'], ['build', 'design'], ['javascript', 'using'], ['design', 'pattern']]


Now that we have a list of common unigrams and bigrams, we can prune each document's tokens to generate features.

In [100]:
def get_features(tokens, unigrams, bigrams):
    # Create a dictionary to easily reference bigrams using either word in each pair
    bigram_hash = {}
    for tkn_x, tkn_y in bigrams:
        bigram_hash.setdefault(tkn_x, set()).update([tkn_y])
        bigram_hash.setdefault(tkn_y, set()).update([tkn_x])
    # Step through each document's tokens, locating all unigram and bigram occurrences
    features = {}
    for person_id, tkns in tokens.items():
        # Get unigrams that occur using a set for quick lookups
        tkn_feats = set(unigram for unigram in unigrams if unigram in set(tkns))
        # Add bigrams that occur by stepping through tokens
        window = COLLOCATION_WINDOW - 1
        for idx_x, tkn_x in enumerate(tkns):
            # Continue to the next token if the focal token does not occur in any bigrams
            if tkn_x not in bigram_hash:  
                continue
            # Check the collocation window for tokens found in bigrams with the focal token
            idx_start = max(0, idx_x - window)
            idx_end = min(len(tkns), idx_x + window)
            for tkn_y in tkns[idx_start:idx_end]:
                if tkn_y in bigram_hash[tkn_x]:
                    # Store bigram with words alphabetically ordered
                    tkn_feats.update([' '.join(sorted([tkn_x, tkn_y]))])  
        features[person_id] = tkn_feats
    return features


features = get_features(tokens, unigrams, bigrams)

print('Sample features')
for feature in list(features.values())[:10]:
    print('  ' + feature)

We can generate a Scipy sparse dataframe from these features because they work well with Scikit-Learn models. We choose sparse dataframes because the number of features will increase as our corpus size grows, and we don't want to run into memory issues too quickly. Unfortunately, sparse dataframes do not store row or column labels, so we track these explicitly.

In [102]:
import scipy

# Column names are feature strings
feature_columns = {unigram: idx_col for idx_col, unigram in enumerate(unigrams)}
feature_columns.update({' '.join(sorted(bigram)): idx_col + len(unigrams)
                        for idx_col, bigram in enumerate(bigrams)})

# We need to generate a new data format for the sparse dataframe API
data = []  # Track data values
data_row = []  # Track row indices for data values
data_col = []  # Track column indices for data values

# Row names are data IDs
feature_rows = {}

# Iterate through rows and generate new data objects
for idx_row, (person_id, ftrs) in enumerate(features.items()):
    feature_rows[person_id] = idx_row  # Update row labels
    for ftr in ftrs:  # Update data objects
        idx_col = feature_columns[ftr]
        data_row.append(idx_row)
        data_col.append(idx_col)
        data.append(1)
        
# Create a sparse dataframe
feature_matrix = scipy.sparse.csr_matrix(
    (data, (data_row, data_col)), shape=(len(feature_rows), len(feature_columns)))

## Generating labels

We need labels for training and testing, but this step is much simpler -- we just need to be certain that our label ordering matches the row ordering in our feature dataframe. We're already using Numpy, Scipy, and Scikit-Learn, so we'll use a Pandas series so that no scientific libraries feel left out.

Ratings is simply a data object that was created when accessing data during the preliminary steps that I glossed over.

In [120]:
import pandas as pd

labels = pd.Series([ratings[person_id] for person_id, _
                    in sorted(feature_rows.items(), key=lambda x: x[1])])

## Generating a classification model

After all of this data engineering, how well does a naive bayes model predict our labels?

In [131]:
import sklearn.cross_validation
import sklearn.metrics
import sklearn.naive_bayes


def calculate_model_performance(feature_matrix, labels):
    # Divide the features and labels into training and testing sets
    x_train, x_test, y_train, y_test = sklearn.cross_validation.train_test_split(
        feature_matrix, labels)
    # Fit a model
    model = sklearn.naive_bayes.BernoulliNB()
    model.fit(x_train, y_train)
    # Predict test set labels
    predictions = model.predict(x_test)
    r, p, f, _ = sklearn.metrics.precision_recall_fscore_support(
        y_test, predictions, pos_label='APPROVED', average='binary')
    return r, p, f


r, p, f = calculate_model_performance(feature_matrix, labels)
print('Recall:     {:.2f} \nPrecision:  {:.2f} \nF-Score:    {:.2f}'.format(r, p, f))

Ick.

That's not what we were hoping for. Maybe we just got unlucky? Let's cross-validate the model and visualize performance.

In [164]:
%matplotlib inline
import matplotlib.pyplot as plt


NUM_ITER = 100


def cross_validate_model_performance(feature_matrix, labels):
    # Repeat training and testing across many data partitions
    metrics = [calculate_model_performance(feature_matrix, labels)
               for n in range(NUM_ITER)]
    # Format data for plotting
    performance = [[metrics[idx_iter][idx_metric] for idx_iter in range(len(metrics))]
                   for idx_metric in range(3)]
    # Plot
    plt.figure(figsize=(8, 6))
    plt.boxplot(performance)
    plt.title('Model performance')
    plt.xticks(range(1, len(performance)+1), ['Recall', 'Precision', 'F-Score'])
    plt.ylabel('Percent')
    plt.ylim([0, 1])
    plt.show()
    return performance
    

performance = cross_validate_model_performance(feature_matrix, labels)

Well, it looks like this model isn't fantastic. What if we do feature selection?

In [150]:
import random

SUPPORT_THRESHOLD = 0.25


def get_supported_features(features, outcomes, prop_train, k_best):
    # Create a selector object to calculate feature support
    selector = sklearn.feature_selection.SelectKBest(k=k_best)
    # Get feature support for different sets of training data
    supports = [_get_support(features, outcomes, selector, prop_train)
                for _ in range(NUM_ITER)]
    # Get features that were supported across at least X% of training sets
    return _get_support_summary(supports)


def _get_support(features, outcomes, selector, prop_train):
    num = features.shape[0]
    idx_train = random.sample(range(num), int(num * prop_train))
    selector.fit(features[idx_train, :], outcomes[idx_train])
    return selector.get_support()


def _get_support_summary(supports):
    total_support = supports[0].astype(int)
    for support in supports[1:]:
        total_support += support.astype(int)
    return total_support >= SUPPORT_THRESHOLD * len(supports)


prop_train = 0.8
k = int(0.1 * feature_matrix.shape[1])

support = get_supported_features(feature_matrix, outcomes, prop_train, k)
performance_selected = cross_validate_model_performance(feature_matrix[support], labels)

In [None]:
plt.figure(figsize=(10, 6))
plt.boxplot(performance, positions=np.arange(1, 8, 2.5))
plt.boxplot(performance_selected, positions=np.arange(2, 9, 2.5))
plt.title('Model performance')
plt.xticks(np.concatenate([np.arange(1, 8, 2.5),np.arange(2, 9, 2.5)]),
           ['Recall', 'Precision', 'F-Score'] + ['Pruned'] * 3)
plt.xlim([0, 8])
plt.ylabel('Percent')
plt.ylim([0, 1])
plt.show()

Hmm, it looks like features selection doesn't make much of a difference and our model still isn't meeting expectations. We have several options for moving forward:  e.g., explore alternative feature selection methods, clean or normalize features, reconsider the labels, choose a different model type.

Taking a step back, the issue here is that our labels are determined by several variables, and our feature dataset is not capturing enough of that complexity. We can greatly improve model performance by adding just three additional features, achieving recall and precision scores above 0.9 and F-scores above 0.8, and we can get modest improvements by adding another small set of carefully selected features.

The final model is a random forest ensemble model with a handful of quantitative features parsed from the raw data, probability estimates from the above naive bayes model that predicts labels using common keywords, and probability estimates from three naive bayes models that predict labels using three different subsets of technical keywords.