In [1]:
import os

# all notebooks need to run from the root directory
# this will check to see if the current working directory is notebooks, if so, it will change
# to root directory.  if already in the root directory, will not modify
if "notebooks" in os.getcwd():
    os.chdir("..")

from src.utility.get_data import get_non_profit_text_df, get_non_profit_df

In [2]:
df_text = get_non_profit_text_df()
df_text.head(3)

Downloading nonprofit.txt...
Downloading nonprofit_text.txt...


Unnamed: 0,nonprofit_text_id,reporting_year,nonprofit_id,grouptype,description
0,10,2020,4553,charitablegroup,MAINTAIN AND BEAUTIFY THE DEGREGORIE PARK MAIN...
1,11,2019,4978,charitablegroup,PROVIDING HOUSING AND RESIDENTIAL SERVICES FOR...
2,12,2017,37,charitablegroup,PROVIDING SCHOLARSHIPS AND EDUCATIONS ASSISTAN...


In [3]:
df_np = get_non_profit_df()
df_np.head(3)

Unnamed: 0,nonprofit_id,reporting_year,ein,businessname,phone,address1,address2,city,stabbrv,zip
0,10,2021,10274998,MOUNT ST JOSEPH,2078730705,7 HIGHWOOD STREET,7 HIGHWOOD STREET,WATERVILLE,ME,4901
1,11,2020,10275026,BELFAST CURLING CLUB,2073389851,PO BOX 281 BELMONT AVE,PO BOX 281 BELMONT AVE,BELFAST,ME,4915
2,12,2021,10275130,Unity College,2075097100,90 Quaker Hill Road,90 Quaker Hill Road,Unity,ME,4988


# Problem Statement
Given the above data set of non-profit organizations, we will be using natural language processing to categorize the non-profit organizations based on their tax description. We do not have labels for the non-profit organizations, so this will be an unsupervised learning problem. Let's start with the simplest possible approach, let's get our words into an embedding space and run some simple clustering to see what comes out. 

Since we don't have labels we won't be able to quantitatively evaluate the results but we can take a qualitative look and a quick look at possible next steps. 

We'll be using some of the tools from [this article](https://maxhalford.github.io/blog/unsupervised-text-classification/) as well as filling in the rest with some Scikit-Learn.

In [4]:
import string

def clean_text(text):
    """
    Preprocess text by lowercasing, removing punctuation and newline characters
    """
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.replace('\n', ' ')
    text = ' '.join(text.split())  # remove multiple whitespaces
    return text

In [5]:
# Download spacy word embeddings from Word2Vec
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
[K     |████████████████████████████████| 777.4 MB 8.2 kB/s  eta 0:00:013     |███████████████████████████████▋| 767.3 MB 6.1 MB/s eta 0:00:02
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.2.0
You should consider upgrading via the '/Users/rharrigan/.pyenv/versions/3.10.2/envs/npc/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [6]:
# Now load our embeddings and setup an embed function
import spacy
import numpy as np

nlp = spacy.load('en_core_web_lg')

def embed(tokens, nlp):
    """Return the centroid of the embeddings for the given tokens.

    Out-of-vocabulary tokens are cast aside. Stop words are also
    discarded. An array of 0s is returned if none of the tokens
    are valid.

    """

    lexemes = (nlp.vocab[token] for token in tokens)

    vectors = np.asarray([
        lexeme.vector
        for lexeme in lexemes
        if lexeme.has_vector
        and not lexeme.is_stop
        and len(lexeme.text) > 1
    ])

    if len(vectors) > 0:
        centroid = vectors.mean(axis=0)
    else:
        width = nlp.meta['vectors']['width']  # typically 300
        centroid = np.zeros(width)

    return centroid
doc = df_text.loc[0, "description"]
tokens = doc.split(' ')
centroid = embed(tokens, nlp)

print("Embedded document:")
print(doc)
print()
print(f"output embedding is {centroid.shape}")
print(centroid[:10])


Embedded document:
MAINTAIN AND BEAUTIFY THE DEGREGORIE PARK MAINTAIN AND BEAUTIFY THE SHORE PATH INSTALLED HISTORICAL SIGNAGE MAINTAIN AND BEAUTIFY THE HOWE MEMORIAL PARK

output embedding is (300,)
[ 0.38682842 -0.075748   -0.05325746 -0.16872934  0.06439866 -0.06584267
 -0.15967049  0.08441201 -0.38315043  1.7192727 ]


# So what do we do with embeddings?
Well now that we have some embeddings we need to utilize these to classify our documents. But we don't have any classes! We'll need to create some. We could try just making some up but we can do a bit better, what if we just look at the natural clusters which have emerged from the embeddings?

We can use nearest neighbors to find the nearest neighbors of our embeddings and then we can use those to create our classes. We'll be using scikit-learn's [neighbors](https://scikit-learn.org/stable/modules/neighbors.html#unsupervised-nearest-neighbors) module. We'll be using Ball Tree for the nearest neighbors since it [performs better in higher dimensional spaces](https://towardsdatascience.com/tree-algorithms-explained-ball-tree-algorithm-vs-kd-tree-vs-brute-force-9746debcd940) and we have 300 dimensions!

In [32]:
# Do some cleanup, some documents have NaN or float descriptions
mask = df_text['description'].apply(lambda x: isinstance(x, str))
df_text = df_text[mask]
df_text.shape

(1835559, 5)

In [33]:
from tqdm.notebook import tqdm
# Note that this takes ~4 minutes on my machine
X = []
for doc in tqdm(df_text["description"].values):
    vals = embed(doc.split(' '), nlp)
    X.append(vals)
X = np.array(X)
X.shape, df_text.shape

  0%|          | 0/1835559 [00:00<?, ?it/s]

((1835559, 300), (1835559, 5))

In [35]:
from sklearn.neighbors import BallTree
# This took ~3 minutes on my machine
tree = BallTree(X, leaf_size=30, metric='euclidean')

In [55]:
# Now get the distance and indices of the 3 nearest neighbors
def print_neighbors(tgt_ind, tree, n_neighbors=3):
    tgt_vec = X[tgt_ind, :]
    tgt_vec = tgt_vec[np.newaxis, ...]
    dist, ind = tree.query(tgt_vec, k=n_neighbors + 1)
    print(f"Input document: {df_text.loc[tgt_ind, 'description']}")
    print()
    # Skip the first one since it's always itself
    for i, d in zip(ind[0].tolist()[1:], dist[0].tolist()[1:]):
        print(f"Neighbor {i:02d} is {d:0.2f}: {df_text.loc[i, 'description']}")
print_neighbors(54624, tree)

Input document: TO ESTABLISH ORGANIZED AMATEUR VOLLEYBALL WITH ULTIMATE OBJECTIVES OF SOCIAL, PHYSICAL, MENTAL, AND MORAL DEVELOPMENT OF GIRLS AGED 9 TO 17 YEARS. A PROGRAM OF FRIENDLY COMPETITION WITH THE GOAL OF EDUCATING PLAYERS ABOUT SPORTSMANSHIP, TEAMWORK, FELLOWSHIP, COURTESY, DISCIPLINE, AND INTEGRITY WILL BE ESTABLISHED.

Neighbor 351103 is 1.30: Tournaments and games for junior hockey players
Neighbor 261062 is 1.31: THE ORGANIZATION PRODUCED 7 SHOWS AND EVENTS DURING THE YEAR, INCLUDING 2 MAINSTAGE PRODUCTIONS, 4 STAGED READINGS, AND 1 FESTIVAL, WITH AN APPROXIMATE TOTAL ATTENDANCE OF 5230.
Neighbor 1668434 is 1.36: THE ORGANIZATION WORKS ON ENVIRONMENT AND EDUCATION PROGRAMS.


# Discussion
These results show some promise, we can see that we're starting to capture some meaningful elements but there is definitely a lot of work left to do!

# Conclusions
We've seen how we can start to classify documents using NLP and unsupervised learning methods. Some next steps might include:
* Identify the classes we should be using by examining documents that don't fit our single classifier and coming up with more classes. Repeat this process. 
* Improving the word embedding space or even using a more sophisticated model like sentence embeddings (GloVe or BERT)
* Using more sophisticated clustering methods like Hierarchical Clustering


For some more sophistication take a look at [this article](https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de)