## Summary

Straightforward notebook. Apply HDBSCAN to 100,000 row subset and save the model. I've provided the data w/ vectorized sample on kaggle.

Note that algorithmic clusters do not necessarily match up perfectly with human intuitions of 'clusters'. A more precise way of thinking about purpose of this notebook and next couple is that we are trying to separate comments similar enough to be from the same campaign, and comments that are not. Thus a campaign may be broken up into several smaller clusters.

### Additional Notes

* I did not re-run this on the entire 100,000 datapoints as fit_predict() takes a few hours to finish running.
* Like most clustering algorithms, HDBSCAN does not scale well with dimensionality or sample size - I tried PCA on the doc vecs but it gave me very muddy results
* My intuition is that for this step if one had train the word vectors on the corpus using word2vec or use another encoding method with fewer dimensions you might be able to scale up HDBSCAN better. The issue with other encoding methods, however, is that many require taking the entire corpus into memory, which is not feasible on my desktop

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import normalize
import hdbscan

In [2]:
%%time
## import a sample of the comments we didn't cull out from the original, vectorized
X = pd.read_csv('proc_17_108_copy_uniques_level0_unclustered_sample_vectorized.csv', index_col=0)

CPU times: user 6.2 s, sys: 188 ms, total: 6.38 s
Wall time: 6.4 s


In [3]:
X_sample = X.sample(20000) #try a small subset to see it working
# X_sample = X

In [4]:
### hdbscan doesn't support cosine distance so let's use angular distance instead (euclidean of normalized vectors)

In [5]:
data = X_sample.iloc[:,-300:]
norm_data = normalize(data, norm='l2')

In [6]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True) # would want to scale min_cluster_size with your sample size

In [7]:
%%time
# takes a few hours for all 100,000 data points
dbscan = clusterer.fit_predict(norm_data)

CPU times: user 6min 51s, sys: 208 ms, total: 6min 51s
Wall time: 6min 51s


In [9]:
dbscan.max() #make sure it's clustering -- can do some more checking by glomming back onto the dataframe

9

In [None]:
joblib.dump(clusterer, 'level1-clusterer.pkl') 