## Summary

This notebook is also very short: apply the saved HDBSCAN model to flag remaining clustered items using the function ```approximate_predict()```. This way, we can identify comment groups similar enough to be from campaigns, and comments that are outliers. 

### Additional Notes

* There is a possibility of imprecision with ```approximate_predict()```, such as cluster drift, missing smaller clusters, etc. The assumption we have to make here in order to proceed is that the 100,000 entry random sample is sufficiently representative of the entire population. This may need to be studied further, but the fact that 22 out of 23 million comments were in the top 300 clusters/duplicates makes me comfortable with the assumption at a high level -- needs to be quantified as well.
* Warning: I cleaned up the cells but did not re-run them as ```approximate_predict()``` took 12+ hours to finish running. No first-time run guarantee for this notebook only (others I upload I have re-run), but it should be fine.
* I will however publish the labeled post-cluster dataset in conjunction with the visualizations notebook, in case you don't feel like waiting overnight for the results

In [1]:
import pandas as pd
import numpy as np
import hdbscan

In [2]:
%%time
X = pd.read_csv('proc_17_108_copy_uniques_level0_unclustered_vectorized.csv', index_col=0)

CPU times: user 54.6 s, sys: 2.25 s, total: 56.8 s
Wall time: 57.3 s


In [3]:
X_sample = X
#X_sample = X.sample(1000)
len(X_sample)

864032

In [5]:
### hdbscan doesn't support cosine distance so let's use angular distance instead (euclidean of l2 norm vectors)

In [6]:
from sklearn.preprocessing import normalize

In [7]:
data = X_sample.iloc[:,-300:]
norm_data = normalize(data, norm='l2')

In [9]:
### load our pickled clusterer
from sklearn.externals import joblib
clusterer = joblib.load('level1-clusterer.pkl') 

In [10]:
%%time
### use our pickled clusterer to predict on new data
results, strengths = hdbscan.approximate_predict(clusterer, norm_data)
#results = clusterer.#fit_predict(norm_data)

CPU times: user 12h 8min 14s, sys: 14.2 s, total: 12h 8min 28s
Wall time: 12h 11min 19s


In [None]:
# immediately save results and strengths to file (by themselves) so we can glom back separately onto the dataframe (don't want to lose 12 hours of computing time!)

In [14]:
%%time
joblib.dump(results, 'results-11-19.pkl')

CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 18.4 ms


['results-11-19.pkl']

In [15]:
%%time
joblib.dump(strengths, 'strengths-11-19.pkl')

CPU times: user 0 ns, sys: 8 ms, total: 8 ms
Wall time: 10.7 ms


['strengths-11-19.pkl']