# Part 10 -- KMeans Clustering

Letting the machine identify clusters of trends in the data.

**Load lib codes**

In [1]:
!pwd

/home/jovyan/work/Github/Analyzing_Unstructured_Data_for_Finance/ipynb


In [2]:
from os import chdir
chdir('/home/jovyan/work/Github/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()
%matplotlib inline

In [3]:
TSNE_SVD_tfidf = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/8.TSNE_SVD_tfidf.pickle')

**Try different components for KMeans Cluster Model**
K-means is a popular clustering algorithm that tries to distribute a predefined number of points (K) in a way that they end up in the center of our clusters, close to the mean, using Euclidian Distance.

We're going to create 5 clusters using MiniBatchKMeans from scikit-learn, which is a fast implementation of k-means that processes examples in small batches instead of individually.

If K=5, that means we're breaking our data into 5 different clusters and we will have 5 different means. KMeans clustering is getting the means of how many groups we think there are in the data. 

In [4]:
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score

In [None]:
km = MiniBatchKMeans(n_clusters=10, max_iter=100)
km_tsne_tfidf = km.fit(TSNE_SVD_tfidf)
km_clusters = km.predict(TSNE_SVD_tfidf)
km_distances = km.transform(TSNE_SVD_tfidf)

In [None]:
#returns the average silhouette score for your dataset
start = datetime.now()

silhouette_score(TSNE_SVD_tfidf, km.labels_)

end = datetime.now()
print(end - start)
#.45 is somewhat close to 1, meaning it is somewhat close to the centroid.

**This is KMeans on vectorized data (tweets)**

In [None]:
kmeans_model = MiniBatchKMeans(n_clusters=10, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000, random_state=42)
kmeans = kmeans_model.fit(TSNE_SVD_tfidf)
kmeans_clusters = kmeans.predict(TSNE_SVD_tfidf)
kmeans_distances = kmeans.transform(TSNE_SVD_tfidf)

In [None]:
tsne_kmeans = kmeans_model.fit_transform(kmeans_distances)

**This is KMeans on the vectorized data (tweets) fitted to KMeans distances**

In [None]:
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

plt.scatter(tsne_kmeans[:,0], tsne_kmeans[:,1], c=colormap[kmeans_clusters])

In [None]:
X = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/4.X.pickle')

In [None]:
X = X['cleaned_text']

In [None]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=900, plot_height=700, title="2008-2017 tweets (tf-idf)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_tfidf.scatter(x=TSNE_SVD_tfidf[:,0], y=TSNE_SVD_tfidf[:,1],
                    source=bp.ColumnDataSource({
                        "tweet": X,
                    }))

hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet"}
show(plot_tfidf)

The k-means algorithm runs for a few hundred iterations until the centroids don't improve much any more, and then for each tweet, it provides us with the closest centroid and the distance to each cluster centroid.
Let's see which cluster the first five tweets have ended up in:

In [None]:
tfidf = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/6.tfidf_transformer.pickle')

In [None]:
for i, tweet in enumerate(X):
    try:
        if(i < 5):
            print("Cluster " + str(kmeans_clusters[i]) + ": " + X[i] + "(distance: " + str(kmeans_distances[i][kmeans_clusters[i]]) + ")\n")
    except:
        pass

To better understand what's in each cluster, let's get the top 10 features (word) for each of our 5 clusters:

In [None]:
terms[0]

In [None]:
len(sorted_centroids)

In [None]:
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()
for i in range(10):
    print("Cluster %d:" % i, end='')
    for j in sorted_centroids[i, :10]:
        print(' %s' % terms[j], end='')
    print()

In [None]:
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

plot_kmeans = bp.figure(plot_width=900, plot_height=700, title="2008-2017 tweets (k-means)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_kmeans.scatter(x=tsne_kmeans[:,0], y=tsne_kmeans[:,1], 
                    color=colormap[kmeans_clusters], 
                    source=bp.ColumnDataSource({
                        "tweet": X, 
                        "cluster": kmeans_clusters
                    }))
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet"" - cluster: @cluster)"}
show(plot_kmeans)