# 1 Introduction

Our dataset is a set of questions from  the online forum [Stack Exchange](https://stackexchange.com/). Every row has three columns:
 
 1. title of the question
 2. content of the question in HTML format
 3. some tags related to the question
 
We will focus only on the content column. We will try to cluster the documents in a meaningful way so that we can identify similar documents and find groups of related subjects.

The data was found on [Kaggle](https://www.kaggle.com/akshatpathak/text-data-clustering/data). Don't go looking now, because on the webpage it is already splitted into categories. In this exercise, it is the goal to cluster the documents and to find meaningful clusters ourselves, in a unsupervised way.  The data has been merged and mixed.
 

# 2 Exploring and preparing the data

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
import pandas as pd
import time
import numpy as np

In [0]:
questions = pd.read_csv("/content/drive/My Drive/xylosai/clustering/questionData.csv")

In [0]:
len(questions)

In [0]:
questions.head(10)

In [0]:
questions.drop(columns=["title","tags"],inplace=True)
questions.head()

In [0]:
for question in questions["content"][0:10]:
  print(question)
  print("***"*30)

The content is in HTML format with a lot of tags like `` <p></p> or <a></a> ``. We need to extract only the real text from the content body. For this, we use the get_text() method of the BeautifulSoup library.

In [0]:
! pip install beautifulsoup4

In [0]:
from bs4 import BeautifulSoup

Let's test on one example

In [0]:
print(questions["content"][0])
print("***"*30)
print(questions["content"][6])


In [0]:
print(BeautifulSoup(questions["content"][0]).get_text())
print("***"*30)
print(BeautifulSoup(questions["content"][6]).get_text()

Let's now apply this on all rows. 

In [0]:
start = time.time()
questions["content_clean"] = questions["content"].apply(lambda x: BeautifulSoup(x).get_text())
end = time.time()
print("operation took {0} seconds".format(end-start))

In [0]:
questions.head()  


# 3 Building the TF-IDF vectors

Sklearn has built-in functionality for calculating TF-IDF vectors. [The documentation is found here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Read the documentation. Look a the different parameters and decide which ones are relevant and which ones you don't really care about. Are we going to accept the defaults or do we need to change some parameter values?


It is clear that Sklearn builds the vocabulary (the list of all available words) automatically, if no vocabulary is given by the user. For this exercise, we will let Sklearn clear the job for us. 

Scroll down in the documentation and watch the available methods that we can call on the *TfidfVectorizer* object. Follow the links to get more details about the usage of a method. 

the *TfidfVectorizer* object is created with the parameters and fitted with the *fit()* method. When calling *fit()*, the object will internally build the IDF vector. This is the vector we need to transform a word count vector to a TF-IDF vector by multiplying elementwise. After fitting, any word count vector can be transformed into its corresponding TF-IDF vector, using the *transform()* method. 

The creation of a vocabulary, getting the word counts and calculating the IDF vector are completely abstracted away and handled by Sklearn in the background.


When calling *transform()*, a list of text objects is transformed into a TF-IDF weighted matrix of , with a row vector for each document.  These vectors and then used for the clustering


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

We use all default values, except for min_df and max_df. With this setting, we will ignore all words that appear in less than 0.1% of the documents or that appear in more than 99% of the documents. 

In [0]:
vectorizer = TfidfVectorizer(min_df=0.001, max_df=0.99) 

In [0]:
vectorizer.fit(questions["content_clean"])

We can view our vocabulary with *get_feature_names()*. The first few features are just numbers, after that we get words. If we wanted to, we could tweak the model further so that we have no numbers as features.

In [0]:
print(vectorizer.get_feature_names())
print("size of vocabulary: {0}".format(len(vectorizer.get_feature_names())))

In [0]:
feature_matrix = vectorizer.transform(questions["content_clean"]) 

The result is a Scipy sparse matrix object. It behaves similar like a Numpy array, but it is optimized specifically for sparse matrices. 

Does the shape of the matrix make sense?

In [0]:
print(type(feature_matrix))
print(feature_matrix.shape)

# 4 Clustering

Sklearn supports various clustering algorithms [(see documentation here)](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)

## 4.1 K-means

First, we will try K-means. For this algorithm, we must decide in advance the number of clusters we want to find. We use the MiniBatchKMeans function. This is basically the same as KMeans, except that it updates the center positions incrementally using small batches of the data, instead of doing a clustering of all data examples in one large batch. 

For a large number of samples, this is much faster. Check yourself by replacing MiniBatchKMeans by KMeans. When using KMeans, the fitting takes very long.

In [0]:
from sklearn.cluster import MiniBatchKMeans, KMeans

In [0]:
kmeans = MiniBatchKMeans(n_clusters = 4,random_state=0)

This could take a while...

In [0]:
cluster_labels = kmeans.fit_predict(feature_matrix)

In [0]:
cluster_labels.shape

In [0]:
print(cluster_labels)
print("unique labels: {0}".format(np.unique(cluster_labels)))

We now have a list of cluster labels that map every original question to a cluster label. Since we have not named the clusters, they get a label of 0 to 4.

We add the cluster labels as a column to our original questions dataframe. Then we filter the dataframe in order to display the four clusters.

In [0]:
label_series = pd.Series(cluster_labels)
questions["label"] = label_series

In [0]:
questions["label"].unique()

In [0]:
questions.head(10)

In [0]:
for label in questions["label"].unique():
  print("First 10 samples of cluster {0}".format(label))
  samples = questions[questions["label"] == label]["content_clean"][0:10]
  print(samples)
  
  

Do these clusterings seem accurate? Let's try with a different amount of clusters

In [0]:
kmeans = MiniBatchKMeans(n_clusters = 6,random_state=0)
cluster_labels = kmeans.fit_predict(feature_matrix)
label_series = pd.Series(cluster_labels)
questions["label_6clusters"] = label_series

In [0]:
for label in questions["label_6clusters"].unique():
  print("First 10 samples of cluster {0}".format(label))
  samples = questions[questions["label_6clusters"] == label]["content_clean"][0:10]
  print(samples)

The original data came from 6 distinct classes: biology, cooking, crypto, DIY, robotics and travel. So it should be no surprise that clustering with 6 classes yields better results than 4 classes.

It looks like...
* cluster 0 is about DIY. 
* Cluster 1 is about cooking. 
* Cluster 2 is about travelling
* Cluster 3 has only one sample
* Cluster 4 and 5 are not clear, they appear to be a mix of many





## 4.2 Mean-shift

Instead of first creating a clustering class (MeanShift class), and then calling a *fit()* function on it, we will directly use a function called *mean_shift()*. 

This would also have been possible for K-means, using the *k_means()* function. The options are clear when [reading the docs, where an overview of both classes and functions is given ](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster).

By the way...

[read the docs for mean_shift()!](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.mean_shift.html#sklearn.cluster.mean_shift)

In [0]:
from sklearn.cluster import mean_shift

Running the cell below takes forever. It is clear that mean-shift is computationally more expensive.

In [0]:
cluster_centers, labels = mean_shift(feature_matrix.toarray(),n_jobs=-1)