This notebook illustrates text clustering. Some visualization techniques are also covered. We will use a COVID tweet dataset.

In [None]:
import pandas as pd
import numpy as np

# load datasets
train_data_file = "./Datasets/Corona_NLP/Tweets_preprocessed_train_data.csv"
test_data_file = "./Datasets/Corona_NLP/Tweets_preprocessed_test_data.csv"

# import train and test datasets into data frames and print out their original lengths
train_data_df = pd.read_csv(train_data_file)
test_data_df = pd.read_csv(test_data_file)
print ("Original train set: ",len(train_data_df))
print ("Original test set: ",len(test_data_df))

# remove rows with null labels
train_data_df = train_data_df[~train_data_df["Sentiment"].isnull()]
test_data_df = test_data_df[~test_data_df["Sentiment"].isnull()]
print ("After removing instances with no labels, train set size: ", len(train_data_df))
print ("After removing instances with no labels, test set size: ", len(test_data_df))

# remove empty rows from both datasets and print out their new lengths
train_data_df = train_data_df[~train_data_df["CleanedTweet"].isnull()]
test_data_df = test_data_df[~test_data_df["CleanedTweet"].isnull()]
print ("After removing empty tweets, train set size: ",len(train_data_df))
print ("After removing empty tweets, test set size: ",len(test_data_df))

# print out top 5 rows of the training set
display(train_data_df.head(5))

In [None]:
# combine the training and test data for later tasks
frames = [train_data_df, test_data_df]
all_dataset = pd.concat(frames)

## Text Clustering

Text clustering can be helpful when we do not have labels and when our goal is to get a better understanding of the dataset.

In [None]:
# preprocess a subset of data for use (using the entire dataset might be prohibitive in some cases)
subset = all_dataset[0:1000].copy()

# we do want to exclude stopwords for clustering
subset['ProcessedTweet'] = ""
for index, row in subset.iterrows():
    stemmed_words = []
    words = row["StopwordRemovedTweet"] 
    words = words[1:-1].split(',')
    altered_text = ""
    for word in words:
        altered_text = altered_text + " " +word.strip()[1:-1]
    subset.at[index,"ProcessedTweet"] = altered_text

x_text = subset['ProcessedTweet']
display(x_text.head(10))

In [None]:
vectorizer = CountVectorizer(ngram_range = (1,1))

vectorizer.fit(x_text)

X = vectorizer.transform(x_text)

final_df = pd.DataFrame(data = X.toarray(), columns=vectorizer.get_feature_names_out())
final_df.head()

### K-Means Clustering

K-Means clustering partitions data points into k clusters, where each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroids). Keywords can be extracted from clusters to get an intuitive sense of what the clusters are about. 

In [None]:
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# create 3 clusters
k=3
kmeans = KMeans(n_clusters=k).fit(final_df)

# the label of the cluster that each instance was assigned to 
labels = kmeans.predict(final_df)

# find center/centroid of clusters 
centers = kmeans.cluster_centers_

### Visualizing clusters
We will use a simple histogram to observe the most dominant words in each cluster (visualization code from https://nbviewer.org/github/LucasTurtle/national-anthems-clustering/blob/master/Cluster_Anthems.ipynb). Experiment with different K values and find the best one.

In [None]:
#!pip install seaborn

In [None]:
import seaborn as sns
from wordcloud import WordCloud

def get_top_features_cluster(tf_idf_array, prediction, n_feats):
    labels = np.unique(prediction)
    dfs = []
    for label in labels:
        id_temp = np.where(prediction==label) # indices for each cluster
        x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
        sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
        features = vectorizer.get_feature_names_out()
        best_features = [(features[i], x_means[i]) for i in sorted_means]
        df = pd.DataFrame(best_features, columns = ['features', 'score'])
        dfs.append(df)
    return dfs

def plotWords(dfs, n_feats):
    plt.figure(figsize=(8, 4))
    for i in range(0, len(dfs)):
        plt.title(("Most Common Words in Cluster {}".format(i)), fontsize=10, fontweight='bold')
        sns.barplot(x = 'score' , y = 'features', orient = 'h' , data = dfs[i][:n_feats])
        plt.show()

In [None]:
final_df_array = final_df.to_numpy()
prediction = kmeans.predict(final_df)
n_feats = 20
dfs = get_top_features_cluster(final_df_array, prediction, n_feats)
plotWords(dfs, 13)

What do you observe in the plots above?

### Wordclouds

Now that we have generated graphs above and observed the most common words in each cluster, we can also generate word clouds from the clusters.

In [None]:
# transform a centroids dataframe into a dictionary to be used for a WordCloud
def centroidsDict(centroids, index):
    a = centroids.T[index].sort_values(ascending = False).reset_index().values
    centroid_dict = dict()

    for i in range(0, len(a)):
        centroid_dict.update( {a[i,0] : a[i,1]} )

    return centroid_dict

def generateWordClouds(centroids):
    wordcloud = WordCloud(max_font_size=100, background_color = 'white')
    for i in range(0, len(centroids)):
        centroid_dict = centroidsDict(centroids, i)        
        wordcloud.generate_from_frequencies(centroid_dict)

        plt.figure()
        plt.title('Cluster {}'.format(i))
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.show()

In [None]:
centroids = pd.DataFrame(kmeans.cluster_centers_)
centroids.columns = final_df.columns
generateWordClouds(centroids)

In order to observe the clusters graphically and more intuitively, we are going to use PCA to reduce the dimensionality of our feature matrix, generating a two-dimension plot.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(2)
 
# transform the data
df = pca.fit_transform(final_df)
 
df.shape

kmeans = KMeans(n_clusters= k)
 
# predict the labels of clusters
label = kmeans.fit_predict(df)

u_labels = np.unique(label)
centroids = kmeans.cluster_centers_

# plot the results 
for i in u_labels:
    plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)

plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()

Dimensionality reduction helps you see the clusters more clearly. In our case, clusters are not too distinct (i.e., there are some overlapping words between clusters), since most of the tweets are about the panic in early days of COVID-19. Making the text representations better (better cleaning, better weighting, etc.) could lead to better clusters. You can also experiment with different k values. 

We can calculate Silhouette score for the clusters as well. See the details on how to do this: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html. 