# HW 5: Clustering and Topic Modeling

<div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

<div class="alert alert-block alert-warning">If you use GPT to generate code, make sure you understand and customize the generated code. Keep in mind that it is not guaranteed that the generated code can satisfy all requirements and the code can even be executed! Also keep in mind that similar submissions are suspects of plagiarism.  </div>

In this assignment, you'll practice different text clustering methods. For unsupervised learning, we have a training set of text, and a testing set with labels.

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs:
    - `train_text` is a list of documents for traing
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text`
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words?
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows:
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset
  
  
- This function has no return. Print out the classification report.


- Briefly discuss the following questions.
    - What preprocessing parameters are better and why.
    - Which distance measure is better and why it is better.
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.


In [1]:
# Add your import statement

import pandas as pd
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline

warnings.filterwarnings('ignore')
# add import statement

In [4]:
train = pd.read_csv("train5.csv")
train_text=train["body"]

test = pd.read_csv("test5.csv")

test_label = test["category"]
test_text = test["body"]

test_text

0      The Academy of Motion Picture Arts and Science...
1      Jim Carrey’s latest portrait is a haunting tri...
2      Actress Ali Wentworth knows she bears a striki...
3       "Film Festivals are still an important outlet...
4      In her 1940 “Self Portrait with Cropped Hair,”...
                             ...                        
746    As Tracey Scott Wilson's Buzzer gets underway ...
747     The Museum. Close to celebrating its 40th ann...
748    Becca “Do the Damn Thing” Kufrin’s season of “...
749    Why do people leave organizations? Reasons oft...
750    WASHINGTON ― House Republicans say they’re mak...
Name: body, Length: 751, dtype: object

In [5]:
from collections import Counter

# Tokenize the text into words
tokens = " ".join(train_text).split()

# Calculate word frequencies
word_frequencies = Counter(tokens)

# Display the top N words by frequency
top_words = word_frequencies.most_common(20)
print(top_words)

# Choose a threshold frequency to identify less informative words
threshold_frequency = 500

# Identify words with frequency below the threshold for exclusion
less_informative_words = [word for word, frequency in top_words if frequency <= threshold_frequency]

common_stopwords = set(ENGLISH_STOP_WORDS)
# Add less informative words to your custom stopwords
custom_stopwords = common_stopwords.union(less_informative_words)
less_informative_words

[('the', 58409), ('to', 34058), ('of', 32158), ('and', 31154), ('a', 29508), ('in', 21868), ('that', 14991), ('is', 11713), ('for', 11085), ('on', 9798), ('with', 8841), ('I', 7802), ('as', 7440), ('The', 7429), ('was', 7308), ('it', 6030), ('are', 5736), ('by', 5683), ('at', 5591), ('be', 5271)]


[]

In [None]:
def cluster_kmean(train_text, test_text, test_label):


    # Add your code here



In [6]:
def cluster_kmean(train_text, test_text, test_label,stop_words):
    # TF-IDF vectorizer
    vectorizer = TfidfVectorizer(stop_words = stop_words ,min_df=5)

    # KMeans clustering
    kmeans = KMeans(n_clusters=4, random_state=42, n_init=50, max_iter=1000,init='k-means++')

    # Create a pipeline
    pipeline = Pipeline([
        ('tfidf', vectorizer),
        ('kmeans', kmeans)
    ])

    # Fit the pipeline on the training data
    pipeline.fit(train_text)

    # Predict cluster labels for test data
    predicted_labels = pipeline.predict(test_text)

    # Map cluster IDs to labels using majority vote
    cluster_to_label = {}
    for cluster_id in range(4):
        majority_label = test_label[predicted_labels == cluster_id].mode().values[0]
        cluster_to_label[cluster_id] = majority_label

    # Map predicted cluster IDs to test labels
    mapped_labels = [cluster_to_label[cluster_id] for cluster_id in predicted_labels]

    # Print classification report
    print(classification_report(test_label, mapped_labels))

    # Interpretation of clusters (you can customize these based on your data)
    print("\nCluster Interpretation:")
    for cluster_id in range(4):
        cluster_text = [test_text[i] for i in range(len(test_text)) if predicted_labels[i] == cluster_id]
        print(f"Cluster {cluster_id + 1}: {len(cluster_text)} documents")
        # Additional analysis or interpretation for each cluster can be added here.

In [7]:
result = cluster_kmean(train_text, test_text, test_label, 'english')

                precision    recall  f1-score   support

ARTS & CULTURE       0.59      0.95      0.73       297
      BUSINESS       0.51      0.72      0.60       142
 ENTERTAINMENT       0.00      0.00      0.00       168
      POLITICS       0.68      0.35      0.47       144

      accuracy                           0.58       751
     macro avg       0.45      0.50      0.45       751
  weighted avg       0.46      0.58      0.49       751


Cluster Interpretation:
Cluster 1: 15 documents
Cluster 2: 476 documents
Cluster 3: 75 documents
Cluster 4: 185 documents


In [8]:
result = cluster_kmean(train_text, test_text, test_label, None)

                precision    recall  f1-score   support

ARTS & CULTURE       0.82      0.66      0.73       297
      BUSINESS       0.52      0.24      0.33       142
 ENTERTAINMENT       0.59      0.81      0.69       168
      POLITICS       0.57      0.87      0.69       144

      accuracy                           0.65       751
     macro avg       0.63      0.64      0.61       751
  weighted avg       0.67      0.65      0.64       751


Cluster Interpretation:
Cluster 1: 65 documents
Cluster 2: 229 documents
Cluster 3: 238 documents
Cluster 4: 219 documents


In [None]:
result = cluster_kmean(train_text, test_text, test_label)

In [9]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical labels to numerical labels
numerical_labels = label_encoder.fit_transform(test_label)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import numpy as np

def cluster_gmm(train_text, test_text, test_label):

    # TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    tfidf_train = vectorizer.fit_transform(train_text)

    # Choosing the number of clusters for GMM
    num_clusters_gmm = len(np.unique(test_label))  # Use the number of unique labels in the test set

    # Choosing the covariance type for GMM
    covariance_type = 'full'  # Experiment with other types if needed

    # GMM clustering
    gmm = GaussianMixture(n_components=num_clusters_gmm, covariance_type=covariance_type, n_init=10, random_state=42)
    gmm.fit(tfidf_train.toarray())  # GMM requires dense input

    # Predict cluster labels for test data
    tfidf_test = vectorizer.transform(test_text)
    predicted_labels_gmm = gmm.predict(tfidf_test.toarray())

    # Evaluate the clustering performance for GMM
    print("GMM Classification Report:")
    print(classification_report(test_label, predicted_labels_gmm))

## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, test_label)`.

You may take a subset from the data to do GMM because it can take a lot of time.

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM?

- Note, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability

In [None]:
def cluster_gmm(train_text, test_text, test_label):

    # Add your code here


In [None]:
import pandas as pd  # Assuming you have a DataFrame

# Assuming your data is in a DataFrame named 'df'
# Randomly sample 30% of the data
subset_df = train.sample(frac=0.3, random_state=42)
subset_df_test = test.sample(frac=0.15, random_state=42)

# Extract the relevant columns for clustering (e.g., 'text' and 'label')
train_text_subset = subset_df['body'].tolist()
test_text_subset = subset_df_test['body'].tolist()
test_label_subset = subset_df_test['category']

numerical_labels = label_encoder.fit_transform(test_label_subset)

# Now you can use the subset for clustering

results = cluster_gmm(train_text_subset, test_text_subset, numerical_labels)

In [None]:
reuslts = cluster_gmm(train_text, test_text, test_label)

## Q3: Clustering by LDA

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, test_label)`.

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic? In other words, do you think your result can achieve intratopic coherence and intertopic separation?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics?
- Among the three models, KMeans, GMM, and LDA, which model performs the best? Can you explain why this model can outperform the others?

In [None]:
def cluster_lda(train, test_text, test_label):

    # add your code here


In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical labels to numerical labels
numerical_labels = label_encoder.fit_transform(test_label)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import classification_report

def cluster_lda(train_text, test_text, test_label, num_topics=4):


    # Count vectorizer (Bag of Words)
    vectorizer = CountVectorizer(min_df=4,stop_words='english')
    X_train = vectorizer.fit_transform(train_text)

    # LDA model
    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda.fit(X_train)

    # Transform the test data to topic space
    X_test = vectorizer.transform(test_text)
    topic_probabilities = lda.transform(X_test)

    # Assign the topic with the highest probability to each test document
    predicted_topics = topic_probabilities.argmax(axis=1)

    # Print top 30 words for each topic
    feature_names = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(lda.components_):
        top_words_idx = topic.argsort()[:-31:-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

    # Evaluate clustering performance
    print("\nLDA Classification Report:")
    print(classification_report(test_label, predicted_topics))

In [None]:
cluster_lda(train_text, test_text, numerical_labels)

## Q5. Bonus:

Can you measure the coherence and separation of the clustering results from the three models? Which model performs the best in terms of the coherence and separation?

Explain your idea and implment it.

In [None]:
test_text.shape