# HW 6: Clustering and Topic Modeling

In this assignment, you'll practice different text clustering methods. A dataset has been prepared for you:
- `hw6_train.csv`: This file contains a list of documents. It's used for training models
- `hw6_test`: This file contains a list of documents and their ground-truth labels (4 lables: 1,2,3,7). It's used for external evaluation. 

|Text| Label|
|----|-------|
|paraglider collides with hot air balloon ... | 1|
|faa issues fire warning for lithium ... | 2|
| .... |...|

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering 

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs: 
    - `train_text` is a list of documents for traing 
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text` 
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words? 
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows: 
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset 
  
  
- This function has no return. Print out the classification report. 


- Briefly discuss:
    - Which distance measure is better and why it is better. 
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.
- Write your analysis in a pdf file.

In [1]:
# Add your import statement
import gensim
from gensim import corpora
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from nltk.corpus import stopwords
from nltk.cluster import KMeansClusterer,cosine_distance,euclidean_distance
from sklearn import mixture
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans


In [2]:
train = pd.read_csv("hw6_train.csv")
train_text=train["text"]

test = pd.read_csv("hw6_test.csv")
test_label = test["label"]
test_text = test["text"]

train.head()

Unnamed: 0,text
0,Would you rather get a gift that you knew what...
1,Is the internet ruining people's ability to co...
2,Permanganate?\nSuppose permanganate was used t...
3,If Rock-n-Roll is really the work of the devil...
4,Has anyone purchased software to watch TV on y...


In [3]:
def cluster_kmean(train_text, test_text, test_label):
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=3) 
    dtm= tfidf_vect.fit_transform(train_text)
    
    num_clusters=4
    clusterer = KMeansClusterer(num_clusters,cosine_distance, \
                            repeats=20)
    clusters = clusterer.cluster(dtm.toarray(), \
                             assign_clusters=True)
    centroids=np.array(clusterer.means())
    sorted_centroids = centroids.argsort()[:, ::-1] 
    voc_lookup= tfidf_vect.get_feature_names()
    test_dtm = tfidf_vect.transform(test_text)
    predicted = [clusterer.classify(v) for v in test_dtm.toarray()]
    new_p=pd.DataFrame({'Original':test_label,'Predicted':predicted})
    new_p=new_p.groupby('Predicted')
    new_p1=new_p['Original'].apply(lambda x : x.value_counts().idxmax())
    cluster_d={}
    for i in range(0,new_p1.shape[0]):
        cluster_d[i] = new_p1[i]
    predicted_target=[cluster_d[i] for i in predicted]
    print(metrics.classification_report\
           (test_label,predicted_target))


    

In [4]:
cluster_kmean(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.83      0.70      0.76       332
           2       0.90      0.68      0.77       314
           3       0.68      0.88      0.77       355
           7       0.69      0.77      0.73       273

    accuracy                           0.76      1274
   macro avg       0.78      0.75      0.76      1274
weighted avg       0.78      0.76      0.76      1274





In [6]:
cluster_kmean(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.84      0.69      0.76       332
           2       0.91      0.68      0.78       314
           3       0.67      0.88      0.76       355
           7       0.70      0.77      0.73       273

    accuracy                           0.76      1274
   macro avg       0.78      0.76      0.76      1274
weighted avg       0.78      0.76      0.76      1274

TfidfVectorizer(min_df=3, stop_words='english')


## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, text_label)`. 

You may take a subset from the data to do GMM because it can take a lot of time. 

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM? 

- Note, like KMean, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability 

In [32]:
def best_cluster(train_text):
    train_text[:5:]
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=3) 
    dtm= tfidf_vect.fit_transform(train_text)
    lowest_bic = np.infty   # initial BIC is set to infinity
    best_gmm = None
    n_components_range = range(2,5) 
    cv_types = ['spherical', 'tied', 'diag']
    for cvtype in cv_types:
        for n_components in n_components_range:
            gmm = mixture.GaussianMixture(n_components=n_components,
                                      covariance_type=cvtype, random_state=42)
            gmm.fit(dtm.toarray())
            bic = gmm.bic(dtm.toarray())
            if bic < lowest_bic:  # save the model with lowest BIC sofar
                lowest_bic = bic
                best_gmm = gmm

    print (lowest_bic,best_gmm)
    

In [91]:
def cluster_gmm(train_text, test_text, test_label):
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=6)
    dtm= tfidf_vect.fit_transform(train_text)
    test_dtm = tfidf_vect.transform(test_text)
    a=mixture.GaussianMixture(covariance_type='diag',n_init=20, n_components=4, random_state=42)
    gmm =a.fit(dtm.toarray())
    predicted=gmm.predict(test_dtm.toarray())
    new_p=pd.DataFrame({'Original':test_label,'Predicted':predicted})
    new_p=new_p.groupby('Predicted')
    new_p1=new_p['Original'].apply(lambda x : x.value_counts().idxmax())
    cluster_d={}
    for i in range(0,new_p1.shape[0]):
        cluster_d[i] = new_p1[i]
    predicted_target=[cluster_d[i] for i in predicted]
    print(metrics.classification_report\
           (test_label,predicted_target))
    
    

In [92]:
cluster_gmm(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.73      0.69      0.71       332
           2       0.56      0.86      0.68       314
           3       0.79      0.73      0.76       355
           7       0.83      0.48      0.61       273

    accuracy                           0.70      1274
   macro avg       0.73      0.69      0.69      1274
weighted avg       0.73      0.70      0.69      1274



## Q3: Clustering by LDA 

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, text_label)`. 

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics? 
- Does your LDA model achieve better performance than KMeans or GMM?

In [14]:
def cluster_lda(train_text, test_text, test_label):
    tfidf_vect = CountVectorizer(stop_words="english",\
                             min_df=4) 
    dtm= tfidf_vect.fit_transform(train_text)
    test_dtm = tfidf_vect.transform(test_text)
    tf_feature_names = tfidf_vect.get_feature_names_out()
    num_cluster=4
    lda=LatentDirichletAllocation(n_components=num_cluster, \
                                max_iter=40,verbose=1,
                                evaluate_every=1, n_jobs=1,
                                random_state=1).fit(dtm)
    top_words=20
    for topic_idx,topic in enumerate (lda.components_):
        print ('Topic %d :'% (topic_idx))
        words=[(tf_feature_names[i],'%.2f'%topic[i]) \
           for i in topic.argsort()[::-1][0:top_words]]
        print(words)
        print("\n")
    id2word={idx:w for idx, w in \
         enumerate(tfidf_vect.get_feature_names())}
    corpus = gensim.matutils.Sparse2Corpus(dtm, \
                            documents_columns=False)
    dictionary = corpora.Dictionary.from_corpus(corpus, \
                            id2word=id2word)
    ldamodel = gensim.models.\
    ldamodel.LdaModel(corpus, num_topics = num_cluster, \
                            id2word=id2word, \
                            iterations=28)
    test_corpus = gensim.matutils.Sparse2Corpus(test_dtm, \
                    documents_columns=False)
    predict = ldamodel.get_document_topics(test_corpus)
    predicted=  []
    k = list(predict)
    for i in k:
        predicted.append(max(i,key=lambda item:item[1])[0])
    new_p=pd.DataFrame({'Original':test_label,'Predicted':predicted})
    new_p=new_p.groupby('Predicted')
    new_p1=new_p['Original'].apply(lambda x : x.value_counts().idxmax())
    cluster_d={}
    for i in range(0,new_p1.shape[0]):
        cluster_d[i] = new_p1[i]
    predicted_target=[cluster_d[i] for i in predicted]
    print(metrics.classification_report\
           (test_label,predicted_target))
    
            
    
        
    
    
    # add your code here
        

In [15]:
cluster_lda(train_text, test_text, test_label)

iteration: 1 of max_iter: 40, perplexity: 3744.8238
iteration: 2 of max_iter: 40, perplexity: 3495.1870
iteration: 3 of max_iter: 40, perplexity: 3306.9531
iteration: 4 of max_iter: 40, perplexity: 3160.2251
iteration: 5 of max_iter: 40, perplexity: 3050.6756
iteration: 6 of max_iter: 40, perplexity: 2964.8248
iteration: 7 of max_iter: 40, perplexity: 2904.6078
iteration: 8 of max_iter: 40, perplexity: 2862.7297
iteration: 9 of max_iter: 40, perplexity: 2834.3298
iteration: 10 of max_iter: 40, perplexity: 2814.5958
iteration: 11 of max_iter: 40, perplexity: 2800.6554
iteration: 12 of max_iter: 40, perplexity: 2789.3890
iteration: 13 of max_iter: 40, perplexity: 2780.1794
iteration: 14 of max_iter: 40, perplexity: 2772.7972
iteration: 15 of max_iter: 40, perplexity: 2766.3616
iteration: 16 of max_iter: 40, perplexity: 2761.0165
iteration: 17 of max_iter: 40, perplexity: 2756.3925
iteration: 18 of max_iter: 40, perplexity: 2752.1355
iteration: 19 of max_iter: 40, perplexity: 2748.0342
it



              precision    recall  f1-score   support

           1       0.45      0.58      0.51       332
           2       0.53      0.32      0.40       314
           3       0.54      0.40      0.46       355
           7       0.53      0.76      0.62       273

    accuracy                           0.50      1274
   macro avg       0.51      0.51      0.50      1274
weighted avg       0.51      0.50      0.49      1274



In [5]:
cluster_lda(train_text, test_text, test_label)



iteration: 1 of max_iter: 40, perplexity: 3747.0206
iteration: 2 of max_iter: 40, perplexity: 3492.8285
iteration: 3 of max_iter: 40, perplexity: 3294.2049
iteration: 4 of max_iter: 40, perplexity: 3142.3757
iteration: 5 of max_iter: 40, perplexity: 3035.2926
iteration: 6 of max_iter: 40, perplexity: 2960.7164
iteration: 7 of max_iter: 40, perplexity: 2905.9848
iteration: 8 of max_iter: 40, perplexity: 2861.7826
iteration: 9 of max_iter: 40, perplexity: 2826.3852
iteration: 10 of max_iter: 40, perplexity: 2798.3178
iteration: 11 of max_iter: 40, perplexity: 2776.2680
iteration: 12 of max_iter: 40, perplexity: 2758.8788
iteration: 13 of max_iter: 40, perplexity: 2745.1875
iteration: 14 of max_iter: 40, perplexity: 2734.3637
iteration: 15 of max_iter: 40, perplexity: 2725.2595
iteration: 16 of max_iter: 40, perplexity: 2717.2603
iteration: 17 of max_iter: 40, perplexity: 2710.9113
iteration: 18 of max_iter: 40, perplexity: 2705.7915
iteration: 19 of max_iter: 40, perplexity: 2701.2028
it