# Assignment 5: Clustering and Topic Modeling

In this assignment, you'll need to use the following dataset:
- text_train.json: This file contains a list of documents. It's used for training models
- text_test.json: This file contains a list of documents and their ground-truth labels. It's used for testing performance. This file is in the format shown below. Note, each document has a list of labels.
You can load these files using json.load()

|Text| Labels|
|----|-------|
|paraglider collides with hot air balloon ... | ['Disaster and Accident', 'Travel & Transportation']|
|faa issues fire warning for lithium ... | ['Travel & Transportation'] |
| .... |...|

## Q1: K-Mean Clustering

Define a function **cluster_kmean()** as follows: 
- Take two file name strings as inputs: $train\_file$ is the file path of text_train.json, and $test\_file$ is the file path of text_test.json
- When generating tfidf weights, set the min_df to 5.
- Use **KMeans** to cluster documents in $train\_file$ into 3 clusters by **cosine similarity**  and **Euclidean distance** separately. Use sufficient iterations with different initial centroids to make sure clustering converge 
- Test the clustering model performance using $test\_file$: 
  * Predict the cluster ID for each document in $test\_file$.
  * Let's only use the **first label** in the ground-truth label list of each test document, e.g. for the first document in the table above, you set the ground_truth label to "Disaster and Accident" only.
  * Apply **majority vote** rule to dynamically map the predicted cluster IDs to the ground-truth labels in $test\_file$. **Be sure not to hardcode the mapping** (e.g. write code like {0: "Disaster and Accident"}), because a  cluster may corrspond to a different topic in each run. (hint: if you use pandas, look for "idxmax" function) 
  * Calculate **precision/recall/f-score** for each label, compare the results from the two clustering models, and write your analysis in a pdf file 
- This function has no return. Print out confusion matrix, precision/recall/f-score. 

## Q2: LDA Clustering 

Q2.1. Define a function **cluster_lda()** as follows: 
1. Take two file name strings as inputs: $train\_file$ is the file path of text_train.json, and $test\_file$ is the file path of text_test.json
2. Use **LDA** to train a topic model with documents in $train\_file$ and the number of topics $K$ = 3. Keep min_df to 5 when generating tfidf weights, as in Q1.  
3. Predict the topic distribution of each document in  $test\_file$ and select the topic with highest probability. Similar to Q1, apply **majority vote rule** to map the topics to the labels and show the classification report. 
4. Return the array of topic proportion array

Q2.2. Find similar documents
- Define a function **find_similar_doc(doc_id, topic_mix)** to find **top 3 documents** that are the most similar to a selected one with index **doc_id** using the topic proportion array **topic_mix**. 
- You can calculate the cosine or Euclidean distance between two documents using the topic proportion array
- Return the IDs of these similar documents.

Q2.3. Provide a pdf document which contains: 
  - performance comparison between Q1 and Q2.1
  - describe how you tune the model parameters, e.g. alpha, max_iter etc. in Q2.1.
  - discuss how effective the method in Q2.2 is to find similar documents, compared with the tfidf weight cosine similarity we used before.

## Q3 (Bonus): Biterm Topic Model (BTM)
- There are many variants of LDA model. BTM is one designed for short text, while lDA in general expects documents with rich content.
- Read this paper carefully http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf and try to understand the design
- Try the following experiments:
    - Script a few thousand tweets by different hastags
    - Run LDA and BTM respectively to discover topics among the collected tweets. BTM package can be found at https://pypi.org/project/biterm/
    - Compare the performance of each model. If one model works better, explain why it works better,
- Summarize your experiment in a pdf document.
- Note there is no absolute right or wrong answer in this experiment. All you need is to give a try and understand how BTM works and differences between BTM and LDA

**Note: Due to randomness involved in these alogorithms, you may get the same result as what I showed below. However, your result should be close after you tune parameters carefully.**

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.cluster import KMeansClusterer, cosine_distance
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation

import pandas as pd
from sklearn import metrics
import numpy as np
import json, time
from matplotlib import pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
#Q1
def cluster_kmean(train_file, test_file):
    train=json.load(open(train_file,'r'))
    test=json.load(open(test_file,'r'))
    test_text, labels = zip(*test)
    first_label=[item[0] for item in labels]
    
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=5) 
    dtm_train= tfidf_vect.fit_transform(train)
    dtm_test= tfidf_vect.transform(test_text)
    
    num_clusters=3

    clusterer = KMeansClusterer(num_clusters, \
                            cosine_distance, \
                            repeats=20)

    clusters = clusterer.cluster(dtm_train.toarray(), \
                             assign_clusters=True)
    
    predict = [clusterer.classify(v) for v in dtm_test.toarray()]
    
    df=pd.DataFrame(list(zip(first_label, predict)), \
                columns=['actual_class','cluster'])
 
    confusion = pd.crosstab( index=df.cluster, columns=df.actual_class)
    print(confusion)
    
    mapping = confusion.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predict]

    print(metrics.classification_report(first_label, predicted_target))
    
    ###Euclidean
    
    num_clusters=3

    Euclidean_km = KMeans(n_clusters=num_clusters, n_init=20).fit(dtm_test)
    

    clusters = Euclidean_km.labels_.tolist()
    
    predicted2 = Euclidean_km.predict(dtm_test)

    confusion_df2 = pd.DataFrame(list(zip(first_label, predicted2)), \
                columns=['actual_class','cluster'])
    #confusion_df2.head()

# generate crosstab between clusters and true labels
    confusion_df3 = pd.crosstab( index=confusion_df2.cluster, columns=confusion_df2.actual_class)
    
    
    mapping = confusion_df3.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predicted2]

    print(metrics.classification_report(first_label, predicted_target))

In [3]:
if __name__ == "__main__":  
    
    # Due to randomness, you won't get the exact result
    # as shown here, but your result should be close
    # if you tune the parameters carefully
    
    # Q1
    cluster_kmean('train_text.json', \
                  'test_text.json')
            

actual_class  Disaster and Accident  News and Economy  Travel & Transportation
cluster                                                                       
0                                87                 1                      132
1                                83                 9                       36
2                                40               196                       16
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic Disaster and Accident
Cluster 2: Topic News and Economy
                         precision    recall  f1-score   support

  Disaster and Accident       0.65      0.40      0.49       210
       News and Economy       0.78      0.95      0.86       206
Travel & Transportation       0.60      0.72      0.65       184

               accuracy                           0.69       600
              macro avg       0.68      0.69      0.67       600
           weighted avg       0.68      0.69      0.67       600

Cluster 0: Topic Disaster and Acc

In [4]:
#Q2
def cluster_lda(train_file, test_file):
    train=json.load(open(train_file,'r'))
    test=json.load(open(test_file,'r'))
    test_text, labels=zip(*test)
    first_label=[item[0] for item in labels]
    
    tfidf_vect = CountVectorizer(min_df=5, stop_words='english')
    
    dtm_train= tfidf_vect.fit_transform(train)
    dtm_test= tfidf_vect.transform(test_text)
 
    num_clusters=3

    lda = LatentDirichletAllocation(n_components=num_clusters, learning_method='batch',\
                                max_iter=25,verbose=1, n_jobs=1,
                                random_state=0).fit(dtm_train)
    
    topic_assign=lda.transform(dtm_test)
    
    predict=topic_assign.argmax(axis=1)
    
    df=pd.DataFrame(list(zip(first_label, predict)), \
                columns=['actual_class','cluster'])

    confusion = pd.crosstab( index=df.cluster, \
                            columns=df.actual_class)
    print(confusion.head())
    mapping = confusion.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predict]

    print(metrics.classification_report(first_label, \
                                        predicted_target))

    return topic_assign, labels

In [5]:
if __name__ == "__main__":  
    
    # Due to randomness, you won't get the exact result
    # as shown here, but your result should be close
    # if you tune the parameters carefully
    
            
    # Q2
    print("\nQ2")
    topic_assign =cluster_lda('train_text.json', \
        'test_text.json')


Q2
iteration: 1 of max_iter: 25
iteration: 2 of max_iter: 25
iteration: 3 of max_iter: 25
iteration: 4 of max_iter: 25
iteration: 5 of max_iter: 25
iteration: 6 of max_iter: 25
iteration: 7 of max_iter: 25
iteration: 8 of max_iter: 25
iteration: 9 of max_iter: 25
iteration: 10 of max_iter: 25
iteration: 11 of max_iter: 25
iteration: 12 of max_iter: 25
iteration: 13 of max_iter: 25
iteration: 14 of max_iter: 25
iteration: 15 of max_iter: 25
iteration: 16 of max_iter: 25
iteration: 17 of max_iter: 25
iteration: 18 of max_iter: 25
iteration: 19 of max_iter: 25
iteration: 20 of max_iter: 25
iteration: 21 of max_iter: 25
iteration: 22 of max_iter: 25
iteration: 23 of max_iter: 25
iteration: 24 of max_iter: 25
iteration: 25 of max_iter: 25
actual_class  Disaster and Accident  News and Economy  Travel & Transportation
cluster                                                                       
0                                30                18                      138
1                 

In [None]:
def cluster_lda(train_file, test_file):
    train=json.load(open(train_file,'r'))
    test=json.load(open(test_file,'r'))
    test_text, labels=zip(*test)
    first_label=[item[0] for item in labels]
    
    tfidf_vect = CountVectorizer(min_df=5, stop_words='english')
    
    dtm_train= tfidf_vect.fit_transform(train)
    dtm_test= tfidf_vect.transform(test_text)
 
    num_clusters=3

    lda = LatentDirichletAllocation(n_components=num_clusters, learning_method='batch',\
                                max_iter=25,verbose=1, n_jobs=1,
                                random_state=0).fit(dtm_train)
    
    topic_assign=lda.transform(dtm_test)
    
    predict=topic_assign.argmax(axis=1)
    
    df=pd.DataFrame(list(zip(first_label, predict)), \
                columns=['actual_class','cluster'])

    confusion = pd.crosstab( index=df.cluster, \
                            columns=df.actual_class)
    print(confusion.head())
    mapping = confusion.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predict]

    print(metrics.classification_report(first_label, \
                                        predicted_target))

    return topic_assign, labels





def find_similar_doc(doc_id,topic_assign):
    
    
    docs_tokens={idx:tokenize(doc) \
             for idx,doc in enumerate(docs)}

    # step 3. get document-term matrix
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index" )
    dtm=dtm.fillna(0)
    
    # step 4. get normalized term frequency (tf) matrix        
    tf=dtm.values
    doc_len=tf.sum(axis=1)
    print(doc_len)
    tf=np.divide(tf.T, doc_len).T
    
    # step 5. get idf
    df=np.where(tf>0,1,0)
    #idf=np.log(np.divide(len(docs), \
    #    np.sum(df, axis=0)))+1

    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=tf*smoothed_idf
    
   
    Simalarity =1-spatial.distance.cosine(topic_assign[10],topic_assign[i])
    
    # the best score is always at the diagnal. a doc is similar to itself with score =1
    # return the second largest
    top_sim_index=np.argsort(Simalarity[doc_id])[::-1][1]
    
    return  top_sim_index, Simalarity[doc_id,top_sim_index]

In [11]:
# Q2
def cluster_lda(train_file, test_file):
    
    topic_assign = None
    
    # add your code here
    
    return topic_assign

def find_similar(doc_id, topic_assign):
    
    docs = None
    
    # add your code here
    
    return docs

In [12]:
if __name__ == "__main__":  
    
    # Due to randomness, you won't get the exact result
    # as shown here, but your result should be close
    # if you tune the parameters carefully
    
    # Q1
    print("Q1")
    cluster_kmean('../../dataset/train_text.json', \
                  '../../dataset/test_text.json')
            
    # Q2
    print("\nQ2")
    topic_assign =cluster_lda('../../dataset/train_text.json', \
        '../../dataset/test_text.json')
    doc_ids = find_similar(10, topic_assign)
    print ("docs similar to {0}: {1}".format(10, doc_ids))

Q1
cosine
actual_class  Disaster and Accident  News and Economy  Travel & Transportation
cluster                                                                       
0                                61                 2                      152
1                               109                 7                       25
2                                40               197                        7
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic Disaster and Accident
Cluster 2: Topic News and Economy
                         precision    recall  f1-score   support

  Disaster and Accident       0.77      0.52      0.62       210
       News and Economy       0.81      0.96      0.88       206
Travel & Transportation       0.71      0.83      0.76       184

              micro avg       0.76      0.76      0.76       600
              macro avg       0.76      0.77      0.75       600
           weighted avg       0.76      0.76      0.75       600

L2
actual_class  Disast

  'precision', 'predicted', average, warn_for)


iteration: 1 of max_iter: 25
iteration: 2 of max_iter: 25
iteration: 3 of max_iter: 25
iteration: 4 of max_iter: 25
iteration: 5 of max_iter: 25, perplexity: 3494.8408
iteration: 6 of max_iter: 25
iteration: 7 of max_iter: 25
iteration: 8 of max_iter: 25
iteration: 9 of max_iter: 25
iteration: 10 of max_iter: 25, perplexity: 3416.5917
iteration: 11 of max_iter: 25
iteration: 12 of max_iter: 25
iteration: 13 of max_iter: 25
iteration: 14 of max_iter: 25
iteration: 15 of max_iter: 25, perplexity: 3382.7160
iteration: 16 of max_iter: 25
iteration: 17 of max_iter: 25
iteration: 18 of max_iter: 25
iteration: 19 of max_iter: 25
iteration: 20 of max_iter: 25, perplexity: 3377.7126
iteration: 21 of max_iter: 25
iteration: 22 of max_iter: 25
iteration: 23 of max_iter: 25
iteration: 24 of max_iter: 25
iteration: 25 of max_iter: 25, perplexity: 3375.9923
actual_class  Disaster and Accident  News and Economy  Travel & Transportation
cluster                                                          