# Clutering themes approach

We'll try to cluster the themes, and then have QA models fine tuned for each of the clusters.

## Results
> **Birch Clustering performs the best of all the models tested during experimentation**

- We read through the following [relevant survey](https://link.springer.com/article/10.1007/s40745-015-0040-1).
- We select and experiment with only those algorithms which are suitable for our use case - 
- Quite a few (particularly the older ones) do not have python implementations which are compatible with BERTopic.
- The lack of a predict() method in many clustering algorithms immediately disqualifies them for our task. Without the ability to predict which clusters a new theme would belong to, the clustering can not be used for new themes during the testing phase.
- For instance, Agglomerative and MST Clustering are simple but cannot predict the cluster of new datapoints (without recomputing the whole clustering) and is hence not practical for our task.
- KMeans is a simple partitioning algorithm, however it ends up falling behind BIRCH in the quality of the clusters.
- HDBScan was tried with a variety of parameters, but it always ends up leaving 75+ outliers, which is far too many to be effective
- BIRCH ends up being the best choice, it has very intuitive tuning hyperparameters, which makes it quite easy to get favourable results

## Loading dataset


In [None]:
import gdown

def download_test_data(round = 1):
    """Download the test data (4 csv files)"""
    assert round in [1,2], "round can be 1 or 2"
    ids = [
        [
            "15WPYOD3ZLShFq_NRtiBHbpz3RTvc8ZWR",
            "15yxIF27NvEa3l12yNy6F5h8lGCJ2n7rf",
            "1Ilpxyj_0T-1KzQMdVSEbSmc1ybxOv69G",
            "1nkEDQZJY6_cAEVw3JlaKCgz0C6mDSYiv"
        ],
        [
            "1-3fMldkBVsTAX3W5JewdAdlUG_agexG0",
            "1-59pQe8TH7UaORF1RSqzFWybMJShdf1U",
            "1-AbnJRRHQiTU5zyUdDC2gUwbIGkEF5l6",
            "1-Px6FFj043L7lbAEBOAMSy2bdoPiVNhy"
        ]
    ]
    for id in ids[round-1]:
        url = f"https://drive.google.com/u/1/uc?id={id}&export=download"
        gdown.download(url, quiet=True)

In [None]:
download_test_data(round=2)

In [None]:
import pandas as pd
paragraphs = pd.read_csv('input_paragraph.csv')
print(type(paragraphs))
paragraphs.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,id,paragraph,theme
0,1,In The New Yorker music critic Jody Rosen desc...,Beyoncé
1,2,Beyoncé's second solo album B'Day was released...,Beyoncé
2,3,"In July 2002, Beyoncé continued her acting car...",Beyoncé
3,4,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
4,5,Forbes magazine began reporting on Beyoncé's e...,Beyoncé


In [None]:
uniq_theme_list = list(paragraphs['theme'].unique())
para_list = list(paragraphs['paragraph'])

print(f"We have {len(para_list)} pargraphs in total & those belong to {len(uniq_theme_list)} unique themes")

We have 13481 pargraphs in total & those belong to 361 unique themes


#### Concatenate all the paragraphs of a theme into one document

In [None]:
para_list = [' '.join(paragraphs[paragraphs['theme']==theme]['paragraph']) for theme in uniq_theme_list]

## BERTopic Model

### Setting up helper functions

In [None]:
def generate_embeddings(embedding_model, para_list):
  print("Generating the document embeddings...")
  para_embeddings = embedding_model(para_list)
  para_embeddings = para_embeddings.numpy()
  return para_embeddings

def fit_bertopic_model(topic_model, para_embeddings):
  print("Fitting the model using the paras & their embeddings")
  # print(type(para_embeddings))
  # np_para_emb = para_embeddings.numpy() # Converting the para_embeddings from EagerTensor -> numpy array
  topic_model.fit(documents = para_list, embeddings = para_embeddings) # Can use pre-trained embeddings directly

def perform_cluster_prediction(topic_model, para_list, para_embeddings):
  np_para_emb = para_embeddings #.numpy() # Converting the para_embeddings from EagerTensor -> numpy array
  try:
    topics, scores = topic_model.transform(para_list, np_para_emb) # Predicting the documents clusters
  except:
    topics
  return topics, scores

def get_max_cluster_id(theme_cluster_count_dict):
  """
  Input: theme_cluster_count_dict : A dictionary with cluster_id as keys and their # of occurences as the value
  """
  max_count = 0
  cid = -1
  for key in theme_cluster_count_dict:
    if (theme_cluster_count_dict[key] > max_count):
      max_count = theme_cluster_count_dict[key]
      cid = key
  return cid

def get_cid_to_list_topics(mydic):
  """
  Input: mydic: Dict mapping theme --> cluster id
  Returns a dict with cluster id as key and the value is a list of themes associated to it.
  """
  cluster_id_to_topic = {}
  for topic, cluster_id in mydic.items():
      if cluster_id not in cluster_id_to_topic:
          cluster_id_to_topic[cluster_id] = [topic]
      else:
          cluster_id_to_topic[cluster_id].append(topic)
  return cluster_id_to_topic

def get_avg_themes_per_cluster(topic_model):
  data = topic_model.get_topic_info()
  avg = data[data['Topic']!=-1]['Count'].mean()
  return avg

def generate_theme_to_cluster_mapping(paragraphs, topics):
  global_idx = 0

  theme_to_cluster_mapping = {} # This will hold the final theme to cluster mapping

  uniq_theme_list_df = paragraphs['theme'].unique()

  for theme in uniq_theme_list_df:

    theme_df = paragraphs[paragraphs['theme']==theme] # Getting the part of df with the curr theme
    theme_cluster_count_dict = {} # To store the count of each cluster_id the prev theme was mapped to. '-1' cluster_id indicates that we'll be using the global model.

    for i in range(len(theme_df)):
        curr_cid = topics[global_idx]
        if theme_cluster_count_dict.get(curr_cid)==None:
          theme_cluster_count_dict[curr_cid] = 0
        theme_cluster_count_dict[curr_cid] += 1 # Incrementing the count
    global_idx+=1
    mode_cid = get_max_cluster_id(theme_cluster_count_dict)
    theme_to_cluster_mapping[theme] = mode_cid

  return theme_to_cluster_mapping

### Setting up experimentation algorithms

In [None]:
%%capture
!pip install bertopic

In [None]:
!pip install pyclustering
!pip install mst_clustering

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyclustering
  Downloading pyclustering-0.10.1.2.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyclustering
  Building wheel for pyclustering (setup.py) ... [?25l[?25hdone
  Created wheel for pyclustering: filename=pyclustering-0.10.1.2-py3-none-any.whl size=2395121 sha256=400fba6db47dae67401c65978d86f6ad27f01cee9b9f0d850b848c44089a2d55
  Stored in directory: /root/.cache/pip/wheels/dc/25/8b/072b221a5cff4f04e7999d39ca1b6cb5dad702cc3e1da951d4
Successfully built pyclustering
Installing collected packages: pyclustering
Successfully installed pyclustering-0.10.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mst_clustering
  Downlo

In [None]:
#@title KMeans Clustering Parameters
from sklearn.cluster import KMeans
n_clusters=35                                  #@param {type:'number'}
kmeans_model = KMeans(n_clusters=n_clusters)

In [None]:
#@title MST Clustering Parameters
from mst_clustering import MSTClustering
cutoff_s = 0.5                                  #@param {type:"number"}
approx = False                                 #@param ["False", "True"] {type:"raw"}
mst_model = MSTClustering(cutoff_scale=cutoff_s,approximate=False)

In [None]:
#@title Agglomerative Clustering Parameters
from sklearn.cluster import AgglomerativeClustering
n_clusters = 40                       #@param {type:"number"}
agglo_model = AgglomerativeClustering(n_clusters=n_clusters)

In [None]:
#@title HDBSCAN
from hdbscan import HDBSCAN

min_cluster_size = 3                  #@param {type:"slider",min:1,max:20,step:1}
metric = 'euclidean'                  #@param ['euclidean','l2','p','wminkowski']
csm = 'leaf'                          #@param ["eom", "leaf"]
min_samples = 10                      #@param {type:"slider", min:5, max:30, step:1}
prediction_data=True                  #@param  ["False", "True"] {type:"raw"}
hdbc_model = HDBSCAN(min_cluster_size=5, 
                     metric='euclidean', 
                     cluster_selection_method=csm,
                     min_samples=min_samples,
                     prediction_data=True)

In [None]:
#@title Birch Clustering
from sklearn.cluster import Birch

branching_f = 50                      #@param {type:'number'}
thresh=0.4                            #@param {type:'number'}

brc_model = Birch(branching_factor=branching_f, n_clusters=None, threshold=thresh)

### Main Loop

In [None]:
from bertopic import BERTopic
import tensorflow_hub
from umap import UMAP
import copy

In [None]:
#@title Hyperparameter Tuning
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
para_embeddings = generate_embeddings(embedding_model, para_list) 


umap_model = UMAP(n_neighbors=15, n_components=6, min_dist=0.0, metric='cosine')

cluster_dict = {'BIRCH': brc_model,'HDBSCAN': hdbc_model,'Agglomerative':agglo_model,'MST':mst_model,'KMeans':kmeans_model}
cluster_model_name = 'BIRCH' #@param ['BIRCH','HDBSCAN','Agglomerative','MST','KMeans']
cluster_model = cluster_dict[cluster_model_name]

topic_model = BERTopic(        
        low_memory = True,
        umap_model = umap_model,
        hdbscan_model = cluster_model,
        embedding_model=embedding_model,
        vectorizer_model = vectorizer_model, 
        calculate_probabilities=False, 
        verbose=True, 
        # nr_topics = 'auto'
        nr_topics=int(0.3*len(uniq_theme_list))
        )  
fit_bertopic_model(topic_model, para_embeddings)
try:
  topics, scores = perform_cluster_prediction(topic_model, para_list, para_embeddings)
  theme_to_cluster_mapping = generate_theme_to_cluster_mapping(paragraphs, topics)
  cid_to_themes_mapping = (get_cid_to_list_topics(theme_to_cluster_mapping))
except:
  print('No predict() method available, skipping generation of cid_to_theme map')
avg_themes_per_cluster = get_avg_themes_per_cluster(topic_model)
topic_info = topic_model.get_topic_info()
unclustered = topic_info[topic_info['Topic']==-1]['Count']
if unclustered.empty:
  unclustered = 0
  num_clusters = len(topic_info['Topic'])
else:
  unclustered = int(unclustered)
  num_clusters = len(topic_info['Topic'])-1
print('--------------------------------------------------------------------------------------------------------------------------------------------')
print(f"{cluster_model_name} Clustering Algorithm Results:")
print(f"#Themes unclustered = {unclustered}\nAvg. #Themes / cluster = {avg_themes_per_cluster}\n# clusters = {num_clusters}")
  

Generating the document embeddings...
Fitting the model using the paras & their embeddings


2023-02-02 17:34:02,822 - BERTopic - Reduced dimensionality
2023-02-02 17:34:02,847 - BERTopic - Clustered reduced embeddings
Instructions for updating:
Use tf.identity instead.
2023-02-02 17:34:06,955 - BERTopic - Reduced number of topics from 36 to 36
2023-02-02 17:34:09,039 - BERTopic - Reduced dimensionality
2023-02-02 17:34:09,043 - BERTopic - Predicted clusters


--------------------------------------------------------------------------------------------------------------------------------------------
BIRCH Clustering Algorithm Results:
#Themes unclustered = 0
Avg. #Themes / cluster = 10.027777777777779
# clusters = 36


### Peek Cluster Titles

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,0,11,0_film_music_album_madonna
1,1,9,1_architecture_mosaic_mosaics_gothic
2,2,11,2_chinese_china_tibet_ming
3,3,12,3_windows_dell_game_apple
4,4,26,4_city_new_york_area
5,5,7,5_john_burke_victoria_assent
6,6,17,6_uranium_light_energy_copper
7,7,14,7_philosophy_quran_hayek_whitehead
8,8,18,8_species_bacteria_plants_genes
9,9,5,9_education_universities_university_schools


## Visualizing the clustered paragraphs

### As a raw dictionary 
WARNING: Requires the predict() method

In [None]:
cid_to_themes_mapping

{0: ['Beyoncé',
  'Spectre_(2015_film)',
  'Kanye_West',
  'American_Idol',
  'Sony_Music_Entertainment',
  'Universal_Studios',
  'House_music',
  'Queen_(band)',
  'Madonna_(entertainer)',
  'Turner_Classic_Movies',
  'Steven_Spielberg'],
 24: ['Frédéric_Chopin', 'Classical_music', 'A_cappella', 'Mandolin'],
 2: ['Sino-Tibetan_relations_during_the_Ming_dynasty',
  '2008_Summer_Olympics_torch_relay',
  'Zhejiang',
  'Umayyad_Caliphate',
  'Southeast_Asia',
  'Myanmar',
  'Sichuan',
  'History_of_India',
  'Iran',
  'Qing_dynasty',
  'Tajikistan'],
 3: ['The_Legend_of_Zelda:_Twilight_Princess',
  'Computer_security',
  'Videoconferencing',
  'Xbox_360',
  'Macintosh',
  'Dell',
  'Nintendo_Entertainment_System',
  'Digimon',
  'PlayStation_3',
  'IBM',
  'Windows_8',
  'Super_Nintendo_Entertainment_System'],
 4: ['New_York_City',
  'Plymouth',
  'Oklahoma_City',
  'Boston',
  'National_Archives_and_Records_Administration',
  'List_of_numbered_streets_in_Manhattan',
  'Atlantic_City,_Ne

### As a barchart

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

### Intertopic Distance Map

In [None]:
topic_model.visualize_topics()

### Hierachial View

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(para_list)
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)

100%|██████████| 35/35 [00:00<00:00, 96.21it/s] 


.
├─city_war_new_state_states
│    ├─city_new_state_area_population
│    │    ├─■──cotton_ice_antarctica_antarctic_glacier ── Topic: 30
│    │    └─city_new_state_population_area
│    │         ├─city_new_area_state_population
│    │         │    ├─■──island_islands_alaska_tuvalu_ireland ── Topic: 19
│    │         │    └─city_new_area_population_largest
│    │         │         ├─city_new_area_york_population
│    │         │         │    ├─■──mexico_city_state_valencia_spanish ── Topic: 27
│    │         │         │    └─■──city_new_york_area_street ── Topic: 4
│    │         │         └─■──paris_switzerland_swiss_strasbourg_thuringia ── Topic: 29
│    │         └─chinese_india_china_tibet_dynasty
│    │              ├─■──kathmandu_hyderabad_delhi_india_portugal ── Topic: 23
│    │              └─■──chinese_china_tibet_ming_dynasty ── Topic: 2
│    └─war_army_states_government_law
│         ├─war_army_greek_soviet_empire
│         │    ├─■──nasser_soviet_tito_estonia_union ── Topic: 