# Clustering
We utilized clustering to create user and item clusters as a way to further experiment with classical recommendation systems like ALS, SVD++ and NMF. Through clustering we improved computational efficiency by minimizing our search space. Clustering was done through the use of uniform manifold approximation and projection, or UMAP, and scikit-learn.  Similarly for users, we calculated feature scores and average time of day of interaction and performed the same procedures. 

## Clustering News
For news clustering, we first vectorized the titles and abstracts with scikit-learns TF-IDF and BOW vectorizers. Afterwards, we performed dimension reduction to two components with UMAP under both hellinger and euclidean distance metrics, then performed clustering off of the results with HDBSCAN and Kmeans.

In [2]:
import clustering_modules as cm
import pandas as pd
# Loading in the data for tf-idf and bag of words vectorization methods.
news_text = pd.read_csv('../MIND_large/csv/news.csv', index_col=0).set_index('news_id').drop(columns=['url','title_entities','abstract_entities'])
news_text.head()

Unnamed: 0_level_0,category,sub_category,title,abstract
news_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
N88753,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
N23144,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
N86255,health,medical,Dispose of unwanted prescription drugs during ...,
N93187,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
N75236,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."


In [3]:
# Create our UMAP_embeddings for our vectorization types and distance metrics.
bow_matrix, tf_matrix = cm.vectorize_items(news_text)
bow_embeddings = [cm.create_UMAP_embeddings(2, bow_matrix, 'euclidean'), cm.create_UMAP_embeddings(2, bow_matrix)]
tf_embeddings = [cm.create_UMAP_embeddings(2, tf_matrix, 'euclidean'), cm.create_UMAP_embeddings(2, tf_matrix)]

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  return super().__call__(*args, **kws)
Disconnection_distance = 1 has removed 198 edges.
It has only fully disconnected 2 vertices.
Use umap.utils.disconnected_vertices() to identify them.
  warn(


In [4]:
# Apply kmeans and hdbscan clustering algorithms to our embeddings
embeddings = bow_embeddings + tf_embeddings
kmeans_labels = [cm.create_kmeans_labels(embeddings[index]) for index in [0, 2, 1, 3]]
hdbscan_labels = [cm.create_hdbscan_labels(embeddings[index]) for index in [0, 2, 1, 3]]

ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
# Plot clustering results
cm.visualize_all_item_clusters(bow_embeddings, tf_embeddings, ['Euclidean', 'Hellinger'], hdbscan_labels, kmeans_labels, cmap='vridis')

By clustering we can see that text vectorized by tf-idf has a much wider spread and more clearly defined clusters after being pushed into two dimensions. Due to this structure we utilize tf-idf embeddings for clustering.

### Creating a user-item matrix
With reduced dimension embeddings created we can then move on to creating the user item matrix

In [None]:
higher_dim_bow = create_UMAP_embeddings(50, bow_matrix)
higher_dim_tf = create_UMAP_embeddings(50, tf_matrix)



In [None]:
bow_hdbscan_labels = create_hdbscan_labels(higher_dim_bow)
tf_hdbscan_labels = create_hdbscan_labels(higher_dim_tf)

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
# labels = [bow_hellinger_hdbscan_labels, bow_hellinger_kmeans_labels, tf_hellinger_hdbscan_labels, tf_hellinger_kmeans_labels]
axs = axs.flatten()
axs[0].scatter(hellinger_bow_embeddings[:, 0],hellinger_bow_embeddings[:, 1], alpha = 0.5, s=1, c=bow_hdbscan_labels)
axs[1].scatter(hellinger_tf_embeddings[:, 0], hellinger_tf_embeddings[:, 1], alpha = 0.5, s=1, c=tf_hdbscan_labels)

fig.suptitle("Exploration of HDBSCAN on higher dimensional data")
axs[0].set_title('BoW Embeddings - hdbscan')
axs[1].set_title('BoW Embeddings - kmeans')

plt.tight_layout()
plt.show();

In [None]:
def tune_clustering(matrix, n_components=50, metric='euclidean', min_cluster_size=500, n_neighbors=30):
    umap_embeddings = umap.UMAP(
        n_neighbors=30,
        min_dist=0.0,
        n_components=n_components,
        random_state=42,
        metric=metric
    ).fit_transform(matrix)
    hdbscan_labels = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=500).fit_predict(umap_embeddings)
    clustered = (hdbscan_labels >= 0)
    print(f"CLUSTERED: Adjusted random score is: {adjusted_rand_score(news_text['category'][clustered], hdbscan_labels[clustered])},\n Adjusted mutual info is: {adjusted_mutual_info_score(news_text['category'][clustered], hdbscan_labels[clustered])}")
    print(f"Adjusted random score is: {adjusted_rand_score(news_text['category'], hdbscan_labels)},\n Adjusted mutual info is: {adjusted_mutual_info_score(news_text['category'], hdbscan_labels)}")    

In [None]:
def tune_clustering(matrix, labels, mat_name, n_components=50, metric='euclidean', min_cluster_size=500, n_neighbors=30, min_dist=0.0, min_samples=5, random_state=42, counter=0):
    """
    Tune clustering parameters and evaluate clustering performance.

    Parameters:
    - matrix: Data matrix to cluster.
    - labels: True labels for evaluation.
    - n_components: Number of dimensions for UMAP.
    - metric: Distance metric for UMAP.
    - min_cluster_size: Minimum cluster size for HDBSCAN.
    - n_neighbors: Number of neighbors for UMAP.
    - min_dist: Minimum distance between points in UMAP space.
    - min_samples: Minimum samples for HDBSCAN.
    - random_state: Random state for reproducibility.

    Returns:
    - Prints evaluation scores.
    """
    try:
        umap_embeddings = umap.UMAP(
            n_neighbors=n_neighbors,
            min_dist=min_dist,
            n_components=n_components,
            random_state=random_state,
            metric=metric,
        ).fit_transform(matrix)
        
        hdbscan_labels = hdbscan.HDBSCAN(
            min_samples=min_samples,
            min_cluster_size=min_cluster_size
        ).fit_predict(umap_embeddings)
        
        clustered = (hdbscan_labels >= 0)
        if clustered.any():
            data = pd.DataFrame(data = {'mat' : mat_name, 'n_neighbors' : n_neighbors, 'min_dist' : min_dist, 'n_components' : n_components, 'metric' : metric, 'min_samples' : min_samples, 'min_dist' : min_dist,
                                    'clustered_rand' : adjusted_rand_score(labels[clustered], hdbscan_labels[clustered]),
                                    'clustered_info' : adjusted_mutual_info_score(labels[clustered], hdbscan_labels[clustered]),
                                    'overall_rand' : adjusted_rand_score(labels, hdbscan_labels),
                                    'overall_info' : adjusted_mutual_info_score(labels, hdbscan_labels)})

            data.to_csv('cluster_tuning.csv', mode='a', index = [counter])

        else:
            print("No clusters formed with the given parameters.")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
n_components_range = [10, 50, 100]
min_cluster_size_range = [100, 500, 1000]
n_neighbors_range = [10, 30, 50]
min_dist_range = [0.0, 0.1, 0.5]
min_samples_range = [5, 10, 20]

# Iterate over each parameter range
counter = 0
for n_components in n_components_range:
    for min_cluster_size in min_cluster_size_range:
        for n_neighbors in n_neighbors_range:
            for min_dist in min_dist_range:
                for min_samples in min_samples_range:
                    print(f"Testing parameters: n_components={n_components}, min_cluster_size={min_cluster_size}, "
                          f"n_neighbors={n_neighbors}, min_dist={min_dist}, min_samples={min_samples}")
                    
                    tune_clustering(tf_matrix, news_text['category'],
                                    'tfidf',
                                    n_components=n_components, 
                                    metric='euclidean', 
                                    min_cluster_size=min_cluster_size, 
                                    n_neighbors=n_neighbors, 
                                    min_dist=min_dist, 
                                    min_samples=min_samples,
                                    counter=counter, 
                                    random_state=42)
                    counter += 1
                     


In [None]:
components = [5, 25, 50]
for component in components:
    tune_clustering(tf_matrix, n_components=component)

In [None]:
umap_embeddings_norm = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=50,
    random_state=42
).fit_transform(tf_normalized)

In [None]:
umap_embeddings = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=50,
    random_state=42
).fit_transform(tf_matrix)

In [None]:
umap_embeddings_dim_2 = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=2,
    random_state=42
).fit_transform(tf_matrix)

In [None]:
hdbscan_labels = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=500).fit_predict(umap_embeddings)

In [None]:
hdbscan_labels_norm = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=500).fit_predict(umap_embeddings_norm)

In [None]:
hdbscan_labels_dim_2 = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=500).fit_predict(umap_embeddings_dim_2)

In [None]:
clustered = (hdbscan_labels_dim_2 >= 0)
plt.scatter(umap_embeddings_dim_2[~clustered, 0],
            umap_embeddings_dim_2[~clustered, 1],
            color=(0.5, 0.5, 0.5),
            s=0.1,
            alpha=0.5)
plt.scatter(umap_embeddings_dim_2[clustered, 0],
            umap_embeddings_dim_2[clustered, 1],
            c=hdbscan_labels_dim_2[clustered],
            s=0.1,
            cmap='Spectral');

In [None]:
len(hdbscan_labels)

In [None]:
len(hdbscan_labels_dim_2)

In [None]:
clustered = (hdbscan_labels_norm >= 0)
print(
    adjusted_rand_score(news_text['category'][clustered], hdbscan_labels_norm[clustered]),
    adjusted_mutual_info_score(news_text['category'][clustered], hdbscan_labels_norm[clustered])
)

In [None]:
print(adjusted_rand_score(news_text['category'], hdbscan_labels_norm), adjusted_mutual_info_score(news_text['category'], hdbscan_labels_norm))

In [None]:
clustered = (hdbscan_labels >= 0)
print(
    adjusted_rand_score(news_text['category'][clustered], hdbscan_labels[clustered]),
    adjusted_mutual_info_score(news_text['category'][clustered], hdbscan_labels[clustered])
)

In [None]:
print(adjusted_rand_score(news_text['category'], hdbscan_labels), adjusted_mutual_info_score(news_text['category'], hdbscan_labels))

In [None]:
clustered = (hdbscan_labels >= 0)
plt.scatter(umap_embeddings[~clustered, 0],
            umap_embeddings[~clustered, 1],
            color=(0.5, 0.5, 0.5),
            s=0.1,
            alpha=0.5)
plt.scatter(umap_embeddings[clustered, 0],
            umap_embeddings[clustered, 1],
            c=hdbscan_labels[clustered],
            s=0.1,
            cmap='Spectral');

In [None]:
## Now moving onto exploring the embeddings generated by BERT and how they might work
embeddings=pd.read_csv('pure_embeddings.csv').set_index('news_id')
news = pd.read_csv('MIND_small/csv/news_big_embeddings.csv').drop(columns=['Unnamed: 0', 'abstract_entities', 'title_entities', 'url'])
news = news[news['abstract_embeddings'] != '[0]']
news = news[news['abstract_embeddings'].isna() == False].set_index('news_id')
news.drop(columns = ['abstract_embeddings', 'title_embeddings'], inplace=True)
bert_df = pd.concat([news, embeddings], axis=1).drop(columns=['Unnamed: 0.1', 'title', 'abstract'])

In [None]:
bert_umap_embeddings = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=50,
    random_state=42).fit(bert_df[bert_df.columns.to_list()[2:]])

In [None]:
uplot.points(bert_umap_embeddings, labels=bert_df['category'], cmap='vtidis')