<div align="center">

# Cluster Filtering

</div>
<br>

While the anomaly detector in Muzlin works well to inform the user if new data belongs to the fitted dataset group, what it lacks is whether the new data belongs to many or one sub-group within the training dataset. 

Why is this useful? 

Say that you have a vector index and want to provide the top 10 retrieved context to an LLM to answer th user's question.
While the user's question belongs to the training dataset (e.g. the vector index), the retrieved context may be significantly seperated from the question within the vector space and not really provide much or meaningful context.

A simple approach might be to use the consine similarity and set a passing threshold.
Another approach provided in Muzlin takes a more automated apoproach.

That is where clustering filters come in and can be used as a second layer filter after anomaly detection.

# Let's get started!

To begin, first it is recommended to install the necessary libraries to work with the notebooks



In [None]:
!pip install -q muzlin[notebook]

Now that we have everything installed, let's import the precomputed encoded textual vectors.

In [1]:
import numpy as np
vectors = np.load('vectors.npy')

<br>
Now we can build our clustering filter

In [None]:
from muzlin.anomaly import OutlierCluster
from sklearn.cluster import KMeans

# Let's initialize a clustering method. Don't worry about the n_clusters, this will be dynamically reset
clust = KMeans(n_clusters=2, random_state=1234)

# Since this is linked to the number of retrieved context a useful component is the top-k retreival amount
n_retrieve = 10 # Retrive 10 documents from the vector index


# Set mlflow to true to log the experiment
#mlflow.set_experiment('outlier_model')
clf = OutlierCluster(mlflow=False, method=clust, n_retrieve=n_retrieve)
clf.fit(vectors)
#mlflow.end_run()

<br>
Perhaps a quick look at the cluster stats will be helpful before we continue.

In [3]:
n_col = len(np.unique(clf.labels_))
_, n_counts = np.unique(clf.labels_, return_counts=True)
print('Number of clusters:', n_col)
print('Mean number of vectors per cluster:', np.mean(n_counts)) 
print('Median number of vectors per cluster:', np.median(n_counts))
print('Standard deviation between the number of vectors per cluster:',np.std(n_counts))

Number of clusters: 40
Mean number of vectors per cluster: 20.225
Median number of vectors per cluster: 19.0
Standard deviation between the number of vectors per cluster: 6.897417995163118


<br>
A nice way to visualize this is to decompose the vectors and inspect the 3D plot of all the clusters

In [None]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.decomposition import PCA as PCA_decomp

# Create a colormap with 40 distinct colors
cmap = plt.get_cmap('tab20')  # You can use other colormaps like 'tab20', 'viridis', etc.
colors = [cmap(i / n_col) for i in range(n_col)]  # Generate colors for 40 labels


# Create a decomposition model and transform the data
#decomp = TSNE(n_components=3, perplexity=5, random_state=42, init='pca', learning_rate='auto', metric='cosine')
decomp = PCA_decomp(n_components=3)
vis_dims = decomp.fit_transform(vectors)

x = vis_dims[:, 0]
y = vis_dims[:, 1]
z = vis_dims[:, 2]

labels = clf.labels_

# Initialize an empty list to hold the scatter plots
scatter_list = []

# Plot each label with a unique color
for i, label in enumerate(np.unique(labels)):
    scatter_list.append(go.Scatter3d(
        x=x[labels == label],
        y=y[labels == label],
        z=z[labels == label],
        mode='markers',
        marker=dict(size=1.5, color=f'rgb({colors[i][0] * 255},{colors[i][1] * 255},{colors[i][2] * 255})'),
        name=f'Label {label}'
    ))

# Create the figure
fig = go.Figure(data=scatter_list)


# Set the title
fig.update_layout(title_text='Clusters',
                 width=600, height=600)

# Show the plot
fig.show()

<br>
So now that we have a cluster filter, the next step is to test to see how it performs with retrieved documents
<br>
However, to do this we will need to first build an vector index

In [5]:
import pandas as pd
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

df = pd.read_csv('bigbio_scifact.csv')

texts = df['data'].values.tolist()

embeddings = HuggingFaceEmbeddings(model_name='BAAI/bge-small-en-v1.5')

db = FAISS.from_texts(texts, embeddings)

#db.save_local("faiss_index")
#db = FAISS.load_local("faiss_index", embeddings)

Any document retriever will work. 

Muzlin aslo has a wrapper for LangChain and LlamaIndex vector indeces. I you want to keep everything consistent within Muzlin, this vector index can be loaded into a local class for handeling Langchain indexes  

In [6]:
from muzlin.index import LangchainIndex

db = LangchainIndex(index=db, top_k=n_retrieve)

Let's now create a function for retrieving the stored documents based on the user's query

In [7]:
from muzlin.encoders import HuggingFaceEncoder

def get_doc_vectors(index, encoder, query):

    documents = index(query)
    
    doc_vectors = []
    for doc in documents:
        print(doc)
        doc_embed = encoder([doc])
        doc_array = np.array(doc_embed).reshape(1, -1)
        doc_vectors.append(doc_array.ravel())

    print('\n')
    return doc_vectors

encoder = HuggingFaceEncoder()

The two queries below were shown to pass the anomaly threshold in the last notebook example

What results will the clustering filter test say?

In [8]:
query1 = 'What treatment raises endoplasmic reticulum stress?'
query2 = 'If I take too much folic acid will a side effect be kidney disease?'

q1_doc_vecs = get_doc_vectors(db, encoder, query1)
q2_doc_vecs = get_doc_vectors(db, encoder, query2)

4-PBA treatment raises endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.
4-PBA treatment decreases endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.
ATF4 is a general endoplasmic reticulum stress marker.
BiP is a general endoplasmic reticulum stress marker.
CHOP is a general endoplasmic reticulum stress marker.
Treatment with a protein named FN impairs regenerative abilities of aged muscles.
Treatment with a protein named FN restores regenerative abilities of aged muscles.
Cholesterol loading induces KLF4 expression in VSMCs, resulting in the expression of pro-inflammatory cytokines.
PCSK9 inhibitors decrease plasma Lp(a) levels.
Chenodeoxycholic acid treatment decreases brown adipose tissue activity.


40mg/day dosage of folic acid and 2mg/day dosage of vitamin B12 does not affect chronic kidney disease (CKD) progression.
Intake of folic acid (FA) and vitamin B6 (VB6) increases levels of homocysteine.
A de

Just by visual inspection of the retrived documents above we can see that the first query can be fully answered by the context provided.
However, the second while it seems like the context may appear to answer the query, none of context fully answers the query.

There are three tests that are applied during clustering filtering:

- Is thre retieved context from an optimal number of clusters (e.g. not to dense or sparse in detail)
- Does the query and the retrieved context really constitute a realistic cluster with respect to the entire fitted data (checks if this pseudo-cluster is similar in size to the general cluster size within the vector index)
- Factoring in the density of the retieved documents cluster, does the query really belong to this cluster? 

In [9]:
query1_vec = np.array(encoder([query1])).reshape(1,-1)
query2_vec = np.array(encoder([query2])).reshape(1,-1)

clust_class1, topk_class1, sep_class1 = clf.predict(query1_vec, q1_doc_vecs)
clust_class2, topk_class2, sep_class2 = clf.predict(query2_vec, q2_doc_vecs)

print('For query 1 the three tests are as follows: ', clust_class1, topk_class1, sep_class1) # inlier
print('For query 2 the three tests are as follows: ', clust_class2, topk_class2, sep_class2) # outlier

For query 1 the three tests are as follows:  0 0 0
For query 2 the three tests are as follows:  0 1 1
