In [None]:
# Prerequisite libraries installation
!pip install -qU datasets ollama openai "semantic-router[local]" qdrant-client

Data Preparation and Embedding:

This step involves loading the dataset, preprocessing it, embedding the text data using a pre-trained model, and storing these embeddings in a vector database (e.g., Qdrant). These embeddings will be used for clustering and retrieval tasks in the subsequent steps.
Use LLMs for Hyperparameter Optimization and Clustering:

Once the embeddings are prepared, an LLM is used to suggest optimal hyperparameters for clustering. The suggested parameters are then used to perform clustering, and the quality of clustering is evaluated using metrics like silhouette scores. This step iteratively refines the clustering process based on feedback from the LLM.
Generative Response Integration:

After clustering, retrieval-augmented generation (RAG) is implemented to retrieve relevant documents based on a query. An LLM is then used to generate responses based on these retrieved documents. This step combines the results of the clustering process and uses the LLM to provide meaningful answers to queries.

Step 1: Data Preparation and Embedding


In [1]:

# Data Preparation
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv2-semantic-chunks", split="train[:10000]")
data = data.map(lambda x: {"id": x["id"], "metadata": {"title": x["title"], "content": x["content"]}})
data = data.remove_columns(["title", "content", "prechunk_id", "postchunk_id", "arxiv_id", "references"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [3]:
import torch
from semantic_router.encoders import HuggingFaceEncoder

# Force CPU usage
torch.cuda.is_available = lambda : False

encoder = HuggingFaceEncoder(name="dwzhu/e5-base-4k")
dims = len(encoder(["this is a test"])[0])

In [5]:
# Qdrant Setup
from qdrant_client import QdrantClient
from qdrant_client.http import models
import time
from tqdm.auto import tqdm

index_name = "qdrant-llama-3-rag-1"
client = QdrantClient("http://localhost:6333")

In [8]:
if index_name not in [collection.name for collection in client.get_collections().collections]:
    client.create_collection(collection_name=index_name, vectors_config=models.VectorParams(size=dims, distance=models.Distance.COSINE))
    time.sleep(1)

# Populate Qdrant with embeddings
batch_size = 128
for i in tqdm(range(0, len(data), batch_size)):
    batch = data[i:i+batch_size]
    chunks = [f'{x["title"]}: {x["content"]}' for x in batch["metadata"]]
    embeds = encoder(chunks)
    points = [models.PointStruct(id=int(id.split('#')[0].replace('.', '')), vector=embed, payload=metadata)
              for id, embed, metadata in zip(batch["id"], embeds, batch["metadata"])]
    client.upsert(collection_name=index_name, points=points)


  0%|          | 0/3 [00:00<?, ?it/s]

Step 2: Use LLMs for Hyperparameter Optimization and Clustering


In [9]:
# Define a function to use LLM for suggesting clustering parameters
from openai import OpenAI

llm_client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
MODEL = 'llama3'

def suggest_clustering_params():
    query = "Based on the dataset characteristics, suggest optimal hyperparameters for hierarchical clustering, including the number of clusters and linkage type."
    messages = [{"role": "user", "content": query}]
    response = llm_client.chat.completions.create(model=MODEL, messages=messages)
    return response.choices[0].message.content

In [10]:
# Use the LLM to get suggested parameters
suggested_params = suggest_clustering_params()
print(suggested_params)  # Parse the response to get the actual parameters

To determine the optimal hyperparameters for hierarchical clustering, let's analyze the characteristics of your dataset:

1. **Number of features (n_features=12)**: This suggests that your data might be relatively low-dimensional, which could impact the choice of linkage type.
2. **Data density**: Since you didn't provide information about the data density, I'll make an assumption based on typical clustering problems. If the data is dense with many samples per feature, a more conservative clustering method like single or average linkage might be suitable. If the data is sparse, a more aggressive approach like complete or ward linkage could work better.
3. **Cluster shapes**: Without knowing the exact shape of your clusters (e.g., spherical, linear, or irregular), I'll assume that you're working with general-purpose clustering.

Considering these factors, here are some suggested optimal hyperparameters for hierarchical clustering:

**Number of clusters (n_clusters)**:
To determine the i

In [11]:
# Perform Clustering with Suggested Parameters
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

def perform_clustering(df_emb, n_clusters=3, linkage='ward'):
    model = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage)
    labels = model.fit_predict(df_emb)
    score = silhouette_score(df_emb, labels)
    return labels, score


In [12]:
# Example of parsed parameters (replace with actual parsing logic)
n_clusters = 3
linkage = 'ward'

# Perform clustering using suggested parameters
labels, score = perform_clustering(embeds, n_clusters=n_clusters, linkage=linkage)
print(f"Silhouette Score: {score}")

# Iterate based on feedback
for _ in range(3):
    suggested_params = suggest_clustering_params()
    print(suggested_params)  # Update parameters based on response
    # Perform clustering again with new parameters
    labels, score = perform_clustering(embeds, n_clusters=n_clusters, linkage=linkage)
    print(f"Silhouette Score: {score}")

Silhouette Score: 0.14872944895028972
A great question!

To suggest optimal hyperparameters for hierarchical clustering, let's analyze the dataset characteristics:

**Dataset Characteristics:**

1. **Number of samples:** 500
2. **Dimensionality:** 20 features (high-dimensional)
3. **Distribution:** No clear indication of distributional assumptions (e.g., normality)

**Optimal Hyperparameter Suggestions:**

Based on these characteristics, here are some optimal hyperparameter suggestions:

**Number of Clusters (K):**
For a dataset with 500 samples and high dimensionality (20 features), I recommend starting with a smaller number of clusters to avoid over-clustering. Let's try `k=3` to `k=5`.

**Linkage Type:**
Given the high dimensionality, I suggest using a **ward linkage**, which is more robust to noise and outliers compared to other linkage types like single or complete linkage.

**Why these hyperparameters?**

1. Starting with a smaller number of clusters (`k=3` to `k=5`) allows for a

Step 3: Generative Response Integration


In [13]:
# Retrieval Function
def get_docs(query: str, top_k: int) -> list[str]:
    xq = encoder([query])
    search_result = client.search(collection_name=index_name, limit=top_k, query_vector=xq[0], with_payload=True)
    return [point.payload['content'] for point in search_result]

# Generate Responses Using LLM
def generate(query: str, docs: list[str]):
    system_message = (
        "You are a helpful assistant that answers questions about AI using the "
        "context provided below.\n\n"
        "CONTEXT:\n"
        "\n---\n".join(docs)
    )
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
    ]
    chat_response = llm_client.chat.completions.create(model=MODEL, messages=messages)
    return chat_response.choices[0].message.content



In [14]:
# Example Usage
query = "can you tell me about the Llama LLMs?"
docs = get_docs(query, 5)
print(generate(query=query, docs=docs))

LLaMA!

LLaMA (Large Language Model Applications) is a family of foundation models developed by Meta AI, a leading artificial intelligence research organization. These models are designed to be highly flexible and versatile, allowing them to be fine-tuned for various natural language processing (NLP) tasks.

The LLaMA family includes several models with different sizes and capabilities:

1. **LLaMA 2**: This is the largest and most capable of the LLaMA models. It was trained on a massive dataset of text from the internet, books, and other sources, and it can generate text that is similar in style and content to a given input prompt.
2. **FLAN (Foundation Language Model)**: FLAN is a smaller, more focused model designed for specific NLP tasks like question answering, sentence completion, or text classification.
3. **GLaM (Generative Large Language Model)**: GLaM is another foundation model that's particularly well-suited for generating text based on input prompts.

These models have bee