[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/anthropic_mongodb_pam_ai_stack.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/rag_with_claude_opus_mongodb/)


In [None]:
!pip install --quiet pymongo datasets pandas anthropic voyageai

# Set Environment Variables


In [2]:
import os

os.environ["ANTHROPIC_API_KEY"] = ""
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")

os.environ["VOYAGE_API_KEY"] = ""
VOYAGE_API_KEY = os.environ.get("VOYAGE_API_KEY")

os.environ["HF_TOKEN"] = ""

In [3]:
import pandas as pd
from datasets import load_dataset

# Make sure you have an Hugging Face token(HF_TOKEN) in your development environemnt before running the code below
# How to get a token: https://huggingface.co/docs/hub/en/security-tokens

# https://huggingface.co/datasets/MongoDB/tech-news-embeddings
dataset = load_dataset("MongoDB/tech-news-embeddings", split="train", streaming=True)
combined_df = dataset.take(500)

# Convert the dataset to a pandas dataframe
combined_df = pd.DataFrame(combined_df)

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

In [4]:
# Remove the _id column from the initial dataset
# We do this because MongoDB will automatically generate new unique _id fields
# when inserting documents. Removing existing _id values ensures that MongoDB
# creates fresh, unique identifiers for each document, avoiding potential
# conflicts with pre-existing IDs and maintaining data integrity in the database.
combined_df = combined_df.drop(columns=["_id"])

# Remove the initial embedding coloumn as we are going to create new embeddings with VoyageAI embedding model
combined_df = combined_df.drop(columns=["embedding"])

combined_df.head()

Unnamed: 0,companyName,companyUrl,published_at,url,title,main_image,description
0,01Synergy,https://hackernoon.com/company/01synergy,2023-05-16 02:09:00,https://www.businesswire.com/news/home/2023051...,onsemi and Sineng Electric Spearhead the Devel...,https://firebasestorage.googleapis.com/v0/b/ha...,(Nasdaq: ON) a leader in intelligent power and...
1,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 00:07:00,https://elkodaily.com/news/local/adobe-student...,Adobe student receives national Information an...,https://firebasestorage.googleapis.com/v0/b/ha...,ELKO — An eighth grader at Adobe Middle School...
2,01Synergy,https://hackernoon.com/company/01synergy,2023-05-01 22:22:00,https://www.aei.org/technology-and-innovation/...,Modernizing State Services: Harnessing Technol...,https://firebasestorage.googleapis.com/v0/b/ha...,To deliver 21st-century government services Go...
3,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 13:12:00,https://www.crn.com/news/managed-services/terr...,Terry Richardson On Why He Left AMD GreenPages...,https://firebasestorage.googleapis.com/v0/b/ha...,In February GreenPages acquired Toronto-based ...
4,01Synergy,https://hackernoon.com/company/01synergy,2023-05-15 20:01:00,https://www.benzinga.com/pressreleases/23/05/3...,Synex Renewable Energy Corporation (Formerly S...,https://firebasestorage.googleapis.com/v0/b/ha...,The conference will bring together growth orie...


In [5]:
import voyageai

vo = voyageai.Client(api_key=VOYAGE_API_KEY)


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = vo.embed(text, model="voyage-large-2", input_type="document")

    return embedding.embeddings[0]


combined_df["embedding"] = combined_df["description"].apply(get_embedding)

combined_df.head()

Unnamed: 0,companyName,companyUrl,published_at,url,title,main_image,description,embedding
0,01Synergy,https://hackernoon.com/company/01synergy,2023-05-16 02:09:00,https://www.businesswire.com/news/home/2023051...,onsemi and Sineng Electric Spearhead the Devel...,https://firebasestorage.googleapis.com/v0/b/ha...,(Nasdaq: ON) a leader in intelligent power and...,"[0.01778620295226574, 0.010809645056724548, 0...."
1,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 00:07:00,https://elkodaily.com/news/local/adobe-student...,Adobe student receives national Information an...,https://firebasestorage.googleapis.com/v0/b/ha...,ELKO — An eighth grader at Adobe Middle School...,"[0.013895523734390736, 0.004011738579720259, 0..."
2,01Synergy,https://hackernoon.com/company/01synergy,2023-05-01 22:22:00,https://www.aei.org/technology-and-innovation/...,Modernizing State Services: Harnessing Technol...,https://firebasestorage.googleapis.com/v0/b/ha...,To deliver 21st-century government services Go...,"[-0.01352707389742136, -0.0019957569893449545,..."
3,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 13:12:00,https://www.crn.com/news/managed-services/terr...,Terry Richardson On Why He Left AMD GreenPages...,https://firebasestorage.googleapis.com/v0/b/ha...,In February GreenPages acquired Toronto-based ...,"[0.008943582884967327, -0.016207382082939148, ..."
4,01Synergy,https://hackernoon.com/company/01synergy,2023-05-15 20:01:00,https://www.benzinga.com/pressreleases/23/05/3...,Synex Renewable Energy Corporation (Formerly S...,https://firebasestorage.googleapis.com/v0/b/ha...,The conference will bring together growth orie...,"[0.02538326568901539, 0.0015419272240251303, 0..."


Create Database and Collection
Create Vector Search Index

In [6]:
os.environ["MONGO_URI"] = ""

In [7]:
import pymongo


def get_mongo_client(mongo_uri):
    """Establish and validate connection to the MongoDB."""

    client = pymongo.MongoClient(
        mongo_uri, appname="devrel.showcase.anthropic_rag.python"
    )

    # Validate the connection
    ping_result = client.admin.command("ping")
    if ping_result.get("ok") == 1.0:
        # Connection successful
        print("Connection to MongoDB successful")
        return client
    print("Connection to MongoDB failed")
    return None


mongo_uri = os.environ["MONGO_URI"]

if not mongo_uri:
    print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME = "knowledge"
COLLECTION_NAME = "research_papers"

db = mongo_client.get_database(DB_NAME)
collection = db.get_collection(COLLECTION_NAME)

Connection to MongoDB successful


In [8]:
# To ensure we are working with a fresh collection
# delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 4000, 'electionId': ObjectId('7fffffff000000000000002b'), 'opTime': {'ts': Timestamp(1721843443, 1788), 't': 43}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1721843443, 1788), 'signature': {'hash': b'3\xc9e\r\xf3\x0f\xc1\x9d\xee\x04J(\xa2\xd0\xabZW\x19l\x88', 'keyId': 7353740577831124994}}, 'operationTime': Timestamp(1721843443, 1788)}, acknowledged=True)

In [None]:
# Data Ingestion
combined_df_json = combined_df.to_dict(orient="records")
collection.insert_many(combined_df_json)

In [10]:
def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5,  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "embedding": 0,  # Exclude the embedding field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                },
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

In [12]:
import anthropic

client = anthropic.Client(api_key=ANTHROPIC_API_KEY)

In [13]:
def handle_user_query(query, collection):
    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += (
            f"Title: {result.get('title', 'N/A')}, "
            f"Company Name: {result.get('companyName', 'N/A')}, "
            f"Company URL: {result.get('companyUrl', 'N/A')}, "
            f"Date Published: {result.get('published_at', 'N/A')}, "
            f"Article URL: {result.get('url', 'N/A')}, "
            f"Description: {result.get('description', 'N/A')}, \n"
        )

    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        system="You are Venture Captital Tech Analyst with access to some tech company articles and information. You use the information you are given to provide advice.",
        messages=[
            {
                "role": "user",
                "content": "Answer this user query: "
                + query
                + " with the following context: "
                + search_result,
            }
        ],
    )

    return (response.content[0].text), search_result

In [15]:
# Conduct query with retrieval of sources
query = "Give me the best tech stock to invest in and tell me why"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"\nSource Information: \n{source_information}")

Response: Based on the limited information provided in the article titles and descriptions, it's difficult to recommend a single "best" tech stock to invest in. Investing always carries risk, and it's important to do thorough research before making any investment decisions. That said, here are a few thoughts on the companies mentioned:

01Synergy is listed in a couple of the articles as a top information technology services stock. The IT services sector can be promising, as many businesses rely on these companies for critical tech infrastructure and support. However, without more details on 01Synergy's financials, competitive advantages, growth prospects etc., it's hard to say if it's the best investment.

The article on Seeking Alpha about ranking IT stocks by value could be insightful - finding stocks that are undervalued by the market can lead to good returns. But the description doesn't specify which stocks scored well.

The roundup article mentions larger, well-known tech companie