The following code use Mongo DB and VoyageAI to mimick a VC Analsyt based on accurate information

First step => !pip install pymongo datasets pandas anthropic voyageai

In [None]:
import os
from dotenv import load_dotenv
import requests
from io import BytesIO
import pandas as pd

load_dotenv()

os for interacting with the operating system,
requests for making HTTP requests,
BytesIO from the io module to handle bytes objects like files in memory,
pandas (as pd) for data manipulation and analysis, and
userdata from google.colab enables access to environment variables stored in Google Colab secrets.

In [2]:
def download_and_combine_parquet_files(parquet_file_urls, hf_token):
    """
    Downloads Parquet files from the provided URLs using the given Hugging Face token,
    and returns a combined DataFrame.

    Parameters:
    - parquet_file_urls: List of strings, URLs to the Parquet files.
    - hf_token: String, Hugging Face authorization token.

    Returns:
    - combined_df: A pandas DataFrame containing the combined data from all Parquet files.
    """
    headers = {"Authorization": f"Bearer {hf_token}"}
    all_dataframes = []

    for parquet_file_url in parquet_file_urls:
        response = requests.get(parquet_file_url, headers=headers)
        if response.status_code == 200:
            parquet_bytes = BytesIO(response.content)
            df = pd.read_parquet(parquet_bytes)
            all_dataframes.append(df)
        else:
            print(f"Failed to download Parquet file from {parquet_file_url}: {response.status_code}")

    if all_dataframes:
        combined_df = pd.concat(all_dataframes, ignore_index=True)
        return combined_df
    else:
        print("No dataframes to concatenate.")
        return None

In [3]:
# Uncomment the links below to load more data
# For the full list of data visit: https://huggingface.co/datasets/MongoDB/tech-news-embeddings/tree/refs%2Fconvert%2Fparquet/default/train
parquet_files = [
    "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0000.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0001.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0002.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0003.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0004.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0005.parquet",
]

hf_token = os.getenv('HF_TOKEN')
combined_df = download_and_combine_parquet_files(parquet_files, hf_token)

# Remove the _id coloum from the intital dataset
combined_df = combined_df.drop(columns=['_id'])

# Remove the initial embedding coloumn as we are going to create new embeddings with VoyageAI embedding model
combined_df = combined_df.drop(columns=['embedding'])

In [4]:
combined_df.head()

Unnamed: 0,companyName,companyUrl,published_at,url,title,main_image,description
0,01Synergy,https://hackernoon.com/company/01synergy,2023-05-16 02:09:00,https://www.businesswire.com/news/home/2023051...,onsemi and Sineng Electric Spearhead the Devel...,https://firebasestorage.googleapis.com/v0/b/ha...,(Nasdaq: ON) a leader in intelligent power and...
1,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 00:07:00,https://elkodaily.com/news/local/adobe-student...,Adobe student receives national Information an...,https://firebasestorage.googleapis.com/v0/b/ha...,ELKO — An eighth grader at Adobe Middle School...
2,01Synergy,https://hackernoon.com/company/01synergy,2023-05-01 22:22:00,https://www.aei.org/technology-and-innovation/...,Modernizing State Services: Harnessing Technol...,https://firebasestorage.googleapis.com/v0/b/ha...,To deliver 21st-century government services Go...
3,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 13:12:00,https://www.crn.com/news/managed-services/terr...,Terry Richardson On Why He Left AMD GreenPages...,https://firebasestorage.googleapis.com/v0/b/ha...,In February GreenPages acquired Toronto-based ...
4,01Synergy,https://hackernoon.com/company/01synergy,2023-05-15 20:01:00,https://www.benzinga.com/pressreleases/23/05/3...,Synex Renewable Energy Corporation (Formerly S...,https://firebasestorage.googleapis.com/v0/b/ha...,The conference will bring together growth orie...


In [5]:
# Limiting the amount of document used to 500 for this demo due to the rate limit on VoyageAI API
# Read more on VoyageAI rate limits: https://docs.voyageai.com/docs/rate-limits
max_documents = 500

if len(combined_df) > max_documents:
    combined_df = combined_df[:max_documents]

In [7]:
import voyageai
import time

vo = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))

def get_embedding(text: str) -> list[float]:
    if not text.strip():
      print("Attempted to get embedding for empty text.")
      return []

    embedding = vo.embed(text, model="voyage-large-2", input_type="document")

    return embedding.embeddings[0]

combined_df["embedding"] = combined_df["description"].apply(get_embedding)

combined_df.head()

Unnamed: 0,companyName,companyUrl,published_at,url,title,main_image,description,embedding
0,01Synergy,https://hackernoon.com/company/01synergy,2023-05-16 02:09:00,https://www.businesswire.com/news/home/2023051...,onsemi and Sineng Electric Spearhead the Devel...,https://firebasestorage.googleapis.com/v0/b/ha...,(Nasdaq: ON) a leader in intelligent power and...,"[0.01778620295226574, 0.010809645056724548, 0...."
1,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 00:07:00,https://elkodaily.com/news/local/adobe-student...,Adobe student receives national Information an...,https://firebasestorage.googleapis.com/v0/b/ha...,ELKO — An eighth grader at Adobe Middle School...,"[0.013895523734390736, 0.004011738579720259, 0..."
2,01Synergy,https://hackernoon.com/company/01synergy,2023-05-01 22:22:00,https://www.aei.org/technology-and-innovation/...,Modernizing State Services: Harnessing Technol...,https://firebasestorage.googleapis.com/v0/b/ha...,To deliver 21st-century government services Go...,"[-0.01352707389742136, -0.0019957569893449545,..."
3,01Synergy,https://hackernoon.com/company/01synergy,2023-05-02 13:12:00,https://www.crn.com/news/managed-services/terr...,Terry Richardson On Why He Left AMD GreenPages...,https://firebasestorage.googleapis.com/v0/b/ha...,In February GreenPages acquired Toronto-based ...,"[0.008943582884967327, -0.016207382082939148, ..."
4,01Synergy,https://hackernoon.com/company/01synergy,2023-05-15 20:01:00,https://www.benzinga.com/pressreleases/23/05/3...,Synex Renewable Energy Corporation (Formerly S...,https://firebasestorage.googleapis.com/v0/b/ha...,The conference will bring together growth orie...,"[0.02538326568901539, 0.0015419272240251303, 0..."


<h3>Data ingestion</h3>
To ingest data into the MongoDB database created in previous steps. The following operations have to be carried out:

Connect to the database and collection
Clear out the collection of any existing records
Convert the Pandas DataFrame of the dataset into dictionaries before ingestion
Ingest dictionaries into MongoDB using a batch operation

In [8]:
import pymongo

def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = os.getenv('MONGO_URI')
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME="tech_news"
COLLECTION_NAME="hacker_noon_tech_news"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

Connection to MongoDB successful


In [9]:
# To ensure we are working with a fresh collection
# delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff00000000000002e5'), 'opTime': {'ts': Timestamp(1735490434, 177), 't': 741}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1735490434, 177), 'signature': {'hash': b'\x87\x0e\x12\x95?\xf4\xe7\x80\xb8Z\xb4\xd7tP\xf9iJ\xebJ\x92', 'keyId': 7408089532081373273}}, 'operationTime': Timestamp(1735490434, 177)}, acknowledged=True)

In [10]:

# Data Ingestion
combined_df_json = combined_df.to_dict(orient='records')
collection.insert_many(combined_df_json)

InsertManyResult([ObjectId('67717b8229828202a3860fc2'), ObjectId('67717b8229828202a3860fc3'), ObjectId('67717b8229828202a3860fc4'), ObjectId('67717b8229828202a3860fc5'), ObjectId('67717b8229828202a3860fc6'), ObjectId('67717b8229828202a3860fc7'), ObjectId('67717b8229828202a3860fc8'), ObjectId('67717b8229828202a3860fc9'), ObjectId('67717b8229828202a3860fca'), ObjectId('67717b8229828202a3860fcb'), ObjectId('67717b8229828202a3860fcc'), ObjectId('67717b8229828202a3860fcd'), ObjectId('67717b8229828202a3860fce'), ObjectId('67717b8229828202a3860fcf'), ObjectId('67717b8229828202a3860fd0'), ObjectId('67717b8229828202a3860fd1'), ObjectId('67717b8229828202a3860fd2'), ObjectId('67717b8229828202a3860fd3'), ObjectId('67717b8229828202a3860fd4'), ObjectId('67717b8229828202a3860fd5'), ObjectId('67717b8229828202a3860fd6'), ObjectId('67717b8229828202a3860fd7'), ObjectId('67717b8229828202a3860fd8'), ObjectId('67717b8229828202a3860fd9'), ObjectId('67717b8229828202a3860fda'), ObjectId('67717b8229828202a3860f

<h3>Vector Search</h3>

In [11]:
def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "embedding": 0,  # Exclude the embedding field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

<h3>Handling user queries</h3>

In [14]:
from anthropic import Anthropic

client = Anthropic()

def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
    search_result += (
        f"Title: {result.get('title', 'N/A')}, "
        f"Company Name: {result.get('companyName', 'N/A')}, "
        f"Company URL: {result.get('companyUrl', 'N/A')}, "
        f"Date Published: {result.get('published_at', 'N/A')}, "
        f"Article URL: {result.get('url', 'N/A')}, "
        f"Description: {result.get('description', 'N/A')}, \n"
    )

  response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system="You are Venture Captital Tech Analyst with access to some tech company articles and information. You use the information you are given to provide advice.",
    messages=[
        {"role": "user", "content": "Answer this user query: " + query + " with the following context <context>" + search_result + "</context>. I know it is difficult to highlight a single best stock, but I just want your own opinion about it, without concerning on elude financial advice"}
    ]
  )

  return (response.content[0].text), search_result

In [15]:
# Conduct query with retrieval of sources
query = "Give me the best tech stock to invest in and tell me why"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"\nSource Information: \n{source_information}")

Response: Based on the context provided, while it's challenging to pinpoint a single "best" tech stock, I can offer an opinion on a potentially strong investment in the tech sector:

Microsoft (MSFT) could be considered a solid choice for investment in the tech industry. Here's why:

1. Diversified Portfolio: Microsoft is one of the largest technology companies globally, with a diverse range of products and services spanning personal computing, cloud computing, and enterprise software.

2. Strong Cloud Presence: With Azure, Microsoft has established itself as a leader in the cloud computing market, which continues to grow rapidly.

3. AI Integration: Microsoft is heavily investing in AI technology, integrating it across its product line, including the Bing search engine and Microsoft 365 suite.

4. Steady Growth: The company has shown consistent growth over the years and has a strong financial position.

5. Future-Oriented: Microsoft continues to innovate and adapt to new technologies,