In [3]:
from datasets import load_dataset
import time

### Load Datasets

Load a collection AI research arxiv papers from hugging face. The papers have been chunked and information (chunk id with arxiv id, paper title, chunk text, post chunk id and arxiv id from reference papaer) from each chunk is provided in a dictionary.

In [4]:
dataset = load_dataset("jamescalam/ai-arxiv2-semantic-chunks", split="train")

### Semantic Router

Semantic Router is a superfast decision-making layer for your LLMs and agents. 

https://github.com/aurelio-labs/semantic-router


In [5]:
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder
from settings import config
pinecone_api_key = config["pinecone_api_key"]
serpapi_key=config["serpapi_key"]

In [6]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

In [7]:
encoder = OpenAIEncoder(name="text-embedding-3-small")

In [8]:
from pinecone import Pinecone
from pinecone import ServerlessSpec

In [9]:
pc = Pinecone(api_key=pinecone_api_key)

In [10]:
spec = ServerlessSpec(cloud="aws", region="us-east-1")

In [11]:
index_name = "gpt-4o-research-agent"
if index_name not in pc.list_indexes().names():
    # if it doesn't exist, create index
    pc.create_index(
        index_name,
        dimension=1536, # the length of the vector generated by the encoder above. 
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initiated
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

In [12]:
index = pc.Index(index_name)
time.sleep(1)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

In [13]:
from tqdm.auto import tqdm

In [14]:
data = dataset.to_pandas().iloc[:1000]

In [15]:
list(data['references'][0])

['1905.07830']

In [16]:
batch_size = 128
for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    batch = data[i:i_end].to_dict(orient="records")
    #get batch data
    metadata = [{
        "title": r["title"],
        "content": r["content"],
        "arxiv_id": r["arxiv_id"],
        "references": list(r["references"])
    } for r in batch]
    # generate unique ids for each chunk
    ids = [r["id"] for r in batch]
    content = [r["content"] for r in batch]
    #embed text
    embeds = encoder(content)
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/8 [00:00<?, ?it/s]

In [19]:
import requests
arxiv_id = "2401.04088"
res = requests.get(f"https://export.arxiv.org/abs/{arxiv_id}")

In [20]:
import re
abstract_pattern = re.compile(
    r'<blockquote class="abstract mathjax">\s*<span class="descriptor">Abstract:</span>\s*(.*?)\s*</blockquote>',
    re.DOTALL
)

re_match = abstract_pattern.search(res.text)

print(re_match.group(1))


We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both 

In [21]:
from serpapi import GoogleSearch
# https://serpapi.com/manage-api-key
serpapi_params = {
    "engine": "google",
    "api_key": serpapi_key
}

search = GoogleSearch({
    **serpapi_params,
    "q": "coffee"
})

results = search.get_dict()["organic_results"]
contexts = "\n---\n".join(
    ["\n".join([x["title"], x["snippet"], x["link"]]) for x in results]
)
print(contexts)

Coffee
Coffee is a beverage brewed from roasted, ground coffee beans. Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans
https://en.wikipedia.org/wiki/Coffee
---
Starbucks Coffee Company
More than just great coffee. Explore the menu, sign up for Starbucks® Rewards, manage your gift card and more.
https://www.starbucks.com/
---
r/Coffee
thread where you can share what you are brewing or ask for bean recommendations. This is a place to share and talk about your favorite coffee roasters or beans.
https://www.reddit.com/r/Coffee/
---
Coffee
Shop Dunkin'® coffee, Folgers® coffee, Café Bustelo® coffee, and more. No matter how you like your coffee, we've got you covered! Choose from subtle to bold ...
https://shop.smucker.com/collections/coffee?srsltid=AfmBOor5gqPO7OO9zc7mlXq6vddgEaioHlD9fp3TExGhPxKExpjsYFxq
---
Catalina Coffee: Order Online - Houston
Catalina Coffee 2201 Washington Ave Houston, Texas 77007 (713) 861-8448 info@catalinacoffeeshop.com Get dir