## Init

Get your API token, then run:
```
! nomic login
```

Then run with your generated API token 
```
! nomic login < token > 
```

In [None]:
! nomic login
! nomic login token

## Document Loading

Let's test 3 interesting blog posts.

In [19]:
from langchain_community.document_loaders import WebBaseLoader

urls =["https://lilianweng.github.io/posts/2023-06-23-agent/",
       "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
       "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/"]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

## Splitting 

### Larger Context Models

There's a lot of interesting considerations to think about on [text splitting](https://www.youtube.com/watch?v=8OJC21T2SL4). 

Many approaches to date have focused on very granular splitting by semantic groups or sub-sections, which is a challenge.

The intution was: retrieve just the minimal context needed to address the question driven by:

(1) Embedding models with smaller context size

(2) LLMs with smaller context size

This means, we need high `precision` in retreival: 

> We reject as many irrelevant chunks (false positives) as possible.

Thus, all chunks we send to the model are relevant, but:

(1) We can suffer lower `recall` (leave our importaint details) 

(2) We incur higher splitting complexity

--- 

Embeddings models are starting to support larger context as discussed [here](https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval).

Nomic's release supports > 8k token limit locally (GPU today, CPU soon) and via API (soon).

And LLMs are seeing context window expansion, as seen with [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday) or Yarn LLaMA2 [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral). 

Here, we can try a workflow that is less concerned with `precision`:

(1) We use larger context chunks and embedds to promote `recall` 

(2) Use use larger context LLMs that can "sift" through less relevant information to get our answer

Lets pick a few interesting blog posts and see how long each document is using [TikToken](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).

In [21]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=10000, 
                                                            chunk_overlap=100)
doc_splits = text_splitter.split_documents(docs_list)

In [22]:
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
for d in doc_splits:
    print("The document is %s tokens"%len(encoding.encode(d.page_content)))

The document is 8759 tokens
The document is 811 tokens
The document is 7083 tokens
The document is 9029 tokens
The document is 3488 tokens


## Index 

Nomic embeddings [here](https://docs.nomic.ai/reference/endpoints/nomic-embed-text). 

In [36]:
import os
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

In [42]:
# Add to vectorDB
api_key = os.getenv("NOMIC_API_KEY")
# api_key = "xxx2"
vectorstore = Chroma.from_documents(
    documents=texts,
    collection_name="rag-chroma",
    embedding=NomicEmbeddings(model='nomic-embed-text-v1',
                              nomic_api_key=api_key), # TO FIX 
)
retriever = vectorstore.as_retriever()

NameError: name 'exit' is not defined

## RAG Chain

To test locally, we can use Ollama [here](https://x.com/mattshumer_/status/1720115354884514042?s=20), [here](https://ollama.ai/library/yarn-mistral) - 
```
ollama pull yarn-mistral
```

Of course, we can also run [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday). 

In [33]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama

# Prompt 
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Local LLM
ollama_llm = "yarn-mistral"
model = ChatOllama(model=ollama_llm)

# LLM API
model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")

# Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [34]:
# Question
chain.invoke("What are the types of agent memory?")

'In the context provided, the types of agent memory mentioned are:\n\n1. Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information after the original stimuli have ended. It typically only lasts for a few seconds.\n\n2. Short-Term Memory (STM) or Working Memory: It stores information that is currently being used to carry out complex cognitive tasks such as learning and reasoning. Short-term memory has a limited capacity and duration.\n\n3. Long-Term Memory (LTM): This type of memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. It is divided into two subtypes:\n   - Explicit / Declarative Memory: Memory of facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).\n   - Implicit / Procedural Memory: Unconscious memory involving skills and routines perform

Some considerations are noted in the [needle in a haystack analysis](https://twitter.com/GregKamradt/status/1722386725635580292?lang=en):

* LLMs may suffer with retrieval from large context depending on where the information is placed.