# **Setup of the environment**
---
## **Installation**
First we must install the packages and set the necessary environment variables 


In [12]:
%pip install --quiet langchain
%pip install --quiet langchain-community
%pip install --quiet beautifulsoup4
%pip install --quiet langchain-ollama
%pip install --quiet chromadb

724.51s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


731.05s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


737.60s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


743.94s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


750.41s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


# **Basic steps**
---
Import the required libraries



In [None]:
from langchain import PromptTemplate
from langchain import hub
from langchain.docstore.document import Document
from langchain.document_loaders import WebBaseLoader
from langchain.schema import StrOutputParser
from langchain.schema.prompt_template import format_document
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores import Chroma
import shutil

### **Read and parse the website data**

To read the website data as a document, we will use the `WebBaseLoader` from LangChain. 




In [None]:
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Firefly_Aerospace")
docs = loader.load()


- We are going to use Python's `split()` function to extract the required portion of the text. 
- The extracted text should be converted back to LangChain's `Document` format.

In [None]:
# Extract the text from the website data document
text_content = docs[0].page_content

# The text content between the substrings "code, audio, image and video." to
# "Cloud TPU v5p" is relevant for this tutorial. You can use Python's `split()`
# to select the required content.
text_content_1 = text_content.split("is an American private aerospace firm based in",1)[1]
final_text = text_content_1.split("Firefly headquarters and factory are located in",1)[0]

# Convert the text to LangChain's `Document` format
docs =  [Document(page_content=final_text, metadata={"source": "local"})]

### **Initialize Ollama's embedding model**
To create the embeddings from the website data, Ollama's embedding model will be used to suports creating text embeddings.

To use this embeeding model, `OllamaEmbeddings` will be imported from LangChain. More information about the embedding model.

In [None]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="tinyllama")

### **Store the data using Chroma**
To create a Chroma vector database from the website data.

In [None]:
#clear database
shutil.rmtree('./chroma_db')
# # Save to disk
vectorstore = Chroma.from_documents(
        documents=docs,                 # Data
        embedding=embeddings,    # Embedding model
        persist_directory="./chroma_db" # Directory to save data
    )

### **Create a retriever using Chroma**

- Retriever will be created to retrieve website data embeddings from the newly created Chroma vector store. 
- This retriever can be later used to pass embeddings that provide more context to the LLM for answering users queries. 


In [None]:
# Load from disk
vectorstore_disk = Chroma(
                        persist_directory="./chroma_db",       # Directory of db
                        embedding_function=embeddings   # Embedding model
                   )
# Get the Retriever interface for the store to use later.
# When an unstructured query is given to a retriever it will return documents.
# Read more about retrievers in the following link.
# https://python.langchain.com/docs/modules/data_connection/retrievers/
#
# Since only 1 document is stored in the Chroma vector store, search_kwargs `k`
# is set to 1 to decrease the `k` value of chroma's similarity search from 4 to
# 1. If you don't pass this value, you will get a warning.
retriever = vectorstore_disk.as_retriever(search_kwargs={"k": 1})

# Check if the retriever is working by trying to fetch the relevant docs related
# to the word 'MMLU' (Massive Multitask Language Understanding). If the length is greater than zero, it means that
# the retriever is functioning well.
print(len(retriever.get_relevant_documents("MMLU")))

## **Generator**
- The Generator prompts the LLM for an aswer when the user asks a questions. 
- The retriever created from the Chroma vector database will be used to 
  provide more context to the user's query. 


In [None]:
from langchain_ollama import OllamaLLM
# model could be updated
llm = OllamaLLM(model="tinyllama")

### **Create prompt templates**
LangChain's `PromptTemplate` will be used to generate prompts to the LLM answering questions.  

In [None]:
# Prompt template to query Gemini
llm_prompt_template = """You are an assistant for question-answering tasks.
Use the following context to answer the question.
If you don't know the answer, just say that you don't know.
Use eight sentences maximum and keep the answer concise.\n
Question: {question} \nContext: {context} \nAnswer:"""

llm_prompt = PromptTemplate.from_template(llm_prompt_template)

print(llm_prompt)

### **Creating a stuff documents chain**
- LangChain provides `Chains` for chaining togerther LLM with each other or other components for complex applications. 

In [None]:
# Combine data from documents to readable string format.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create stuff documents chain using LCEL.
#
# This is called a chain because you are chaining together different elements
# with the LLM. In the following example, to create the stuff chain, you will
# combine the relevant context from the website data matching the question, the
# LLM model, and the output parser together like a chain using LCEL.
#
# The chain implements the following pipeline:
# 1. Extract the website data relevant to the question from the Chroma
#    vector store and save it to the variable `context`.
# 2. `RunnablePassthrough` option to provide `question` when invoking
#    the chain.
# 3. The `context` and `question` are then passed to the prompt where they
#    are populated in the respective variables.
# 4. This prompt is then passed to the LLM (`gemini-pro`).
# 5. Output from the LLM is passed through an output parser
#    to structure the model's response.
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | llm_prompt
    | llm
    | StrOutputParser()
)

## **Prompt the model**
Now we can querry the LLM by passing any questions to the `invoke()` functions of the stuff documents chain that we created previously.

In [None]:
response = rag_chain.invoke("Please give me a summary?")
print(response)

# **Conclusion**
We have successfully created an LLM application that answers questions using data from a website with the help of LLM, LangChain, and Chroma.