## CA 3, LLMs Spring 2024

- **Name:** Majid Faridfar
- **Student ID:** 810199569

---

# RAG (50 points)

## Install Requirements

In [None]:
%pip install -q langchain
%pip install -q ctransformers
%pip install -q sentence_transformers
%pip install -q datasets
%pip install -q rank_bm25
%pip install -q faiss-gpu
%pip install -q arxiv
%pip install -q pymupdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.2/121.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## 1. An Overview of LangChain (10 pt)

LangChain is an open-source framework designed to simplify the creation of applications using LLMs. It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. This application will fetch country-related information from a Large Language Model. For this purpose, we will be utilizing the LLaMa 2 chat 7B as our base model.

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q8_0.gguf:   0%|          | 0.00/7.16G [00:00<?, ?B/s]

### 1.1 GGUF Format (3 pt)

Write a brief paragraph discussing the GGUF format and its benefits. Compare it with transformers library.

> The GGUF (Generated-Guided Unified Format) format aims to unify the storage and deployment of LLMs, making it easier for developers to use and integrate these models into their applications. One of the key benefits of GGUF is its efficiency in handling large-scale models, enabling faster loading times and reduced memory usage compared to other formats. Additionally, GGUF supports various optimizations tailored specifically for LLMs, such as GPU acceleration and customized layer configurations, allowing for better performance on specialized hardware setups.
>
> On the other hand, the Transformers library provides a comprehensive suite of tools and utilities for working with pre-trained transformer models, including fine-tuning, inference, and model deployment. While Transformers offers a wide range of functionalities and compatibility with various model formats, GGUF stands out for its focus on LLMs specifically within the LangChain community. GGUF's optimizations and specialized features cater to the unique requirements of LLMs, providing developers with a dedicated solution for deploying and utilizing these powerful language models effectively.
>
> Note: AI helped me answering this question.

### 1.2 Simple Chain (2 pt)

Complete the next cell to create a simple chain that takes the name of a country as input and outputs its capital. To accomplish this, you should utilize the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to formulate an effective prompt.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
  HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
  AIMessagePromptTemplate.from_template("")
])

output_parser = StrOutputParser()

simple_chain = prompt | model | output_parser

answer = simple_chain.invoke({"country": "Iran"})

print(answer)


 The capital of Iran is Tehran.


Write about the objectives behind the creation of `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes. What they actually do? Write a brief description.

> **HumanMessagePromptTemplate**: This class represents a human message prompt template. When creating prompts for language models, you can use `HumanMessagePromptTemplate` to define examples or context provided by a human user. For instance, if you want to include a user's input in a chat interaction, you can use this template to format the message appropriately.
>
> **AIMessagePromptTemplate**: This class represents an AI-generated message prompt template. It allows you to create prompts specifically for AI-generated responses. When designing prompts, you can use this template to provide instructions or context for the AI model to generate relevant content.

What is the purpose of adding an empty `AIMessagePromptTemplate` at the end of prompt? What is the consequences of omitting it?

> Adding an empty AIMessagePromptTemplate at the end of the prompt serves as a placeholder for the model's response. In this context, it indicates that the model's response will directly follow the human message prompt without any intermediary AI-generated prompts. It helps to structure the interaction between the user and the model, making it clear where the human input ends and the AI-generated output begins.
>
> Omitting the empty AIMessagePromptTemplate at the end of the prompt would still allow the model to generate a response, but it might be less clear where the model's output starts in the interaction sequence. This could lead to confusion in parsing the output or distinguishing between human and AI-generated content. Adding the empty AIMessagePromptTemplate helps maintain consistency and clarity in the interaction flow.

### 1.3 JSON Chain (5 pt)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a country's name, population, and major cities in addition to the capital. Additionally, incorporate a `SystemMessagePromptTemplate` to ensure the model's response is structured in JSON format. Keep in mind that a distinct parser is required to parse the JSON output.

In [None]:
from langchain_core.prompts import SystemMessagePromptTemplate
from langchain_core.output_parsers.json import JsonOutputParser

# prompt = # WRITE YOUR ANSWER HERE
prompt = ChatPromptTemplate.from_messages([
    HumanMessagePromptTemplate.from_template('Please complete this form about {country}.\n\n"country" = {country} (No need to change this field)\n"capital" = (Write name of capital city of {country} in this field)\n"population" = (Write number of population of {country} in this field)\n"cities" = (Write name of some major cities of {country} in this field)'),
    AIMessagePromptTemplate.from_template(""),
    SystemMessagePromptTemplate.from_template(JsonOutputParser().get_format_instructions())
])

# output_parser = # WRITE YOUR ANSWER HERE
output_parser = JsonOutputParser()

json_chain = prompt | model | output_parser

answers = json_chain.batch([
  {"country": "Iran"},
  {"country": "USA"},
  {"country": "Japon"},
  {"country": "Nigeria"}
])

for ans in answers:
  # print(ans)
  print(f"{ans['country']}:")
  print(f"  capital: {ans['capital']}")
  print(f"  population: {ans['population']}")
  print(f"  important cities: {ans['cities']}")


Iran:
  capital: Tehran
  population: 830752641
  important cities: ['Tehran', 'Mashhad', 'Isfahan', 'Karaj', 'Qom']
USA:
  capital: Washington D.C.
  population: 3278975176
  important cities: ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego']
Japon:
  capital: Tokyo
  population: 1270880494
  important cities: ['Tokyo', 'Osaka', 'Nagoya', 'Yokohama', 'Kobe']
Nigeria:
  capital: Abuja
  population: 21078954
  important cities: ['Lagos', 'Kano', 'Ibadan', 'Benin City', 'Port Harcourt']


## 2. Different Types of Retrievers (15 pt)

In this section, We use mini-bioasq dataset to evalute different types of retrivers.

In [None]:
import json
from datasets import load_dataset

corpus = load_dataset("rag-datasets/mini-bioasq", "text-corpus", split="passages")
qa_dataset = load_dataset("rag-datasets/mini-bioasq", "question-answer-passages", split="test[:100]")

qa_dataset = qa_dataset.map(lambda data: {
  "relevant_passage_ids": json.loads(data["relevant_passage_ids"])
})

print(corpus)
print(qa_dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/513 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.5M [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/40221 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4719 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 40221
})
Dataset({
    features: ['question', 'answer', 'relevant_passage_ids', 'id'],
    num_rows: 100
})


### 2.1 Evaluate Retriever (4 pt)

To effectively compare various retrieval systems, we must define a metric. Complete the `evaluate_retriever` function to measure the accuracy of the retrieved documents. Consider the `relevant_passage_ids` column as the expected documents to be retrieved.

In [None]:
def evaluate_retriever(retriever):
    # WRITE YOUR ANSWER HERE
    total_questions = 0
    correct_predictions = 0

    for example in qa_dataset:
        question_id = example["id"]
        relevant_passage_ids = example["relevant_passage_ids"]
        question = example["question"]

        # print(f"{question_id}")
        # print(f" - Question: {question}")
        # print(f" - Relevant IDs: {relevant_passage_ids}")

        retrieved_passages = retriever.invoke(question)
        retrieved_passage_ids = [rp.metadata['id'] for rp in retrieved_passages]

        # print(f" - Retrieved Passages IDs: {retrieved_passage_ids}")
        # print("--------------")

        for passage_id in retrieved_passage_ids:
            if passage_id in relevant_passage_ids:
                correct_predictions += 1
            total_questions += 1

    accuracy = correct_predictions / total_questions

    return accuracy

### 2.2 TF-IDF Retriever (3 pt)

Create a TF-IDF retriever and configure it to returns the top 5 relevant documents.

In [None]:
from langchain.schema import Document
from langchain.retrievers import TFIDFRetriever

docs = []
for example in corpus:
    doc = Document(page_content=example['passage'], metadata={"id": example["id"]})
    docs.append(doc)

In [None]:
tfidf_retriever = TFIDFRetriever.from_documents(documents=docs, k=5)

### 2.3 Semantic Retriever (5 pt)

Semantic retrievers operate by retrieving documents through embeddings. These systems require an embedding model to convert documents into a vector space, and a vector database to find the closest documents to a query. Construct a semantic retriever that utilizes [`intfloat/e5-base`](https://huggingface.co/intfloat/e5-base) as the embedding model and FAISS for the vector database.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

device = 'cuda'

# embedding_model = # WRITE YOUR ANSWER HERE
embedding_model = HuggingFaceEmbeddings(
    model_name="intfloat/e5-base",
    model_kwargs={'device': device},
    encode_kwargs={'normalize_embeddings': True},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
import random

random.shuffle(docs)
split_index = 20000
splitted_docs = docs[:split_index]

In [None]:
from langchain_community.vectorstores import FAISS
# from langchain_community.vectorstores import Chroma

vectordb = FAISS.from_documents(documents=docs, embedding=embedding_model)

In [None]:
# semantic_retriever = # WRITE YOUR ANSWER HERE
# semantic_retriever = FAISS.from_documents(docs, embedding_model).as_retriever(search_kwargs={"k":5})
semantic_retriever = vectordb.as_retriever(search_kwargs={'k':5})

### 2.4 Compare Retrivers (3 pt)

Calculate the score for each retriever using `evaluate_retriever` you previously writed. In this question, which one outperforms the other? Illustrate a scenario for each retriver that it outperforms the other.

In [None]:
tfidf_acc = evaluate_retriever(tfidf_retriever)

print(f"TF-IDF accuracy: {tfidf_acc:.2f}")

TF-IDF accuracy: 0.46


In [None]:
semantic_acc = evaluate_retriever(semantic_retriever)

print(f"semantic accuracy: {semantic_acc:.2f}")

semantic accuracy: 0.54


> As you can see, semantic retriever outperforms TF-IDF retriever for 8 percent. That is because semantic retriever embeds the meaning of question and the documents, so it does a more accurate similarity search compared to TFIDF which only considers the frequency of words.

## 3. RAG (25 pt)

In this section, you should use all the concepts you've learned until now to create a complete RAG chain.

### 3.1 Load Documents (2 pt)

Load [RAFT](https://arxiv.org/abs/2403.10131) and [DSPy](https://arxiv.org/abs/2401.12178) papers. You can use `ArxivLoader` to get documents from arXiv.


In [None]:
from langchain_community.document_loaders import ArxivLoader

# docs = # WRITE YOUR ANSWER HERE
raft_paper_id = "2403.10131"
dspy_paper_id = "2310.03714"

docs = [
    ArxivLoader(raft_paper_id).load()[0],
    ArxivLoader(dspy_paper_id).load()[0],
]

### 3.2 Split Documents into Chunks (4 pt)

Usually, each document is constructed from multiple sections, each with a separate topic. It is better to split each document into smaller parts named chunks and search among them instead of actual documents. Write a splitter to create chunks from loaded documents.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = # WRITE YOUR ANSWER HERE
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.split_documents(docs)

In [None]:
print(len(chunks))

946


### 3.3 Retriever (3 pt)

Create a retriever of your choice.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

device = 'cuda'

# embedding_model = # WRITE YOUR ANSWER HERE
embedding_model = HuggingFaceEmbeddings(
    model_name="intfloat/e5-base",
    model_kwargs={'device': device},
    encode_kwargs={'normalize_embeddings': True},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# from langchain.retrievers import TFIDFRetriever

# retriever = TFIDFRetriever.from_documents(documents=chunks, k=1)

from langchain_community.vectorstores import FAISS

vectordb = FAISS.from_documents(documents=chunks, embedding=embedding_model)

In [None]:
retriever = vectordb.as_retriever(search_kwargs={'k':2})

### 3.4 Design Prompt (2 pt)

Design a suitable prompt for RAG.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

# prompt = # WRITE YOUR ANSWER HERE
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template('You are a helpful AI assistant that answers a question based on a given context. You will be given a context and a question that you have to answer based on the provided context. Assume that you only have the knowledge of the provided context.'),
    HumanMessagePromptTemplate.from_template('Necessary context: {context}\nQuestion to be answered: {question}'),
    AIMessagePromptTemplate.from_template(""),
])

### 3.5 RAG Chain (3 pt)

Design a question from the documents and get the retriever and RAG output for that question.

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    "context_length" : 2048,
  }
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q8_0.gguf:   0%|          | 0.00/7.16G [00:00<?, ?B/s]

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

# question = # ANSWER HERE
question = "What does it mean to be understandable in the context of consensus algorithms?"

# retrieved_doc = # ANSWER HERE
retrieved_doc = retriever.invoke(question)

# answer = # ANSWER HERE
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='and in a thread-safe fashion throughout execution. The program’s metric is used to filter for multi-', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a programming model that abstracts LM pipelines as text\ntransformation graphs, i.e. imp

### 3.6 Out of Domain Question (4 pt)

Ask a question that is not related to documents. Does model answer it? Change your prompt to force model say "I don't know" when some one asks out of domains questions.

> As you can see, models answers the question correctly, but obviously anything related to the color of sky wasn't mentioned in the context! So we change prompt to make it say "I don't know".

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

# question = # ANSWER HERE
question = "What is the color of sky?"

# retrieved_doc = # ANSWER HERE
retrieved_doc = retriever.invoke(question)

# answer = # ANSWER HERE
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='3\n4 # As we assumes no answer , we use ‘answer_passage_match ‘ to filter ungrounded answers.\n5 finetuning_teleprompter = BootstrapFinetune(metric=dspy.evaluate.answer_passage_match)\n6', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a 

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

# prompt = # WRITE YOUR ANSWER HERE
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template('You are a helpful AI assistant that answers a question based on a given context. You will be given a context and a question that you have to answer based on the provided context. Assume that you only have the knowledge of the provided context and if the necessary information is not provided, say \"I don\'t know\".'),
    HumanMessagePromptTemplate.from_template('Necessary context: {context}\nQuestion to be answered: {question}'),
    AIMessagePromptTemplate.from_template(""),
])

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

# question = # ANSWER HERE
question = "What is the color of sky?"

# retrieved_doc = # ANSWER HERE
retrieved_doc = retriever.invoke(question)

# answer = # ANSWER HERE
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='3\n4 # As we assumes no answer , we use ‘answer_passage_match ‘ to filter ungrounded answers.\n5 finetuning_teleprompter = BootstrapFinetune(metric=dspy.evaluate.answer_passage_match)\n6', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a 

### 3.7 The Effect of Temperature (7 pt)

RAG performance is highly dependent on model temperature. Explain that low temperature is better or high temperature? For the same prompt, compare the output of the model with low and high temperature.

>  Low Temperature: A lower temperature (closer to 0) makes the model's output more deterministic. It makes the model more likely to choose the most probable outcomes and less likely to sample from less probable ones. This can be beneficial when you want the model to produce more predictable and less varied results, which might be important for tasks requiring high precision.
>
> High Temperature: A higher temperature makes the model's output more diverse. It gives less probable outcomes a higher chance of being chosen. This can be useful when you want the model to generate more varied and creative outputs, which might be important for tasks where diversity is valued over precision.

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    "context_length" : 2048,
    "temperature" : 0.1 # low
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

# question = # ANSWER HERE
question = "What does it mean to be understandable in the context of consensus algorithms?"

# retrieved_doc = # ANSWER HERE
retrieved_doc = retriever.invoke(question)

# answer = # ANSWER HERE
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='and in a thread-safe fashion throughout execution. The program’s metric is used to filter for multi-', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a programming model that abstracts LM pipelines as text\ntransformation graphs, i.e. imp

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    "context_length" : 2048,
    "temperature" : 2 # high
  }
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

# question = # ANSWER HERE
question = "What does it mean to be understandable in the context of consensus algorithms?"

# retrieved_doc = # ANSWER HERE
retrieved_doc = retriever.invoke(question)

# answer = # ANSWER HERE
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='and in a thread-safe fashion throughout execution. The program’s metric is used to filter for multi-', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a programming model that abstracts LM pipelines as text\ntransformation graphs, i.e. imp

> As you can see the second response is way more deterministic and short. But the first one is more long and diverse, discussing several aspects.