## Prototype for Document Question Answering Using Llama and LangChain

Importing the packages and setting the model path

In [1]:
from llama_cpp import Llama
from langchain import PromptTemplate
from langchain.llms import LlamaCpp
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings


MODEL_PATH = "./models/llama-7b.ggmlv3.q4_K_M.bin"


Initializing Llama and loading the document on which question answering is to be performed. `TextLoader` is used for .txt files.

In [2]:
loader = TextLoader("./docs/sample_input2.txt")
docs = loader.load()

Transforming the document by breaking it into smaller chunks using `TextSplitter`. It separates the document at the '\n\n' separator. However, if we set the separator to null and define a specific chunk size, each chunk will be of that specified length. Consequently, the resulting list length will be equal to the length of the document divided by the chunk size. In summary, it will resemble something like this: list length = length of doc / chunk size.


In [3]:
# text_splitter = CharacterTextSplitter(chunk_size=256, chunk_overlap=100)
# texts = text_splitter.split_documents(docs)
# texts
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256, chunk_overlap=0, separators=[" ", ",", "\n", "."]
    )
texts = text_splitter.split_documents(docs)
texts

[Document(page_content='India, officially known as the Republic of India, is a diverse and vibrant country located in South Asia. It is the seventh-largest country by land area and the second-most populous country in the world, with over 1.3 billion people. India shares its', metadata={'source': './docs/sample_input2.txt'}),
 Document(page_content='borders with several countries, including Pakistan, China, Nepal, Bhutan, Bangladesh, and Myanmar.\n\nGeographically, India is known for its diverse landscape, which ranges from the towering Himalayan mountain range in the north to the coastal plains in the', metadata={'source': './docs/sample_input2.txt'}),
 Document(page_content="south, and from the arid desert regions in the west to the fertile Gangetic plains in the east. The country is also home to several major rivers, including the Ganges and Brahmaputra, which have played a significant role in shaping India's history and", metadata={'source': './docs/sample_input2.txt'}),
 Document(p

Embedding documents and embedding queries

In [4]:
# from langchain.embeddings import LlamaCppEmbeddings
# embeddings = LlamaCppEmbeddings(model_path=MODEL_PATH)
_texts = []
for i in range(len(texts)):
    _texts.append(texts[i].page_content)
texts[0]

Document(page_content='India, officially known as the Republic of India, is a diverse and vibrant country located in South Asia. It is the seventh-largest country by land area and the second-most populous country in the world, with over 1.3 billion people. India shares its', metadata={'source': './docs/sample_input2.txt'})

In [5]:
# embedded_texts = embeddings.embed_documents(_texts)
# len(embedded_texts), len(embedded_texts[0])
embeddings = HuggingFaceEmbeddings()
query = "Who is the current President of India?"

embedded_query = embeddings.embed_query(query)
embedded_texts = embeddings.embed_documents(_texts)
len(embedded_texts), len(embedded_texts[0]), len(embedded_query)

  from .autonotebook import tqdm as notebook_tqdm


(14, 768, 768)

In [6]:
embedded_texts[0][:4]

[0.048259492963552475,
 0.010048022493720055,
 -0.027376944199204445,
 -0.02173936553299427]

In [7]:
# query = "Who is the current President of India?"
# embedded_query = embeddings.embed_query(query)
# len(embedded_query)

Creating a vector store using ChromaDB to store embedded data and to perform vector search operations.

In [8]:
db = Chroma.from_documents(texts, embeddings)
query_vector = embeddings.embed_query(query)
docs = db.similarity_search_by_vector(query_vector, k=1)
docs

[Document(page_content='governance structure. The President of India is the head of state, while the Prime Minister is the head of government. The current President of India is Droupadi Murmu while the current prime minister is Narendra Modi. The country follows a parliamentary', metadata={'source': './docs/sample_input2.txt'})]

Giving a prompt to the Llama model.

In [16]:
from langchain.prompts import PromptTemplate

template = """You are an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. Below is some information. 
{context}

Based on the above information only, answer the below question. 

{question} Be concise."""

prompt = PromptTemplate.from_template(template)
prompt.input_variables

['context', 'question']

In [17]:
query = "Who is the current President of India?"
similar_doc = db.similarity_search(query, k=1)
context = similar_doc[0].page_content
print(context)

governance structure. The President of India is the head of state, while the Prime Minister is the head of government. The current President of India is Droupadi Murmu while the current prime minister is Narendra Modi. The country follows a parliamentary


In [19]:
llm = LlamaCpp(model_path=MODEL_PATH)
query_llm = LLMChain(llm=llm, prompt=prompt)
response = query_llm.run({"context": context, "question": query})
print(response)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 



The correct answer to this question has 4 words as shown in the box below (highlighted).

```
    A) Narendra Modi
    B) Droupadi Murmu
    C) Mukesh Ambani
    D) Akshay Kumar
```

The following are possible answers to this question. 

* The correct answer is shown in the box below with respect to the highlighted words above and can be selected in a multiple choice answer. Alternatively, if you would like to provide an answer in your own words, please do so by typing the text into the box. If you're unsure of any details or have more than one answer which could fit the bill, feel free to add both answers in separate boxes.
* The incorrect responses and possible reasons will be provided below for reference. 

1) Incorrect Answer Reason (A): Narendra Modi is correct. Narendra Modi has been the prime minister of India since June 2014 but he was elected president on 20 July 2017, so he is currently president. Mukesh Ambani was a candidate in this election and was nominated by Bharatiya
