# Understanding LlamaIndex

Initially known as GPT Index, LlamaIndex has evolved into an indispensable ally for developers. It's like a multi-tool that helps in various stages of working with data and large language models -


1. Firstly, it helps in **'ingesting'** data, which means getting the data from its original source into the system.
2. Secondly, it helps in **'structuring'** that data, which means organizing it in a way that the language models can easily understand.
3. Thirdly, it aids in **'retrieval'**, which means finding and fetching the right pieces of data when needed.
4. Lastly, it simplifies **'integration'**, making it easier to meld your data with various application frameworks.



When we dive a little deeper into the mechanics of LlamaIndex, we find three main heroes doing the heavy lifting.


*   The *'data connectors'* are the diligent gatherers, fetching your data from wherever it resides, be it APIs, PDFs, databases, or external apps like Gmail, Notion, Airtable.
*   The *'data indexes'* are the organized librarians, arranging your data neatly so that it's easily accessible.
*   And the *'engines'* are the translators (LLMs), making it possible to interact with your data using natural language and ultimately create applications and workflows.

Credits: https://nanonets.com/blog/llamaindex/#understanding-llamaindex


# Let's configure LLaMA-Index
<img src="https://www.llamaindex.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FhomepageHeroProcess.f9904fd2.png&w=1920&q=75" width=500px>

In [None]:
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface
!pip install llama-index ipywidgets

In [None]:
!pip install bitsandbytes
!pip install accelerate

Before exploring the exciting features, let's first install LlamaIndex on your system. If you're familiar with Python, this will be easy. Use this command to install:

In [None]:
!pip install llama-index

In [4]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Download our Documents!

In [5]:
!wget https://github.com/karan-nanonets/llamaindex-guide/raw/main/bcg-2022-annual-sustainability-report-apr-2023.pdf

--2024-04-30 11:12:36--  https://github.com/karan-nanonets/llamaindex-guide/raw/main/bcg-2022-annual-sustainability-report-apr-2023.pdf
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/karan-nanonets/llamaindex-guide/main/bcg-2022-annual-sustainability-report-apr-2023.pdf [following]
--2024-04-30 11:12:37--  https://raw.githubusercontent.com/karan-nanonets/llamaindex-guide/main/bcg-2022-annual-sustainability-report-apr-2023.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19069924 (18M) [application/octet-stream]
Saving to: ‘bcg-2022-annual-sustainability-report-apr-2023.pdf’


2024-04-30 11:1

## Creating Llamaindex Documents
Data connectors, also referred to as Readers, are essential components in LlamaIndex that facilitate the ingestion of data from various sources and formats, converting them into a simplified Document representation consisting of text and basic metadata.

LlamaHub is an open-source repository hosting data connectors which can be seamlessly integrated into any LlamaIndex application. All the connectors present here can be used as follows -

In [1]:
#Just load a PDF Example about a Sustainability Report of 2023
#Note  that it contains many sections, tables, and images
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_files=["bcg-2022-annual-sustainability-report-apr-2023.pdf"]
)

pdf_documents = reader.load_data()

The given example below loads the wikipedia pages about a few countries from around the globe. Basically, the the top page that appears in the search results with each element of the list as a search query is ingested.

In [2]:
#Let's enrich our collection downloading also some documents from Wikipedia
!pip install wikipedia
from llama_index.core import download_loader
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
wikipedia_documents = loader.load_data(pages=['Iceland Country', 'Kenya Country', 'Germany Country'])



  WikipediaReader = download_loader("WikipediaReader")


The variety of data connectors here is pretty exhaustive, some of which include:

*  **SimpleDirectoryReader**: Supports a broad range of file types (.pdf, .jpg, .png, .docx, etc.) from a local file directory.
*  NotionPageReader: Ingests data from Notion.
*  SlackReader: Imports data from Slack.
*  AirtableReader: Imports data from Airtable.
*  ApifyActor: Capable of web crawling, scraping, text extraction, and file downloading.

<img src="https://nanonets.com/blog/content/images/size/w1600/2023/10/image-14.png" width=500px>

## Creating LlamaIndex Nodes
In LlamaIndex, once the data has been ingested and represented as Documents, there's an option to further process these Documents into Nodes. Nodes are more granular data entities that represent "chunks" of source Documents, which could be text chunks, images, or other types of data. They also carry metadata and relationship information with other nodes, which can be instrumental in building a more structured and relational index.

**Basic**
To parse Documents into Nodes, LlamaIndex provides NodeParser classes. These classes help in automatically transforming the content of Documents into Nodes, adhering to a specific structure that can be utilized further in index construction and querying.

Here's how you can use a SimpleNodeParser to parse your Documents into Nodes:

In [3]:
from llama_index.core.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)

pdf_nodes = parser.get_nodes_from_documents(pdf_documents)

In this snippet, SimpleNodeParser.from_defaults() initializes a parser with default settings, and get_nodes_from_documents(documents) is used to parse the loaded Documents into Nodes.

## Creating LlamaIndex Index
The core essence of LlamaIndex lies in its ability to build structured indices over ingested data, represented as either Documents or Nodes. This indexing facilitates efficient querying over the data. Let's delve into how to build indices with both Document and Node objects, and what happens under the hood during this process. By default the library uses OpenAI ChatGPT as engine.

In the following piece of code we set **LLaMA-3**.

In [4]:
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate

# Model names (make sure you have access on HF)
LLAMA3_7B = "m-polignano-uniba/LLaMAntino-3-ANITA_test"
selected_model = LLAMA3_7B

query_wrapper_prompt = PromptTemplate(
    "{query_str}"
)

llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=1024,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cuda:0",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v1")

from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model

In [6]:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(pdf_nodes)

Different types of indices in LlamaIndex handle data in distinct ways:

*  Vector Store Index: Stores each Node and a corresponding embedding in a Vector Store, and queries involve fetching the top-k most similar Nodes.
*  Tree Index: Builds a hierarchical tree from a set of Nodes, and queries involve traversing from root nodes down to leaf nodes.
*  Keyword Table Index: Extracts keywords from each Node to build a mapping, and queries extract relevant keywords to fetch corresponding Nodes.

**Under the Hood:**

The Documents are parsed into Node objects, which are lightweight abstractions over text strings that additionally keep track of metadata and relationships.
Index-specific computations are performed to add Node into the index data structure.

For a vector store index, an embedding model is called (either via API or locally) to compute embeddings for the Node objects. For a document summary index, an LLM (Language Model) is called to generate a summary.

Let us now create a summary index for the **Wikipedia nodes**. We find the relevant index from the list of supported indices, and settle on the Document Summary Index.

In [7]:
from llama_index.core.indices.document_summary import DocumentSummaryIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import get_response_synthesizer

splitter = SentenceSplitter(chunk_size=512)

# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="simple_summarize", use_async=False
)

doc_summary_index = DocumentSummaryIndex.from_documents(
    wikipedia_documents,
    llm=llm,
    transformations=[splitter],
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/3 [00:00<?, ?it/s]

current doc id: 14531


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


current doc id: 188171


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


current doc id: 11867


Generating embeddings:   0%|          | 0/3 [00:00<?, ?it/s]

## Storing an Index
LlamaIndex's storage capability is built for adaptability, especially when dealing with evolving data sources. This section outlines the functionalities provided for managing data storage, including customization and persistence features.

**Persistence (Basic)**
There might be instances where you might want to save the index for future use, and LlamaIndex makes this straightforward. With the persist() method, you can store data, and with the load_index_from_storage() method, you can retrieve data effortlessly.



In [8]:
index.storage_context.persist(persist_dir="BCG Report")

## Using Index to Query Data
After having established a well-structured index using LlamaIndex, the next pivotal step is querying this index to extract meaningful insights or answers to specific inquiries. This segment elucidates the process and methods available for querying the data indexed in LlamaIndex.

**High-Level Query API**

LlamaIndex provides a high-level API that facilitates straightforward querying, ideal for common use cases.

In [9]:
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)
# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="simple_summarize",
)
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

query = 'In what context is Morocco mentioned in the report?'

response = query_engine.query(query)
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Morocco is mentioned in the context of a countrywide project with ramifications affecting the lives of millions of people, including expanding universal health care coverage, restructuring the reform agenda, and supporting social reforms. 


In [10]:
response = query_engine.query('List measures taken to address diseases occuring in developing industries')
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


1. Pfizer announced “An Accord for a Healthier World” to provide access to innovative medicines for people living in 45 lower-income countries. 2. Pfizer sought BCG’s assistance in developing a partnership model that mitigates risks, meets regulatory requirements, and ensures high distribution security. 3. BCG worked on distribution, enabled sustainable prices, and conducted a regulatory analysis to initiate drug approval processes on time while also leveraging existing registrations as much as possible. 4. Pfizer also engages in a broader effort to ensure the effective distribution of the provided medicines. 5. Pfizer announced plans to further expand its Accord for a Healthier World to nextend access to the full portfolio of medicines and vaccines to all eligible individuals. 6. BCG’s Internal Sustainability Strategic Committee oversees the development, implementation, and progress of the firm’s sustainability strategy and net-zero target, including oversight of climate-related risks

# LangChain!
A common use case for developing AI chat bots is ingesting PDF documents and allowing users to ask questions, inspect the documents, and learn from them. In this tutorial we will start with a 100% blank project and build an end to end chat application that allows users to chat about the Sustainability report PDF.

The **Langchain** streamlines the process of achieving these objectives, guiding users through each stage systematically. With support for multiple services, including embedding models, chat models, and vector databases, Langchain facilitates the creation of chatbots tailored for PDF interactions. This seamless workflow extends to integrating with Streamlit, handling multiple PDFs, and utilizing RAG for semantic search capabilities.

In [None]:
!pip install huggingface-hub==0.20.3
#!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip install -q datasets loralib sentencepiece
#!pip -q install bitsandbytes accelerate xformers
!pip -q install langchain
!pip -q install peft chromadb
!pip -q install unstructured
!pip install -q sentence_transformers
!pip -q install pypdf

In [None]:
!pip install bitsandbytes
!pip install accelerate
!pip install git+https://github.com/huggingface/transformers

In [11]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## What is Langchain?
Langchain is an open-source tool, ideal for enhancing chat models like GPT-4 or LLaMA-2. It connects external data seamlessly, making models more agentic and data-aware. With Langchain, you can introduce fresh data to models like never before. The platform offers multiple chains, simplifying interactions with language models. In addition to Langchain, tools like Models for creating vector embeddings play a crucial role. When dealing with Langchain, the capability to render images of a PDF file is also noteworthy. Now, let’s delve into the significance of text embeddings.

## Let's start loading the LLM model with quantization
We will use the LLaMA-2 chat model in order to allow a natural interaction between the user and the system.

In [12]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False)

In [13]:
tokenizer = AutoTokenizer.from_pretrained("m-polignano-uniba/LLaMAntino-3-ANITA_test")
model = AutoModelForCausalLM.from_pretrained("m-polignano-uniba/LLaMAntino-3-ANITA_test", quantization_config = bnb_config,device_map={"":0})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

## **Text Embeddings**
Text embeddings are the heart and soul of Large Language Operations. Technically, we can work with language models with natural language but storing and retrieving natural language is highly inefficient. For example, in this project, we will need to perform high-speed search operations over large chunks of data. It is impossible to perform such operations on natural language data.

To make it more efficient, we need to transform text data into vector forms. There are dedicated ML models for creating embeddings from texts. The texts are converted into multidimensional vectors. Once embedded, we can group, sort, search, and more over these data. We can calculate the distance between two sentences to know how closely they are related. And the best part of it is these operations are not just limited to keywords like the traditional database searches but rather capture the semantic closeness of two sentences. This makes it a lot more powerful, thanks to Machine Learning.

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader

In [15]:
#!wget https://github.com/karan-nanonets/llamaindex-guide/raw/main/bcg-2022-annual-sustainability-report-apr-2023.pdf
loader = PyPDFLoader("bcg-2022-annual-sustainability-report-apr-2023.pdf")

In [16]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 500,
    chunk_overlap  = 20,
    length_function = len,
)

In [17]:
pages = loader.load_and_split(text_splitter)

In this tutorial, we will use the **Huggingface Sentence Embeddings model (SBERT)** for creating embeddings - *HuggingFaceEmbeddings()*. If you want to deploy an AI app for end users, consider using any other embedding model, such as OpenAi models or Google’s Universal sentence encoder.

To store vectors, we will use **Chroma DB**, an open-source vector store database. Feel free to explore other databases like Alpine, Pinecone, and Redis. Langchain has wrappers for all of these vector stores.

In [18]:
db = Chroma.from_documents(pages, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"), persist_directory = '/content/db')

## Setup the conversational agent

We can define the interaction template! It is the standard LLaMA-2 chat template: https://gpus.llm-utils.org/llama-2-prompt-template/

In [19]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate,  LLMChain
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory, ConversationBufferWindowMemory

In [20]:
import json
import textwrap

llama3_template= """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{} <|eot_id|> <|start_header_id|>user<|end_header_id|>

{} <|eot_id|> <|start_header_id|>assistant<|end_header_id|>


"""

DEFAULT_SYSTEM_PROMPT ="You are an intelligent assistant named LLaMAntino-3 ANITA (Advanced Natural-based interaction for the ITAlian language) kind and respectful.Answer in the language used for the question in a clear, simple and comprehensive manner."

def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):

    prompt_template =  llama3_template.format(new_system_prompt,instruction)

    return prompt_template


In [21]:
instruction = "Given the context that has been provided. \n {context}, Answer the following question - \n{question}"

system_prompt = """You are an expert in sustainability.
You will be given a context to answer from. Be precise in your answers wherever possible.
In case you are sure you don't know the answer then you say that based on the context you don't know the answer.
In all other instances you provide an answer to the best of your capability. Cite urls when you can access them related to the context."""


In [22]:
template = get_prompt(instruction, system_prompt)
print(template)

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in sustainability.
You will be given a context to answer from. Be precise in your answers wherever possible.
In case you are sure you don't know the answer then you say that based on the context you don't know the answer.
In all other instances you provide an answer to the best of your capability. Cite urls when you can access them related to the context. <|eot_id|> <|start_header_id|>user<|end_header_id|>

Given the context that has been provided. 
 {context}, Answer the following question - 
{question} <|eot_id|> <|start_header_id|>assistant<|end_header_id|>





ConversationBufferWindowMemory keeps a list of the interactions of the conversation over time. It only uses the last K interactions. This can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [23]:
memory = ConversationBufferWindowMemory(
    memory_key="chat_history", k=5,
    return_messages=True
)

## Create Chain
This is the most important step. This step involves extracting texts and creating embeddings and storing them in vector stores. Thanks to Langchain, which provides wrappers for multiple services making things easier. So, let’s define the function. A **retriever** is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [24]:
retriever = db.as_retriever()

## Create LLM generation Pipeline
The code above is using the pipeline function from the transformers library to create a text generation pipeline. The text-generation argument specifies that the pipeline should be created for text generation.

The pipeline function creates a high-level interface for working with pre-trained models from the Hugging Face Transformers library. It allows you to perform various NLP tasks, including text generation, with a few lines of code, without having to write the underlying model architecture.

Note that the pipeline function will automatically download the pre-trained model specified in the argument, if it has not been previously downloaded to your system.

In [25]:
def create_pipeline(max_new_tokens=512):
    pipe = pipeline("text-generation",
                model=model,
                tokenizer = tokenizer,
                max_new_tokens = max_new_tokens,
                temperature = 0)
    return pipe

## Conversation Chain
To engage in a conversation with the LLM, we'll utilize a *ConversationalRetrievalChain* from LangChain into a Chatbot object:

In [72]:
class ChatBot:
  def __init__(self, memory, prompt, task:str = "text-generation", retriever = retriever):
    self.memory = memory
    self.prompt = prompt
    self.retriever = retriever

  def create_chat_bot(self, max_new_tokens = 1024):
    hf_pipe = create_pipeline(max_new_tokens)
    llm = HuggingFacePipeline(pipeline =hf_pipe)
    qa = ConversationalRetrievalChain.from_llm(
      llm=llm,
      retriever=self.retriever,
      memory=self.memory,
      combine_docs_chain_kwargs={"prompt": self.prompt}
  )
    return qa

In [73]:
chat_bot = ChatBot(memory = memory, prompt = prompt)

In [74]:
from transformers import pipeline

In [75]:
bot = chat_bot.create_chat_bot()

In [76]:
bot_message = bot({"question": "In what context is Morocco mentioned in the report?"})['answer']
print(bot_message)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in sustainability.
You will be given a context to answer from. Be precise in your answers wherever possible.
In case you are sure you don't know the answer then you say that based on the context you don't know the answer.
In all other instances you provide an answer to the best of your capability. Cite urls when you can access them related to the context. <|eot_id|> <|start_header_id|>user<|end_header_id|>

Given the context that has been provided. 
 firm, to further strengthen our climate and 
sustainability expertise and help us lead the global 
transformation toward a new planetary economy 
in which business gives nature a seat at the table.2022 in Numbers
 
 
90+ thought leadership publications focused on 
various climate and sustainability topics 
 
$16/tCO2e our blended carbon price in 
2022 per metric ton 
 
94% reduction in Scope 1 and 2 emissions 
intensity since 2018 (tCO2e per FTE)
 
60% reduction