<a href="https://colab.research.google.com/github/kfahn22/Colab_notebooks/blob/main/Mistral_7b_instruct_example_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Documentation from LlamaIndex on [using LLMs](https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html)

Notebook from [here](https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing)

Note: Responses from local models can be quite slow, especially with 8-bit quantization.

With 4bit quantization, `mistralai/Mistral-7B-Instruct-v0.1` uses about 12GB of VRAM and 8.5GB of RAM. I used a T4-High RAM instance for this notebook.

In [1]:
!pip install huggingface_hub



In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!pip install git+https://github.com/run-llama/llama_index

Collecting git+https://github.com/run-llama/llama_index
  Cloning https://github.com/run-llama/llama_index to /tmp/pip-req-build-9agoo11r
  Running command git clone --filter=blob:none --quiet https://github.com/run-llama/llama_index /tmp/pip-req-build-9agoo11r
  Resolved https://github.com/run-llama/llama_index to commit cc739d10069a7f2ac653d6d019fbeb18a891fea2
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting dataclasses-json (from llama-index==0.9.44)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index==0.9.44)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index==0.9.44)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index==0.9.44)
  Dow

In [4]:
!pip install transformers accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.26.1 bitsandbytes-0.42.0


## Setup

### Data

You can load a source document from a directory or from a url.  

First, import the necessary modules.

In [None]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

If you want to upload a document, use the following code cell to create the data directory and then add the document to the directory.


In [None]:
# import os

# # Create data directory if it doesn't exist
# os.makedirs("./data", exist_ok=True)

In [None]:
#documents = SimpleDirectoryReader("./data").load_data()

The essay from Paul Graham is the example used in the LlamaIndex docs.

In [6]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-07 16:19:55--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-02-07 16:19:56 (5.90 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [7]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

You can also load a url. Note that when the url was "thecodingtrain.com" only the homepage was used for the response.

In [31]:
from llama_index.readers import BeautifulSoupWebReader

url = "https://www.gutenberg.org/ebooks/98.txt.utf-8"
#url = "https://thecodingtrain.com"

documents = BeautifulSoupWebReader().load_data([url])

### LLM

This should run on a T4 instance on the free tier

In [8]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


llm = HuggingFaceLLM(
    model_name="mistralai/Mistral-7B-Instruct-v0.1",
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.1",
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.2, "top_k": 5, "top_p": 0.95},
    device_map="auto",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [9]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5")

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Index Setup

In [32]:
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [33]:
from llama_index import SummaryIndex

summary_index = SummaryIndex.from_documents(documents, service_context=service_context)

### Helpful Imports / Logging

In [26]:
from llama_index.response.notebook_utils import display_response

In [27]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [34]:
query_engine = vector_index.as_query_engine(response_mode="compact")

#response = query_engine.query("What is the featured Coding Challenge")

response = query_engine.query("Who is Sydney Carton?")
display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Sydney Carton is a character in Charles Dickens' novel "A Tale of Two Cities." He is a lawyer and a man of great integrity who is willing to sacrifice himself for the greater good. He is in love with Lucie Manette, but he knows that he cannot marry her because of his low social status. He is also a victim of the corrupt and unjust legal system of the time, and he is willing to risk his life to help others.

### Refine

In [35]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("What is the plot of a Tale of Two Cities?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** A Tale of Two Cities is a novel by Charles Dickens that tells the story of Charles Darnay, a French aristocrat, and Sydney Carton, a drunken lawyer, during the French Revolution. The novel is set in both London and Paris and follows the lives of these two characters as they navigate the political and social upheaval of the time.

The plot of the novel revolves around the themes of love, sacrifice, and the struggle between the forces of revolution and the forces of the established order. Charles Darnay falls in love with Lucie Manette, the daughter of a wealthy French merchant, and marries her. However, he is later captured by the revolutionaries and imprisoned in the Bastille. Sydney Carton, who is deeply in love with Lucie himself, offers to take Darnay's place in the guillotine in order to save him from certain death.

Throughout the novel, Dickens explores the moral ambiguities of the revolution and the dangers of extremism. He also delves into the personal struggles of the characters, as they grapple with their own beliefs and desires in the face of the tumultuous events around them. Ultimately

### Tree Summarize

The document was "thecodingtrain.com/challenges" when this cell was run.

In [22]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("What is the climate spiral?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** The climate spiral is a visual representation of the changing temperatures over time, illustrating the urgent need to address climate change. It was originally designed by the climate scientist Ed Hawkins. The climate spiral is created by plotting temperature data on a two-dimensional graph, with time on the x-axis and temperature on the y-axis. The resulting spiral shows how temperatures have increased over time, with the spiral becoming tighter and more densely packed as temperatures continue to rise. The climate spiral is often used to illustrate the severity of climate change and the need for urgent action to address it.

## Router Query Engine

In [36]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

### Single Selector

In [37]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=False
)

response = query_engine.query("Why is The Tale of Two Cities considered a classic novel?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** The Tale of Two Cities is considered a classic novel for several reasons. Firstly, it is a masterpiece of storytelling, with its intricate plot, vivid characters, and powerful themes. The novel explores complex issues such as revolution, love, and sacrifice, and it does so in a way that is both engaging and thought-provoking.

Secondly, the novel is a historical document that provides a unique insight into the events and people of the French Revolution. Charles Dickens was a master of historical detail, and he brings the period to life in a way that is both accurate and compelling.

Thirdly, the novel is a work of art that is open to interpretation. It has been studied and analyzed by scholars and critics for generations, and it continues to inspire new generations of readers and writers.

Finally, the novel is a testament to the power of storytelling. It is a timeless tale that speaks to the human experience, and it will continue to captivate and inspire readers for generations to come.

### Multi Selector

In [None]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=True,
)

response = query_engine.query("")

display_response(response)

## SubQuestion Query Engine

In [None]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
from llama_index.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    verbose=True,
)

response = query_engine.query("")

display_response(response)