# Answer Machine Learing questions using Gemma and LlamaIndex
- LLMs can be fine-tuned to achieve several common tasks but it still cannot being aware of specific content.
- **RAG (Retrieval Augmented Generation)** is a popular approach to address such knowledge-intensive tasks (legal, medical, technical, ...)
- This notebook demonstrates how you can quickly build a RAG to answer **machine learning questions** with additional knowledge in a machine learning book.
- Let's see how Gemma answer questions with and without additional information.

## Before we start
- RAG takes an input and retrieves a set of relevant documents given a source
- The process to build a RAG with LlamaIndex generally consists of two stages:
    + **Idexing Stage:** LlamaIndex prepares the knowledge base by ingesting data and converting it into Documents
    + **Querying stage:** Relevant context is retrieved from the knowledge base to assist the model in responding to queries.
 
![RAG pipeline with LlamaIndex](https://global.discourse-cdn.com/business7/uploads/streamlit/original/3X/a/e/ae647f8c23a1b60fbcf59fd7c4f0a33aee9ce255.png)

## 1. Indexing Stage
In this stage, we collect data from a machine learning book and store it into a database.

### 1.1. Data Collection
First, we extract infomation from a book [Python for Data Analysis](https://wesmckinney.com/book/) and store as **html files** in folder "data".

In [1]:
from bs4 import BeautifulSoup
import requests
import os
from tqdm import tqdm

url = "https://wesmckinney.com/book/"

# Function to extract pages url 
def extract_pages_url():
    response = requests.get(url)
    html = response.text

    soup = BeautifulSoup(html, 'html.parser')
    chapter_links = []
    
    for li_tag in soup.find_all('li', class_="sidebar-item"):
        link = li_tag.find('a')
        try:
            chapter_links.append("https://wesmckinney.com" + link['href'])
        except:
            pass

    return chapter_links[2:] # skip the first 2 section, as they are just introduction

def extract_page(page_url):
    response = requests.get(page_url)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    
    # Remove script and style tags
    for script in soup(['script', 'style']):
        script.extract()

    # Get rid of empty tags
    for tag in soup.find_all():
        if not tag.text.strip():
            tag.extract()
    
    content = soup.find("main", class_="content")
    return content.prettify()

In [2]:
pages_url = extract_pages_url()
pages_url

['https://wesmckinney.com/book/preliminaries',
 'https://wesmckinney.com/book/python-basics',
 'https://wesmckinney.com/book/python-builtin',
 'https://wesmckinney.com/book/numpy-basics',
 'https://wesmckinney.com/book/pandas-basics',
 'https://wesmckinney.com/book/accessing-data',
 'https://wesmckinney.com/book/data-cleaning',
 'https://wesmckinney.com/book/data-wrangling',
 'https://wesmckinney.com/book/plotting-and-visualization',
 'https://wesmckinney.com/book/data-aggregation',
 'https://wesmckinney.com/book/time-series',
 'https://wesmckinney.com/book/modeling',
 'https://wesmckinney.com/book/data-analysis-examples',
 'https://wesmckinney.com/book/advanced-numpy',
 'https://wesmckinney.com/book/ipython']

In [3]:
for page_url in tqdm(pages_url, desc="Collecting data"):
    section_name = page_url.split("/")[-1]
    cleaned_html = extract_page(page_url)

    # Write cleaned HTML content to a new file
    os.makedirs("./data", exist_ok=True)
    with open(f'./data/wesmckinney_{section_name}.html', 'w', encoding='utf-8') as f:
        f.write(cleaned_html)

Collecting data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:13<00:00,  1.11it/s]


In [4]:
!ls ./data

wesmckinney_accessing-data.html
wesmckinney_advanced-numpy.html
wesmckinney_data-aggregation.html
wesmckinney_data-analysis-examples.html
wesmckinney_data-cleaning.html
wesmckinney_data-wrangling.html
wesmckinney_ipython.html
wesmckinney_modeling.html
wesmckinney_numpy-basics.html
wesmckinney_pandas-basics.html
wesmckinney_plotting-and-visualization.html
wesmckinney_preliminaries.html
wesmckinney_python-basics.html
wesmckinney_python-builtin.html
wesmckinney_time-series.html


### 1.2. Data Ingestion 
In this step, we create a vector database with LlamaIndex. 
- First, we load documents from book chapters we have extract bellow
- Second, we use an Embedding model to vectors embedding
- Third, these documents embeddings will store in a database, in this practice, we use Chromadb - a local database.

In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.file import HTMLTagReader, PDFReader

In [6]:
# read documents from folder data
documents = SimpleDirectoryReader("./data", 
                                  file_extractor={".pdf": PDFReader(), 
                                                  ".html": HTMLTagReader(tag='main', ignore_no_id=True)
                                                 }
                                 ).load_data()

In [7]:
len(documents[0].text), len(documents)

(56923, 15)

A document is a chapter in our reference book, so it quite long.

In [8]:
documents[0].metadata

{'tag': 'main',
 'tag_id': 'quarto-document-content',
 'file_path': '/workspace/GemmaRAG/data/wesmckinney_accessing-data.html',
 'file_name': 'wesmckinney_accessing-data.html',
 'file_type': 'text/html',
 'file_size': 143688,
 'creation_date': '2024-03-12',
 'last_modified_date': '2024-03-12'}

In [9]:
print(documents[0].text[:1000])

6
     

      Data Loading, Storage, and File Formats
This Open Access web version of
     
      Python for Data Analysis 3rd Edition
     
     is now available as a companion to the
     
      print and digital editions
     
     . If you encounter any errata,
     
      please report them here
     
     . Please note that some aspects of this site as produced by Quarto will differ from the formatting of the print and eBook versions from O’Reilly.
    

     If you find the online edition of the book useful, please consider
     
      ordering a paper copy
     
     or a
     
      DRM-free eBook
     
     to support the author. The content from this website may not be copied or reproduced. The code examples are MIT licensed and can be found on GitHub or Gitee.
Reading data and making it accessible (often called
  
   data loading
  
  ) is a necessary first step for using most of the tools in this book. The term
  
   parsing
  
  is also sometimes used to describe loading

In [10]:
from llama_index.embeddings.instructor import InstructorEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import Settings
import chromadb
import torch

In [11]:
# Load document embedding model
device_type = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embed_model = InstructorEmbedding(model_name="intfloat/multilingual-e5-large", cache_folder="./models", device=device_type)
embed_model

load INSTRUCTOR_Transformer
max_seq_length  512


InstructorEmbedding(model_name='intfloat/multilingual-e5-large', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fa653a38dc0>, query_instruction=None, text_instruction=None, cache_folder='./models')

In [12]:
# Init chroma db
# Creates a persistent instance of Chroma that saves to disk
chroma_client = chromadb.PersistentClient(path="./DB")
# Get or create a collection with the given name and metadata.
chroma_collection = chroma_client.get_or_create_collection("chroma_db")
chroma_collection, chroma_collection.count()

(Collection(name=chroma_db), 1023)

Embeddings are stored within a ChromaDB collection.
During query time, the index uses ChromaDB to query for the top k most similar nodes.

In [13]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [14]:
# Create index from documents.
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model, show_progress=True
)

Parsing nodes:   0%|          | 0/15 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/341 [00:00<?, ?it/s]

## 3. Querying Stage

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate
from getpass import getpass

In [16]:
# to load gemma model, we need to get huggingface access token and request access to gemma
HF_ACCESS_TOKEN = getpass('Huggingface access token')

Huggingface access token ········


In [37]:
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it",
                                          token=HF_ACCESS_TOKEN)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", 
                                             device_map=device,
                                             # torch_dtype=torch.float16,
                                             token=HF_ACCESS_TOKEN
                                            )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [38]:
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."
llm_hf = HuggingFaceLLM(
    context_window=8192,
    max_new_tokens=256,
    generate_kwargs={
        "temperature": 0.7,
        "do_sample": True
    },
    device_map=device,
    system_prompt=system_prompt,
    model=model,
    tokenizer=tokenizer
)

In [39]:
query_engine_hf = index.as_query_engine(llm=llm_hf)

In [40]:
%%time
response_hf = query_engine_hf.query("what is linear regression")

CPU times: user 2.21 s, sys: 24.5 ms, total: 2.23 s
Wall time: 2.52 s


In [41]:
print(response_hf.response)


Linear regression is a statistical method that models a relationship between a continuous dependent variable and one or more independent variables.


In [43]:
response_hf.metadata

{'8c7abda3-9ea9-4567-8a60-e3e61e3af804': {'tag': 'main',
  'tag_id': 'quarto-document-content',
  'file_path': '/workspace/GemmaRAG/data/wesmckinney_modeling.html',
  'file_name': 'wesmckinney_modeling.html',
  'file_type': 'text/html',
  'file_size': 90881,
  'creation_date': '2024-03-10',
  'last_modified_date': '2024-03-10'}}

Let see how gemma work without RAG

In [80]:
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.7,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=1000
)

In [81]:
question = "what is linear regression?"
prompt = f"""
<start_of_turn>user
You are a Q&A assistant. Your goal is to answer questions as accurately as possible
Question: {question}
<end_of_turn>
<start_of_turn>model
 """
print(prompt)


<start_of_turn>user
You are a Q&A assistant. Your goal is to answer questions as accurately as possible
Question: what is linear regression?
<end_of_turn>
<start_of_turn>model
 


In [84]:
%%time
respones_without_rag = text_generation_pipeline(prompt)[0]

CPU times: user 2.1 s, sys: 8.82 ms, total: 2.11 s
Wall time: 2.13 s


In [85]:
print(respones_without_rag['generated_text'])

Linear regression is a method of data analysis that is used to find a relationship between two variables. It is a regression analysis that uses a least squares approach to find a linear relationship between two variables. It is also a method of visualization that uses a scatter plot to display a linear relationship between two variables.


## References
1. [Python for Data Analysis](https://wesmckinney.com/book/)
2. [Prompt engineer guide - RAG](https://www.promptingguide.ai/techniques/rag)