In [1]:
import torch
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, AutoModelForCausalLM
from dotenv import load_dotenv
from huggingface_hub import login
import os
import pandas as pd



# Retrieval-Augmented Generation (RAG) Using Google Gemma-7b-it

In this notebook we will implement the following RAG architecture to create a customized LLM using text from my textbook **Handbook of Regression Modeling** and Google's Gemma-7b-it open source LLM:

![RAG Architecture](rag_architecture.jpg) 

We will be building two components, an information-retrieval (IR) component and a generation component:
* The IR component acts as a knowledge database of text files.  This database will be used to identify documents or text passages most relevant to the intent of the user query. Embeddings will be used to search this database and return the most relevant passages from the textbook.
* The generation component will feed the results of the IR component into the Gemma LLM as context in order to generate an attempt at a comprehensive natural language response to the user prompt.

## Preparing a dataset from my textbook to work with Gemma-7b-it LLM

My textbook is available open source, and its codebase is in an open Github repository.  We will connect to the repository and download the text of the 14 chapters and sections of the textbook, and create a Pandas dataframe with 14 rows and two columns, containing an ID-number for each chapter/section and the text of each chapter/section. 

In [2]:
import requests

# chapters are Rmd files with the following names
chapter_list = [
    "01-intro",
    "02-basic_r",
    "03-primer_stats",
    "04-linear_regression",
    "05-binomial_logistic_regression",
    "06-multinomial_regression",
    "07-ordinal_regression",
    "08-hierarchical_data",
    "09-survival_analysis",
    "10-tidy_modeling",
    "11-power_tests",
    "12-further",
    "13-solutions",
    "14-bibliography"
]

# create a function to obtain the text of each chapter
def get_text(chapter: str) -> str:
    # URL on the Github where the rmd files are stored
    github_url = f"https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/r/{chapter}.Rmd"  
    
    result = requests.get(github_url)
    return result.text

# iterate over the chapter URLs and pull down the text content    
book_text = []
for chapter in chapter_list:
    chapter_text = get_text(chapter)
    book_text.append(chapter_text)

In [3]:
# write to a dataframe
book_data = dict(chapter = list(range(14)), text = book_text)
book_data = pd.DataFrame.from_dict(book_data)

Most of these text documents are too long to fit into Gemma's context window, and so we will need to split them into smaller documents in a way that makes some sort of semantic sense.  

We will use a Langchain transformer to do semantic splitting, with a chunk size of 1000 and a chunk overlap of 150. 

In [4]:
# semantically split chapters to a max length of 1000
loader = DataFrameLoader(book_data, page_content_column="text")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

# examine a document to ensure it looks as we expect
docs[0]

Document(page_content="`r if (knitr::is_latex_output()) '\\\\mainmatter'`\n\n# The Importance of Regression in People Analytics {#inf-model}\n\nIn the 19th century, when Francis Galton first used the term 'regression' to describe a statistical phenomenon (see Chapter \\@ref(linear-reg-ols)), little did he know how important that term would be today.  Many of the most powerful tools of statistical inference that we now have at our disposal can be traced back to the types of early analysis that Galton and his contemporaries were engaged in.  The sheer number of different regression-related methodologies and variants that are available to researchers and practitioners today is mind-boggling, and there are still rich veins of ongoing research that are focused on defining and refining new forms of regression to tackle new problems.", metadata={'chapter': 0})

## Generating embeddings and storing in a vector DB

![Embeddings](embeddings.png)

Since we will use embeddings for the IR component, we need to generate embeddings for our split dataset and then write those into a vector database to allow them to be searched.  We will use the *all-MiniLM-L6-v2 model* to generate the embeddings and the ChromaDB vector database to store them.  Vector Databases have limits to the number of documents that can be encoded in a single command, and so we will use a batch command just in case there are too many documents.

In [5]:
import chromadb
from chromadb.utils import embedding_functions
from chromadb.utils.batch_utils import create_batches
import uuid

In [8]:
# set up the ChromaDB
CHROMA_DATA_PATH = "./chroma_data_regression_book/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "regression_book_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

# in case docs have already been written
client.delete_collection(COLLECTION_NAME)

In [9]:
# enable the DB using Cosine Similarity as the distance metric
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL
)

collection = client.create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)

In [10]:
# write text chunks to DB in batches 
batches = create_batches(
    api=client,
    ids=[f"{uuid.uuid4()}" for i in range(len(docs))], 
    documents=[doc.page_content for doc in docs], 
    metadatas=[{'source': './handbook_of_regression_modeling', 'row': k} for k in range(len(docs))]
)

for batch in batches:
    print(f"Adding batch of size {len(batch[0])}")
    collection.add(ids=batch[0],
                   documents=batch[3],
                   metadatas=batch[2])

Adding batch of size 578


Now our documents are persisting in our vector database, we can try running a query against them.

In [11]:
results = collection.query(
    query_texts=["Which method would you recommend for ordered category outcomes?"],
    n_results=3, 
    include=['documents']
)

results

{'ids': [['371eeda2-c01f-420b-80c6-2b61894d0069',
   '94fc3891-4b37-49b9-a797-e73a11eec739',
   '3360431f-6857-47ac-a329-af10b6528c61']],
 'distances': None,
 'metadatas': None,
 'embeddings': None,
 'documents': [['# Proportional Odds Logistic Regression for Ordered Category Outcomes {#ord-reg}',
   "# Multinomial Logistic Regression for Nominal Category Outcomes\n\n`r if (knitr::is_latex_output()) '\\\\index{multinomial logistic regression|(}'`\nIn the previous chapter we looked at how to model a binary or dichotomous outcome using a logistic function.  In this chapter we look at how to extend this to the case when the outcome has a number of categories that do not have any order to them.  When an outcome has this nominal categorical form, it does not have a sense of direction.  There is no 'better' or 'worse'&zwj;, no 'higher' or 'lower'&zwj;, there is only 'different'&zwj;.\n\n## When to use it\n\n### Intuition for multinomial logistic regression \n\nA binary or dichotomous outcome

## Pipelining the RAG using ChromaDB and Gemma-7b

We now have our vector DB in place so our IR layer is complete.  Now we will load the Gemma-7b LLM via Huggingface.  An access token is needed for this.  This model is large and will take some loading time.

In [12]:
load_dotenv()
login(token=os.getenv("HF_TOKEN"))

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/Keith_McNulty/.cache/huggingface/token
Login successful


In [13]:
if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print(x)
else:
    print("MPS not available")

tensor([1.], device='mps:0')


In [14]:
# load model to Apple Silicon
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", padding=True, truncation=True, max_length=512)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Now we have loaded the Gemma-7b-it model, we can set up an LLM pipeline.

We can run the model normally and ask a general question to the Gemma model:

In [15]:
prompt = """
<start_of_turn>user
What food should I try in New Mexico?<end_of_turn>
<start_of_turn>model
"""

# embed the prompt
input_ids = tokenizer(prompt, return_tensors="pt")

# generate the answer
outputs = model.generate(**input_ids, max_new_tokens=512)

# decode the answer
tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]

"New Mexico is known for its unique cuisine, blending Native American, Spanish, and Mexican influences. Here are some must-try foods in the Land of Enchantment:\n\n**Traditional Native American Foods:**\n\n* **Indian tacos:** Corn tortillas filled with meat (often mutton or beef), beans, cheese, lettuce, tomato, and red chili powder.\n* **Chiles en nogada:** Layers of red and green chiles, potatoes, and vegetables in a savory sauce.\n* **Sopaipillas:** Puffy fried dough balls often served with honey or jam.\n* **Posole:** Stewed pork in a flavorful broth, served with corn tortillas and red chili powder.\n\n**Spanish and Mexican Influences:**\n\n* **Hatch chiles:** Green and red chiles grown in Hatch, New Mexico, known for their unique flavor and heat.\n* **Red and green chile stew:** A hearty stew made with red and green chiles, vegetables, and meat.\n* **Carne adovada:** Slow-roasted beef marinated in red chile powder.\n* **Biscochitos:** Crispy fried dough cookies dusted with cinnamo

Finally, we define a function that executes our IR layer and our LLM summarization layer.  The function accepts a question and queries it against our ChromaDB, retrieving a defined number of documents based on the smallest distance from the query.  The results are joined together and sent to the LLM as context along with the original question, to generate a summarized result.

In [16]:
def ask_question(question: str, model: AutoModelForCausalLM = model, tokenizer: AutoTokenizer = tokenizer, collection: str = COLLECTION_NAME, n_docs: int = 3) -> str:
    
    # Find close documents in chromadb
    collection = client.get_collection(collection)
    results = collection.query(
       query_texts=[question],
       n_results=n_docs
    )

    # Collect the results in a context
    context = "\n".join([r for r in results['documents'][0]])

    prompt = f"""
    <start_of_turn>user
    You are an expert on statistics and its applications to People Analytics.  
    Here is a question: {question}\n\n Answer it with reference to the following information and only using the following information: {context}.<end_of_turn>
    <start_of_turn>model
    """

    # Generate the answer using the LLM
    input_ids = tokenizer(prompt, return_tensors="pt")

    # Return the generated answer
    outputs = model.generate(**input_ids, max_new_tokens=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]

## Testing our RAG agent

We test our agent by asking some questions where relevant information is known to exist in the IR layer.

In [17]:
ask_question("What method would you recommend I use to model ordered category outcomes and why?")

'    Sure, here is the answer to the question:\n\nThe text suggests that the recommended method for modeling ordered category outcomes is proportional odds logistic regression. This is because proportional odds models are easy to interpret and are commonly adopted in the field. However, it is important to note that proportional odds models have some underlying assumptions that should be checked before using them.\n\nTherefore, the recommended method for modeling ordered category outcomes is proportional odds logistic regression, but it is important to check the underlying assumptions of this model before using it.'

In [18]:
ask_question('What should I look out for when using Proportional Odds regression?')

'    When using Proportional Odds regression, you should look out for the following:\n\n    * **Test of Proportional Odds Assumption:** Before running the model, you should test the proportional odds assumption to see if the model is appropriate for your data. If the assumption fails, you should consider alternative models for ordinal outcomes.\n    * **Variable Removal:** If the assumption fails, you may consider removing variables that do not impact the outcome. However, whether or not you are comfortable doing this will depend very much on the impact on overall model fit.\n    * **Alternative Models:** If you are not comfortable removing variables, you should consider alternative models for ordinal outcomes. The most common alternatives include models such as cumulative logit models, rank-based models, and threshold-based models.'

In [19]:
ask_question('What type of modeling is most likely to add value in People Analytics?')

'    The text suggests that the most common type of modeling in people analytics is regression modeling.'

In [20]:
ask_question('Can you please explain what is meant by the term "inferential modeling"?')

'    Sure, here is the explanation of the term "inferential modeling" as per the text:\n\n**Inferential modeling** is the process of learning about a relationship (or lack of relationship) between the data in $X$ and $y$ and using that to *describe* a relationship (or lack of relationship) between our constructs $\\mathscr{C}$ and our outcome $\\mathscr{O}$ that is valid to a high degree of statistical certainty on the population $\\mathscr{P}$. This process includes testing a proposed mathematical relationship, comparing that relationship against other proposed relationships, describing the relationship statistically, and determining whether the relationship (or certain elements of it) can be generalized from the sample set $S$ to the population $\\mathscr{P}$.'

In [21]:
ask_question('Where did the term regression originate from?')

"    Sure, here is the answer to the question:\n\nThe term 'regression' originated from the Latin term 'regressio', which approximately means 'go back'. This term was first used by Francis Galton in the late 1800s to describe a statistical phenomenon that he discovered while researching the relationship between the heights of a population of almost 1000 children and the average height of their parents."

In [22]:
ask_question('How do I get started using R for regression modeling?')

'    Sure, here is the answer to the question:\n\nTo get started using R for regression modeling, follow these steps:\n\n1. Download and install the latest version of R from https://www.r-project.org/. Ensure that the version suits your operating system.\n2. Download the latest version of the RStudio IDE from https://rstudio.com/products/rstudio/ and view the video on that page to familiarize yourself with its features.\n3. Open RStudio and play around.\n\nThe initial stages of using R can be challenging, mostly due to the need to become familiar with how R understands, stores and processes data. Extensive trial and error is a learning necessity. Perseverance is important in these early stages, as well as an openness to seek help from others either in person or via online forums.'

In [23]:
ask_question('What factors determine statistical power?')

'    Sure, here is the answer to the question:\n\nThe factors that determine statistical power are the effect size, the sample size, the level of significance, and the standard deviation of the population.\n\nIn order to achieve a statistically significant result, the observed effect size must be greater than a certain multiple of the standard error above the null hypothesis mean. This multiple is called $z_{\\alpha}$.'

We also test to ensure that questions are only answered based on content in the book.

In [24]:
ask_question('What is the standard model of Physics?')

'    The text does not provide information about the standard model of Physics, therefore I cannot answer this question.'