# Building RAG Anwer Generator with LangChain

In this example, we'll work on building an AI answer generator engine from start-to-finish. We will be using LangChain, OpenAI, and Pinecone vector DB, to build the engine capable of learning from the external world using **R**etrieval **A**ugmented **G**eneration (RAG).

We will be using a set of information files from a specified folder and another file with a list of questions. Each question will be answered independently in it's own context. The answers will be written as a CSV file with some additional information about where they were found in the provided files. The use case assumed in the example, and reflected in some specific promts, is answering a company security assessment questionnaire based on that company's policy and procedure documents. 

The example assumes that the documents are be prepared with blocks preceeded by a line with the block name starting with ##. The blocks themselves will be broken into chanks to make embedding and processing possible.

By the end of the example we'll have a functioning answer generator using RAG pipeline.

### Before you begin

You'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).

### Prerequisites

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.
- **python-dotenv**: Read .env file that contains encironment variables

You can install these libraries using pip like so:

In [53]:
!pip install -qU \
    langchain \
    openai \
    pinecone-client \
    tiktoken \
    python-dotenv

### Setup Open AI 

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [54]:
import os
from dotenv import load_dotenv
#from langchain.chat_models import ChatOpenAI

# Load the environment variables from .env file
load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    print("OPENAI_API_KEY Not Defined");

open_api_key = os.getenv("OPENAI_API_KEY", "OPENAI_API_KEY Not Defined")

#chat = ChatOpenAI(
#    openai_api_key=open_api_key,
#    model='gpt-3.5-turbo'
#)

In [55]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

### Task 3.0 Read Files

In [57]:
def read_files_from_folder(folder_path):
    files_data = []

    # List all files in the given folder
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        # Check if it's a file and not a directory
        if os.path.isfile(file_path):
            # Open and read the file
            with open(file_path, 'r') as file:
                file_text = file.read()
                files_data.append({'file_name': file_name, 'file_text': file_text})

    return files_data

# Example usage
folder_path = "/Users/leo/text"
files = read_files_from_folder(folder_path)

print(len(files))
#for file in files:
#    print(file['file_name'])

15


### Task 3.1 Split Text Into Sections

Split the files into sections. Section name has format ## ***\n. The text before the first section name (header) may or may not be igrnored.

In [58]:
import re

IGNORE_HEADER = True

def split_into_blocks(text):
    # Regular expression pattern to find the block markers, assuming they end with a newline
    pattern = r'##\s*(.*?)\n'

    text = text.strip()
    if not IGNORE_HEADER and not text.startswith('##'):
        text = '## FILE HEADER\n' + text
    
    # Split the text based on the pattern
    parts = re.split(pattern, text)

    # First part is always before the first marker, which we can ignore
    parts = parts[1:]

    # Create a list of dictionaries from the split parts
    # Odd indexed elements are section names, even indexed elements are section texts    
    sections = [{'section_name': name, 'section_text': text} for name, text in zip(parts[0::2], parts[1::2])]

    return sections

# Example usage
test_text = """
Header of the file that may or may not be ignored depending on IGNORE_HEADER flag
## Introduction
This is the introduction section.
It has multiple lines, etc. '
## Methodology
Here we describe our methodology.
## Results
Here are the results.
"""
#test_result = split_into_blocks(test_text)
#test_result

### Split files into sections

Split blocks into chunks if required. Each chunk will become a section.

In [62]:
from langchain.text_splitter import CharacterTextSplitter

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=5000,
        chunk_overlap=400,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def process_files(files):
    processed_data = []
    for file in files:
        blocks = split_into_blocks(file['file_text'])
        for block in blocks:
            #Split text into chunks
            chunks = get_text_chunks(block['section_text'])
            #======================
            chunk_index = 0;
            for chunk in chunks:
                chunk_index += 1
                processed_data.append({
                    'file_name': file['file_name'],
                    'section_name': block['section_name'] + ' #' + str(chunk_index),
                    'section_text': chunk
                })
    return processed_data

def sort_files_by_name(files):
    # Sorting the files by the 'file_name' key
    return sorted(files, key=lambda x: float(x['file_name'].split('-')[0]))

files = sort_files_by_name(files)

sections = process_files(files)

print(len(sections))
#for section in sections:
#    print(section['file_name'], section['section_name'])

96


### Task 4: Building the Knowledge Base

We now have a file chunks that can serve as our answer engine knowledge base. Our next task is to transform the chunks into the knowledge base that our engine can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [67]:
from pinecone import Pinecone

if not os.getenv("PINECONE_API_KEY"):
    print("PINECONE_API_KEY Not Defined")
    
# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY", "PINECONE_API_KEY Not Defined")

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [68]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [69]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [70]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

  warn_deprecated(


We're now ready to embed and index all our our data! We do this by looping through our document sections, embedding, and inserting everything in batches.

In [71]:
from tqdm.auto import tqdm  # for progress bar
import pandas as pd
import uuid

data = pd.DataFrame(sections);

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    #ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    ids = [f"{uuid.uuid4()}" for i, x in batch.iterrows()]
    
    # get text to embed
    texts = [x['section_text'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['section_text'],
         'source': x['file_name'],
         'title': x['section_name']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/1 [00:00<?, ?it/s]

NOTE: **This is not immediately available!** We can check that the vector index has been populated using `describe_index_stats` like before:

In [75]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 96}},
 'total_vector_count': 96}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our engine. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [76]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)



Read Questions

In [77]:
def read_questions(file_path):
    lines = []
    with open(file_path, 'r') as file:
        for line in file:
            stripped_line = line.strip()
            if stripped_line and not stripped_line.startswith('##'):
                lines.append(stripped_line)
    return lines

file_path = folder_path = "/Users/leo/question/questions.txt"
questions = read_questions(file_path)
#print(questions)
print(len(questions))

137


Answer All Questions One by One

In [78]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=2)
    
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    metadata_ref = " ### ".join([x.metadata['source'] + '; ' + x.metadata['title'] for x in results])
    #print(metadata_ref)
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query. If there is no answer, answer "No answer in the provided files".

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt, metadata_ref

In [79]:
import re

def split_yes_no_string(text):
    # Check if the string starts with the specified patterns
    match = re.match(r'^(Yes,|Yes\.|No,|No\.|yes,|yes\.)\s*(.*)', text)

    if match:
        # If it matches, the first group is part 1, and the rest is part 2
        part1 = match.group(1)
        part2 = match.group(2).strip()
    else:
        # If it does not match, the first part is empty, and the second part is the whole string
        part1 = 'N/A'
        part2 = text.strip()

    # Remove . or , from part1
    part1 = part1.rstrip('.,')

    # Capitalize only the first letter of part2, if it exists
    if part1:
        part1 = part1[0].upper() + part1[1:]
    
    if part2:
        part2 = part2[0].upper() + part2[1:]

    return part1, part2

# Example usage
#text = "yes,    this is a sample text. All is well.        "
#part1, part2 = split_yes_no_string(text)
#print(f"Part 1: '{part1}'")
#print(f"Part 2: '{part2}'")


In [80]:
import csv
csv_data = []

for question in tqdm(questions):
    messages = [
        SystemMessage(content="You are a compliance officer at a company answering vendor assessment questionnaire."),
    ]

    content, reference = augment_prompt(question)
    
    # create a new user prompt
    prompt = HumanMessage(
        content=content
    )
    # add to messages
    messages.append(prompt)
    
    res = chat(messages)

    response_content = res.content

    yesNo, text_answer = split_yes_no_string(response_content)
    # Append the question, response, and reference to csv_data
    
    csv_data.append([question, yesNo, text_answer, reference])
    
with open('/Users/leo/question/answers.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # Write the header
    writer.writerow(['Question', 'Yes/No', 'Content', 'Reference'])
    # Write the data
    writer.writerows(csv_data)

  0%|          | 0/137 [00:00<?, ?it/s]

  warn_deprecated(


In [66]:
pc.delete_index(index_name)

---