### LangChain local LLM RAG example with self chunking and reranking (Cohere)
### For LangSmith users (requires API key)
Utilising LangChain v0.1

This notebook demonstrates the use of LangChain for Retrieval Augmented Generation in Linux with Nvidia's CUDA. LLMs are run using Ollama.

It introduces self-chunking (where we split up our document into chunks) and then re-ranking the retrieved results before passing into the LLM

Models tested:
- Llama 2
- Mistral 7B
- Mixtral 8x7B
- Neural Chat 7B
- Orca 2
- Phi-2
- Solar 10.7B
- Yi 34B


See the [README.md](README.md) file for help on how to setup your environment to run this.

In [25]:
# Select your model here, put the name of the model in the ollama_model_name variable
# Ensure you have pulled them or run them so Ollama has downloaded them and can load them (which it will do automatically)

# Ollama installation (if you haven't done it yet): $ curl https://ollama.ai/install.sh | sh
# Models need to be running in Ollama for LangChain to use them, to test if it can be run: $ ollama run mistral:7b-instruct-q6_K

ollama_model_name = "mistral:7b-instruct-q6_K"
# "llama2:7b-chat-q6_K"
# "mistral:7b-instruct-q6_K"
# "mixtral:8x7b-instruct-v0.1-q4_K_M"
# "neural-chat:7b-v3.3-q6_K"
# "orca2:13b-q5_K_S"
# "phi" or try "phi:chat"
# "solar:10.7b-instruct-v1-q5_K_M"
# Can't run "yi:34b-chat-q3_K_M" or "yi:34b-chat-q4_K_M" - never stopped with inference

In [26]:
# Our LangSmith API key is stored in apikeys.py
# Store your LangSmith key in a variable called LangSmith_API

from apikeys import LangSmith_API
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = LangSmith_API

# Project Name
os.environ["LANGCHAIN_PROJECT"] = "LangChain RAG Linux Chunking"

In [27]:
# Load the LLM with Ollama, setting the temperature low so it's not too creative

from langchain_community.llms import Ollama
llm = Ollama(model=ollama_model_name) #, temperature=0.1)

In [28]:
# Quick test of the LLM with a general question before we start doing RAG
llm.invoke("why is the sky blue?")

# Note: This line would not complete for Yi-34B - need to work out why inferencing never finishes (works fine when running with the same prompt in ollama.)

"\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when light from the sun passes through Earth's atmosphere. The light is made up of various colors, with each color having a different wavelength. Blue light has a shorter wavelength than other colors, such as red and orange light, so it is more easily scattered in all directions by the molecules and particles in the atmosphere. This scattering effect causes the blue light to spread out evenly across the sky, giving it a uniform appearance."

In [29]:
# Embeddings will be based on the Ollama loaded model

from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model=ollama_model_name)

In [30]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('Data', glob="**/*.docx")

In [31]:
# Load documents

docs = loader.load()

In [32]:
docs

 Document(page_content="Thundertooth\n\nEmbraced by the futuristic city and its inhabitants, Thundertooth found a sense of purpose beyond merely satisfying his hunger. Inspired by the advanced technology surrounding him, he decided to channel his creativity into something extraordinary. With the help of the city's brilliant engineers, Thundertooth founded a one-of-a-kind toy factory that produced amazing widgets – magical, interactive toys that captivated the hearts of both children and adults alike.\n\nThundertooth's toy factory became a sensation, and its creations were highly sought after. The widgets incorporated cutting-edge holographic displays, levitation technology, and even the ability to change shapes and colors with a mere thought. Children across the city rejoiced as they played with these incredible toys that seemed to bring their wildest fantasies to life.\n\nAs the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent dinosaur named Se

In [33]:
# Ensure we have the right number of Word documents loaded

len(docs)

4

We create a function to split text into paragraphs but keep numbered sections and bullet points together. This is suitable for the document because it has numbered and bulleted points - this would need to be changed to suit the document.

In [34]:
import re

# Define the regular expression pattern for splitting paragraphs
para_split_pattern = re.compile(r'\n\n')

# Splits a document's text into paragraphs but if it has numbered or bulleted points, they will be included with the paragraph before it.
def split_text_into_paragraphs(text):


    # Use the pattern to split the text into paragraphs
    paragraphs = para_split_pattern.split(text)

    # Combine paragraphs that should not be split
    combined_paragraphs = [paragraphs[0]]

    for p in paragraphs[1:]:
        # Check if the paragraph starts with a number or a dash and, if so, concatenate it to the previous paragraph so we keep them all in one chunk

        # Strip out any leading new lines
        p = p.lstrip('\n')

        if p and (p[0].isdigit() or p[0] == '-' or p.split()[0].endswith(':')):
            combined_paragraphs[-1] += '\n\n\n' + p
        else:
            combined_paragraphs.append(p)

    # Remove empty strings from the result
    combined_paragraphs = [p.strip() for p in combined_paragraphs if p.strip()]

    return combined_paragraphs

Create nodes from the paragraphs that we've carefully split up, counting the paragraphs so we know what kind of token length we're working with.

Note: LangChain does not have a function like TokenCounter that LlamaIndex has to count the tokens in a string. We won't consider third party ones at the moment, like TikToken, so we'll go without the token count. I've left the code relevant to LlamaIndex so you can see how we would count the number of tokens in a paragraph.

In [35]:
# from llama_index.utilities.token_counting import TokenCounter
# from llama_index.schema import TextNode
# import uuid
from langchain.docstore.document import Document

# token_counter = TokenCounter() # Uses the global tokenizer set above, which should match the LLM

paragraph_separator = "\n\n\n"

# Stores the maximum length of a paragraph, in tokens
# max_paragraph_tokens = 0

# Total tokens, used to determine average
# total_paragraph_tokens = 0

# Nodes
paragraph_nodes = []

# Loop through the documents, splitting each into paragraphs and checking the number of tokens per paragraph
for document in docs:

    paragraph_token_lens = []
    # paragraphs = document.text.split(paragraph_separator)
    paragraphs = split_text_into_paragraphs(document.page_content)
    print(f"Document {document.metadata['source']} has {len(paragraphs)} paragraphs")
    for paragraph in paragraphs:
        # token_count = token_counter.get_string_tokens(paragraph)
        # paragraph_token_lens.append(token_count)
        # print(f"Paragraph tokens: {token_count}")

        # if token_count > max_paragraph_tokens:
            # max_paragraph_tokens = token_count

        # total_paragraph_tokens = total_paragraph_tokens + token_count

        # Create and add the node from the paragraph
        # include metadata we can use for citations
        node = Document(page_content=paragraph, metadata=document.metadata) # Copy the metadata from the Word document into here
        # node.metadata["token_count"] = token_count
        paragraph_nodes.append(node)

    # print(paragraph_token_lens)

# print(f"\n** The maximum paragraph tokens is {max_paragraph_tokens} **")

# average_paragraph_tokens = int(total_paragraph_tokens / len(paragraph_nodes))
# print(f"\n** The average paragraph's token count is {average_paragraph_tokens} **")

print(f"\n** Created {len(paragraph_nodes)} nodes **")
# paragraph_nodes


Document Data/Thundertooth Part 3.docx has 10 paragraphs
Document Data/Thundertooth Part 2.docx has 6 paragraphs
Document Data/Thundertooth Part 1.docx has 13 paragraphs
Document Data/Thundertooth Part 4.docx has 14 paragraphs

** Created 43 nodes **


Let's see the split data - now neatly in paragraphs and the bullet points and numbered lists are with their respective paragraph.

In [36]:
paragraph_nodes

[Document(page_content='Thundertooth', metadata={'source': 'Data/Thundertooth Part 3.docx'}),
 Document(page_content="One fateful day, as the citizens of the futuristic city went about their daily lives, a collective gasp echoed through the streets as a massive meteor hurtled towards Earth. Panic spread like wildfire as people looked to the sky in horror, realizing the impending catastrophe. The city's advanced technology detected the threat, and an emergency broadcast echoed through the streets, urging everyone to seek shelter.", metadata={'source': 'Data/Thundertooth Part 3.docx'}),
 Document(page_content="Thundertooth, ever the protector of his newfound home, wasted no time. With a determined gleam in his eyes, he gathered his family and hurried to the city's command center, where Mayor Grace and the leading scientists were coordinating the evacuation efforts.", metadata={'source': 'Data/Thundertooth Part 3.docx'}),
 Document(page_content='The mayor, recognizing Thundertooth\'s inte

We no longer need to use the LangChain text splitter as we've already done the splitting

In [37]:
# Split them up into chunks using a Text Splitter

# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter()
# documents = text_splitter.split_documents(docs)

In [38]:
# Create the embeddings from our split up chunks

from langchain_community.vectorstores import FAISS

vector = FAISS.from_documents(paragraph_nodes, embeddings)

In [39]:
# Prepare the prompt and then the chain

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

if ollama_model_name == "phi" or ollama_model_name == "phi:chat":
    # Phi-2 prompt is less flexible
    prompt_template = """Instruct: With this context\n\n{context}\n\nQuestion: {input}\nOutput:"""

else:
    prompt_template = """You are a story teller, answering questions in an excited, insightful, and empathetic way. Answer the question based only on the provided context:

    <context>
    {context}
    </context>

    Question: {input}"""

prompt = ChatPromptTemplate.from_template(prompt_template)
# document_chain = create_stuff_documents_chain(llm, prompt)

Now that we have broken down the documents into paragraph-sized chunks we need to retrieve more paragraphs so the LLM has a decent amount of context to use. Without adding the "search_kwargs" parameter the answer to the questions was worse. For example, when asked if they had any children no relevant context was provided.

Note: To be able to get the context for the children's names to be included (and then reranked to the top) I needed to set the number of retrieved chunks to 20. The section with the children's names was the 11th result from the retriever! This indicates that retrieving more than you think you need is likely.

In [40]:
# Create the retriever and set it to return a good amount of chunks

from langchain.chains import create_retrieval_chain

retriever = vector.as_retriever(search_kwargs={"k": 20})

Let's implement the Cohere reranking, utilising our retriever (which is getting more results to work with now) and our LLM

Note: You'll need a Cohere API key. A trial key is free for non-commercial purposes. I've stored it in apikey.py as Cohere_API = "your key in here"

https://cohere.com/

In [41]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

from apikeys import Cohere_API

# Create the retriever
compressor = CohereRerank(cohere_api_key=Cohere_API, top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [42]:
# Let's test that it includes the paragraph starting with "As the years passed..." when asked about their children.

test_retrieval = compression_retriever.get_relevant_documents("Did they have any children? If so, what were their names?")

test_retrieval

[Document(page_content="As the years passed, Thundertooth's life took a heartwarming turn. He met a kind and intelligent dinosaur named Seraphina, and together they started a family. Thundertooth and Seraphina were blessed with four children, each with unique characteristics that mirrored the diversity of their modern world.\n\n\nLumina: The eldest of Thundertooth's children, Lumina inherited her mother's intelligence and her father's sense of wonder. With sparkling scales that emitted a soft glow, Lumina had the ability to generate light at will. She became fascinated with technology, often spending hours tinkering with gadgets and inventing new ways to enhance the widgets produced in the family's factory.\n\n\nEcho: The second-born, Echo, had a gift for mimicry. He could perfectly replicate any sound or voice he heard, providing entertainment to the entire city. His playful nature and ability to bring joy to those around him made him a favorite among the neighborhood children.\n\n\nS

The above shows that, indeed, we are able to get that paragraph and it is the highest ranked.

Importantly, if we had not brought enough chunks back with the retriever (referring to the vector store retriever) then we would not have had the right chunks to run through Cohere for reranking.

So if this line:
```
retriever = vector.as_retriever(search_kwargs={"k": 20})
```

was:
```
retriever = vector.as_retriever(search_kwargs={"k": 10})
```

We would not have been able to get that "As the years passed..." chunk for reranking.

Additionally, we're able to compress the number of chunks from the 11+ we needed to get the right chunk down to 5 because we have the best 5 of that bunch. This reduces the tokens needed for the LLM to process.

Now, we create a LangChain chain with the Cohere reranker retriever

In [48]:
from langchain.chains import RetrievalQA

rerank_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
)

ValidationError: 1 validation error for RetrievalQA
prompt_template
  extra fields not permitted (type=value_error.extra)

In [44]:
# Here are our test questions

TestQuestions = [
    "Summarise the story for me",
    "Who was the main protagonist?",
    "Did they have any children? If so, what were their names?",
    "Did anything eventful happen?",
    "Who are the main characters?",
    "What do you think happens next in the story?"
]

Ask our questions with our reranking chain

In [45]:
qa_pairs = []

for index, question in enumerate(TestQuestions, start=1):
    question = question.strip() # Clean up

    print(f"\n{index}/{len(TestQuestions)}: {question}")

    response = rerank_chain.invoke({"query": question})

    qa_pairs.append((question.strip(), response["result"])) # Add to our output array

    # Uncomment the following line if you want to test just the first question
    # break 


1/6: Summarise the story for me

2/6: Who was the main protagonist?

3/6: Did they have any children? If so, what were their names?

4/6: Did anything eventful happen?

5/6: Who are the main characters?

6/6: What do you think happens next in the story?


In [46]:
# Print out the questions and answers

for index, (question, answer) in enumerate(qa_pairs, start=1):
    print(f"{index}/{len(qa_pairs)} {question}\n\n{answer}\n\n--------\n")

1/6 Summarise the story for me

Thundertooth is an intelligent and resourceful dinosaur who helps the mayor of a futuristic city devise a plan to divert or neutralize a meteor that is heading towards their city. Lumina, Echo, Sapphire, and Ignis are enlisted to assist in the crisis, utilizing their unique abilities to create a protective force field, amplify emergency signals, calm the population, and alter the meteor's trajectory. After successfully averting the disaster, Thundertooth is hailed as a hero and his family's legacy is celebrated in the city's history. Mayor Grace invites Thundertooth and his family to stay in the city, promising to find a way to provide for them without causing harm to anyone.

--------

2/6 Who was the main protagonist?

The main protagonist in the given context is Thundertooth.

--------

3/6 Did they have any children? If so, what were their names?

Thundertooth and Seraphina had four children. Their names were Lumina, Echo, Sapphire, and Ignis.

-----