Task
1. Creat a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve?



In [1]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
import re

  from .autonotebook import tqdm as notebook_tqdm


### 1. Creat a RAG pipeline that can take following text and answer following questions

In [2]:
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(MODEL_NAME)


sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [3]:
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]

creating function to split the the text, lets start with paragraph level chunking and see how it goes.

In [4]:
# Function to generate embeddings
def get_transformer_embeddings(texts):
    return model.encode(texts, convert_to_numpy=True)

# Function to retrieve relevant passage
def retrieve_passage(query):
    query_embedding = get_transformer_embeddings([query])
    similarities = cosine_similarity(query_embedding, stored_embeddings)[0]  # Ensure correct shape
    best_match_idx = np.argmax(similarities)
    return stored_texts[best_match_idx]

# Function to answer questions based on stored content
def answer_question(query):
    relevant_passage = retrieve_passage(query)
    return relevant_passage

In [5]:

def split_text_paragraph(text):
    return [para.strip() for para in re.split("\n+", text) if para.strip()]

# Store document embeddings
stored_texts = split_text_paragraph(sample_text)
stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)  

In [6]:
stored_texts

['The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.',
 'Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.',
 'Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.',
 'Scientists believe that the Amazon plays a cruci

In [7]:
print("\nSample Questions and Answers:\n")
for question in questions:
    response = answer_question(question)
    print(f"Q: {question}\nA: {response}\n")



Sample Questions and Answers:

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Which countries does the Amazon span across?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. T

The result shows that paragraph chunking returns rather large chunks. For example, the answer for "What is the Amazon rainforest?" and "Which countries does the Amazon span across?" is identical because both questions trigger the retrieval of the entire paragraph covering multiple details. This indicates that while paragraph-level chunks preserve context, they may sometimes be too broad and include more information than necessary for very specific questions.

### 2. Try different types of chunking to get better answers?

Lets try with sentences chucking

In [8]:
# Function to split text into meaningful chunks (sentences)
def split_text_sentences(text):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

In [9]:
stored_texts = split_text_sentences(sample_text)
stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)

In [10]:
#print entire in stored text
stored_texts   


['The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.',
 'It spans across nine countries, including Brazil, Peru, and Colombia.',
 'The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.',
 'Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.',
 'This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.',
 'Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter.',
 'These tribes have unique languages, traditions, and knowledge of the ecosystem.',
 'However, many face threats from illegal land encroachment and industrial activities.',
 'Scientists believe that the 

In [11]:
print("\nSentence-Level Chunking Results:\n")
for question in questions:
    response = answer_question(question)
    print(f"Q: {question}\nA: {response}\n")


Sentence-Level Chunking Results:

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter.

Q: What is t

The sentence-level chunking results look much more targeted. Instead of returning an entire paragraph, you're now getting a single sentence that directly answers each question. For example, for "What is the Amazon rainforest? the answer is now just:

The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers."

This more focused output can be preferable when you need concise answers.

In [12]:
def split_text_sliding(text, window_size=3, step=1):
    # Split text into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    # Create overlapping chunks using a sliding window
    for i in range(0, len(sentences) - window_size + 1, step):
        chunk = " ".join(sentences[i:i + window_size])
        chunks.append(chunk)
    return chunks



In [13]:
stored_texts = split_text_sliding(sample_text, window_size=3, step=1)
stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)


In [14]:
stored_texts

['The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.',
 'It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds. Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.',
 'The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds. Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanizatio

In [15]:
print("\nSliding Window Chunking Results:\n")
for question in questions:
    response = answer_question(question)
    print(f"Q: {question}\nA: {response}\n")


Sliding Window Chunking Results:

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds. Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture,

the sliding window results indicate that while overlapping chunks can preserve context across sentence boundaries, they may sometimes include extra or even unrelated details. For instance, for the question “How does the Amazon rainforest affect global weather patterns?”, the retrieved chunk begins in one context (indigenous tribes) and then includes information about the rainforest’s role in weather. This happens because the sliding window (e.g., a window of 3 sentences with an overlap) captures multiple sentences at once, and if the window is too large or the step size is too small, you might end up with chunks that mix several topics.

In summary:

Pros:

Sliding window chunking preserves context between adjacent sentences.
It can help capture information that spans multiple sentences.

Cons:

It may include extra details that are not directly relevant to the question.
If the window size is too large, the retrieved chunk may become less focused.


### 3. Does asking questions differently give better answers? Why?

Lets try this with our sentence chucking and the question 
Q: What types of wildlife can be found in the Amazon?
that had the aswear:
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter.


In [16]:
stored_texts = split_text_sentences(sample_text)
stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)

In [17]:
# Define the original question and a couple of rephrasings
question_original = "What types of wildlife can be found in the Amazon?"
question_alt1 = "Which animals and insects inhabit the Amazon rainforest?"


# Retrieve answers using your existing answer_question function
answer_original = answer_question(question_original)
answer_alt1 = answer_question(question_alt1)

print("Original Question:")
print("Q:", question_original)
print("A:", answer_original, "\n")

print("Alternative Phrasing 1:")
print("Q:", question_alt1)
print("A:", answer_alt1, "\n")


Original Question:
Q: What types of wildlife can be found in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. 

Alternative Phrasing 1:
Q: Which animals and insects inhabit the Amazon rainforest?
A: The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds. 



I noticed that changing the way you ask a question can really change the answer you get from the system. For example, when I asked "What types of wildlife can be found in the Amazon?" I got an answer about indigenous tribes. But when I rephrased it to "Which animals and insects inhabit the Amazon rainforest?" I got a response that listed jaguars, sloths, and insects. Basically, even small tweaks in the wording change the embedding the model uses, which then changes what text it finds most similar. This shows that rephrasing your question can help get a more accurate or relevant answer.

### 4. Try a different similarity search instead of cosine similarity - do the answers improve?

In [18]:


# Alternative retrieval function using Euclidean distance
def retrieve_passage_euclidean(query):
    # Get the embedding for the query
    query_embedding = model.encode([query], convert_to_numpy=True)
    # Compute Euclidean distances between the query and each stored embedding
    distances = pairwise_distances(query_embedding, stored_embeddings, metric='euclidean')[0]
    # Lower distance indicates higher similarity; find the index with the smallest distance
    best_match_idx = np.argmin(distances)
    return stored_texts[best_match_idx]

# Define a new function to answer questions using Euclidean similarity
def answer_question_euclidean(query):
    return retrieve_passage_euclidean(query)

# Now compare answers using Euclidean distance for our list of questions
print("\nSentence-Level Chunking with Euclidean Distance Results:\n")
for question in questions:
    response = answer_question_euclidean(question)
    print(f"Q: {question}\nA: {response}\n")



Sentence-Level Chunking with Euclidean Distance Results:

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, an