# Generative AI
##  From "rag_task.ipynb" March 5, 2025
(use tensor_requirements.txt)


The hand-in exercise for this topic is in the notebook named ‘rag_task.ipynb’. Do all 4
tasks within this notebook. For task 2, you should try at least 3 types of chunking such as
chunk in paragraphs, sentences or even by punctuation marks – you are welcome to
choose your own chunking strategy. For task 4 you should try at least one other type of
similarity or distance function to calculate the similarity.

Task
1. Create a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve? <br>

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize
import nltk
import re

In [22]:
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]
sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [None]:
questions_2 = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?",
    "What are the main causes of deforestation in the Amazon?",
    "How does the Amazon rainforest support biodiversity?",
    "What are the economic benefits of the Amazon rainforest?",
    "How do indigenous tribes in the Amazon use the rainforest's resources?",
    "What are the challenges faced by indigenous tribes in the Amazon?",
    "How does the Amazon River support local ecosystems?",
    "What are the impacts of climate change on the Amazon rainforest?",
    "What international agreements aim to protect the Amazon?",
    "How do conservation programs help the Amazon rainforest?",
    "What are some sustainable development projects in the Amazon?",
    "How does illegal logging affect the Amazon rainforest?",
    "What are the benefits of reforestation in the Amazon?",
    "How does the Amazon rainforest influence rainfall patterns?",
    "What are the cultural significances of the Amazon rainforest?",
    "How does urbanization impact the Amazon rainforest?",
    "What are the long-term effects of deforestation on the Amazon?",
    "How do scientists study the Amazon rainforest?",
    "What role does the Amazon play in the global carbon cycle?",
    "How do local communities benefit from the Amazon rainforest?",
    "What are the threats to wildlife in the Amazon rainforest?"
]
sample_text_2 = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, Colombia, Venezuela, Ecuador, Bolivia, Guyana, Suriname, and French Guiana. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, pink river dolphins, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, mining, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually. The main causes of deforestation include cattle ranching, soybean farming, and illegal logging.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment, industrial activities, and diseases brought by outsiders. Indigenous tribes use the rainforest's resources sustainably, practicing traditional agriculture and hunting methods that preserve the ecosystem.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river. It supports local ecosystems by providing habitats for numerous aquatic species and facilitating nutrient cycling.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging, promote reforestation initiatives, and support indigenous communities. Conservation programs focus on creating protected areas, monitoring biodiversity, and implementing sustainable practices.

The Amazon rainforest supports biodiversity by providing a wide range of habitats for various species. It also offers economic benefits, such as ecotourism, sustainable forestry, and the discovery of new medicines from its plant species. However, the rainforest faces challenges from climate change, which can alter rainfall patterns, increase temperatures, and exacerbate droughts.

International agreements, such as the Paris Agreement, aim to protect the Amazon by setting targets for reducing greenhouse gas emissions and promoting sustainable land use. Sustainable development projects in the Amazon focus on improving the livelihoods of local communities while preserving the environment. These projects include agroforestry, community-based tourism, and renewable energy initiatives.

Illegal logging in the Amazon rainforest leads to habitat destruction, soil erosion, and the loss of biodiversity. Reforestation efforts help restore degraded areas, improve soil quality, and sequester carbon dioxide. The Amazon rainforest influences rainfall patterns by releasing large amounts of water vapor, which can affect precipitation in distant regions.

The cultural significance of the Amazon rainforest is immense, as it is home to numerous indigenous cultures with rich traditions and knowledge. Urbanization impacts the Amazon rainforest by increasing demand for resources, leading to further deforestation and pollution. The long-term effects of deforestation include the loss of biodiversity, disruption of ecosystem services, and exacerbation of climate change.

Scientists study the Amazon rainforest using various methods, including satellite imagery, field research, and remote sensing. The Amazon plays a vital role in the global carbon cycle by absorbing and storing carbon dioxide, which helps mitigate climate change. Local communities benefit from the Amazon rainforest through access to natural resources, traditional medicines, and cultural heritage.

Threats to wildlife in the Amazon rainforest include habitat loss, poaching, and climate change. Conservation efforts aim to protect endangered species, restore habitats, and promote sustainable use of resources. The Amazon rainforest is a complex and dynamic ecosystem that requires ongoing research and protection to ensure its survival for future generations.
"""


#### 1. Create a RAG pipeline that can take following text and answer following questions
#### 2. Try different types of chunking to get better answers?
I tried with the methods below:
* Paragraph Tokenization: Split the text into paragraphs. used regex
* Semantic Analysis: Use semantic analysis to identify natural breaks in the text. used nltk's punkt model.
* Period-based Chunks: Chunks sizes were from period to period. used regex

Some of them do work "better" than other, but that all depends on the sample text and how that is contructed.

#### 3. Does asking questions differently give better answers? Why?
Yes. You would get more "accurate" responses if you used questions that include more words and thus is more similar with the questions.
    
#### 4. Try a different similarity search instead of cosine similarity - do the answers improve? <br>


I tried with cosine similarity, euclidian distance and manhattan. Unfortunately I saw no improvement with any combination of type of chuncking combined with any similiarity search.
However analysing the similarity scores I can see that they are getting different values all alligning with each of the similarity search methods way of represemting distance.
I also attempted with a larger sample text, and more questions. Unfortuatenly did not change the answers returned.
See below for the sim scores for the first question:

        
Similarity scores for question 'What is the Amazon rainforest?':  
cosine:
[0.7151465  0.36805463 0.5594398  0.60209775 0.47062945 0.57009375
 0.3113671  0.20859346 0.59229565 0.6159775  0.544066   0.31913093]

euclidian:
[0.56986892 0.47075905 0.51581486 0.52851919 0.4928654  0.51887113
 0.46007264 0.44285039 0.52548611 0.53294059 0.51153048 0.46148122]

Manhattan:
[0.07821383 0.0537261  0.06547236 0.06768317 0.05906415 0.06518811
 0.05220144 0.04851812 0.06615595 0.06781829 0.063402   0.05290067]


In [27]:
# the unsupervised model from nltk required to split text into semantic sentences 
nltk.download('punkt')
from scipy.spatial.distance import cdist

def rag_pipe(model_name, chunk_method, sample_text, sim_search_method):
    model = SentenceTransformer(model_name)

    # code to split text into meaningful chunks by seperating at...
    if chunk_method == 'para': #(paragraphs)
        stored_texts = [para.strip() for para in re.split("\n+", sample_text) if para.strip()]
    elif chunk_method == 'period': #(periods)
        stored_texts = [segment.strip() for segment in re.split(r'\.\s*', sample_text) if segment.strip()]
    elif chunk_method == 'semantic': #(semantic markers)
        stored_texts = [sentence.strip() for sentence in sent_tokenize(sample_text) if sentence.strip()]
    
    # Store corpus/sample_text embeddings
    stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)
    
    #generating question embeddings and performing similarity comparison
    print("\nSample Questions and Answers:\n")
    for question in questions:
        question_embedding = model.encode([question], convert_to_numpy=True)
        # cosine sim
        if sim_search_method == 'cosine':
            similarities = cosine_similarity(question_embedding, stored_embeddings)[0]
        # Euclidian distance
        elif sim_search_method == 'euclidian':        
            distances = cdist(question_embedding, stored_embeddings, metric='euclidean')[0]
            similarities = 1 / (1 + distances) 
        # Manhattan distance
        elif sim_search_method == 'manhattan':
            distances = cdist(question_embedding, stored_embeddings, metric='cityblock')[0]
            similarities = 1 / (1 + distances)          
        best_match_idx = np.argmax(similarities)
        
        response = stored_texts[best_match_idx]
        print(f"Q: {question}\nA: {response}\n")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hassan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [28]:
rag_pipe('sentence-transformers/all-MiniLM-L6-v2', 'semantic', sample_text, 'cosine')
# func('sentence-transformers/all-MiniLM-L6-v2', 'semantic', sample_text, 'euclidian') 
# func('sentence-transformers/all-MiniLM-L6-v2', 'semantic', sample_text, 'manhattan')


Sample Questions and Answers:

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter.

Q: What is the 