#### Objectif du notebook
Construire une première chaîne RAG basique, et évaluer sa performance sur le PP Mahakam project et l'AAP 'Dépôt sur site'

In [23]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [None]:
import dotenv
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

dotenv.load_dotenv("/home/chougar/Documents/GitHub/Formation_datascientest/DL-NLP/.env")

True

### Couper le doc en sections

In [None]:
import pdfplumber
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document  # Import the Document class
from typing import List, Dict


# fonction lecture pdf & extracttion des pages
def extract_text_with_page_numbers(pdf_path: str) -> List[Dict]:
    """
    Extract text from PDF with page numbers
    Returns list of {'page': int, 'text': str} dictionaries
    """
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            text = page.extract_text()
            if text.strip():  # Skip empty pages
                pages.append({
                    'page': page_num,
                    'page_content': text  
                })
    return pages



# Extract text with page numbers
pdf_path = "./data/PROJECT DOCUMENT MAHAKAM 2023-2025_balise.pdf"
pages = extract_text_with_page_numbers(pdf_path)



# Conversion des pages vers LangChain Document objects
docs = [
    Document(page_content=page['page_content'], metadata={'page': page['page']})
    for page in pages
]

# Init du text splitter (split par section "logique", avec les séparateurs spécifiés, sans dépasser 4000 caract = 600-700 tokens = 1 page )
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=800,
    length_function=len,
    is_separator_regex=False,
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]  # Split by paragraphs, sentences, words
)

# Split des documents
splits = text_splitter.split_documents(docs)

# Inspection
print(f"Created {len(splits)} chunks")
print("\nSample chunk:")
print(splits[0].page_content)
print(f"\nMetadata: {splits[0].metadata}")

Created 56 chunks

Sample chunk:
MANGROVE RESTORATION AND
LOCAL DEVELOPMENT PROJECT
FOREST PROGRAMME
Project title: MAHAKAM
Country: Indonesia
Implementing partners: Yayasan
Mangrove Lestari & Pokja Pesisir

Metadata: {'page': 1}


In [76]:
splits[1].page_content

"Planète Urgence | FOREST Programme\nBrief project description\nEast Kalimantan Province in 2021/2022 received great attention nationally\nbecause of the moving of the state capital city (Jakarta) to a location near\nthe city of Balikpapan and Penajam Paser Utara in East Kalimantan Province.\nThe development of the new capital will start in 2022. Although the\nIndonesia President commit to develop the new capital as Forest and Smart\nCity, the surrounding area particularly the coastal area such as Delta\nMahakam and Adang Bay might get high pressure as the consequence of the\nnew development and the movement of 1.5 million people to the new\ncapital.\nDelta Mahakam, in the eastern part of East Kalimantan, is an area that is\nrelatively close to the prospective center of the State capital (about 100 km).\nMahakam Delta is naturally a mangrove habitat, but due to excessive land\nclearing for extensive aquaculture about 47.5 % of the mangrove ecosystem\nis degraded to be converted into aq

### Embedding: 
* Convertion des docs en vecteurs (représentation sémantiques des textes, sous forme numérique)

In [26]:
# Intanciating the vectorstore
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())

# Intanciating the retriever
retriever = vectorstore.as_retriever()



In [27]:
embedder = OpenAIEmbeddings()
print(f"Modèle par défaut: {embedder.model}") 

Modèle par défaut: text-embedding-ada-002


### Choix du modèle de prompt et du LLM à utiliser

In [59]:
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Provide as much information as possible, but only from the given context"    
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "Question: {question} \nContext: {context} \nAnswer:"),
    ]
)
# Prompt
# prompt = hub.pull("rlm/rag-prompt")

# LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.2)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



#### Tester plusieurs questions tirées du questionnaire "Dépôt sur site.docx"
#### Protocol:
* Deux types de user prompts sont utilisés:
    * Prompt direct, qui reprend strictement l'information demandée dans le questionnaire
    * Prompt "amélioré", qui reprend l'information demandée dans le questionnaire, reformulée en question, et enrichie avec des précisions

1. Soumettre successivement les 2 prompts au LLM dans le cadre de la chaîne RAG 
2. Evaluer la réponse du LLM avec la réponse de référence correspondant à chaque question (copiée à la main), en utilisant le LLM dans une seconde itération, et produire un score de 1 à 10 (voir prompt dans `score_reference_vs_rag_with_gpt`)


In [29]:
# Wrappers pour LLM/ RAG Q/A et cosine sim
evaluations=[]
embedder =OpenAIEmbeddings()

def get_cosine_scores(reference_answer, vanilla_answer, rag_answer, embedder=embedder):

    emb_ref_answer =embedder.embed_query(reference_answer)
    emb_vanilla_answer =embedder.embed_query(vanilla_answer)
    emb_rag_answer =embedder.embed_query(rag_answer)

    vanilla_similarity = cosine_similarity([emb_ref_answer], [emb_vanilla_answer])
    rag_similarity = cosine_similarity([emb_ref_answer], [emb_rag_answer])

    vanilla_similarity= np.round(vanilla_similarity[0][0], 2)
    rag_similarity= np.round(rag_similarity[0][0], 2)

    print("--------------------\nCosine similarity scores:")
    print(f"Vanilla score: {vanilla_similarity}")
    print(f"RAG score: {rag_similarity}")
    print("--------------------\n")


    return vanilla_similarity, rag_similarity

def get_rag_reply(question, llm=llm, rag_chain=rag_chain):
    # 2. question avec RAG
    rag_qa= rag_chain.invoke(question)



    print(f"LLM answer:\n{rag_qa}")

    return rag_qa

def score_reference_vs_rag_with_gpt(question, reference_text, ragReply, llm=llm):
    prompt=f"""
        Help me to compare a RAG answer against a reference text to a given question

        Question:\n{question}

        Reference text:\n{reference_text} 

        RAG anwser:\n{ragReply}

        Provide a score for 1 to 10 to evaluate the quality of the RAG answer against the reference text (coherence, coverage, clarity ..)
        Respond only with the score (1,2 ... 10)
    """

    resp=llm.invoke(prompt)

    print(f"LLM eval score: {resp.content} (higher is better)")
    
    
    try:
        return int(resp.content)
    except:
        return resp.content





#### Question 1: Décrire le projet MAHAKAM

In [62]:
questions=[
    {"question": "Description of the project", "prompt_enhanced": False},
    {"question": "Give me a description of the MAHAKAM project", "prompt_enhanced": True}
]

# copié manuellement du document
reference_answer="""
    Brief project description
    East Kalimantan Province in 2021/2022 received great attention nationally
    because of the moving of the state capital city (Jakarta) to a location near
    the city of Balikpapan and Penajam Paser Utara in East Kalimantan Province.
    The development of the new capital will start in 2022. Although the
    Indonesia President commit to develop the new capital as Forest and Smart
    City, the surrounding area particularly the coastal area such as Delta
    Mahakam and Adang Bay might get high pressure as the consequence of the
    new development and the movement of 1.5 million people to the new
    capital.
    Delta Mahakam, in the eastern part of East Kalimantan, is an area that is
    relatively close to the prospective center of the State capital (about 100 km).
    Mahakam Delta is naturally a mangrove habitat, but due to excessive land
    clearing for extensive aquaculture about 47.5 % of the mangrove ecosystem
    is degraded to be converted into aquaculture (2017). Despite various
    conservation efforts by different parties and the government, land clearing
    still continues. Delta Mahakam land ownership is government land that has
    designated as a production forest, but this area has been inhabited by
    residents from generation to generation.
    Adang Bay is one of the coastal villages in Adang Bay, Paser Regency, on the
    southern part of East Kalimantan Province (about 100 km from the new
    capital). This area is also experiencing land conversion to increase
    aquaculture, besides there are several locations in coastal areas that are
    affected by abrasion. Restoration activities in East Kalimantan Province are
    needed to restore a degraded environment, as well as to support the vision
    of the nation's capital as a green city.
    The ecosystem in Delta Mahakam and Adang Bay 1 are also home to
    critically endangered species, such as the nasal monkey (proboscis
    monkey), endemic to the island of Borneo. On a global scale, the mangrove
    is a key ecosystem to answer the challenge of carbon sequestration and
    fight against climate change.
    The objective of the project is therefore to contribute to restore the
    degraded mangrove forest in East Kalimantan (Delta Mahakam and Adang
    Bay) as home of endemic and endangered species including proboscis
    monkey and key ecosystem to mitigate and to adapt the impact of climate
    change; and this, through four main actions: raising awareness of the
    stakeholders, rehabilitating degraded mangrove forest, supporting the
"""



In [74]:

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]
    
    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    
    rag_qa=get_rag_reply(question)

    # cosine_scores=get_cosine_scores(reference_answer, vanilla_qa, rag_qa)
    # vanilla_similarity, rag_similarity=cosine_scores

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {
            "question_#":1, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced,             
            "eval_score_llm": eval_score, 
        }
    )



------------
Enhanced prompt: False
LLM answer:
The country is Indonesia, and the relevant cities are in East Kalimantan Province, specifically Paser District (Adang Bay village) and Kutai Kartanegara district (Delta Mahakam). The new political capital of Indonesia, Nusantara, is also being established nearby.
LLM eval score: 9 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The MAHAKAM project takes place in Indonesia, specifically in Paser District (Adang Bay village) and Kutai Kartanegara district (Delta Mahakam) in East Kalimantan Province.
LLM eval score: 10 (higher is better)


#### Question 2: Ville et pays

In [73]:
questions=[
    {"question": "Country and city", "prompt_enhanced": False},
    {"question": "In which country and city the MAHAKAM project takes place", "prompt_enhanced": True}
]
prompt_enhanced=True

# copié manuellement du document
reference_answer="""
    The location of the project is in Paser District (Adang Bay village) and Kutai
    Kartanegara district (Delta Mahakam) East Kalimantan Province. The location
    of project is nearby the new capital of Indonesia which is in the Penajam
    Paser Utara (around 130-160 km)
"""

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]
    
    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    rag_qa=get_rag_reply(question)

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)


    evaluations.append(
        {
            "question_#": 2, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced,
             
            "eval_score_llm": eval_score,
        }
    )



------------
Enhanced prompt: False
LLM answer:
The country is Indonesia, and the relevant cities are Paser District and Kutai Kartanegara, located in East Kalimantan Province. The new capital of Indonesia, Nusantara, is also being developed nearby.
LLM eval score: 9 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The MAHAKAM project takes place in Indonesia, specifically in the Paser District (Adang Bay village) and Kutai Kartanegara district (Delta Mahakam) in East Kalimantan Province.
LLM eval score: 10 (higher is better)


#### Question 3: 
Qui sont les Bénéficiaires visés du projet (Âge, activité, catégorie socio-professionnelle, genre)

In [65]:
# question de base
# question="""target beneficiaries(age, activity, socio-professional category, gender). """

# question amélioré
questions=[
    {"question": "target beneficiaries", "prompt_enhanced": False},
    {
    "question": """who are the project's target beneficiaries of the Mahakam project (age, activity, socio-professional category, gender). 
        Give specific details and figures if any""", 
    "prompt_enhanced": True},

]



# réponse copié manuellement du document
reference_answer="""
    Number of direct beneficiaries of the pilot project: 3245 people with the
    proportion of 30% women and 70% men.
    Number of indirect beneficiaries: 3000 people by assuming at least the
    project will give benefit indirectly to 1500 people per location including in
    East Kalimantan and Indonesia.
    The target groups include:
    - School children (primary schools and secondary schools)
    - Teachers (primary school teachers)
    - Community members (villagers, consists of fish farmers, women group,
    and youth)
    - Village officials
    - Stakeholders from various institutions (government institutions,
    universities, and non-government organizations)
    - Public audience in general (reached by Media)
    Other potential groups:
    - High school and university students
    - Environmental activists 
"""


for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]

    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    rag_qa=get_rag_reply(question)

    
    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {
            "question_#": 3, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced,
             
            "eval_score_llm": eval_score,
        }
    )



------------
Enhanced prompt: False
LLM answer:
The target beneficiaries of the Planète Urgence FOREST Programme include 3,245 direct beneficiaries, comprising 30% women and 70% men, and 3,000 indirect beneficiaries. Specific target groups are school children (both primary and secondary) and primary school teachers. Additionally, there is a focus on training 40 women in financial management and product marketing in the Delta Mahakam and Adang Bay areas.
LLM eval score: 8 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The target beneficiaries of the Mahakam project include community members such as fish farmers, women's groups, youth, and village officials. The project also engages stakeholders from various institutions, including government bodies and NGOs, as well as high school and university students. Specific demographic details such as age and gender are not provided, but the project emphasizes gender equity and social inclusion.
LLM eval score: 7 (higher is bet

#### Question 4: Nombre de personnes concernées 

In [66]:

questions=[
    # question de base
    {"question":"""Number of people concerned""", "prompt_enhanced": False},
    # question améliorée    
    {"question":"""What is the number of people concerned by the Mahakam project ?""", "prompt_enhanced": True},
]

# réponse copié manuellement du document
reference_answer="""
    Number of direct beneficiaries of the pilot project: 3245 people.
    Number of indirect beneficiaries: 3000 people.
"""

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]

    print(f"------------\nEnhanced prompt: {prompt_enhanced}")

    rag_qa=get_rag_reply(question)
    

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {
            "question_#": 4, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced,
             
            "eval_score_llm": eval_score,
        }
    )



------------
Enhanced prompt: False
LLM answer:
The pilot project has 3,245 direct beneficiaries, comprising 30% women and 70% men. Additionally, there are 3,000 indirect beneficiaries, assuming at least 1,500 people benefit per location. The target groups include school children and primary school teachers.
LLM eval score: 9 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The number of people concerned by the Mahakam project includes community members (villagers, fish farmers, women groups, and youth), village officials, stakeholders from various institutions, and the general public reached by media. Additionally, high school and university students, as well as environmental activists, are also potential groups involved. However, a specific numerical figure is not provided in the context.
LLM eval score: 4 (higher is better)


#### Question 5: 
Contexte, environnement, justification et enjeux du projet 

In [67]:


questions=[
    #question de base
    {"question":"""Context, environment, project rationale and challenges""", "prompt_enhanced": False},
    # question améliorée    
    {"question":"""What is the context of the Mahakam project, including environment, project rationale and challenges ?""", "prompt_enhanced": True},
    # # question améliorée 2    
    # {"question":"""What is the context behind the Mahakam project, with a focus on the theme of environment, project rationale and several challenges ?""", "prompt_enhanced": True},
]

# réponse copié manuellement du document
reference_answer="""
    Context & environment and development challenges
    Geographic and socio-economic context
    East Kalimantan is one of the richest provinces in Indonesia and the main
    contribution to the national GDP. Before palm oil and mining coal booming in
    early 2000, forestry, mining and gas sectors are the backbone of economic
    development in East Kalimantan. Because too much depending on the
    unrenewable natural resources, the economic growth of East Kalimantan
    gradually declines and, in 2016, reached the minus point because of the
    lowest price of coal at the global level. In 2019, Indonesia government has
    decided to move the capital of Indonesia from Jakarta to East Kalimantan.
    Currently, the government accelerate the infrastructure development of new
    capital.

    The project will be implemented in several regions of Mahakam Delta and
    Adang Bay. Mahakam Delta is located on the eastern coast of the island of
    Borneo, in East Kalimantan province, which is one of the five provinces that
    has the lowest population density in Indonesia. This province is also the main
    contributor to the national GDP, mainly for its wealth in oil and gas. It is
    nevertheless aquaculture activities which constitute the main source of
    income for the local population. About 90% of the population depend on it
    for their livelihood. As a result, 54.19% of the Mahakam Delta has been
    converted to shrimp ponds. The majority of exports from the area are made
    up of tiger shrimp and white shrimp that are farmed in the delta ponds and
    along the Paser District's shore.

    Paser District is located on the east coast of East Kalimantan Province. The
    village of Adang Bay is located in the coastal area of this district, in Adang
    Bay. This area is a conservation area managed by the Ministry of Forestry
    (KPHP). Therefore, limited economic activities are allowed in this area.
    However, since the late 1990s, massive clearing for the construction of
    aquaculture ponds has destroyed the mangrove forest in the area. According    
    to the District Pastor's Investment Agency, about 1,506 people live in Adang
    Bay. Most of them work as fishermen, fish farmers and swallowers. The
    village government and the community have made a strong commitment to
    conserve the area by adhering to the jurisdictional REDD+ approach, funded
    by the World Bank's FCPF or Forest Carbon Partnership Facility Project.
    Environmental context
    Largest archipelago in the world (more than 13,000 islands), Indonesia has
    an area of 1,905,000 km2 of which less than 50% is still covered by forests
    today, while the country is part of the 3rd largest planetary tropical
    forest zone (after the Amazon and the Congo Basin). More than half of
    Indonesian forests have disappeared since 1960. However, they are home to
    a large part of the world's biodiversity (more than 10% respectively of plant,
    mammal, reptile and bird species). Today, the country counts for 3 to 5% of
    annual global greenhouse gas emissions (among the 10 most emitting
    countries) including more than 50% due to land use, their change of land use
    and the exploitation of forests.
    Indonesia is home to almost 1/4 of the world's mangroves (20%). This
    maritime ecosystem, made up of a set of mainly woody plants (the most
    notable species being the mangrove), develops in the swinging area of the
    tides of the low coasts and in marshes at the mouth of certain rivers. Of the
    nearly 3.2 million hectares of mangrove forest 2 in the country today, more
    than 50,000 ha are lost each year.
    The mangrove is one of the most productive ecosystems on the planet,
    home to a particularly abundant biomass. The mangroves' root system is
    notably a biotope where a variety of fish and crabs live and reproduce. The
    mangrove thus provides important resources (forestry and fishery) to coastal
    populations, a natural “buffer” zone adapted to salinity, filtering sediment
    and pollution carried by rivers and the sea, and preserving the fresh water
    resources of the land. They are a food security and livelihood issue, in
    particular providing income to fishing communities. This ecosystem is also
    an important natural fount of carbon, with Indonesian mangroves storing
    around 5 times more carbon per hectare than terrestrial forests. The
    government of Indonesia has taken into account this ecosystem in its REDD+
    strategy, implemented in the only pilot province of East-Kalimantan, with the
    Provincial Council on Climate Change (DDPI) with the support of the World
    Bank in the framework of the “Forest Carbon Partnership Facility Project”.
    Finally, the mangrove plays a key role in natural defense. The complex
    network of mangrove roots can help reduce wave power, which limits erosion
    and protects coastal communities from the destructive forces of tropical
    storms. Mangroves provide protection against extreme weather events and
    tsunamis, and can adapt to rising sea levels and subsidence. They therefore    
    contribute to reducing the risk of disasters, to the resilience of communities
    and ecosystems and to their adaptation to climate change.
    In Mahakam Delta, results from a study conducted in 2018 and 2019 by the
    Kutai Kartanegara District has shown that 47.8% of mangrove forests are
    deteriorated.
    Table 1. Critical Criteria of Mahakam Delta Mangrove3
    Critical Criteria Land Area (ha) Percentage
    Damaged 7,034 5.6
    Severe 52,945 42.2
    Undamaged 65,522 52.2
    Total 125,502 100.0
    
    Source: The Result of Spatial Analysis of Mangrove Damage Level (2018)

    With the plan to move the state capital to Penajam Paser Utara District
    (PPU), the development activities to create this new big city will take place
    massively. The central government has planned to create a green city for the
    new capital, which construction will start in 2022, but various problems still
    pose challenges in locations outside the new capital. On the one hand, a
    close government center can control the surrounding environment to keep it
    conserved, but the gap in the quality of human resources and plans to move
    a large number of people from Jakarta to this area will certainly cause
    pressure on the environment.

    Biodiversity issues
    The mangrove of Mahakam Delta conceals a rich marine and arboreal
    biodiversity, characterized by a large variety of fish, arthropods, reptiles
    such as the marine crocodile (Crocodylus porosus), aquatic mammals such
    as the Irrawaddy dolphin (Orcaella brevirostris) or terrestrial like the nasal
    monkey (Nasalis larvatus), these last 2 species being considered as being
    “endangered” by the IUCN.
    The deforestation of Mahakam Delta’s mangrove hampers the effort to
    conserve this type of species, for example by fragmenting the habitat of the
    nasal monkey, whose interaction between populations strongly depends on
    the continuity of the canopy. The isolation of these populations makes them
    more vulnerable to poaching. The long-nosed monkey, endemic of Borneo
    Island is listed as “Endangered” by the IUCN as it has undergone extensive    
    population reductions across its range, and ongoing hunting and habitat
    destruction continue to threaten most populations. Numbers have
    declined by more than 50% (but probably less than 80%) over the past 3
    generations (approximately 36-40 years).4 At the scale of Mahakam Delta,
    only 2 censuses have been conducted to monitor this specie, respectively in
    1997 and 2005 which reflects the lack of resources of local institutions to
    conserve and protect this biodiversity.
    In addition, the degradation of this ecosystem leads to a decrease in fish
    stocks in the delta, threatening both fishermen and species such as sea
    crocodiles and dolphins. The situation is currently might threatening to
    exacerbate human-animal conflicts and therefore to further decrease the
    populations of the above-mentioned species, even threatening them with
    extinction. Hence, beside preventing the mangrove forest conversion into
    palm oil plantation, aquaculture ponds, and other usages, the reforestation
    activity is necessary to improve the degraded mangrove ecosystem in the
    coastal area.
    Paradoxically, the considerable modification of delta habitats resulted in a
    very substantial increase in populations of birds associated with open
    wet areas, such as egrets (100 individuals in 1987 to nearly 15,000
    individuals in 2013). Likewise, some species of heron have seen their
    population sizes increase considerably, such as the purple heron or the Javan
    pond-heron, the lesser adjutant, ducks, Sunda teal and the wandering
    whistling-duck also seem to have used the habitats created by the clearings
    to considerably increase their populations.
    The populations of these species have benefited of new feeding areas when
    the shrimp ponds were developed. Indeed, egrets, ducks, and waders use
    the shrimp ponds in high numbers on cyclical basis when shrimp ponds are
    emptied for shrimp harvesting. The presence of pristine areas, with large
    trees or dense copses of smaller species (Nypa) removed from human
    presence, is also favourable for the reproduction of these species. Here they
    find quiet conditions for reproduction or gatherings (dormitories). Amongst
    the species observed in 2013 and those not observed in 1987, eight dwell in
    an aquatic environment and directly depend on the shrimp ponds: darter,
    stilts, grey heron, black-crowned night heron, intermediate egret, western
    marsh-harrier and the Garganey. The opening of shrimp ponds was the
    obvious factor leading to the growth of all these bird populations.
    
    Institutional Context
    The key players in coastal region in East Kalimantan including in Delta
    Mahakam Ulu (Delta Mahakam) and Delta Mahakam (Adang Bay) are
    relatively similar. Since the area located or nearby the conservation area and
    forest production area, the Ministry of Forestry via Nature Conservancy
    Agency in East Kalimantan and Forest Management Unit (provincial
    government agency) are the most influence actor. They have authority to
    determine the activities which allowed and not allowed in the area. However,
    they cannot control the vast area of conservation area since 50 percent of
    mangrove forest in the region have been degraded. Besides government, the
    others key actors are fishermen, fish farmers, swallow workers and investors
    in aquaculture sectors. Those actors have shaped the landscape of coastal
    area in Delta Mahakam and Adang Bay over the past 20 years. In their hand
    the future of sustainable aquaculture is determined. Environmental and
    development NGOs, oil and gas company and other parties has programme
    in their area. Most of the programme focus on improving the livelihood of the
    local people and restoring the mangrove forest.
    
    The Movement of Indonesian New Capital
    Paser District (East Kalimantan Province) will soon be the site of Indonesia's
    new political capital, Nusantara, as part of the plan to move the country's
    capital from the island of Java to the island of Borneo, which is home to one
    of the world’s largest rainforests.
    Jakarta, the current political capital which will become the country's
    economic capital by 2045, is currently facing several environmental, climatic
    and demographic problems and challenges: overpopulation, heavy pollution,
    rising water levels, frequent flooding, etc. In order to deal with the inevitable
    future security issue, the Indonesian government has decided to build a new
    capital 2,000 km away from Jakarta, in the province of East Kalimantan,
    more precisely between the towns of Balikpapan and Samarinda. With the
    legislation for the relocation of the new capital published, the physical
    development of the new capital will begin in 2022. In August 2024, the
    President plans to celebrate Indonesia's Independence Day in the new
    capital.
    The government plans to make the new capital a "forest city" by strongly
    preserving forest areas and using sustainable energy. However, many argue
    that the development of the new capital could lead to environmental
    degradation and loss of essential biodiversity, especially in the mangrove
    forest. The majority of the Indonesian population, including the local
    population, supports the new capital movement by echoing the effect of
    equitable development. Indeed, for decades, the natural resources of
    Kalimantan Island have been exploited to support Indonesian development,
    especially that of Java Island.
    The location of the project (Delta Mahakam and Adang Bay -Adang Bay) is an
    area relatively close to the potential centre of the state capital (about 100-
    200 km).

    Environment and development challenges
    a) Aquaculture industry
    Mahakam Delta area is under pressure from both the industrial and
    agricultural sectors, including aquaculture facing a national high dynamic.
    From 2015 to 2035 it is expected a destruction of 600,000 ha of mangrove
    for shrimp farm at the national scale. The World Bank (2013) estimates a
    pressure to double cultivated shrimp production from currently 300,000 t
    (produced by 600,000 ha of ponds) to 600,000 t/1,000,000 t by 2030 to fulfil
    the demand. However, with improvements in brackish water aquaculture
    productivity, halting palm oil concession to use mangroves, along with
    maintaining other mangrove use pressures at moderate levels, the net loss
    of mangroves in the next two decades could be reduced to around 23,000 ha
    at this same scale.
    The East-Kalimantan Province is the new area to develop aquaculture ponds
    as Java, Sumatra and Sulawesi islands are facing a decrease of the
    production and the destruction of their environment due to unsustainable
    practices.
    Feature 1: Forecasted mangrove loss at six mangrove regions in Indonesia
    in the next two decades due to land use change under pessimistic scenario.
    Circle size indicates potential loss areas in Sumatra, Kalimantan, and Papua;
    as for Java, Sulawesi and Maluku potential loss areas are represented by the
    smaller circles.
    Scientific studies also show that the percentage of mangrove natural
    recovery is higher in East-Kalimantan with 1.4%/year against 0.7%/year in
    other islands in inactive ponds. This suggests to consider conservation
    activities in specific areas of Mahakam Delta. At the scale of Mahakam Delta,
    the table below for which the percentage (43.7%) is as higher as the
    remaining mangrove forest (48.5%) highlights the dominance of aquaculture.

    
    b) Demography
    The demographic issue must also be considered. Indeed, the announcement
    in 2019 of the relocation of the political and administrative capital of Jakarta
    to the province of East Kalimantan, between the cities of Balikpapan and
    Samarinda, suggests strong migrations, the development of infrastructures
    but also a growing demand for aquaculture products. By 2024, the
    Indonesian Minister of Planning hopes to transfer nearly 1.5 million public
    officials and political representatives in East Kalimantan.
    Delta Mahakam Ulu village, which belongs to Delta Mahakam district, is
    located in the northern part of the Mahakam Delta. The location of Delta
    Mahakam sub-district is close to the state-owned oil company (Pertamina),
    formerly VICO. Due to the proximity of a fairly large company, the
    community's economy is quite dynamic and the area offers a variety of jobs.
    However, the number of people who still carry out the traditional work of
    fishermen and fish farmers is still quite high, especially in the coastal areas.
    Working as a fish farmer has become one of the choices of the community as
    land is available for opening ponds. The conversion of mangrove forests into
    ponds has been going on for decades, but the production of fish and shrimp
    has decreased from time to time. Based on various studies and research,
    planting a number of mangroves in ponds can improve the soil and water
    quality in the ponds so that they can provide sustainable production. The
    farmer groups in Delta Mahakam Ulu ponds are beginning to realise the
    importance of planting mangroves in the ponds, and therefore need support
    from various parties.
    c) Other issues
    The table below represents a summary of estimation of potential loss and
    gain of mangroves in six major regions by 2035. The Kalimantan Island is the
    one to analyze in order to justify Planète Urgence and partners’ information.
    The analysis does not yet consider the movement of new capital issue which
    very likely affect the mangrove forest in East Kalimantan as well.
    
    This table highlight the multiple and complex context in which mangrove loss
    depends and confirms challenges faced in Mahakam Delta area. The lack of
    resources (financial, human resources, material) of local authorities coupled
    with a lack of transparency, coordination and communication around
    responsibilities of each actors impacts the management of mangrove forests,
    natural resources and territorial development.    
    
    Another issue that has also had a major impact on life in Indonesia, including
    East Kalimantan, is the Covid 19 global pandemic that has attacked the
    entire world since early 2020. The Covid 19 pandemic has had a major
    impact on life in Indonesia. East Kalimantan is a province outside Java Island
    with the highest rate of exposure to Covid, which has resulted in the
    government imposing a lockdown and restrictions on community activities.
    At the beginning of 2022, community activities began to return to normal,
    but a new variant emerged, namely Omicron, which spread very quickly.
    Facing a pandemic situation that has not ended, of course, the project must
    continue but still pay attention to security, safety, and practice health
    protocols.
    
    3. Strategy & theory of change
    The three years project aims to contribute to restore the degraded of
    mangrove ecosystem in Production Forest (Mahakamm Delta) and
    Conservation area (Adang Bay). In doing so, the project will address the key
    problems in those regions:
    a. Lack of awareness of local people on mangrove ecosystem,
    biodiversity issue and waste
    b. Huge area of degraded mangrove forest which affect the resilience of
    local people in facing climate change, the habitat of endangered
    species and local economy;
    c. Lack of alternative sustainable livelihood in coastal area;
    d. Poor governance particularly on mangrove ecosystem and its
    environmental and economy issue.
    To overcome those problems, Planet Urgence and its partners will work by
    implementing the PU FORET strategy which rely on three components:
    1. Restore degraded forest;
    2. Environmental awareness;
    3. Strengthening livelihood of local people.
    In addition, the involvement of local NGOs, local community and volunteer is
    key for the successful of the project and the sustainability the impact of the
    project. Therefore, PU will reinforce the capacity of those local stakeholders
    to ensure they can carry out the project activities and together achieve the
    long-term goal of the project.    
    """

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]

    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    rag_qa=get_rag_reply(question)

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {
            "question_#": 5, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced, 
             
            "eval_score_llm": eval_score,
        }
    )



------------
Enhanced prompt: False
LLM answer:
The project focuses on restoring degraded mangrove forests in Delta Mahakam and Adang Bay, addressing environmental and development challenges through community engagement and sustainable practices. Key activities include raising awareness, restoring 300 hectares of mangrove, and promoting alternative livelihoods. The project aims to strengthen coastal governance and contribute to several Sustainable Development Goals (SDGs) over a three-year period.
LLM eval score: 7 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The Mahakam project focuses on mangrove restoration and local development in Indonesia, aiming to enhance environmental awareness and sustainable livelihoods in the Delta Mahakam and Adang Bay regions. Key challenges include restoring 300 hectares of degraded mangrove forest and promoting alternative livelihoods while addressing biodiversity and environmental issues. The project involves community engagement, 

#### Question 6: Dates du projet 

In [68]:

questions=[
    # question de base
    {"question":"""Project start date / end date""", "prompt_enhanced": False},
    # question améliorée    
    {"question":"""What are the start date / end date of the Mahakam project ?""", "prompt_enhanced": True},
]

# réponse copié manuellement du document
reference_answer="""
    March 2023 – February 2026
"""

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]
    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    
    rag_qa=get_rag_reply(question)

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {
            "question_#": 6, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced, 
             
            "eval_score_llm": eval_score,
        }
    )



------------
Enhanced prompt: False
LLM answer:
The project start date is March 2023, and the end date is February 2026. The duration of the project is three years.
LLM eval score: 10 (higher is better)
------------
Enhanced prompt: True
LLM answer:
I don't know.
LLM eval score: 1 (higher is better)


#### Question 7:
* Budget prévisionnel du projet 
* Montant global du projet (en Euros) 

In [69]:

questions=[
    # question de base
    {"question":"""
        Project budget 
        Total amount of the project (in Euros) 
        
        Amount of donation requested from the Foundation (in Euros) 
        
        Detailed provisional project budget 
        
        Detailed project budget for current year
     """, "prompt_enhanced": False},
    # question améliorée    
    {"question":"""
        What is the Project Budget Overview?

        What is the Total Project Cost in Euros?
        How much is the Donation Requested from the Foundation in Euros?
        Can you provide a Comprehensive Provisional Project Budget?
        What is the Detailed Project Budget for the Current Year?
    """, "prompt_enhanced": True},
]

# réponse copié manuellement du document
reference_answer="""
     The total required resources is 818 341 € for the period 2023-2026
"""

for e in questions:
    question=e["question"]
    prompt_enhanced=e["prompt_enhanced"]

    print(f"------------\nEnhanced prompt: {prompt_enhanced}")
    rag_qa=get_rag_reply(question)

    eval_score=score_reference_vs_rag_with_gpt(question, reference_answer, rag_qa)

    evaluations.append(
        {

            "question_#": 7, "question": question,  "rag_answer": rag_qa, "prompt_enhanced":prompt_enhanced,             
            "eval_score_llm": eval_score, 
        }
    )



------------
Enhanced prompt: False
LLM answer:
The total required resources for the project are €818,341. The specific amount of donation requested from the Foundation is not provided in the context. Detailed budgets for the current year and provisional project budget are also not specified.
LLM eval score: 7 (higher is better)
------------
Enhanced prompt: True
LLM answer:
The Total Project Cost is €818,341. The context does not specify the Donation Requested from the Foundation. A Comprehensive Provisional Project Budget and Detailed Project Budget for the Current Year are not provided in the retrieved context.
LLM eval score: 7 (higher is better)


#### Resultats pour le prompt template rlm/rag-prompt

In [45]:
import pandas as pd

df_evaluations=pd.DataFrame(evaluations)
display(df_evaluations)

print("Score moyen par prompt simple:")
display(df_evaluations[df_evaluations["prompt_enhanced"]==False]["eval_score_llm"].mean())

print("Score moyen par prompt amélioré:")
display(df_evaluations[df_evaluations["prompt_enhanced"]==True]["eval_score_llm"].mean())


Unnamed: 0,question_#,question,rag_answer,prompt_enhanced,eval_score_llm
0,1,Description of the project,The Planète Urgence | FOREST Programme aims to...,False,6
1,1,Give me a description of the MAHAKAM project,The MAHAKAM project is focused on mangrove res...,True,9
2,2,Country and city,"The country is Indonesia, and the relevant cit...",False,8
3,2,In which country and city the MAHAKAM project ...,"The MAHAKAM project takes place in Indonesia, ...",True,10
4,3,target beneficiaries,The target beneficiaries of the Planète Urgenc...,False,8
5,3,who are the project's target beneficiaries of ...,The target beneficiaries of the Mahakam projec...,True,7
6,4,Number of people concerned,"The pilot project has 3,245 direct beneficiari...",False,9
7,4,What is the number of people concerned by the ...,The number of people concerned by the Mahakam ...,True,4
8,5,"Context, environment, project rationale and ch...",The project aims to address environmental and ...,False,7
9,5,"What is the context of the Mahakam project, in...",The Mahakam project focuses on mangrove restor...,True,7


Score moyen par prompt simple:


7.857142857142857

Score moyen par prompt amélioré:


6.714285714285714

#### Commentaires:
> * Le score moyen pour les prompts directs est globalement meilleur que pour les prompts améliorés, mais avec des nuances:
>     * Le prompt amélioré fonctionne nettement mieux pour les questions 1, 2
>     * Le prompt amélioré fonctionne nettement moins bien pour les questions 4, 6
> * A choisir, il est préférable de retenir le prompt direct, mais dans l'idéal il faudrait trouver un compromis entre les 2

* **Le cas de la question 5**:<br>
Cette question est la plus complexe, Kristin a déclaré son insatisfaction quand aux réponses, jugées trop vagues et légères.<br>
En injectant la réponse de référence comme contexte, sans passer par la chaîne RAG, la réponse du LLM est bien meilleure (structure, exhaustivité)<br>
La réponse à cette question s'étale des pages 10 à 18 du PP Mahakam project, il y a peut être un retrait insuffisant de fragments par le retriever

#### Prochaînes étapes:
* Utiliser une taille de fragment plus petite (ici jusqu'à 4000 caractères, près de 1000 tokens, tester avec 256, 512)
* Améliorer le retriever en combiant le sémantique ici présent, avec un sparse retriever, et de l'indexation lexicale
* Utiliser un reranker
* Fragmentation du PP avec génération de méta données et de résumés