# Information Retrieval

Hi! This sub-task is about infromation retrieval. We will attempt to find relevant chunks of text from a database based on a given prompt. The chunks are stored in a PostgreSQL database, and we will use various retrieval methods to find the most relevant ones.

🎯 Goals of this notebook:
 1. Select chunks with exact and similar keyword matches with the prompt
 2. Select chunks with similar semantic meaning to prompt based on Word2Vec and/or ClimateBERT embeddings
 3. Perform hybrid search with combined scores of BM25 ranking and the ClimateBERT scores


## 1. Preperations

### 1.1 Import libraries and functions

In [1]:
# Import necessary modules
import sys
import os
from pathlib import Path

# Get the absolute path of the project root directory
notebook_dir = Path(os.getcwd())  
project_root = notebook_dir.parent.parent  # Go up TWO levels instead of one

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    print(f"Added {project_root} to sys.path")

Added c:\Users\User\Documents\DS205\group-6-final-project to sys.path


In [None]:

from dotenv import load_dotenv
import os
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors, Word2Vec
from gensim.utils import simple_preprocess
from sqlalchemy import create_engine, text

from scripts.retrieval.retrieval_support import boolean_search, bm25_search, fuzzy_search, vector_search, df_with_similarity_score, hybrid_scoring
from scripts.retrieval.functions import generate_word2vec_embedding_for_text, generate_embeddings_for_text



Downloading ClimateBERT model and tokenizer

In [None]:
EMBEDDING_MODEL_LOCAL_DIR = os.getenv('EMBEDDING_MODEL_LOCAL_DIR')
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")

climatebert_tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)
climatebert_model = AutoModel.from_pretrained(EMBEDDING_MODEL_LOCAL_DIR)

Downloading Word2Vec model

In [None]:
custom_w2v = Word2Vec.load("./local_model/custom_word2vec_768.model")

### 1.2 Introduce a prompt



#### 1.2.1 Define a prompt and exact keywords

In [7]:
# Introducing a prompt based on ASCOR CP1.a
prompt = "Does the country have a decarbonisation strategy to meet Paris Agreement that they are implementing or in the national legislation?"
keywords = prompt.split(" ")

#### 1.2.2 Generate similar words to keywords

We will use the fine-tuned Word2Vec  model that we have from generating embeddings to expand search query. The way this works is to find top k similar words with highest similarity score to expand the keywords in the prompt. This will help us to find more relevant chunks that may not contain the exact keywords but are still related to the topic.

In [8]:
# Generate similar words using word2vec model to prompt's keywords and store them for keyword search
keywords = simple_preprocess(prompt)
similar_words = []

# For each keyword, try to find similar words
for keyword in keywords:
    try:
        # Only get similar words if keyword exists in vocabulary
        if keyword in custom_w2v.wv:
            similar = custom_w2v.wv.most_similar(keyword, topn=5)  # Get top 5 similar words
            similar_words.extend([word for word, score in similar])
    except KeyError:
        # Skip words not in vocabulary
        continue

# Combine original keywords with similar words
all_search_terms = list(set(keywords + similar_words))

print("Original keywords:", keywords)
print("\nExpanded keywords:", all_search_terms)

Original keywords: ['does', 'the', 'country', 'have', 'decarbonisation', 'strategy', 'to', 'meet', 'paris', 'agreement', 'that', 'they', 'are', 'implementing', 'or', 'in', 'the', 'national', 'legislation']

Expanded keywords: ['to', 'regulation', 'reason', 'contract', 'once', 'national', 'meet', 'farms', 'yet', 'followed', 'packages', 'does', 'action', 'strategy', 'legislation', 'met', 'positive', 'after', 'body', 'starting', 'task', 'draft', 'social', 'or', 'page', 'feedstock', 'subject', 'built', 'limited', 'happens', 'decarbonisation', 'paris', 'achieved', 'intensity', 'developed', 'procurements', 'reduced', 'given', 'prediction', 'the', 'non', 'realized', 'rise', 'revised', 'implemented', 'strong', 'kw', 'prepared', 'agreement', 'plan', 'fees', 'determination', 'used', 'have', 'carriers', 'well', 'implementing', 'results', 'that', 'able', 'are', 'another', 'shares', 'still', 'continue', 'encourage', 'currently', 'ministerial', 'development', 'against', 'expected', 'stipulated', 'po

Generate embeddings for the prompt

In [None]:
# Convert prompt into embeddings
prompt_w2v_embeddings = generate_word2vec_embedding_for_text(prompt, custom_w2v)
prompt_climatebert_embeddings = generate_embeddings_for_text(prompt, climatebert_model, climatebert_tokenizer)

### 1.3 Load the dataframe from the database

In [None]:

from sqlalchemy import create_engine, text
import os
engine = create_engine(os.getenv("DB_URL"))

df = pd.read_sql("SELECT * FROM document_embeddings", engine)

When viewing 'df', I noticed something peculiar about the 'word2vec_embedding'. Some rows contains only 0 embeddings, which is not expected. Upon counting the zeroes embeddings, there were 2981 out of our tested 5000 rows with complete 0 embeddings. This may stem from how the embeddings were generated. So, caution is warranted upon interpreting these embeddings, eventhough some of these 'seem to' retrieve relevant chunks.

In [None]:
# Count how many rows have data like defined variable 'zeroes'
zeroes = df['word2vec_embedding'][0]
count_zeroes = df['word2vec_embedding'].apply(lambda x: np.array_equal(x, zeroes)).sum()
count_zeroes

2981

## 2. Retrieving Relevant Chunks

### 2.1 Keyword-based Retrieval

Goal 1: Retrieve chunks with exact and similar keyword matches with the prompt

We will start with boolean chunks, which essentially retrieve all the exact matches of our expanded search terms. the core is manually calcuklate don how many search terms the chunk contains over the total number of search terms.

In [19]:
top_k_boolean_chunks = boolean_search(all_search_terms, df, k=25)
relevant_boolean = top_k_boolean_chunks[['original_text', 'boolean_score']]
print('Relevant chunks based on boolean search:')
relevant_boolean.head(5)


Relevant chunks based on boolean search:


Unnamed: 0,original_text,boolean_score
4666,- Environmental impacts of hydropower developm...,0.171717
4667,- Including sustainability principles in hydro...,0.161616
4782,Climate change models predict that there will ...,0.161616
2995,This section specifically provides an assessme...,0.151515
2997,Achieving a substantial decarbonization of the...,0.151515


In [33]:
relevant_boolean['original_text'].iloc[4]

'Achieving a substantial decarbonization of the energy sector will require major efforts in the building sectors. At EU level, greenhouse gas emissions in the building sector represent more than a third of total emissions. Residential buildings account for 75 % of the European building stock, from which more than 40 % was built before 1960 and more than 90 % before 1990. Low income households represent about 17 % of households in the EU (Eurostat, 2014), while estimates of EU-inhabitants suffering from fuel poverty ranging between 50-160 million inhabitants, corresponding to roughly 6-21 % of the total EU-population (Bouzarovski (2014); BPIE (2015); Bird et al. (2010)). Energy efficiency policies in the residential sector bear great potential to improve the disposable income of households. Disposable household incomes can be increased by improved EE in space heating, hot water generation or energy- using products like fridges or televisions, given that the overwhelming share of all imp

Then, we will move to BM25 ranking, which is an extended veriosn of TF-IDF that takes into account the length of the document and the average document length in the corpus. This will help us to rank the chunks based on their relevance to the search terms, instead of just counting the number of matches.

In [24]:
top_k_bm25_chunks = bm25_search(all_search_terms, df, k=25)
relevant_bm25 = top_k_bm25_chunks[['original_text', 'bm25_score']]
print('Relevant chunks based on BM25 search:')
relevant_bm25.head(5)

Relevant chunks based on BM25 search:


Unnamed: 0,original_text,bm25_score
3377,Business Investment Development Strategy (BIDS),1.0
3848,ADAPTATION STRATEGY AND ACTION PLAN,0.973747
4775,c. 'Rearrange' disturbed forest ecosystems so ...,0.915038
4726,a. Extend or renew native species that are exp...,0.881522
4155,BiH is working on the project 'Advance the Nat...,0.853265


In [31]:
relevant_bm25['original_text'].iloc[4]

"BiH is working on the project 'Advance the National Adaptation Plan (NAP) process for medium-term investment planning in climate sensitive sectors in Bosnia-Herzegovina'. The project will support BiH to advance the National Adaptation Plan (NAP) process and reach goals outlined in the Paris Agreement and 2030 Agenda for Sustainable Development."

Finally, we will perform fuzzy string matching, which essentially captures any of the combinations of search terms pattern, or account for typos, variations, or are not exactly identical to our expanded search terms or core keywords. 

This is useful if the document contains inconsistencies in how keywords are written, which is common in real-world documents.

In [28]:
top_k_fuzzy_chunks = fuzzy_search(prompt, df, k=50)
relevant_fuzzy = top_k_fuzzy_chunks[['original_text', 'fuzzy_score']]
print('Relevant chunks based on fuzzy search:')
relevant_fuzzy.head(5)

Relevant chunks based on fuzzy search:


Unnamed: 0,original_text,fuzzy_score
3434,the,1.0
3159,Decarbonisation /\nremovals,0.77
3847,5 THE,0.75
3613,for the,0.6
4138,BiH has demonstrated its commitment to partici...,0.56


In [30]:
relevant_fuzzy['original_text'].iloc[4]

"BiH has demonstrated its commitment to participate in global efforts aimed at mitigating and adapting to climate change by signing the Paris Agreement. As a contribution to the fulfilment of the Paris Agreement, it adopted the document 'Intended Nationally Determined Contributions (INDCs)' for the period until 2030. The document is based on previously adopted strategic documents, such as the Adaptation to Climate Change and Low Carbon Development Strategy of BiH, and the documents Second National Communication on Climate Change under the UNFCCC and the First Biennial Report on Greenhouse Gas Emissions under the UNFCCC. According to the scenarios developed within the INDC, the highest level of GHG emissions is reached in 2030, when according to the baseline scenario, 20% higher emissions are expected than the 1990 level of emissions. As an unconditional target of reducing GHG emissions, BiH has set a goal of a 2% reduction in 2030 in relation to emissions according to the baseline scen

Sample results are as follows:

1. From Boolean search: 'Achieving a substantial decarbonization of the energy sector will require major efforts in the building sectors. At EU level, greenhouse gas emissions in the building sector represent more than a third of total emissions. Residential buildings account for 75 % of the European building stock, from which more than 40 % was built before 1960 and more than 90 % before 1990. Low income households represent about 17 % of households in the EU (Eurostat, 2014), while estimates of EU-inhabitants suffering from fuel poverty ranging between 50-160 million inhabitants, corresponding to roughly 6-21 % of the total EU-population (Bouzarovski (2014); BPIE (2015); Bird et al. (2010)). Energy efficiency policies in the residential sector bear great potential to improve the disposable income of households. Disposable household incomes can be increased by improved EE in space heating, hot water generation or energy- using products like fridges or televisions, given that the overwhelming share of all implemented measures are cost-effective (Yushchenko and Patel 2017; Dodoo et al. 2017). Derived from this, EE bears a great potential for the alleviation of energy poverty, but additionally induce the multiple benefits of EE, such as improving human health, lowering energy subsidies through social policies, increased the value of properties, local spending and employment, reduced emissions, etc. Initial investments in EE for renovation of buildings usually pay off in terms of heating cost reduction, which enables consumers to spend their money elsewhere in the long run. However, as the evaluation of the German KfW Energy-efficient Refurbishment Programme emphasizes, it must be noted that these investments are profitable after a period of several decades (KfW Group (2018)). Disregarding investment costs is hence a simplification and likewise the neglect of rebound and spill-over effects.'

2. From BM25: "BiH is working on the project 'Advance the National Adaptation Plan (NAP) process for medium-term investment planning in climate sensitive sectors in Bosnia-Herzegovina'. The project will support BiH to advance the National Adaptation Plan (NAP) process and reach goals outlined in the Paris Agreement and 2030 Agenda for Sustainable Development."

3. From Fuzzy string: "BiH has demonstrated its commitment to participate in global efforts aimed at mitigating and adapting to climate change by signing the Paris Agreement. As a contribution to the fulfilment of the Paris Agreement, it adopted the document 'Intended Nationally Determined Contributions (INDCs)' for the period until 2030. The document is based on previously adopted strategic documents, such as the Adaptation to Climate Change and Low Carbon Development Strategy of BiH, and the documents Second National Communication on Climate Change under the UNFCCC and the First Biennial Report on Greenhouse Gas Emissions under the UNFCCC. According to the scenarios developed within the INDC, the highest level of GHG emissions is reached in 2030, when according to the baseline scenario, 20% higher emissions are expected than the 1990 level of emissions. As an unconditional target of reducing GHG emissions, BiH has set a goal of a 2% reduction in 2030 in relation to emissions according to the baseline scenario. The conditional target (with more international assistance) is to reduce emissions by 3% compared to 1990 emissions."

Upon manual interpretation, it seems that Fuzzy String Matching results are most relevant, with explicitly mentions Paris Agreement multiple times, references "National Adaptation Plan (NAP)" and "Adaptation to Climate Change and Low Carbon Development Strategy" and shows concrete goals (2% unconditional, 3% conditional GHG reduction by 2030) and references INDCs and strategic documents. BM25 also retrieve relevant chunks that mentions Paris Agreement mention, references National Adaptation Plan (NAP) and demonstrate hows ongoing project work. Boolean search, on the pother hand produces the poorest retrieval, though it is not totally out of context.

## 2.2 Semantic Retrieval

### Goal 2: Retrieve chunks with similar semantic meaning to prompt based on Word2Vec and/or ClimateBERT embeddings

We can perform similarity search using the existing pgVector extension to perform the vector search.



In [44]:
climatebert_results = vector_search(
    prompt_embeddings=np.array(prompt_climatebert_embeddings),
    embedding_type='climatebert',
    top_k=25
)

print("Top 25 results using ClimateBERT:")
print(climatebert_results[['original_text', 'similarity_score']].head(25))

Top 25 results using ClimateBERT:
                                          original_text  similarity_score
3478  Table 11: Overview table of key policies affec...          1.000000
3997           Initial National Determined Contribution          0.959088
3759                                     Project board:          0.952326
3285               No budget calculated for the moment.          0.950437
3995  Initial National Communication Report under th...          0.946192
3410                          Overall policy documents:          0.945382
3973                      Designated National Authority          0.942615
3633                                        Secretariat          0.939853
4572  21UNCC - Article: How Hydropower Can Help Clim...          0.938768
1925                              Public administration          0.938397
4025  Second National Communication Report under the...          0.938183
3782                    Existing policies and measures.          0.935045
3476

In [None]:
# Change the number in iloc if you want to access other rows
climatebert_results['original_text'].iloc[5]

'Overall policy documents:'

In [37]:

w2v_results = vector_search(
    prompt_embeddings=np.array(prompt_w2v_embeddings),
    embedding_type='word2vec',
    top_k=25
)

print("\nTop 25 results using Word2Vec:")
print(w2v_results[['original_text', 'similarity_score']].head(25))


Top 25 results using Word2Vec:
                                          original_text  similarity_score
4845  According to the EEA analysis, Northern Europe...          1.000000
4329  10The results of the regional climate models a...          0.999901
4045  The first effects of climate change are alread...          0.999671
3116  Regarding employment effects PaMs triggering b...          0.999644
4658  - Hydropower development should be part of a b...          0.999629
4556  The forecasted changes in precipitation and ai...          0.999532
4819  Climate change indirectly affects water availa...          0.999504
3723  In addition to the government departments and ...          0.999488
4721  Adaptation to climate change in the field of f...          0.999407
4135  In terms of international obligations on clima...          0.999364
4835  At the global level, major changes are expecte...          0.999313
4700  country. Approaches to adaptation to climate c...          0.999289
4172  

In [None]:
# Change the number in iloc if you want to access other rows
w2v_results['original_text'].iloc[5]

'The forecasted changes in precipitation and air temperature will negatively affect the current water resources management system in Bosnia and Herzegovina, or both its entities and BD. Changes can be expected in terms of time of occurrence, frequency and intensity of extreme events - floods and droughts. The largest increase in air temperature is predicted in the vegetation period (June, July and August), and a slightly milder increase during March, April and May, which will result in increased evapotranspiration and more pronounced extreme minimums of water levels in watercourses. This will result in a general reduction in the availability of water resources in the vegetation period when the needs are greatest, in terms of water quantity, but also quality, because in low water periods the potential and real danger of significant water quality degradation increases (in which untreated municipal and other wastewaters dominate, such as in the summer in the Miljacka River in Sarajevo). G

Interestingly, upon assessing multiple rows in results, the Word2vec model seems to retrieve more useful information than climateBERT model, contrary to our expectations. This may reflect weaknesses in embeddings generation or the model itself.

However, we have yet to exactly identify is the answers make sense or not, which will be tested in the LLM evaluation phase.

### Goal 3: Rank the chunks based on their relevance to the prompt

We can rerank the chunks based on their relevance to the prompt. Much like the widely knwon hybrid search, we will sum the sparse score (from the chosen keyword technique) and dense score (from embeddings) with weghted parameter alpha. 


In [46]:
df_similarity_score = df_with_similarity_score(
    prompt_embeddings_w2v=np.array(prompt_w2v_embeddings),
    prompt_embeddings_climatebert=np.array(prompt_climatebert_embeddings),
    top_k=None
)
df_similarity_score.head(5)

bm25_df = bm25_search(all_search_terms, df_similarity_score, k=None)
bm25_df.head(5)

Unnamed: 0,document_id,country_code,document_title,original_text,source_hyperlink,w2v_score,climatebert_score,avg_score,bm25_score
4092,CCLW.document.i00000004.n0000,BIH,Climate Change Adaptation and Low Emissions Gr...,The Paris Agreement on Climate Change is based...,https://unfccc.int/sites/default/files/resourc...,0.993667,0.550285,0.771976,1.0
4155,CCLW.document.i00000004.n0000,BIH,Climate Change Adaptation and Low Emissions Gr...,BiH is working on the project 'Advance the Nat...,https://unfccc.int/sites/default/files/resourc...,0.986941,0.725626,0.856283,0.892933
4614,CCLW.document.i00000004.n0000,BIH,Climate Change Adaptation and Low Emissions Gr...,The risks associated with climate change have ...,https://unfccc.int/sites/default/files/resourc...,0.994369,0.605252,0.79981,0.874398
3705,CCLW.document.i00000002.n0000,ALB,National Energy and Climate Plan 2019 Draft,Energy Regulatory Authority (ERE): The Energy ...,https://www.energy-community.org/dam/jcr:a0c2b...,0.985716,0.755189,0.870453,0.859087
3377,CCLW.document.i00000002.n0000,ALB,National Energy and Climate Plan 2019 Draft,Business Investment Development Strategy (BIDS),https://www.energy-community.org/dam/jcr:a0c2b...,0.937189,0.868619,0.902904,0.841708


In [47]:

hybrid_results = hybrid_scoring(bm25_df, alpha=0.5)
print("Top results using hybrid scoring:")
print(hybrid_results[['original_text', 'hybrid_score']].head(50))

Top results using hybrid scoring:
                                          original_text  hybrid_score
3377    Business Investment Development Strategy (BIDS)      0.855164
4658  - Hydropower development should be part of a b...      0.847423
3848                ADAPTATION STRATEGY AND ACTION PLAN      0.824155
4155  BiH is working on the project 'Advance the Nat...      0.809280
3705  Energy Regulatory Authority (ERE): The Energy ...      0.807138
4136  Under the UNFCCC, Bosnia and Herzegovina is co...      0.800100
4154  At the meeting of the Ministerial Council of t...      0.783631
4092  The Paris Agreement on Climate Change is based...      0.775142
3533  Figure 6: Energy intensity (Source: National S...      0.773565
4775  c. 'Rearrange' disturbed forest ecosystems so ...      0.761036
4614  The risks associated with climate change have ...      0.739825
3004  Energy Efficiency Fund: The EE Law mandates th...      0.734070
3423  . National Energy Efficiency Action Plan 2010-... 

Upon reviewing the hybrid results, they seem to retrieve most relevant chunks comapred to keyword search and vector search, which is expected as they combine the strengths of both methods. The hybrid scoring method effectively balances the keyword-based BM25 scores with the vector similarity scores, providing a more nuanced ranking of the documents.

All in all, the fuzzy string matches produces the most relevant chunk for keyword search, better than widely known BM25 and simple boolean search. Further, Word2Vec embeddings produce more useful information than ClimateBERT, which is contrary to our expectationsdespite many of it being 0 (around 60% of the rows).  Finally, the hybrid search produces the most relevant chunks, which is expected as it combines the strengths of both keyword and semantic search.

Overall, hybrid search is the most effective method for retrieving relevant chunks, as it combines the strengths of both keyword and semantic search. The key issue is tuning the parameter of alpha to find the optimal balance between the two scores.