# Information Retrieval Systems  

### Query extension with synonym terms in an effort to improve retrieval results. 
by Miltos Tsolkas

## IR2025 Preprocessing
Firstly, we will preprocess the collection of texts IR2025 (which is saved in `corpus.jsonl`) in order to transform them in a form in which it can be used by the ElasticSearch search engine.

### Exploration and Loading of Data 
* We will need the data of `corpus.jsonl` in a format that will be able to be handled simpler, like `pandas dataframe`.
* To achieve that we will create a function that scans the entire document and loads in into a dataframe.

In [54]:
import pandas as pd

def load_ir2025_data(file_path):
    documents = []

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Parse each JSON line into a dictionary
            doc = pd.read_json(line, lines=True)
            documents.append(doc)

    df = pd.concat(documents, ignore_index=True)
    return df

* Now, let's call this function on our data.

In [55]:
file_path = './scidocs/corpus.jsonl'
data_df = load_ir2025_data(file_path)

  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line, lines=True)
  doc = pd.read_json(line

* Let's see if our data was loaded.

In [56]:
data_df.head(10)

Unnamed: 0,_id,title,text,metadata
0,632589828c8b9fca2c3a59e97451fde8fa7d188d,A hybrid of genetic algorithm and particle swa...,An evolutionary recurrent network which automa...,"{'authors': ['1725986'], 'year': 2004, 'cited_..."
1,86e87db2dab958f1bd5877dc7d5b8105d6e31e46,A Hybrid EP and SQP for Dynamic Economic Dispa...,Dynamic economic dispatch (DED) is one of the ...,"{'authors': ['30728239', '49115828', '1857220'..."
2,2a047d8c4c2a4825e0f0305294e7da14f8de6fd3,Genetic Fuzzy Systems - Evolutionary Tuning an...,It's not surprisingly when entering this site ...,"{'authors': ['1685850', '1699069', '34695695',..."
3,506172b0e0dd4269bdcfe96dda9ea9d8602bbfb6,A modified particle swarm optimizer,"In this paper, we introduce a new parameter, c...","{'authors': ['8385459', '4298485'], 'year': 19..."
4,51317b6082322a96b4570818b7a5ec8b2e330f2f,Identification and control of dynamic systems ...,This paper proposes a recurrent fuzzy neural n...,"{'authors': ['34448377', '2062864'], 'year': 2..."
5,857a8c6c46b0a85ed6019f5830294872f2f1dcf5,Separate face and body selectivity on the fusi...,Recent reports of a high response to bodies in...,"{'authors': ['2981413', '2074160', '1931482'],..."
6,12f107016fd3d062dff88a00d6b0f5f81f00522d,Scheduling for Reduced CPU Energy,The energy usage of computer systems is becomi...,"{'authors': ['1800362', '9036495', '1686255', ..."
7,1ae0ac5e13134df7a0d670fc08c2b404f1e3803c,A data mining approach for location prediction...,Mobility prediction is one of the most essenti...,"{'authors': ['2108906', '22789555', '1801322',..."
8,7d3c9c4064b588d5d8c7c0cb398118aac239c71b,$\mathsf {pSCAN}$ : Fast and Exact Structural ...,We study the problem of structural graph clust...,"{'authors': ['38736958', '35660624', '36838704..."
9,305c45fb798afdad9e6d34505b4195fa37c2ee4f,"Synthesis, properties, and applications of iro...","Iron, the most ubiquitous of the transition me...","{'authors': ['5701357'], 'year': 2005, 'cited_..."


In [57]:
data_df.columns

Index(['_id', 'title', 'text', 'metadata'], dtype='object')

In [58]:
print(data_df.isnull().any())

_id         False
title       False
text        False
metadata    False
dtype: bool


* It looks like everything was loaded and without any missing values.
* Let's examine analytically the content of a line of our dataframe, for ex. the first line.

In [59]:
print(data_df.iloc[0]["_id"])

632589828c8b9fca2c3a59e97451fde8fa7d188d


In [60]:
print(data_df.iloc[0]["title"])

A hybrid of genetic algorithm and particle swarm optimization for recurrent network design


In [61]:
print(data_df.iloc[0]["text"])

An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing phenomenon in nature. These enhanced elites constitute half of the population in the new generation, 

In [62]:
print(data_df.iloc[0]["metadata"])

{'authors': ['1725986'], 'year': 2004, 'cited_by': ['93e1026dd5244e45f6f9ec9e35e9de327b48e4b0', '870cb11115c8679c7e34f4f2ed5f469badedee37', '7ee0b2517cbda449d73bacf83c9bb2c96e816da7', '97ca96b2a60b097bc8e331e526a62c6ce3bb001c', 'f7d4fcd561eda6ce19df70e02b506e3201aa4aa7', '772f83c311649ad3ca2baf1c7c4de4610315a077', '0719495764d98886d2436c5f5a6f992104887160', 'a1aa248db86001ea5b68fcf22fa4dc01016442f8', 'a1877adad3b8ca7ca1d4d2344578235754b365b8', '8aedb834e973a3b69d9dae951cb47227f9296503', '1e5048d87fd4c34f121433e1183d3715217f4ab4', 'b1c411363aded4f1098572f8d15941337310ca15', '05bd67f3c33d711f5e8e1f95b0b82bab45a34095', 'f59f50a53d81f418359205c814f098be5fa7655a', '8cc9fa42cb88f0307da562bb7a8104cb2ed4474c', 'c26229b43496b2fe0fa6a81da69928b378092d4d', 'fe49526fef68e26217022fc56e043b278aee8446', 'c471da1875ad3e038469880b5f8321fb15364502', 'a2f65aae36fee93adf4e32589816b386bd0121cf', '97d58db3c8d08ba6b28fcb7b87031222b077669a', '3bb96f380b213d3b597722bf6ce184ff01299e14', '2450a56cfa19bb75fdca9bb

* It appears that in the column `metadata` has some values that could be implemented as their own columns-categories of our dataframe.
* Let's make a function that will incorporate these values this way. 

In [63]:
import json

def update_metadata(df):
    df['metadata'] = df['metadata'].apply(lambda x: json.loads(x) if isinstance(x, str) else x)

    metadata_df = pd.json_normalize(df['metadata'])
    
    updated_df = pd.concat([df.drop(columns=['metadata']), metadata_df], axis=1)
    
    return updated_df

In [64]:
updated_df = update_metadata(data_df)

In [65]:
updated_df.columns

Index(['_id', 'title', 'text', 'authors', 'year', 'cited_by', 'references'], dtype='object')

In [66]:
updated_df.head(10)

Unnamed: 0,_id,title,text,authors,year,cited_by,references
0,632589828c8b9fca2c3a59e97451fde8fa7d188d,A hybrid of genetic algorithm and particle swa...,An evolutionary recurrent network which automa...,[1725986],2004.0,"[93e1026dd5244e45f6f9ec9e35e9de327b48e4b0, 870...","[57fdc130c1b1c3dd1fd11845fe86c60e2d3b7193, 513..."
1,86e87db2dab958f1bd5877dc7d5b8105d6e31e46,A Hybrid EP and SQP for Dynamic Economic Dispa...,Dynamic economic dispatch (DED) is one of the ...,"[30728239, 49115828, 1857220, 47952931]",2002.0,"[8c6e8ac20aa8507879820a09ed4529d8e903e431, 6b7...",[]
2,2a047d8c4c2a4825e0f0305294e7da14f8de6fd3,Genetic Fuzzy Systems - Evolutionary Tuning an...,It's not surprisingly when entering this site ...,"[1685850, 1699069, 34695695, 1841941]",2001.0,"[ac1611bbe12f2dc91dad1d1ded3e618b0b848f21, 333...",[]
3,506172b0e0dd4269bdcfe96dda9ea9d8602bbfb6,A modified particle swarm optimizer,"In this paper, we introduce a new parameter, c...","[8385459, 4298485]",1998.0,"[019d49506e8fac0e964dbc52d1afc495c47df384, f51...",[54acdb67ca083326c34eabdeb59bfdc01c748df0]
4,51317b6082322a96b4570818b7a5ec8b2e330f2f,Identification and control of dynamic systems ...,This paper proposes a recurrent fuzzy neural n...,"[34448377, 2062864]",2000.0,"[4de9e6412d59169e624df02fc8c4e377a1f8be5d, ac7...","[7e1216cac1ec99b056f4d14f0ca088f3cbb9b120, 8ad..."
5,857a8c6c46b0a85ed6019f5830294872f2f1dcf5,Separate face and body selectivity on the fusi...,Recent reports of a high response to bodies in...,"[2981413, 2074160, 1931482]",2005.0,"[34bf37eb7a34ac4efc57254303f65429a3ccdd85, fde...","[5a06e4c072c85afe71490498e718bf3424faf08c, 3bc..."
6,12f107016fd3d062dff88a00d6b0f5f81f00522d,Scheduling for Reduced CPU Energy,The energy usage of computer systems is becomi...,"[1800362, 9036495, 1686255, 1753148]",1994.0,"[c3ce0da75953dd041152c1757d18647fe05872b2, 33e...",[]
7,1ae0ac5e13134df7a0d670fc08c2b404f1e3803c,A data mining approach for location prediction...,Mobility prediction is one of the most essenti...,"[2108906, 22789555, 1801322, 1796253]",2005.0,"[f1f25228e0285e615b84a150dec2279785bc7dc6, 8e8...","[073bc173609570544a63770d0ce51ce17dd079e5, ac3..."
8,7d3c9c4064b588d5d8c7c0cb398118aac239c71b,$\mathsf {pSCAN}$ : Fast and Exact Structural ...,We study the problem of structural graph clust...,"[38736958, 35660624, 36838704, 19262604, 47569...",2017.0,[],"[dd31b94077f656630348f810607308204d5fe013, 680..."
9,305c45fb798afdad9e6d34505b4195fa37c2ee4f,"Synthesis, properties, and applications of iro...","Iron, the most ubiquitous of the transition me...",[5701357],2005.0,"[82b17ab50e8d80c81f28c22e43631fa7ec6cbef2, 649...",[]


In [67]:
print(updated_df.isnull().any())

_id           False
title         False
text          False
authors       False
year           True
cited_by      False
references    False
dtype: bool


In [68]:
print(updated_df.isnull().sum())

_id            0
title          0
text           0
authors        0
year          72
cited_by       0
references     0
dtype: int64


* It seems that years are stored as decimals and that some years are missing.
* We will turn the null years into zeros and all year values into integers.

In [69]:
updated_df['year'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  updated_df['year'].fillna(0, inplace=True)


In [70]:
updated_df['year'] = updated_df['year'].astype(int)

In [71]:
print(updated_df.isnull().sum())

_id           0
title         0
text          0
authors       0
year          0
cited_by      0
references    0
dtype: int64


In [72]:
updated_df.dtypes

_id           object
title         object
text          object
authors       object
year           int32
cited_by      object
references    object
dtype: object

* It appears that some columns don't offer interesting information, they just contain codes, for example the authors colum.
* We will modify their values to be, instead of lists of codes, just a number that states the length of each list.
* For example if a value of the authors column is 2 it means there were 2 authors.

In [73]:
columns_to_count = ['authors', 'cited_by', 'references']

for col in columns_to_count:
    updated_df[col] = updated_df[col].apply(lambda x: len(x) if isinstance(x, list) else 0)

print("Updated DataFrame with list lengths:")
print(updated_df.head())

Updated DataFrame with list lengths:
                                        _id  \
0  632589828c8b9fca2c3a59e97451fde8fa7d188d   
1  86e87db2dab958f1bd5877dc7d5b8105d6e31e46   
2  2a047d8c4c2a4825e0f0305294e7da14f8de6fd3   
3  506172b0e0dd4269bdcfe96dda9ea9d8602bbfb6   
4  51317b6082322a96b4570818b7a5ec8b2e330f2f   

                                               title  \
0  A hybrid of genetic algorithm and particle swa...   
1  A Hybrid EP and SQP for Dynamic Economic Dispa...   
2  Genetic Fuzzy Systems - Evolutionary Tuning an...   
3                A modified particle swarm optimizer   
4  Identification and control of dynamic systems ...   

                                                text  authors  year  cited_by  \
0  An evolutionary recurrent network which automa...        1  2004       432   
1  Dynamic economic dispatch (DED) is one of the ...        4  2002       169   
2  It's not surprisingly when entering this site ...        4  2001       521   
3  In this paper, w

In [74]:
print((updated_df == 0).sum())

_id              0
title            0
text             0
authors        567
year            72
cited_by      2649
references    1153
dtype: int64


### Text Cleaning
We will modify our dataframe so that it's in a form appropriate to be used ElasticSearch (a.k.a. json). We want to do the following transformations into our text values:
1. **Tokenization**: Split each text into tokens.

2. **Lowercasing**: Turn all words into lowercase.

3. **Removal of Stop Words**: Remove common words that don't offer much information.

4. **Stemming**: Trim words into their core form.

5. **Save**: Save updated data into a json file.  

In more detail:  

* We'll use the library `NLTK (Natural Language Toolkit)` which is used for processing and analyzing natural language in Python.

* The tokenization will happen with the function `word_tokenize`.

* Punctuation marks and words with non alphabetical symbols will be removed with the condition `isalpha()`.

* Stop words will be removed with `stopwords` of `nltk.corpus`.

* We'll do a **grammatical stemming** with the use of the `PorterStemmer` algorithm (ex. "running" → "run"), and a **logical stemming** (ex. "better" → "good") with `WordNetLemmatizer`.

In [75]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import os

current_dir = os.getcwd()

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def clean_text(text):

    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    if pd.isna(text):
        return ""
    
    if not isinstance(text, str):
        text = str(text)

    # tokenization
    words = word_tokenize(text)
    # lowercasing and removing non-alphabetic words
    words = [word.lower() for word in words if word.isalpha()]
    # stop words removal
    words = [word for word in words if word not in stop_words]
    # Stemming/Lemmatization
    words = [lemmatizer.lemmatize(stemmer.stem(word)) for word in words]
    # joining the words back into a cleaned text
    return " ".join(words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [76]:
cleaned_df = updated_df.copy(deep=False)

cleaned_df = cleaned_df.assign(
    title=lambda x: x['title'].apply(clean_text),
    text=lambda x: x['text'].apply(clean_text)
)

cleaned_df.to_json('cleaned_data.json', orient='records', lines=True)

print("Text cleaning and normalization complete. Data saved as 'cleaned_data.json'.")

Text cleaning and normalization complete. Data saved as 'cleaned_data.json'.


* Let's check how our data was transformed.

In [77]:
cleaned_df.head(3)

Unnamed: 0,_id,title,text,authors,year,cited_by,references
0,632589828c8b9fca2c3a59e97451fde8fa7d188d,hybrid genet algorithm particl swarm optim rec...,evolutionari recurr network autom design recur...,1,2004,432,10
1,86e87db2dab958f1bd5877dc7d5b8105d6e31e46,hybrid ep sqp dynam econom dispatch nonsmooth ...,dynam econom dispatch ded one main function po...,4,2002,169,0
2,2a047d8c4c2a4825e0f0305294e7da14f8de6fd3,genet fuzzi system evolutionari tune learn fuz...,surprisingli enter site get book one popular b...,4,2001,521,0


In [78]:
cleaned_df.dtypes

_id           object
title         object
text          object
authors        int64
year           int32
cited_by       int64
references     int64
dtype: object

In [79]:
updated_df['title'] = updated_df['title'].astype(str)
updated_df['text'] = updated_df['text'].astype(str)
cleaned_df['title'] = cleaned_df['title'].astype(str)
cleaned_df['text'] = cleaned_df['text'].astype(str)

In [80]:
print(f"Title before: \"{updated_df.iloc[0]["title"]}\".")
print(f"Title after: \"{cleaned_df.iloc[0]["title"]}\".")
print(f"-----------------------------------------------")
print(f"Text before: \"{updated_df.iloc[0]["text"]}\".")
print(f"Text after: \"{cleaned_df.iloc[0]["text"]}\".")

Title before: "A hybrid of genetic algorithm and particle swarm optimization for recurrent network design".
Title after: "hybrid genet algorithm particl swarm optim recurr network design".
-----------------------------------------------
Text before: "An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded a

* We can see that our goal was achieved.

## Index creation with the use of ElasticSearch.  

### Connection with Elasticsearch  

* Elasticsearch is a distributed search engine, and a tool for real time data analysis, which is used for storing, searching and analyzing large amounts of data with speed and effectiveness.
* Let's download it if it hasn't already been downloaded.

In [81]:
!pip install elasticsearch



* Now, we'll import the modules we need.

In [82]:
import json
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

* We'll use connection credentials (they aren't the same for everyone) to check if elasticsearch is running

In [83]:
!curl -u elastic:9uAH3wr_ZKMvFXJOCTSc http://localhost:9200/

{
  "name" : "DESKTOP-T3AQS4C",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "83WbcL3bR6O512OaxxoHvQ",
  "version" : {
    "number" : "9.0.1",
    "build_flavor" : "default",
    "build_type" : "zip",
    "build_hash" : "73f7594ea00db50aa7e941e151a5b3985f01e364",
    "build_date" : "2025-04-30T10:07:41.393025990Z",
    "build_snapshot" : false,
    "lucene_version" : "10.1.0",
    "minimum_wire_compatibility_version" : "8.18.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   539  100   539    0     0  70744      0 --:--:-- --:--:-- --:--:-- 77000


* And, finally, we'll create the client instance.

In [84]:
client = Elasticsearch(
    "http://localhost:9200",
    basic_auth=("elastic", "9uAH3wr_ZKMvFXJOCTSc"),
    request_timeout=1000000  
)

### Index Creation

We will need to choose an appropriate **Analyzer** and a **similarity function**. Let's look at what each of these are:
  
   
`Analyzer`
* An analyzer is like a text cleaner and organizer.

* It ensures that all variations of a word (like "run", "running", "ran") are treated as the same word, and that common words are ignored.

* Even though our texts are currently preprocessed (with clean_text), we will still need an analyzer.

* This is because the texts in the index may be simplified, but our queries need to match them.
   
`Similarity Function`

* The similarity function is the way the system measures how similar/relevant a sentence/text is to a query.

* With the appropriate similarity function, we can rank the results based on how well they match.

* We will use the **Vector Space Model (VSM)**, which represents documents and queries as vectors in a multidimensional space.

* In this model, similarity is typically computed based on **TF-IDF (term frequency-inverse document frequency)** weighting, which gives more importance to words that appear frequently in a document but rarely across the collection.

* The vector space model is simple, efficient, and widely used for information retrieval tasks, and allows us to compute similarity in mathematical terms (e.g. inner product), making search more accurate and easily scalable.   

More Specifically:

`Vector Space Model (TF-IDF)`

* Calculates the **meaning of the terms** in the documents.
* Assigns **a higher weight to terms** that are **less common** in the entirety of all documents.
* Their similarity is calculated as:

$$
\text{Similarity} = \text{tf} \times \text{idf} \times \text{norm}
$$

- **tf**: Square root of the frequency of the term (how many times does the term appear in the document).
- **idf**: Logarithmic pointer of the rarity of the terms in all documents.
- **norm**: Factor of normalization of lenght, to avoid bigger texts having a bigger impact.

We'll firstly create an empty index with a defined structure.

In [85]:
client.indices.delete(index="ir_2025") #empty it in case it already exists

ObjectApiResponse({'acknowledged': True})

In [86]:
from elasticsearch import Elasticsearch, helpers
import pandas as pd

vsm_optimized_mapping = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0, 
        "similarity": {
            "scripted_tfidf": {
                "type": "scripted",
                "script": {
                    "source": """
                        double tf = Math.sqrt(doc.freq);
                        double idf = Math.log((field.docCount + 1.0) / (term.docFreq + 1.0)) + 1.0;
                        double norm = 1 / Math.sqrt(doc.length);
                        return query.boost * tf * idf * norm;
                    """
                }
            }
        },
        "analysis": {
            "analyzer": {
                "custom_english": {  
                    "type": "english"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "copy_to": "allContent",
                "analyzer": "custom_english",
                "similarity": "scripted_tfidf"
            },
            "text": {
                "type": "text",
                "copy_to": "allContent",
                "analyzer": "custom_english",
                "similarity": "scripted_tfidf"
            },
            "authors": {
                "type": "integer",
                "copy_to": "allContent"
            },
            "cited_by": {
                "type": "integer",
                "copy_to": "allContent"
            },
            "references": {
                "type": "integer",
                "copy_to": "allContent"
            },
            "year": {
                "type": "integer",
                "copy_to": "allContent"
            },
            "allContent": {
                "type": "text",
                "analyzer": "custom_english",
                "similarity": "scripted_tfidf"
            }
        }
    }
}

try:
    client.indices.create(index='ir_2025', body=vsm_optimized_mapping, master_timeout='2m')
    print("Index created successfully!")
except Exception as e:
    print("Error creating index:", e)

Index created successfully!


In [87]:
def generate_data(df):
    for index, row in df.iterrows():
        yield {
            "_index": "ir_2025",
            "_id": row['_id'],  
            "_source": {
                "title": row['title'],
                "text": row['text'],
                "authors": row['authors'],
                "cited_by": row['cited_by'],
                "references": row['references'],
                "year": row['year'],
                "allContent": f"{row['title']} {row['text']} {row['authors']} {row['cited_by']} {row['references']} {row['year']}"
            }
        }

def bulk_index_data(df, batch_size=500):
    try:
        for i in range(0, len(df), batch_size):
            chunk = df.iloc[i:i+batch_size]
            bulk(client.options(request_timeout=300), generate_data(chunk))
            print(f"Indexed batch {i//batch_size + 1} successfully")
    except Exception as e:
        print(f"Error during bulk indexing at batch {i//batch_size + 1}:", e)

bulk_index_data(cleaned_df, batch_size=500)

Indexed batch 1 successfully
Indexed batch 2 successfully
Indexed batch 3 successfully
Indexed batch 4 successfully
Indexed batch 5 successfully
Indexed batch 6 successfully
Indexed batch 7 successfully
Indexed batch 8 successfully
Indexed batch 9 successfully
Indexed batch 10 successfully
Indexed batch 11 successfully
Indexed batch 12 successfully
Indexed batch 13 successfully
Indexed batch 14 successfully
Indexed batch 15 successfully
Indexed batch 16 successfully
Indexed batch 17 successfully
Indexed batch 18 successfully
Indexed batch 19 successfully
Indexed batch 20 successfully
Indexed batch 21 successfully
Indexed batch 22 successfully
Indexed batch 23 successfully
Indexed batch 24 successfully
Indexed batch 25 successfully
Indexed batch 26 successfully
Indexed batch 27 successfully
Indexed batch 28 successfully
Indexed batch 29 successfully
Indexed batch 30 successfully
Indexed batch 31 successfully
Indexed batch 32 successfully
Indexed batch 33 successfully
Indexed batch 34 su

`Results`
* Generally, to calculate the **number of batches**, we use the formula:
$$
\text{Number of batches} = \lceil \frac{\text{Total Lines}}{\text{Size of batch}} \rceil
$$
* And our data is:
$$
\text{Total Lines: } 25,657 \quad | \quad \text{Size of batch: } 500
$$
* So we must get:
$$
\text{Size of batches} = \lceil \frac{25657}{500} \rceil
$$
$$
\text{Size of batches} = \lceil 51.314 \rceil = 52
$$

* So, **our index was created successfully**.

`Concerning allContent`
* The addition of `"copy_to": "allContent"` makes the search more efficient. 
* It is useful for the consistency of results in case there's a search for many fields simultaneously.

`Concerning Shards`  
* Each index in ElasticSearch consists of smaller ones called shards. 
* Each shard has its own index and can be hosted in a unique node in a cluster.
* Our dataframe has 25657 rows.
* Since each file takes up 1KB of space then we have almost 25KB.
* ElasticSearch can manage 50 GB effectively. In comparisson 25ΚΒ is little, so one shard shall be enough.

* Let's check a few things

In [88]:
if client.indices.exists(index="ir_2025"):
    print("Index exists!")
else:
    print("Index not found!")

Index exists!


In [89]:
response = client.indices.get(index="ir_2025")
print(response)

{'ir_2025': {'aliases': {}, 'mappings': {'properties': {'allContent': {'type': 'text', 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'authors': {'type': 'integer', 'copy_to': ['allContent']}, 'cited_by': {'type': 'integer', 'copy_to': ['allContent']}, 'references': {'type': 'integer', 'copy_to': ['allContent']}, 'text': {'type': 'text', 'copy_to': ['allContent'], 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'title': {'type': 'text', 'copy_to': ['allContent'], 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'year': {'type': 'integer', 'copy_to': ['allContent']}}}, 'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}}, 'number_of_shards': '1', 'provided_name': 'ir_2025', 'similarity': {'scripted_tfidf': {'type': 'scripted', 'script': {'source': '\n                        double tf = Math.sqrt(doc.freq);\n                        double idf = Math.log((field.docCount + 1.0) / (term.docFreq + 1.0

In [90]:
mapping = client.indices.get_mapping(index="ir_2025")
print(mapping)

{'ir_2025': {'mappings': {'properties': {'allContent': {'type': 'text', 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'authors': {'type': 'integer', 'copy_to': ['allContent']}, 'cited_by': {'type': 'integer', 'copy_to': ['allContent']}, 'references': {'type': 'integer', 'copy_to': ['allContent']}, 'text': {'type': 'text', 'copy_to': ['allContent'], 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'title': {'type': 'text', 'copy_to': ['allContent'], 'analyzer': 'custom_english', 'similarity': 'scripted_tfidf'}, 'year': {'type': 'integer', 'copy_to': ['allContent']}}}}}


## Implementation of the Queries  
At this point we will execute queries on our index and we'll collect the answers of our search machine. We will use the queries that exist in the file `scidocs`. We'll keept the k first retreved texts, with `k = 20, 30, 50`.
*  Let's firstly load the queries and create a function that will make a singular query.

In [91]:
queries = []
with open('./scidocs/queries.jsonl', "r", encoding="utf-8") as file:
    for line in file:
        queries.append(json.loads(line))

In [92]:
def search_document(query_text, size=20):
    response = client.search(
        index="ir_2025",
        body={
            "query": {
                "match": {
                    "allContent": query_text
                }
            },
            "size": size
        }
    )
    return response

* Time to create structures in which we will store the answers for all queries, as well as for the various k.  
* We will also include the cases k = 5, 10, and 15 for calculations that we will do in part 4.

In [93]:
data = []

for q in queries:
    query_text = q.get("text", "")
    query_id = q.get("_id", "")
    if not query_text:
        continue
    
    response = search_document(query_text, size=1000)  
    hits = response['hits']['hits']
    
    if not hits:
        print("  No results found.\n")
        continue
    
    results_list = []
    for hit in hits:
        id = hit['_id']
        source = hit['_source']
        score = hit['_score']
        title = source.get('title', 'N/A')
        text = source.get('text', 'N/A')
        results_list.append((id, title, text, score))
    
    data.append({
        "query_id": query_id,
        "query": query_text,
        "results": results_list
    })
dfs_all = pd.DataFrame(data)

In [None]:
k_values = [5, 10, 15, 20, 30, 50]
results = {}
dfs = {}
for k in k_values:
    data = []
    
    for q in queries:
        query_text = q.get("text", "")
        query_id = q.get("_id", "")
        if not query_text:
            continue
        
        response = search_document(query_text, size=k)
        hits = response['hits']['hits']
        
        if not hits:
            print("No results found.\n")
            continue
        
        results_list = []
        for hit in hits:
            id = hit['_id']
            source = hit['_source']
            score = hit['_score']
            title = source.get('title', 'N/A')
            text = source.get('text', 'N/A')
            results_list.append((id, title, text, score))
        
        data.append({
            "query_id": query_id,
            "query": query_text,
            "results": results_list
        })
    
    dfs[k] = pd.DataFrame(data)


* We see that there were answers to our queries.  
* `dfs` is a dictionary where each key is one of the k values (5, 10, 15, 20, 30, 50) and each value is a pandas DataFrame.  
* Each DataFrame has three columns:  
  * **query_id** — the id of the search (query)  
  * **query** — the text of the search (query) as a string  
  * **results** — a list of tuples (title, score), where each tuple represents a search result and its relevance score.  
* Each row corresponds to a single search, so each row stores all the top-k results of the search in a list in the `results` column.  
* Let's examine the contents of this dictionary.

In [95]:
dfs[20].shape

(1000, 3)

In [96]:
dfs[30].shape

(1000, 3)

In [97]:
dfs[50].shape

(1000, 3)

In [98]:
dfs[20]['results'].apply(len).describe()

count    1000.000000
mean       19.991000
std         0.284605
min        11.000000
25%        20.000000
50%        20.000000
75%        20.000000
max        20.000000
Name: results, dtype: float64

* **count**: 1000.0 -> There are 1000 queries in the DataFrame.

* **mean**: 19.9 -> On average, each query returned about 19.9 results.

* **std**: 0.284605 -> The standard deviation is 0.28, meaning there is little variation in the number of results per query.

* **min**: 11, **25%, 50%, 75%, max** all equal 20 -> The number of results per query is not exactly 20 for all queries, but it is for most of them.

In [99]:
dfs[30]['results'].apply(len).describe()

count    1000.000000
mean       29.981000
std         0.600833
min        11.000000
25%        30.000000
50%        30.000000
75%        30.000000
max        30.000000
Name: results, dtype: float64

* **count**: 1000.000000 — There are 1000 queries in the DataFrame.

* **mean**: 29.981000 — On average, each query returned about 29.98 results.

* **std**: 0.600833 — The standard deviation is about 0.6, indicating little variation in the number of results per query, though slightly more than for k=20.

* **min:** 11, **25%: 30, 50%: 30, 75%:** 30, **max:** 30 — The minimum number of results was 11, while 25%, 50%, and 75% of the queries returned 30 results, with a maximum of 30. This shows that most queries returned the full 30 results, but some had fewer (less than 30).

In [100]:
dfs[50]['results'].apply(len).describe()

count    1000.000000
mean       49.961000
std         1.233288
min        11.000000
25%        50.000000
50%        50.000000
75%        50.000000
max        50.000000
Name: results, dtype: float64

* **count**: 1000.000000 — There are 1000 queries in the DataFrame.

* **mean**: 49.961000 — On average, each query returned about 49.96 results.

* **std**: 1.2332888 — The standard deviation is about 1.2, indicating little variation in the number of results per query, though slightly more than for k=20 and k=30.

* **min**: 11, **25%**: 50, **50%**: 50, **75%**: 50, **max**: 50 — The minimum number of results was 11, while 25%, 50%, and 75% of the queries returned 50 results, with a maximum of 50. This means most queries returned the full 50 results, but some had fewer.

We observe that as the value of k increases, the variation in the answers also increases. Although the variations are small, they are clearly reflected in the std and are also evident in the other metrics.

In [101]:
import numpy as np
all_scores = [result[3] for row in dfs[20]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  2.795202
1              Median  2.599902
2  Standard Deviation  0.939617


* **Average score**: approximately 2.79  
* **Median score**: approximately 2.59  
* **Std of score**: approximately 0.94 (indicates that there is some small variability in the scores)

In [102]:
all_scores = [result[3] for row in dfs[30]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  2.625898
1              Median  2.439002
2  Standard Deviation  0.891223


* **Average score:** approximately 2.62

* **Median score:** approximately 2.44

* **Std of score:** approximately 0.89 (there is small variability in the scores)

In [103]:
all_scores = [result[3] for row in dfs[50]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  2.412677
1              Median  2.237310
2  Standard Deviation  0.831417


* **Average score:** approximately 2.41

* **Median score:** approximately 2.24

* **Std of score:** approximately 0.83 (there is relatively small variability in the scores)

## 4. Evaluation of Results  
At this point, we will evaluate our answers by comparing them with the correct answers using the evaluation tool `trec_eval`. `trec_eval` is a standard tool used by the `TREC (Text REtrieval Conference)` community for retrieval evaluation.

We will examine the evaluation metrics:  
* **MAP (mean average precision)** and  
* **avgPre@k (average precision at the top k retrieved documents)** for k = 5, 10, 15, 20.

We have the following files:

* **queries.jsonl**: the file with the queries.

* **corpus.jsonl**: the file with the correct answers.

* **test.tsv**: contains the mapping of queries to correct answers.

To use `trec_eval`, we will need the following files:

* A -> **Qrel File**: The correct answers in a .qrel file.

* B -> **Results File**: File with our results (all or for various k).

Specifically, A, to be recognized by `trec_eval`, must be in the format:

| query_id   | iteration  | docno      | relevance    |
|------------|------------|------------|--------------|  

Where:  
* **query_id:** the id of the query.  
* **iteration:** we can simply set 0 for all; it has no role.  
* **docno:** the id of the corresponding corpus document.  
* **relevance:** the score found in the test.tsv of the qrels.  

Now, B, to be recognized by `trec_eval`, must be in the format:

| query_id   | iteration  | docno      | rank    | sim    | run_id    |
|------------|------------|------------|---------|--------|-----------|

Where:  
* **query_id:** the id of the query.  
* **iteration:** we can simply set 0 for all; it has no role.  
* **docno:** the id of the corresponding corpus document.  
* **rank:** the position of the answer, i.e., if k=5, the answers will be ranked from 1 to 5 numerically, without any priority.  
* **sim:** the score of the answer.  
* **run_id:** any single-word string as a label, can be common for all.

Creation of A:


In [104]:
output_path_trec = r"C://Users//user//Downloads//trec_eval//qrels.qrel"
output_path_local = "qrels.qrel"  

df = pd.read_csv("./scidocs/qrels/test.tsv", sep="\t", header=None, names=["query_id", "docno", "relevance"], skiprows=1)
df['iteration'] = 0
df = df[['query_id', 'iteration', 'docno', 'relevance']]

df.to_csv(output_path_trec, sep=' ', index=False, header=False)
df.to_csv(output_path_local, sep=' ', index=False, header=False)

print("Qrels saved")

Qrels saved


Creation of Β:

In [105]:
trec_dir = r"C://Users//user//Downloads//trec_eval"
filename = "results.txt"
full_path_trec = f"{trec_dir}\\{filename}"
full_path_local = filename

with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
    for _, row in dfs_all.iterrows():
        query_id = row["query_id"]
        results = row["results"]

        for rank, (docno, title, text, score) in enumerate(results, start=1):
            line = f"{query_id} 0 {docno} {rank} {score:.4f} results_all\n"
            f_trec.write(line)
            f_local.write(line)

print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

Created:
- C://Users//user//Downloads//trec_eval\results.txt
- results.txt


In [106]:
trec_dir = r"C://Users//user//Downloads//trec_eval"

for k in dfs:
    if(k<30):
        df = dfs[k]
        filename = f"results_{k}.txt"
        full_path_trec = f"{trec_dir}\\{filename}"
        full_path_local = filename

        with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
            for _, row in df.iterrows():
                query_id = row["query_id"]
                results = row["results"]

                for rank, (docno, title, text, score) in enumerate(results, start=1):
                    line = f"{query_id} 0 {docno} {rank} {score:.4f} results_{k}\n"
                    f_trec.write(line)
                    f_local.write(line)

        print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

Created:
- C://Users//user//Downloads//trec_eval\results_5.txt
- results_5.txt
Created:
- C://Users//user//Downloads//trec_eval\results_10.txt
- results_10.txt
Created:
- C://Users//user//Downloads//trec_eval\results_15.txt
- results_15.txt
Created:
- C://Users//user//Downloads//trec_eval\results_20.txt
- results_20.txt


* Now that we have created the txt files, we will execute the corresponding commands in the cmd.  
* Below is the excerpt of the commands I ran in the cmd as a copy-paste:


```bash
C:\Users\user>cd C:\Users\user\Downloads\trec_eval

C:\Users\user\Downloads\trec_eval>trec_eval -m map qrels.qrel results.txt
      1 [main] trec_eval 10636 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map                     all     0.0949

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.5 qrels.qrel results_5.txt
      1 [main] trec_eval 9176 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_5               all     0.0662

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.10 qrels.qrel results_10.txt
      1 [main] trec_eval 7444 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_10              all     0.0770

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.15 qrels.qrel results_15.txt
      1 [main] trec_eval 11012 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_15              all     0.0822

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.20 qrels.qrel results_20.txt
      1 [main] trec_eval 6036 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_20              all     0.0847

### Evaluation Results with trec_eval

| Evaluation Metric       | Results   | Value  |
|-------------------------|-----------|--------|
| MAP (Mean Avg Precision) | All       | 0.0949 |
| avgPre@5 (map_cut.5)     | k = 5    | 0.0662 |
| avgPre@10 (map_cut.10)   | k = 10   | 0.0770 |
| avgPre@15 (map_cut.15)   | k = 15   | 0.0822 |
| avgPre@20 (map_cut.20)   | k = 20   | 0.0847 |

## 5. Analysis of Evaluation Results

* The evaluation results show that, while the overall MAP is 0.0949, the avgPre@k values gradually increase from 0.0662 for k=5 to 0.0847 for k=20.  
* This suggests that the retrieval system is indeed capable of identifying relevant documents, but many of them are not ranked at the top positions.  
* The relatively low precision for small values of k indicates that the most relevant documents often appear further down the list.  
* Although increasing k improves the mean precision, the rate of improvement decreases, showing diminishing returns.  
* **Therefore, the larger our k, the easier it is to find the appropriate documents that match our queries.**

# Information Retrieval Systems Assignment - Phase 2  

* In the second phase, we will expand the queries of the `IR2025` collection with synonymous terms that we will extract from `WordNet`.  
* Let's take a closer look at what it provides:

### Semantic Relations of WordNet

WordNet is a lexical database (in English) that organizes words into **synonym sets (synsets)** and maps various semantic relations between them. The main types of relations that WordNet can identify, using the word **"car"** as an example where applicable, are:

| Relation      | Description                        | Example with "car"                     |
|---------------|------------------------------------|----------------------------------------|
| Synonym       | Same/similar meaning               | Car ↔ Auto, Motorcar                   |
| Hypernym      | More general term                  | Vehicle ← Car                          |
| Hyponym       | More specific term                 | Car → Sedan, SUV                       |
| Meronym       | Part of the whole                  | Car → Engine, Wheel                     |
| Holonym       | Whole containing the part          | Fleet ← Car                             |
| Entailment    | Logical consequence of action      | Buy → Pay                               |
| Troponym      | Specific manner of action          | Drive → Skid, Race                      |
| Antonym       | Opposite meaning                   | Buy ↔ Sell                              |


## 1. Defining Tools for Finding Synonyms
* We will use the `NLTK` library, which we have already imported in previous steps.
* We will create appropriate methods to further expand our queries using various synonyms from `WordNet`.
* Let's download `WordNet`.

In [107]:
nltk.download('wordnet')

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


* We'll choose WordNet synonyms for only certain parts of speech. We will check **not to include common words**, i.e., stopwords, and **we will focus on nouns**.  
* We will use the stopwords provided by NLTK as a criterion.

In [108]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def get_noun_synonyms(word, max_synonyms=4):
    synonyms = set()
    if word.lower() in stop_words:
        return []
    for synset in wn.synsets(word, pos=wn.NOUN):  # only noun synsets
        for lemma in synset.lemmas():
            name = lemma.name().replace('_', ' ')
            if name.lower() not in stop_words:
                synonyms.add(name)
                if len(synonyms) >= max_synonyms:
                    break
        if len(synonyms) >= max_synonyms:
            break
    return list(synonyms)

def get_noun_hypernyms(word, max_hypernyms=4):
    hypernyms = set()
    if word.lower() in stop_words:
        return hypernyms
    for syn in wn.synsets(word, pos=wn.NOUN):
        for hypernym in syn.hypernyms():
            for lemma in hypernym.lemmas():
                name = lemma.name().replace('_', ' ')
                if name.lower() not in stop_words:
                    hypernyms.add(name)
                    if len(hypernyms) >= max_hypernyms:
                        break
            if len(hypernyms) >= max_hypernyms:
                break
        if len(hypernyms) >= max_hypernyms:
                break    
    return hypernyms

def get_noun_hyponyms(word, max_hyponyms=4):
    hyponyms = set()
    if word.lower() in stop_words:
        return hyponyms
    for syn in wn.synsets(word, pos=wn.NOUN):
        for hyponym in syn.hyponyms():
            for lemma in hyponym.lemmas():
                name = lemma.name().replace('_', ' ')
                if name.lower() not in stop_words:
                    hyponyms.add(name)
                    if len(hyponyms) >= max_hyponyms:
                        break
            if len(hyponyms) >= max_hyponyms:
                break
        if len(hyponyms) >= max_hyponyms:
            break
    return hyponyms


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


* Let's ckeck that our functions work.

In [109]:
print(get_noun_synonyms('tree', 5))

['Sir Herbert Beerbohm Tree', 'tree diagram', 'tree', 'Tree']


In [110]:
print(get_noun_hypernyms("apple"))

{'edible fruit', 'false fruit', 'apple tree', 'pome'}


In [111]:
print(get_noun_hyponyms('computer'))

{'client', 'number cruncher', 'guest', 'node'}


* Our functions seem to work as expected.

## 2. Expanding IR2025 Queries with Synonymous Terms from WordNet
For completeness, we will expand the queries in two separate ways:
* 1st Method: Expansion by adding **Synonymous Terms**
* 2nd Method: Using **Hypernyms** and **Hyponyms**

We will use the `queries` list.

In [112]:
from nltk.tokenize import word_tokenize

queries_synonyms = []
for q in queries:
    query_text = q.get("text", "")
    id = q.get("_id", "")
    words = word_tokenize(query_text)
    new_query = set()
    for word in words:
        synonyms = get_noun_synonyms(word)
        new_query.update(synonyms)  
    query_str = ' '.join(new_query)  
    queries_synonyms.append((id,query_str))

In [113]:
print(f"Query 1 before adding synonyms:\n{queries[0]}\n")
print(f"Query 1 after adding synonyms:\n{queries_synonyms[0]}\n")

Query 1 before adding synonyms:
{'_id': '78495383450e02c5fe817e408726134b3084905d', 'text': 'A Direct Search Method to solve Economic Dispatch Problem with Valve-Point Effect', 'metadata': {'authors': ['50306438', '15303316', '1976596'], 'year': 2014, 'cited_by': ['38e78343cfd5c013decf49e8cf008ddf6458200f'], 'references': ['632589828c8b9fca2c3a59e97451fde8fa7d188d', '4cf296b9d4ef79b838dc565e6e84ab9b089613de', '86e87db2dab958f1bd5877dc7d5b8105d6e31e46', '4b031fa8bf63e17e2100cf31ba6e11d8f80ff2a8', 'a718c6ca7a1db49bb2328d43f775783e8ec6f985', 'cf51cfb5b221500b882efee60b794bc11635267e', '6329874126a4e753f98c40eaa74b666d0f14eaba', 'a27b6025d147febb54761345eafdd73954467aca']}}

Query 1 after adding synonyms:
('78495383450e02c5fe817e408726134b3084905d', 'trouble search despatch dispatch lookup hunt hunting job consequence outcome method method acting effect result problem communique shipment')



* We see that the query has been expanded according to the first method (expansion with synonyms).  
* We also see that the stopword A is ignored, meaning no synonym is found for it.  
* Let's follow the same logic for implementing the 2nd method (expansion with hypernyms and hyponyms).

In [114]:
queries_hyper_hypo = []
for q in queries:
    query_text = q.get("text", "")
    id = q.get("_id", "")
    words = word_tokenize(query_text)
    new_query = set()
    for word in words:
        #getting hypernyms:
        hypernyms = get_noun_hypernyms(word)
        hypernyms.add(word) 
        new_query.update(hypernyms) 
        #getting hypernyms:
        hyponyms = get_noun_hyponyms(word)
        hyponyms.add(word) 
        new_query.update(hyponyms) 
    query_str = ' '.join(new_query)  
    queries_hyper_hypo.append((id,query_str))

In [115]:
print(f"Query {queries[0].get("_id", "")} before adding hypernyms and hyponyms:\n{queries[0].get("text", "")}\n")
print(f"Query {queries_hyper_hypo[0][0]} after adding hypernyms and hyponyms:\n{queries_hyper_hypo[0][1]}\n")

Query 78495383450e02c5fe817e408726134b3084905d before adding hypernyms and hyponyms:
A Direct Search Method to solve Economic Dispatch Problem with Valve-Point Effect

Query 78495383450e02c5fe817e408726134b3084905d after adding hypernyms and hyponyms:
acting playing investigating exploration question impression to difficulty looking solution activity phenomenon balance-of-payments problem Dispatch Economic report Method shakedown riddle news report solve bandwagon effect technique system of rules with account change appearance system manhunt pons asinorum playacting A Problem visual aspect impact Search operation wallop Valve-Point story investigation Direct head know-how race problem Effect reshipment



## 3. Executing Queries  
In this step, we will run queries on the index and collect the machine's responses. We will use the queries that were already available in the `scidocs` folder, which we also used for building the indexes. We will keep the top k retrieved documents, for `k = 20, 30, 50`.  
* We will use the `search_document` function from Phase 1, which performs a single query. 
* We will first implement the queries that have been expanded with their synonyms.

In [116]:
data = []

for query_id, query_text in queries_synonyms:
    if not query_text:
        continue

    response = search_document(query_text, size=1000)
    hits = response['hits']['hits']

    if not hits:
        print(f"No results found for query ID {query_id}.\n")
        continue

    results_list = []
    for hit in hits:
        id = hit['_id']
        source = hit['_source']
        score = hit['_score']
        title = source.get('title', 'N/A')
        text = source.get('text', 'N/A')
        results_list.append((id, title, text, score))

    data.append({
        "query_id": query_id,
        "query": query_text,
        "results": results_list
    })

dfs_all_synonyms = pd.DataFrame(data)

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.



In [117]:
k_values = [5, 10, 15, 20, 30, 50]
dfs_synonyms = {}

for k in k_values:
    data = []
    
    for query_id, query_text in queries_synonyms:
        if not query_text:
            continue
        
        response = search_document(query_text, size=k)
        hits = response['hits']['hits']
        
        if not hits:
            print(f"No results found for query ID {query_id}.\n")
            continue
        
        results_list = []
        for hit in hits:
            id = hit['_id']
            source = hit['_source']
            score = hit['_score']
            title = source.get('title', 'N/A')
            text = source.get('text', 'N/A')
            results_list.append((id, title, text, score))
        
        data.append({
            "query_id": query_id,
            "query": query_text,
            "results": results_list
        })
    
    dfs_synonyms[k] = pd.DataFrame(data)

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.

No results found for query ID 0c8a029180e8ee5a7a8c886738576b12d3f6530d.



* We see that there were **answers for most of our queries** (the IDs of the **few** cases where there were no answers are printed).  
* `dfs_synonyms` is a dictionary where each key is one of the k values (5, 10, 15, 20, 30, 50) and each value is a pandas DataFrame (as in Part 3 of Phase 1).  
* Let's examine the contents of this dictionary.

In [118]:
dfs_synonyms[20].shape

(998, 3)

In [119]:
dfs_synonyms[30].shape

(998, 3)

In [120]:
dfs_synonyms[50].shape

(998, 3)

In [121]:
dfs_synonyms[20]['results'].apply(len).describe()

count    998.000000
mean      19.994990
std        0.158272
min       15.000000
25%       20.000000
50%       20.000000
75%       20.000000
max       20.000000
Name: results, dtype: float64

* **count:** 998.0 → There are 1000 queries in the DataFrame, 998 were answered with our updated queries.

* **mean:** 19.994990 → On average, each query returned 19.9 results.

* **std:** 0.158272 → The standard deviation is almost 0.16, indicating small deviation.

* **min:** 15, **20, 25%, 50%, 75%, max** all equal 20 → The number of results is consistently 20 except for a few queries.

In [122]:
dfs_synonyms[30]['results'].apply(len).describe()

count    998.000000
mean      29.984970
std        0.474817
min       15.000000
25%       30.000000
50%       30.000000
75%       30.000000
max       30.000000
Name: results, dtype: float64

* **count:** 998.0 → There are 1000 queries in the DataFrame, 998 were answered with our updated queries.

* **mean:** 29.984970 → On average, each query returned almost 30 results.

* **std:** 0.474817 → The standard deviation is almost 0.48, indicating small deviation.

* **min:** 15, **20, 25%, 50%, 75%, max** all equal 30 → The number of results is consistently 30 except for a few queries.

In [123]:
dfs_synonyms[50]['results'].apply(len).describe()

count    998.000000
mean      49.964930
std        1.107906
min       15.000000
25%       50.000000
50%       50.000000
75%       50.000000
max       50.000000
Name: results, dtype: float64

* **count:** 998.0 → There are 1000 queries in the DataFrame, 998 were answered with our updated queries.

* **mean:** 49.964930 → On average, each query returned almost 50 results.

* **std:** 1.107906 → The standard deviation is almost 1, indicating small deviation, larger than for the previous k values.

* **min:** 15, **20, 25%, 50%, 75%, max** all equal 50 → The number of results is consistently 50 except for a few queries.

* We observe that as the value of k increases, the variation in the answers also increases.  
* **We notice that, in this case, with the query expansion, we retrieve fewer documents.**

In [124]:
all_scores = [result[3] for row in dfs_synonyms[20]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  4.110391
1              Median  3.937426
2  Standard Deviation  1.535066


* **Average score:** approximately 4.1

* **Median score:** approximately 3.9

* **Std of score:** approximately 1.5 (there is variability in the scores)

In [125]:
all_scores = [result[3] for row in dfs_synonyms[30]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  3.898443
1              Median  3.727660
2  Standard Deviation  1.462083


* **Average score:** approximately 3.9

* **Median score:** approximately 3.7

* **Std of score:** approximately 1.5 (there is variability in the scores)

In [126]:
all_scores = [result[3] for row in dfs_synonyms[50]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  3.626761
1              Median  3.474216
2  Standard Deviation  1.372217


* **Average score:** approximately 3.6

* **Median score:** approximately 3.5

* **Std of score:** approximately 1.4 (there is variability in the scores)

* Now we will implement the queries that have been expanded with their hypernyms and hyponyms.  
* We will also keep the cases k=5, 10, 15 because they will be useful in Part 5.

In [127]:
data = []

for query_id, query_text in queries_hyper_hypo:
    if not query_text:
        continue
    
    response = search_document(query_text, size=1000)  
    hits = response['hits']['hits']
    
    if not hits:
        print(f"No results found for query ID {query_id}.\n")
        continue
    
    results_list = []
    for hit in hits:
        id = hit['_id']
        source = hit['_source']
        score = hit['_score']
        title = source.get('title', 'N/A')
        text = source.get('text', 'N/A')
        results_list.append((id, title, text, score))
    
    data.append({
        "query_id": query_id,
        "query": query_text,
        "results": results_list
    })

dfs_all_hyper_hypo = pd.DataFrame(data)

In [128]:
k_values = [5, 10, 15, 20, 30, 50]
dfs_hyper_hypo = {}

for k in k_values:
    data = []
    
    for query_id, query_text in queries_hyper_hypo:
        if not query_text:
            continue
        
        response = search_document(query_text, size=k)
        hits = response['hits']['hits']
        
        if not hits:
            print("  No results found.\n")
            continue
        
        results_list = []
        for hit in hits:
            id = hit['_id']
            source = hit['_source']
            score = hit['_score']
            title = source.get('title', 'N/A')
            text = source.get('text', 'N/A')
            results_list.append((id, title, text, score))
        
        data.append({
            "query_id": query_id,
            "query": query_text,
            "results": results_list
        })
    
    dfs_hyper_hypo[k] = pd.DataFrame(data)

In [129]:
dfs_hyper_hypo[20]['results'].apply(len).describe()

count    1000.0
mean       20.0
std         0.0
min        20.0
25%        20.0
50%        20.0
75%        20.0
max        20.0
Name: results, dtype: float64

In [130]:
dfs_hyper_hypo[30]['results'].apply(len).describe()

count    1000.0
mean       30.0
std         0.0
min        30.0
25%        30.0
50%        30.0
75%        30.0
max        30.0
Name: results, dtype: float64

In [131]:
dfs_hyper_hypo[50]['results'].apply(len).describe()

count    1000.0
mean       50.0
std         0.0
min        50.0
25%        50.0
50%        50.0
75%        50.0
max        50.0
Name: results, dtype: float64

* The results appear to be complete and show consistent behavior.  
* The results resemble the case of Phase 1, before query expansion.  
* **We notice that, in this case, with the query expansion, we do not retrieve fewer documents.**  
* Let's, however, check the scores of the answers.

In [132]:
all_scores = [result[3] for row in dfs_hyper_hypo[20]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  6.294461
1              Median  5.906011
2  Standard Deviation  2.328033


* **Average score:** approximately 6.4

* **Median score:** approximately 6

* **Std of score:** approximately 2.3 (there is high variability in the scores)

In [133]:
all_scores = [result[3] for row in dfs_hyper_hypo[30]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  6.002471
1              Median  5.631777
2  Standard Deviation  2.234365


* **Average score:** approximately 6

* **Median score:** approximately 5.7

* **Std of score:** approximately 2.2 (high variability in the scores)

In [134]:
all_scores = [result[3] for row in dfs_hyper_hypo[50]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  5.625489
1              Median  5.289726
2  Standard Deviation  2.115904


* **Average score:** approximately 5.7

* **Median score:** approximately 5.4

* **Std of score:** approximately 2 (high variability in the scores)

Let's examine the performance of the relevance score comparatively for all k and for all query categories.

| Case                       | k  | Mean      | Median    | Std         |
|-----------------------------|----|-----------|-----------|-------------|
| Simple                      | 20 | 2.795202  | 2.599902  | 0.939617    |
| Simple                      | 30 | 2.625898  | 2.439002  | 0.891223    |
| Simple                      | 50 | 2.412677  | 2.237310  | 0.831417    |
| Synonyms                    | 20 | 4.110391  | 3.937426  | 1.535066    |
| Synonyms                    | 30 | 3.898443  | 3.727660  | 1.462083    |
| Synonyms                    | 50 | 3.626761  | 3.474216  | 1.372217    |
| Hypernyms and Hyponyms      | 20 | 6.367802  | 6.029314  | 2.323537    |
| Hypernyms and Hyponyms      | 30 | 6.077830  | 5.750344  | 2.239387    |
| Hypernyms and Hyponyms      | 50 | 5.704961  | 5.394706  | 2.132761    |

**Conclusions:**
* In the simple case, as k increases, the mean and median of the relevance score gradually decrease, which is expected: the more answers collected, the more likely less relevant answers are included.  
* The same trend is observed in the synonyms case, although the values are significantly higher, indicating that **synonyms lead to denser results of high relevance**. However, we also notice a higher standard deviation than in the simple case.  
* In the case of hypernyms and hyponyms, **the relevance values are higher but also more unstable (higher standard deviation)**, indicating high heterogeneity in the answers; that is, results are not always highly relevant.  

The synonym approach appears to achieve the **best scores**, when considered **together with a reasonable standard deviation**, unlike the hypernyms and hyponyms approach. However, we note that although it gives higher scores, this may imply artificially enhanced relevance due to words that are marginally synonymous, introducing a risk of semantic ambiguity. From this perspective, the simple case is safer.

# 4. Evaluation of Results  
At this point, we will evaluate our answers by comparing them with the correct answers using the evaluation tool `trec_eval`. `trec_eval` is a standard tool used by the `TREC (Text REtrieval Conference)` community for retrieval evaluation.

We will examine the evaluation metrics:  
* **MAP (mean average precision)** and  
* **avgPre@k (average precision at the top k retrieved documents)** for k = 5, 10, 15, 20, for both cases of our query expansions.

We will modify our data so that it is stored in a format suitable for `trec_eval`, as in Phase 1 of the assignment.

In [135]:
trec_dir = r"C://Users//user//Downloads//trec_eval"

filename = "results_synonyms.txt"
full_path_trec = f"{trec_dir}\\{filename}"
full_path_local = filename

with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
    for _, row in dfs_all_synonyms.iterrows():
        query_id = row["query_id"]
        results = row["results"]

        for rank, (docno, title, text, score) in enumerate(results, start=1):
            line = f"{query_id} 0 {docno} {rank} {score:.4f} results_synonyms\n"
            f_trec.write(line)
            f_local.write(line)

print(f"Created Synonyms Results")

for k in dfs_synonyms:
    if(k<30):
        df = dfs_synonyms[k]
        filename = f"results_synonyms_{k}.txt"
        full_path_trec = f"{trec_dir}\\{filename}"
        full_path_local = filename

        with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
            for _, row in df.iterrows():
                query_id = row["query_id"]
                results = row["results"]

                for rank, (docno, title, text, score) in enumerate(results, start=1):
                    line = f"{query_id} 0 {docno} {rank} {score:.4f} results_synonyms_{k}\n"
                    f_trec.write(line)
                    f_local.write(line)

        print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

filename = "results_hyper_hypo.txt"
full_path_trec = f"{trec_dir}\\{filename}"
full_path_local = filename

with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
    for _, row in dfs_all_hyper_hypo.iterrows():
        query_id = row["query_id"]
        results = row["results"]

        for rank, (docno, title, text, score) in enumerate(results, start=1):
            line = f"{query_id} 0 {docno} {rank} {score:.4f} results_hyper_hypo\n"
            f_trec.write(line)
            f_local.write(line)

print(f"Created Hypernyms/Hyponyms Results")

for k in dfs_hyper_hypo:
    if(k<30):
        df = dfs_hyper_hypo[k]
        filename = f"results_hyper_hypo_{k}.txt"
        full_path_trec = f"{trec_dir}\\{filename}"
        full_path_local = filename

        with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
            for _, row in df.iterrows():
                query_id = row["query_id"]
                results = row["results"]

                for rank, (docno, title, text, score) in enumerate(results, start=1):
                    line = f"{query_id} 0 {docno} {rank} {score:.4f} results_hyper_hypo_{k}\n"
                    f_trec.write(line)
                    f_local.write(line)

        print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

Created Synonyms Results
Created:
- C://Users//user//Downloads//trec_eval\results_synonyms_5.txt
- results_synonyms_5.txt
Created:
- C://Users//user//Downloads//trec_eval\results_synonyms_10.txt
- results_synonyms_10.txt
Created:
- C://Users//user//Downloads//trec_eval\results_synonyms_15.txt
- results_synonyms_15.txt
Created:
- C://Users//user//Downloads//trec_eval\results_synonyms_20.txt
- results_synonyms_20.txt
Created Hypernyms/Hyponyms Results
Created:
- C://Users//user//Downloads//trec_eval\results_hyper_hypo_5.txt
- results_hyper_hypo_5.txt
Created:
- C://Users//user//Downloads//trec_eval\results_hyper_hypo_10.txt
- results_hyper_hypo_10.txt
Created:
- C://Users//user//Downloads//trec_eval\results_hyper_hypo_15.txt
- results_hyper_hypo_15.txt
Created:
- C://Users//user//Downloads//trec_eval\results_hyper_hypo_20.txt
- results_hyper_hypo_20.txt


* Now that we have created the txt files, we will execute the corresponding commands in the cmd.  
* We will first examine the case of queries expanded with synonyms.  
* Below is the excerpt of the commands I ran in the cmd as a copy-paste:


```bash
C:\Users\user\Downloads\trec_eval>trec_eval -m map qrels.qrel results_synonyms.txt
      1 [main] trec_eval 8136 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map                     all     0.0454

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.5 qrels.qrel results_synonyms_5.txt
      1 [main] trec_eval 1680 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_5               all     0.0304

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.10 qrels.qrel results_synonyms_10.txt
      1 [main] trec_eval 11916 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_10              all     0.0352

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.15 qrels.qrel results_synonyms_15.txt
      1 [main] trec_eval 9584 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_15              all     0.0371

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.20 qrels.qrel results_synonyms_20.txt
      1 [main] trec_eval 10804 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_20              all     0.0384

In summary, our results:

| Evaluation Metric       | Results   | Value  |
|-------------------------|-----------|--------|
| MAP (Mean Avg Precision) | All       | 0.0454 |
| avgPre@5 (map_cut.5)     | k = 5    | 0.0304 |
| avgPre@10 (map_cut.10)   | k = 10   | 0.0352 |
| avgPre@15 (map_cut.15)   | k = 15   | 0.0371 |
| avgPre@20 (map_cut.20)   | k = 20   | 0.0384 |


* Now we will examine the case of queries expanded with hypernyms and hyponyms.  
* Below is the excerpt of the commands I ran in the cmd as a copy-paste:


```bash
C:\Users\user\Downloads\trec_eval>trec_eval -m map qrels.qrel results_hyper_hypo.txt
      1 [main] trec_eval 6588 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map                     all     0.0433

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.5 qrels.qrel results_hyper_hypo_5.txt
      1 [main] trec_eval 2084 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_5               all     0.0278

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.10 qrels.qrel results_hyper_hypo_10.txt
      1 [main] trec_eval 14060 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_10              all     0.0320

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.15 qrels.qrel results_hyper_hypo_15.txt
      1 [main] trec_eval 1576 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_15              all     0.0347

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.20 qrels.qrel results_hyper_hypo_20.txt
      1 [main] trec_eval 12700 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_20              all     0.0361

In summary, our results:

| Evaluation Metric       | Results   | Value  |
|-------------------------|-----------|--------|
| MAP (Mean Avg Precision) | All       | 0.0420 |
| avgPre@5 (map_cut.5)     | k = 5    | 0.0266 |
| avgPre@10 (map_cut.10)   | k = 10   | 0.0308 |
| avgPre@15 (map_cut.15)   | k = 15   | 0.0333 |
| avgPre@20 (map_cut.20)   | k = 20   | 0.0346 |


## 5. Analysis of Evaluation Results

A table summarizing all our data:

| Evaluation Metric       | Results   | Original Answers | Expanded with Synonyms | Expanded with Hypernyms/Hyponyms |
|-------------------------|-----------|-----------------|-----------------------|----------------------------------|
| MAP (Mean Avg Precision) | All       | 0.0949          | 0.0454                | 0.0420                           |
| avgPre@5 (map_cut.5)     | k = 5    | 0.0662          | 0.0304                | 0.0266                           |
| avgPre@10 (map_cut.10)   | k = 10   | 0.0770          | 0.0352                | 0.0308                           |
| avgPre@15 (map_cut.15)   | k = 15   | 0.0822          | 0.0371                | 0.0333                           |
| avgPre@20 (map_cut.20)   | k = 20   | 0.0847          | 0.0384                | 0.0346                           |


* The original method showed the best performance across all evaluation metrics.  
* Expanding the queries with synonyms significantly reduced performance, likely due to the addition of less relevant results.  
* Expanding with hypernyms and hyponyms caused an even greater decrease in precision, indicating that this approach can introduce more noise into the search.  
* Overall, query expansion does not guarantee improvement and requires a more selective approach.  
* Query expansion can go wrong because it adds words that are not always relevant or specific to the original query, resulting in reduced precision due to noise and ambiguity in the search.

# Information Retrieval Systems - Phase 3  

* In Phase 3, we will expand the queries of the `IR2025` collection with synonymous terms extracted from `word2vec`.  
* Word2vec is an algorithm based on feedforward neural networks for learning vector representations of words that can be used to find words with similar meaning or words that appear in similar contexts.  
* The model estimates the probability of selecting a word (output) based on its context (input).  
* It extracts the nearest neighbors of a word by examining its contexts and determines when two words are semantically related (when they appear in the same or similar context).  
* From this perspective, we can use the model to discover synonymous words.


### 1. Training a word2vec Model  
At this point, we will train a word2vec model using the `IR2025` collection as input and the `gensim` library, which provides implemented neural network models, such as word2vec, for Python.  

For the model we will train, we need to decide the values of the following parameters:

* `sentences` = The texts on which the model will be trained (a list of lists of words).  
* `vector_size` = Number of dimensions of the embedding vector for each word (e.g., 100, 300).  
* `window` = How many words before/after the current word will be considered as context.  
* `min_count` = Words with frequency lower than this number are ignored (default: 5).  
* `workers` = Number of threads to use for faster training (depending on CPU).  
* `sg` = Choice of algorithm: 1 for Skip-gram, 0 for CBOW.  

`Skip-gram`  
* Predicts surrounding words based on the target word.  
* More accurate with rare words.  
* Requires more time and resources.  
* Suitable for larger datasets and detailed representations.  

`CBOW (Continuous Bag of Words)`  
* Predicts the target word based on surrounding words (context).  
* Faster to train.  
* Works better with frequent words.  
* Suitable for small datasets.  

For starters, let's check the CPU configuration to select the number of workers.

In [136]:
import multiprocessing
print(multiprocessing.cpu_count())

4


In our implementation:  

* `sentences` = List of tokenized sentences from the IR2025 texts.  
* `vector_size` = 50–100, to avoid overloading the model.  
* `window` = 5, since our texts are long and context plays an important role.  
* `min_count` = 1, because we will apply the model on a limited number of texts (1000), so we do not gain much by ignoring many words.  
* `workers` = 4, based on the above `multiprocessing.cpu_count`.  
* `sg` = 0, i.e., CBOW, which is better as it learns more stably and quickly on small datasets, like our 1000 queries.  

Additionally, since we want to train the model on the texts of the `IR2025` collection, we will use the `updated_df` dataframe, specifically its `text` column. Since we already have the texts in a dataframe, we do not need to read them directly from the jsonl file for processing.

In [137]:
updated_df.head(3)

Unnamed: 0,_id,title,text,authors,year,cited_by,references
0,632589828c8b9fca2c3a59e97451fde8fa7d188d,A hybrid of genetic algorithm and particle swa...,An evolutionary recurrent network which automa...,1,2004,432,10
1,86e87db2dab958f1bd5877dc7d5b8105d6e31e46,A Hybrid EP and SQP for Dynamic Economic Dispa...,Dynamic economic dispatch (DED) is one of the ...,4,2002,169,0
2,2a047d8c4c2a4825e0f0305294e7da14f8de6fd3,Genetic Fuzzy Systems - Evolutionary Tuning an...,It's not surprisingly when entering this site ...,4,2001,521,0


In [138]:
!pip install nltk
!pip install gensim



In [139]:
import json
import gensim 
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

* Below is the creation and invocation of a function that will be used to tokenize the sentences in the `text` column of `updated_df`, which will be used for training the `word2vec` model.

In [140]:
def sentence_generator(dataframe, column_name):
    sentences = []
    for line in dataframe[column_name]:
        try:
            tokens = simple_preprocess(line, deacc=True)
            if tokens:
                sentences.append(tokens)
        except json.JSONDecodeError:
            continue
    return sentences

In [141]:
sentences = sentence_generator(updated_df, 'text')

* Now we will train the model using the list of tokenized queries we created.

In [142]:
def train_word2vec(sentences):
    model = Word2Vec(
        sentences= sentences,          # list of tokenized queries
        vector_size= 100,              # size of embedding
        window= 5,                     # size of context window
        min_count= 1,                  # ignores words with frequency < 1
        workers= 4,                    # number of threads / workers
        sg=0                           # 0 for CBOW (we would have 1 for skip-gram)
    )
    return model

In [143]:
model = train_word2vec(sentences)
model.save("my_word2vec.model")
print("Model saved as 'my_word2vec.model'.")

Model saved as 'my_word2vec.model'.


In [144]:
print("\nExample: words  similar with 'learning':")
word = "learning"
if word in model.wv:
    similar_words = model.wv.most_similar(word, topn=5)
    print(f"Top 5 words similar with '{word}':")
    for w, score in similar_words:
        print(f"{w}: {score:.4f}")
else:
    print(f"Word '{word}' not found in the model's dictionary.")

print("\nExample: words  similar with 'economic':")
word = "economic"
if word in model.wv:
    similar_words = model.wv.most_similar(word, topn=5)
    print(f"Top 5 words  similar with '{word}':")
    for w, score in similar_words:
        print(f"{w}: {score:.4f}")
else:
    print(f"Word '{word}' not found in the model's dictionary.")


Example: words  similar with 'learning':
Top 5 words similar with 'learning':
learningmodels: 0.6712
translation: 0.6494
adaptation: 0.6166
learnings: 0.6110
training: 0.5544

Example: words  similar with 'economic':
Top 5 words  similar with 'economic':
sustainability: 0.8057
capital: 0.7998
socio: 0.7995
financial: 0.7805
corporate: 0.7730


* Our model appears to be ready.

### 2. Expanding IR2025 Queries with Synonymous Terms from the Model

* Below is a function that will be used to create a corresponding list of tokens for each query.

In [145]:
queries[0]

{'_id': '78495383450e02c5fe817e408726134b3084905d',
 'text': 'A Direct Search Method to solve Economic Dispatch Problem with Valve-Point Effect',
 'metadata': {'authors': ['50306438', '15303316', '1976596'],
  'year': 2014,
  'cited_by': ['38e78343cfd5c013decf49e8cf008ddf6458200f'],
  'references': ['632589828c8b9fca2c3a59e97451fde8fa7d188d',
   '4cf296b9d4ef79b838dc565e6e84ab9b089613de',
   '86e87db2dab958f1bd5877dc7d5b8105d6e31e46',
   '4b031fa8bf63e17e2100cf31ba6e11d8f80ff2a8',
   'a718c6ca7a1db49bb2328d43f775783e8ec6f985',
   'cf51cfb5b221500b882efee60b794bc11635267e',
   '6329874126a4e753f98c40eaa74b666d0f14eaba',
   'a27b6025d147febb54761345eafdd73954467aca']}}

In [146]:
def query_tokenization(query_list):
    queries_tokenized = []
    for query in query_list:
        try:
            tokens = simple_preprocess(query['text'], deacc=True)
            if tokens:
                queries_tokenized.append(tokens)
        except json.JSONDecodeError:
            continue
    return queries_tokenized

In [147]:
queries_tokenized = query_tokenization(queries)

In [148]:
print(len(queries_tokenized))

1000


In [149]:
queries_tokenized[0]

['direct',
 'search',
 'method',
 'to',
 'solve',
 'economic',
 'dispatch',
 'problem',
 'with',
 'valve',
 'point',
 'effect']

* We will expand the queries by adding, for each word, 4 synonyms suggested by our model.  
* We will limit to 4 synonyms to maintain consistency with the previous phase of the assignment, where we also kept 4 synonyms.  
* We will not handle stopwords at this point because our index already has a built-in analyzer.

In [150]:
queries_extended = []
for query in queries_tokenized:
    new_query = list(query)
    for token in query: 
        if token in model.wv:
            similar_words = model.wv.most_similar(token, topn=4)
            for similar_word in similar_words:
                new_query.append(similar_word[0])
    new_query = list( dict.fromkeys(new_query) ) # to remove duplicates
    query_str = ' '.join(new_query)
    queries_extended.append(query_str)

In [151]:
print(len(queries_extended))

1000


In [152]:
print(f"Query 1 before adding synonyms:\n{queries[0]['text']}\n")
print(f"Query 1 after adding synonyms:\n{queries_extended[0]}\n")

Query 1 before adding synonyms:
A Direct Search Method to solve Economic Dispatch Problem with Valve-Point Effect

Query 1 after adding synonyms:
direct search method to solve economic dispatch problem with valve point effect indirect feedback hookup haptic searching query searches recommendation technique approach algorithm scheme able quickly ability effectively tackle resolve address overcome sustainability capital socio financial auction eventual syllogism intertemporal problems issue challenge task sanctuary parasubiculum commonplaces roadsides electrostatic solenoid hbt mcu points clouds starting view effects impact dependence influence



* It appears that we have successfully expanded our queries with synonyms generated by the model we trained.

## 3. Executing Queries  
In this step, we will run the queries, now expanded with synonyms, on the index and collect the machine's responses. We will use the queries that were already available in the `scidocs` folder, which we also used for building the indexes. We will keep the top k retrieved documents, for `k = 20, 30, 50`.  
* We will use the `search_document` function from Phase 1, which performs a single query.

In [153]:
print(len(queries))
print(len(queries_extended))

1000
1000


In [154]:
data = []

for i in range (0,1000):
    query_text = queries_extended[i]
    query_id = queries[i]['_id']
    if not query_text:
        continue
    
    response = search_document(query_text, size=1000)  
    hits = response['hits']['hits']
    
    if not hits:
        print("  No results found.\n")
        continue
    
    results_list = []
    for hit in hits:
        id = hit['_id']
        source = hit['_source']
        score = hit['_score']
        title = source.get('title', 'N/A')
        text = source.get('text', 'N/A')
        results_list.append((id, title, text, score))
    
    data.append({
        "query_id": query_id,
        "query": query_text,
        "results": results_list
    })
dfs_all_word2vec = pd.DataFrame(data)

In [155]:
k_values = [5, 10, 15, 20, 30, 50]
dfs_synonyms = {}

for k in k_values:
    data = []
    
    for i in range (0,1000):
        query_text = queries_extended[i]
        query_id = queries[i]['_id']
        if not query_text:
            continue
        
        response = search_document(query_text, size=k)
        hits = response['hits']['hits']
        
        if not hits:
            print(f"No results found for query ID {query_id}.\n")
            continue
        
        results_list = []
        for hit in hits:
            id = hit['_id']
            source = hit['_source']
            score = hit['_score']
            title = source.get('title', 'N/A')
            text = source.get('text', 'N/A')
            results_list.append((id, title, text, score))
        
        data.append({
            "query_id": query_id,
            "query": query_text,
            "results": results_list
        })
    
    dfs_synonyms[k] = pd.DataFrame(data)

* We see that there were **responses to all our queries**.  
* `dfs_synonyms` is a dictionary where each key is one of the k values (5, 10, 15, 20, 30, 50) and each value is a pandas DataFrame (as in Phase 1 and 2, Part 3).  
* Let's examine the content of this dictionary.

In [156]:
dfs_synonyms[20].shape

(1000, 3)

In [157]:
dfs_synonyms[30].shape

(1000, 3)

In [158]:
dfs_synonyms[50].shape

(1000, 3)

In [159]:
dfs_synonyms[20]['results'].apply(len).describe()

count    1000.0
mean       20.0
std         0.0
min        20.0
25%        20.0
50%        20.0
75%        20.0
max        20.0
Name: results, dtype: float64

In [160]:
dfs_synonyms[30]['results'].apply(len).describe()

count    1000.0
mean       30.0
std         0.0
min        30.0
25%        30.0
50%        30.0
75%        30.0
max        30.0
Name: results, dtype: float64

In [161]:
dfs_synonyms[50]['results'].apply(len).describe()

count    1000.0
mean       50.0
std         0.0
min        50.0
25%        50.0
50%        50.0
75%        50.0
max        50.0
Name: results, dtype: float64

In general, for each k = 20, 30, 50 we observe:  

* **count**: 1000 → There are 1000 queries in the DataFrame, so 1000 searches were performed.  

* **mean**: k → Each query returned exactly k results on average.  

* **std**: 0 → The standard deviation is 0, meaning all queries returned exactly k results each time with no deviation.  

* **min**, **25%**, **50%**, **75%**, **max**: k → The min, max, and percentile values are all k, so there is no query with fewer or more results.

In [162]:
import numpy as np
all_scores = [result[3] for row in dfs_synonyms[20]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  5.709842
1              Median  5.501593
2  Standard Deviation  1.693263


* **Average score:** approximately 5.68  
* **Median score:** approximately 5.46  
* **Std of score:** approximately 1.7 (indicating some variability in the scores)

In [163]:
import numpy as np
all_scores = [result[3] for row in dfs_synonyms[30]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  5.457026
1              Median  5.256280
2  Standard Deviation  1.639950


* **Average score:** approximately 5.42  
* **Median score:** approximately 5.21  
* **Std of score:** approximately 1.65 (again, indicating variability in the scores)

In [164]:
import numpy as np
all_scores = [result[3] for row in dfs_synonyms[50]['results'] for result in row]

mean_score = np.mean(all_scores)
median_score = np.median(all_scores)
std_score = np.std(all_scores)

stats_df = pd.DataFrame({
    "Statistic": ["Mean", "Median", "Standard Deviation"],
    "Value": [mean_score, median_score, std_score]
})

print(stats_df)

            Statistic     Value
0                Mean  5.128986
1              Median  4.935660
2  Standard Deviation  1.573381


* **Average score:** approximately 5.09  
* **Median score:** approximately 4.89  
* **Std of score:** approximately 1.58 (indicating variability in the scores)

Let's examine the relevance score performance across all k and for all query categories:

| Case                        | k  | Mean      | Median    | Std         |
|------------------------------|----|-----------|-----------|-------------|
| Simple                       | 20 | 2.795202  | 2.599902  | 0.939617    |
| Simple                       | 30 | 2.625898  | 2.439002  | 0.891223    |
| Simple                       | 50 | 2.412677  | 2.237310  | 0.831417    |
| Synonyms WordNet             | 20 | 4.110391  | 3.937426  | 1.535066    |
| Synonyms WordNet             | 30 | 3.898443  | 3.727660  | 1.462083    |
| Synonyms WordNet             | 50 | 3.626761  | 3.474216  | 1.372217    |
| Hypernyms & Hyponyms WordNet | 20 | 6.367802 | 6.029314 | 2.323537   |
| Hypernyms & Hyponyms WordNet | 30 | 6.077830 | 5.750344 | 2.239387   |
| Hypernyms & Hyponyms WordNet | 50 | 5.704961 | 5.394706 | 2.132761   |
| Synonyms word2vec            | 20 | 5.681937  | 5.463569  | 1.708562    |
| Synonyms word2vec            | 30 | 5.426382  | 5.216836  | 1.651440    |
| Synonyms word2vec            | 50 | 5.096679  | 4.898412  | 1.580485    |

The **Hypernyms & Hyponyms WordNet** approach has:  
* The highest mean for all k.  
* Consistently better median.  
* Although it has higher variability (Std), its overall result is clearly superior.  

However, comparing the **synonym approaches**, the word2vec-based synonyms we implemented in this phase have **consistently higher scores** than the WordNet synonyms from Phase 2 for all k. Additionally, it has slightly lower standard deviation, indicating **more stable performance**.  

**Thus, the ranking is:**  
1. **Hypernyms & Hyponyms WordNet**  
2. **Synonyms word2vec**  
3. **Synonyms WordNet**  
4. **Simple method**

# 4. Evaluation of Results  
At this point, we will once again evaluate our responses by comparing them to the correct answers using the `trec_eval` evaluation tool.  

We will check the evaluation metrics:  
* **MAP (mean average precision)**  
* **avgPre@k (average precision at the top k retrieved documents)** for k = 5, 10, 15, 20, and for both query expansion approaches.  

We will modify our data so that it is stored in a format suitable for `trec_eval`, as in Phase 1 of the assignment.

In [166]:
trec_dir = r"C://Users//user//Downloads//trec_eval"
filename = "results_word2vec.txt"
full_path_trec = f"{trec_dir}\\{filename}"
full_path_local = filename

with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
    for _, row in dfs_all_word2vec.iterrows():
        query_id = row["query_id"]
        results = row["results"]

        for rank, (docno, title, text, score) in enumerate(results, start=1):
            line = f"{query_id} 0 {docno} {rank} {score:.4f} results_word2vec\n"
            f_trec.write(line)
            f_local.write(line)

print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

Created:
- C://Users//user//Downloads//trec_eval\results_word2vec.txt
- results_word2vec.txt


In [167]:
trec_dir = r"C://Users//user//Downloads//trec_eval"

for k in dfs_synonyms:
    if(k<30):
        df = dfs_synonyms[k]
        filename = f"results_word2vec_{k}.txt"
        full_path_trec = f"{trec_dir}\\{filename}"
        full_path_local = filename

        with open(full_path_trec, "w") as f_trec, open(full_path_local, "w") as f_local:
            for _, row in df.iterrows():
                query_id = row["query_id"]
                results = row["results"]

                for rank, (docno, title, text, score) in enumerate(results, start=1):
                    line = f"{query_id} 0 {docno} {rank} {score:.4f} results_{k}\n"
                    f_trec.write(line)
                    f_local.write(line)

        print(f"Created:\n- {full_path_trec}\n- {full_path_local}")

Created:
- C://Users//user//Downloads//trec_eval\results_word2vec_5.txt
- results_word2vec_5.txt
Created:
- C://Users//user//Downloads//trec_eval\results_word2vec_10.txt
- results_word2vec_10.txt
Created:
- C://Users//user//Downloads//trec_eval\results_word2vec_15.txt
- results_word2vec_15.txt
Created:
- C://Users//user//Downloads//trec_eval\results_word2vec_20.txt
- results_word2vec_20.txt


* Now that we have created the txt files, we will execute the corresponding commands in the cmd.  
* We will first examine the case of queries expanded with synonyms.  
* Below is a snippet of the commands I ran in the cmd as a copy-paste:


```bash
C:\Users\user>cd C:\Users\user\Downloads\trec_eval

C:\Users\user\Downloads\trec_eval>trec_eval -m map qrels.qrel results_word2vec.txt
      1 [main] trec_eval 10248 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map                     all     0.0525

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.5 qrels.qrel results_word2vec_5.txt
      1 [main] trec_eval 2776 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_5               all     0.0324

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.10 qrels.qrel results_word2vec_10.txt
      1 [main] trec_eval 7880 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_10              all     0.0393

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.15 qrels.qrel results_word2vec_15.txt
      1 [main] trec_eval 12520 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_15              all     0.0424

C:\Users\user\Downloads\trec_eval>trec_eval -m map_cut.20 qrels.qrel results_word2vec_20.txt
      1 [main] trec_eval 7452 find_fast_cwd: WARNING: Couldn't compute FAST_CWD pointer.  Please report this problem to
the public mailing list cygwin@cygwin.com
map_cut_20              all     0.0442

In summary, our results:

| Evaluation Metric         | Results            | Value  |
|---------------------------|------------------|--------|
| MAP (Mean Avg Precision)  | All                | 0.0519 |
| avgPre@5 (map_cut.5)      | k = 5             | 0.0325 |
| avgPre@10 (map_cut.10)    | k = 10            | 0.0384 |
| avgPre@15 (map_cut.15)    | k = 15            | 0.0416 |
| avgPre@20 (map_cut.20)    | k = 20            | 0.0431 |


## 5. Analysis of Evaluation Results

A summary table of all our results:

| Evaluation Metric         | Results | Original Responses | Expanded with WordNet Synonyms | Expanded with WordNet Hypernyms/Hyponyms | Expanded with word2vec Synonyms |
|---------------------------|---------|------------------|-------------------------------|-----------------------------------------|--------------------------------|
| MAP (Mean Avg Precision)  | All     | 0.0949           | 0.0454                        | 0.0420                                  | 0.0519                         |
| avgPre@5 (map_cut.5)      | k = 5   | 0.0662           | 0.0304                        | 0.0266                                  | 0.0325                         |
| avgPre@10 (map_cut.10)    | k = 10  | 0.0770           | 0.0352                        | 0.0308                                  | 0.0384                         |
| avgPre@15 (map_cut.15)    | k = 15  | 0.0822           | 0.0371                        | 0.0333                                  | 0.0416                         |
| avgPre@20 (map_cut.20)    | k = 20  | 0.0847           | 0.0384                        | 0.0346                                  | 0.0431                         |


* The original approach, i.e., without query expansion, achieved the best performance across all evaluation metrics.  
* Expanding queries with hypernyms, hyponyms, or synonyms (by any method) appears ineffective.  
* Query expansion can fail because it adds words that are not always relevant or specific to the original query, resulting in reduced precision due to noise and ambiguity in the search.  

**However, queries expanded with word2vec synonyms were more effective than any other expansion method besides the original queries.**  
**Thus, the final ranking of the methods is:**  
1. **Simple method, without query expansion**  
2. **Queries expanded with word2vec synonyms**  
3. **Queries expanded with WordNet synonyms or hypernyms/hyponyms**
