## NQ - Metadata Indexing
This notebook demonstrates the process of indexing and evaluating metadata for the NQ (Natural Questions) dataset using Elasticsearch. The workflow includes:

- **Data Preparation:** Loading and merging metadata from multiple sources.
- **Elasticsearch Indexing:** Creating and configuring BM25 and KNN indices with custom analyzers and mappings.
- **Embedding Generation:** Using Sentence Transformers to generate dense vector representations for KNN search.
- **Indexing Pipeline:** Bulk indexing documents into Elasticsearch for both BM25 and KNN indices.

Please follow the code cells and markdown explanations for a step-by-step guide through the indexing.

In [1]:
import os
from datetime import date
import pandas as pd
from datetime import datetime
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError
import numpy as np

In [2]:
def get_all_files(folder_name):
    # Change the directory
    os.chdir(folder_name)
    # iterate through all file
    file_path_list =[]
    for file in os.listdir():
        print(file)
        file_path = f"{folder_name}/{file}"
        file_path_list.append(file_path)
    return file_path_list

In [None]:
files = get_all_files('../../output/nq/meta-data/')
files1 = get_all_files('../../output/nq/meta-data-process/')
files2 = get_all_files('../../output/nq/meta-data-lost/')
dfs = []
for file in files:
    print(file)
    df = pd.read_excel(file)  
    print("length of file-----",len(df))
    dfs.append(df)
print("New files-------")
for file in files1:
    print(file)
    df = pd.read_excel(file)
    print("length of file-----",len(df))  
    dfs.append(df)

for file in files2:
    print(file)
    df = pd.read_excel(file)
    print("length of file-----",len(df))  
    dfs.append(df)

# Merge all dataframes
merged_df = pd.concat(dfs, ignore_index=True)

In [25]:
len(merged_df)

2681468

In [27]:
merged_df.head()

Unnamed: 0,_id,title,text,metadata,keybert_title,yake_key_idea,extracted_entities
0,doc1370000,Puppy Bowl,"Also beginning in 2010, the American Animal Ho...",{},puppy bowl,Animal Hospital Association,"2010, the American Animal Hospital Association..."
1,doc1370001,Puppy Bowl,A new element for 2011 was a parody of the pop...,{},kiss cam,Kiss Cam,2011
2,doc1370002,Puppy Bowl,Two other new elements were added in 2012 as w...,{},cockatiel,elements were added,"2012, Twitter, Jill Rappaport"
3,doc1370003,Puppy Bowl,"The hamsters in the blimp and Meep the ""tweeti...",{},puppy cam,blimp and Meep,2013
4,doc1370004,Puppy Bowl,"For the 2014 edition of the Puppy Bowl, the te...",{},puppy bowl,Puppy Bowl parties,"the Puppy Bowl, 2014, Michelle Obama"


In [28]:
merged_df.columns

Index(['_id', 'title', 'text', 'metadata', 'keybert_title', 'yake_key_idea',
       'extracted_entities'],
      dtype='object')

In [29]:
merged_df.head()

Unnamed: 0,_id,title,text,metadata,keybert_title,yake_key_idea,extracted_entities
0,doc1370000,Puppy Bowl,"Also beginning in 2010, the American Animal Ho...",{},puppy bowl,Animal Hospital Association,"2010, the American Animal Hospital Association..."
1,doc1370001,Puppy Bowl,A new element for 2011 was a parody of the pop...,{},kiss cam,Kiss Cam,2011
2,doc1370002,Puppy Bowl,Two other new elements were added in 2012 as w...,{},cockatiel,elements were added,"2012, Twitter, Jill Rappaport"
3,doc1370003,Puppy Bowl,"The hamsters in the blimp and Meep the ""tweeti...",{},puppy cam,blimp and Meep,2013
4,doc1370004,Puppy Bowl,"For the 2014 edition of the Puppy Bowl, the te...",{},puppy bowl,Puppy Bowl parties,"the Puppy Bowl, 2014, Michelle Obama"


## 1. Indexing 

In [3]:
from dotenv import load_dotenv

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
ES_URL = os.getenv("ES_URL")
ES_USER = os.getenv("ES_USER")
ES_PASS = os.getenv("ES_PASS")


In [None]:
# Create a global client connection to elastic search
es_client = Elasticsearch(
    ES_URL,
    basic_auth=(ES_USER, ES_PASS),
    verify_certs=False,
    request_timeout=10000
)

In [14]:
print(es_client.info())

{'name': 'es-sample-es-data-master-2', 'cluster_name': 'es-sample', 'cluster_uuid': 'lxgst327RICarIi1P0c6TQ', 'version': {'number': '8.12.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '1665f706fd9354802c02146c1e6b5c0fbcddfbc9', 'build_date': '2024-01-11T10:05:27.953830042Z', 'build_snapshot': False, 'lucene_version': '9.9.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


In [20]:
## Indexing BIASQ
index_name = "nq_bm25_metadata"
index_mapping = {
    "settings" :{
    "number_of_replicas": 0,
        "number_of_shards": 1,
        "refresh_interval": "1m",
        "analysis": {
            "filter": {
                "possessive_english_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "light_english_stemmer": {
                    "type": "stemmer",
                    "language": "light_english"
                },
                "english_stop": {
                    "ignore_case": "true",
                    "type": "stop",
                    "stopwords": ["a", "about", "all", "also", "am", "an", "and", "any", "are", "as", "at",
                                  "be", "been", "but", "by", "can", "de", "did", "do", "does", "for", "from",
                                  "had", "has", "have", "he", "her", "him", "his", "how", "if", "in", "into",
                                  "is", "it", "its", "more", "my", "nbsp", "new", "no", "non", "not", "of",
                                  "on", "one", "or", "other", "our", "she", "so", "some", "such", "than",
                                  "that", "the", "their", "then", "there", "these", "they", "this", "those",
                                  "thus", "to", "up", "us", "use", "was", "we", "were", "what", "when", "where",
                                  "which", "while", "why", "will", "with", "would", "you", "your", "yours"]
                }
            },
            "analyzer": {
                "text_en_no_stop": {
                    "filter": [
                        "lowercase",
                        "possessive_english_stemmer",
                        "light_english_stemmer"
                    ],
                    "tokenizer": "standard"
                },
                "text_en_stop": {
                    "filter": [
                        "lowercase",
                        "possessive_english_stemmer",
                        "english_stop",
                        "light_english_stemmer"
                    ],
                    "tokenizer": "standard"
                },
                "whitespace_lowercase": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase"
                    ]
                }
            },
            "normalizer": {
                "keyword_lowercase": {
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "id": {"type": "text"},
            "nqid": {"type": "text"},
            "title":{"type": "text"},
            "context": {"type": "text"},
            "metadata": {"type": "text"},
            "keybert_title":{"type": "text"},
            "yake_key_idea":{"type": "text"},
            "extracted_entities":{"type": "text"}
        }
    }
}


In [16]:
def create_index(index_name,mapping):
    try:
        es_client.indices.create(index=index_name,body = mapping)
        print(f"Index '{index_name}' created successfully.")
    except RequestError as e:
        if e.error == 'resource_already_exists_exception':
            print(f"Index '{index_name}' already exists.")
        else:
            print(f"An error occurred while creating index '{index_name}': {e}")

In [21]:
create_index(index_name,index_mapping)

Index 'nq_bm25_metadata' created successfully.


In [22]:
index_name_knn = 'nq_knn_metadata'
index_mapping = {
    "settings" :{
    "number_of_replicas": 0,
        "number_of_shards": 1,
        "refresh_interval": "1m",
        "analysis": {
            "filter": {
                "possessive_english_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "light_english_stemmer": {
                    "type": "stemmer",
                    "language": "light_english"
                },
                "english_stop": {
                    "ignore_case": "true",
                    "type": "stop",
                    "stopwords": ["a", "about", "all", "also", "am", "an", "and", "any", "are", "as", "at",
                                  "be", "been", "but", "by", "can", "de", "did", "do", "does", "for", "from",
                                  "had", "has", "have", "he", "her", "him", "his", "how", "if", "in", "into",
                                  "is", "it", "its", "more", "my", "nbsp", "new", "no", "non", "not", "of",
                                  "on", "one", "or", "other", "our", "she", "so", "some", "such", "than",
                                  "that", "the", "their", "then", "there", "these", "they", "this", "those",
                                  "thus", "to", "up", "us", "use", "was", "we", "were", "what", "when", "where",
                                  "which", "while", "why", "will", "with", "would", "you", "your", "yours"]
                }
            },
            "analyzer": {
                "text_en_no_stop": {
                    "filter": [
                        "lowercase",
                        "possessive_english_stemmer",
                        "light_english_stemmer"
                    ],
                    "tokenizer": "standard"
                },
                "text_en_stop": {
                    "filter": [
                        "lowercase",
                        "possessive_english_stemmer",
                        "english_stop",
                        "light_english_stemmer"
                    ],
                    "tokenizer": "standard"
                },
                "whitespace_lowercase": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase"
                    ]
                }
            },
            "normalizer": {
                "keyword_lowercase": {
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
           "id": {"type": "text"},
            "nqid": {"type": "text"},
            "title":{"type": "text"},
            "context": {"type": "text"},
            "metadata": {"type": "text"},
            "keybert_title":{"type": "text"},
            "yake_key_idea":{"type": "text"},
            "extracted_entities":{"type": "text"},
            "contexts_embedding": {
                    "type": "dense_vector", "dims": 384,
                    "similarity": "cosine", "index": "true"
                }
            
        }
    }
}

create_index(index_name_knn,index_mapping)

Index 'nq_knn_metadata' created successfully.


## Indexing Pipeline

In [None]:
from sentence_transformers import SentenceTransformer
import ast
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [33]:
merged_df.columns

Index(['_id', 'title', 'text', 'metadata', 'keybert_title', 'yake_key_idea',
       'extracted_entities'],
      dtype='object')

In [None]:
index_doc =[]
index_doc_knn =[]
for index, row in merged_df.iterrows():
    print('index.....',index)
    nq_id = row['_id']
    id = "nq"+ str(nq_id)
    title = row['title']
    context =row['text']
    metadata =row['metadata']
    keybert_title = row['keybert_title']
    yake_key_idea = row['yake_key_idea']
    extracted_entities = row['extracted_entities']
    print(context)
    passage_embedding = model.encode(context)
    #print(passage_embedding)
    doc ={
            "id": ""+id+"",
            "nqid":nq_id,
            "title": title,
            "context":context,
            "metadata":metadata,
            "keybert_title":keybert_title,
            "yake_key_idea":yake_key_idea,
            "extracted_entities":extracted_entities
            }
    doc_knn ={
            "id": ""+id+"",
            "nqid":nq_id,
            "title": title,
            "context":context,
            "metadata":metadata,
            "keybert_title":keybert_title,
            "yake_key_idea":yake_key_idea,
            "extracted_entities":extracted_entities,
            "contexts_embedding": passage_embedding

            }
    index_doc.append(doc)
    index_doc_knn.append(doc_knn)
    


In [17]:
len(index_doc)

62249

In [18]:
len(index_doc_knn)

62249

In [19]:
index_doc[0]

{'id': 'p21645374',
 'pubid': 21645374,
 'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
  'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were 

## BM25 indexing

In [22]:
import time
documents = []
for doc in index_doc:
    documents.append(
        {
            "_index": index_name, ## CHANGE INDEX NAME
            "_source": doc,
        }
    )

In [None]:
from elasticsearch import helpers,exceptions, RequestError
def chunk_documents(documents, num_chunks):
    chunk_size = len(documents) // num_chunks
    remainder = len(documents) % num_chunks

    start = 0
    for i in range(num_chunks):
        chunk_end = start + chunk_size + (1 if i < remainder else 0)
        yield documents[start:chunk_end]
        start = chunk_end

# Example usage
total_docs = len(documents)
num_chunks = 50

start_time = time.time()
for i, chunk in enumerate(chunk_documents(documents, num_chunks)):
    #clear_output(wait=True)
    print(f"Chunk {i+1}: {len(chunk)} documents")
    try:
        helpers.bulk(es_client, chunk)
        print("Done indexing documents into ",{index_name}, "index!",{len(chunk)}) ## CHANGE INDEX NAME
    except Exception as e: 
        # Handle the exception
        print("An error occurred:", e)

## KNN Indexing

In [24]:
import time
documents = []
for doc in index_doc_knn:
    documents.append(
        {
            "_index": index_name_knn, ## CHANGE INDEX NAME
            "_source": doc,
        }
    )

In [None]:
index_doc_knn[0]

In [None]:
# Example usage
total_docs = len(documents)
num_chunks = 50
start_time = time.time()
for i, chunk in enumerate(chunk_documents(documents, num_chunks)):
    #clear_output(wait=True)
    print(f"Chunk {i+1}: {len(chunk)} documents")
    try:
        helpers.bulk(es_client, chunk)
        print("Done indexing documents into ",{index_name_knn}, "index!",{len(chunk)}) ## CHANGE INDEX NAME
    except Exception as e: 
        # Handle the exception
        print("An error occurred:", e)