###### The BM25 algorithm is designed to rank documents based on the relevance of terms in a query, considering factors like term frequency and document length. However, if your query doesn't match any documents exactly, you might need to adjust your approach to improve the similarity detection. Here are some strategies to enhance the effectiveness of BM25 in finding more relevant matches:
######  Implemnted in this new script
###### Synonyms and Stemming: Use techniques like stemming or lemmatization to reduce words to their base forms, and consider expanding your query with synonyms to capture more variations of the terms. 
###### Query Expansion: Manually or automatically expand your query with related terms. This can be done using a thesaurus or word embeddings like Word2Vec or GloVe to find semantically similar words.
###### Preprocessing Enhancements: Improve your preprocessing steps by removing noise, handling typos, and ensuring consistent formatting across your dataset.
###### Custom Scoring: Consider implementing a custom scoring function that combines BM25 with other metrics, such as semantic similarity using embeddings.

In [5]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
import nltk

# Ensure NLTK resources are available
nltk.download('wordnet')

# Load the Excel file
file_path = 'C:/Users/oscarahe/OneDrive - Intel Corporation/Desktop/Exceles/query2.csv'  # Replace with your file path
df = pd.read_csv(file_path)

# Select the column to compare against
column_to_compare = 'description'  # Replace with your column name

# Initialize stemmer. This helps to reduce words to their base forms. New feature
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    # Convert to lowercase
    text = str(text).lower()
    # Tokenize, stem, and remove stop words
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Expand query with synonyms. This is a new feature. 
def expand_query(query):
    expanded_query = set(query)
    for word in query:
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(stemmer.stem(lemma.name()))
    return list(expanded_query)

# Preprocess the data in the selected column
documents = df[column_to_compare].apply(preprocess_text).tolist()

# Initialize BM25
bm25 = BM25Okapi(documents)

# Statement to compare
statement = "BIOS"  # Replace with your statement
query = preprocess_text(statement)

# Expand the query
expanded_query = expand_query(query)

# Get BM25 scores
scores = bm25.get_scores(expanded_query)

# Find the indices of the top 5 scores
top_n = 5
top_indices = scores.argsort()[-top_n:][::-1]

# Get the top 5 most similar sightings
top_sightings = df.iloc[top_indices]

print("Top 5 most similar sightings:")
print(top_sightings)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Top 5 most similar sightings:
              id  rev  is_current      updated_date system_updated_date  \
365  15014523589   23           1  10/24/2023 17:37     4/22/2024 22:33   
291  14018113855   29           1  10/10/2023 21:54     4/22/2024 21:27   
114  14019877013   27           1   10/5/2023 21:11     4/22/2024 22:17   
128  14017900157   14           1    1/25/2023 0:53     4/22/2024 21:22   
123  14017766856   15           1  11/24/2022 15:04     4/22/2024 21:20   

                                             read_grps  read_grps_id  \
365  sys_admin,central_firmware_proj_admin,server_p...   22019963307   
291  sys_admin,sighting_central_proj_admin,svr_proj...   22019960771   
114  sys_admin,sighting_central_proj_admin,sighting...   22019954995   
128  sys_admin,central_firmware_proj_admin,sighting...   22019960178   
123  sys_admin,sighting_central_proj_admin,sighting...   22019958903   

      subject            tenant    submitted_date  ...  local_updated_date  \
365  sig