###### The BM25 algorithm is designed to rank documents based on the relevance of terms in a query, considering factors like term frequency and document length. However, if your query doesn't match any documents exactly, you might need to adjust your approach to improve the similarity detection. Here are some strategies to enhance the effectiveness of BM25 in finding more relevant matches:
######  Implemnted in this new script
###### Synonyms and Stemming: Use techniques like stemming or lemmatization to reduce words to their base forms, and consider expanding your query with synonyms to capture more variations of the terms. 
###### Query Expansion: Manually or automatically expand your query with related terms. This can be done using a thesaurus or word embeddings like Word2Vec or GloVe to find semantically similar words.
###### Preprocessing Enhancements: Improve your preprocessing steps by removing noise, handling typos, and ensuring consistent formatting across your dataset.
###### Custom Scoring: Consider implementing a custom scoring function that combines BM25 with other metrics, such as semantic similarity using embeddings.

In [5]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
import nltk

# Ensure NLTK resources are available
nltk.download('wordnet')

# Load the Excel file
file_path = 'C:/Users/fsalasb/OneDrive - Intel Corporation/Documents/AI Workshop/EMR Sightings Valgpt.xlsx'  # Replace with your file path
df = pd.read_excel(file_path)

# Select the column to compare against
column_to_compare = 'Failure Description'  # Replace with your column name

# Initialize stemmer. This helps to reduce words to their base forms. New feature
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    # Convert to lowercase
    text = str(text).lower()
    # Tokenize, stem, and remove stop words
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Expand query with synonyms. This is a new feature. 
def expand_query(query):
    expanded_query = set(query)
    for word in query:
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(stemmer.stem(lemma.name()))
    return list(expanded_query)

# Preprocess the data in the selected column
documents = df[column_to_compare].apply(preprocess_text).tolist()

# Initialize BM25
bm25 = BM25Okapi(documents)

# Statement to compare
statement = "Incorrect values may be observed in Turbo Bin Bucket 7"  # Replace with your statement
query = preprocess_text(statement)

# Expand the query
expanded_query = expand_query(query)

# Get BM25 scores
scores = bm25.get_scores(expanded_query)

# Find the indices of the top 5 scores
top_n = 5
top_indices = scores.argsort()[-top_n:][::-1]

# Get the top 5 most similar sightings
top_sightings = df.iloc[top_indices]

print("Top 5 most similar sightings:")
print(top_sightings)

Top 5 most similar sightings:
             ID                                Failure Description    Status  \
18  14018000516  Wrong Turbo Bin bucket 7 impacting PO, VIS-1 S...  Complete   
19  14018000574  Wrong pcode_config_tdp_level_en_mask fuse valu...  Rejected   
17  14018000504  Wrong mapping on SST-TF mailbox for Cdyn level...  Complete   
23  14018183474  Wrong mapping on SST-TF mailbox for Cdyn level...  Complete   
35  14018874246  During VV for EMR A0 we observed a failure wit...  Complete   

                                               Theory  \
18    The issue was related to incorrect fuse values.   
19  The issue was identified as a fusing issue, no...   
17        The issue was due to an outdated CRIF file.   
23        The issue was due to an outdated CRIF file.   
35  Known bug from ICX and SPR A0. Was not documen...   

                                      Conducted Tests  
18   The problem was resolved by cloning to Fuse CCB.  
19  The problem was resolved by co

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fsalasb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
