###### This version adds the BM25 parameters K1 and b. BM25 has two main parameters: k1 and b. These parameters control the term frequency saturation and the document length normalization, respectively:

###### k1: This parameter controls the saturation of term frequency. A higher value of k1 increases the influence of term frequency on the score. Typical values range from 1.2 to 2.0.
###### b: This parameter controls the degree of length normalization. A value of b close to 1 means that the algorithm will heavily normalize for document length, while a value close to 0 means little normalization. Typical values range from 0.75 to 0.85.
###### Tuning the parameters k1 and b in the BM25 algorithm can significantly impact the effectiveness of your search results. Here are some recommendations to help you tweak these parameters effectively:
###### Understand the Parameters:
###### k1: Controls the term frequency saturation. A higher k1 increases the influence of term frequency, making frequent terms more impactful. If your documents vary greatly in term frequency, adjusting k1 can help balance this.
###### b: Controls the degree of length normalization. A higher b means more normalization, which is useful if document lengths vary significantly. If your documents are of similar length, a lower b might be appropriate.
###### Experiment with Values:
###### Start with typical values: k1 around 1.2 to 2.0 and b around 0.75 to 0.85.
###### Test different combinations to see how they affect the ranking of your documents. You might want to use a grid search approach to systematically explore different values.
###### Evaluate with a Test Set:
###### Use a set of queries and known relevant documents to evaluate the effectiveness of different parameter settings.
###### Measure performance using metrics like precision, recall, or F1-score to determine which settings yield the best results.
###### Consider the Nature of Your Data:
###### If your dataset has documents with highly variable lengths, consider increasing b to account for this.
###### If term frequency is a strong indicator of relevance in your dataset, consider increasing k1.
###### Iterative Testing:
###### Adjust parameters incrementally and observe changes in results.
###### Use feedback from domain experts or end-users to guide adjustments.
###### Automated Tuning:
###### Consider using automated hyperparameter tuning techniques, such as grid search or random search, to explore a wide range of parameter combinations efficiently.
###### Cross-Validation:
###### Use cross-validation to ensure that your parameter settings generalize well across different subsets of your data.
###### By systematically experimenting with these parameters and evaluating their impact on your search results, you can find the optimal settings for your specific application. Keep in mind that the best values for k1 and b can vary depending on the characteristics of your dataset and the nature of your queries. 



In [1]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
import nltk

# Ensure NLTK resources are available
nltk.download('wordnet')

# Load the Excel file
file_path = 'C:/Users/oscarahe/OneDrive - Intel Corporation/Desktop/Exceles/query2.csv'  # Replace with your file path
df = pd.read_csv(file_path)

# Select the column to compare against
column_to_compare = 'description'  # Replace with your column name

# Initialize stemmer
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    text = str(text).lower()
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Expand query with synonyms
def expand_query(query):
    expanded_query = set(query)
    for word in query:
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(stemmer.stem(lemma.name()))
    return list(expanded_query)

# Preprocess the data in the selected column
documents = df[column_to_compare].apply(preprocess_text).tolist()

# Statement to compare
statement = "Crosstalk in PCIe lanes"  # Replace with your statement
query = preprocess_text(statement)

# Expand the query
expanded_query = expand_query(query)

# Define a function to perform grid search for k1 and b
def grid_search_bm25(documents, query, k1_values, b_values, top_n=5):
    best_k1 = None
    best_b = None
    best_score = -1
    best_sightings = None

    for k1 in k1_values:
        for b in b_values:
            bm25 = BM25Okapi(documents, k1=k1, b=b)
            scores = bm25.get_scores(query)
            top_indices = scores.argsort()[-top_n:][::-1]
            top_sightings = df.iloc[top_indices]
            avg_score = scores[top_indices].mean()

            if avg_score > best_score:
                best_score = avg_score
                best_k1 = k1
                best_b = b
                best_sightings = top_sightings

    return best_k1, best_b, best_score, best_sightings

# Define ranges for k1 and b
k1_values = [1.2, 1.5, 1.8, 2.0]
b_values = [0.5, 0.75, 0.85]

# Perform grid search
best_k1, best_b, best_score, best_sightings = grid_search_bm25(documents, expanded_query, k1_values, b_values)

print(f"Best k1: {best_k1}, Best b: {best_b}, Best Average Score: {best_score}")
print("Top 5 most similar sightings:")
print(best_sightings)
print(best_sightings[column_to_compare])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Best k1: 2.0, Best b: 0.85, Best Average Score: 11.73482654569936
Top 5 most similar sightings:
              id rev is_current     updated_date system_updated_date  \
419  14018925189  15          1  7/18/2023 22:17     4/22/2024 21:50   
170  14018331720  41          1  2/27/2023 16:01     4/22/2024 21:33   
135  14019177987  18          1  5/26/2023 15:23     4/22/2024 21:58   
294  14019569295  18          1  6/30/2023 19:40     4/22/2024 22:10   
330  14019101636  23          1  3/15/2024 20:22     4/22/2024 21:56   

                                             read_grps  read_grps_id  \
419  sys_admin,sighting_central_proj_admin,sighting...  2.201996e+10   
170  sys_admin,sighting_central_proj_admin,sighting...  2.201995e+10   
135  sys_admin,sighting_central_proj_admin,sighting...  2.201995e+10   
294  sys_admin,sighting_central_proj_admin,soc_conf...  2.201996e+10   
330  sys_admin,sighting_central_proj_admin,sighting...  2.201996e+10   

      subject            tenant   subm

In [1]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
import nltk
from bs4 import BeautifulSoup

# Ensure NLTK resources are available
nltk.download('wordnet')

# Load the Excel file
file_path = 'C:/Users/oscarahe/OneDrive - Intel Corporation/Desktop/Exceles/query2.csv'  # Replace with your file path
df = pd.read_csv(file_path)

# Select the column to compare against
column_to_compare = 'description'  # Replace with your column name

# Function to remove HTML tags
def remove_html_malformed(text):
    """Remove HTML tags from a malformed HTML string using BeautifulSoup."""
    soup = BeautifulSoup(str(text), "html.parser")
    return soup.get_text()

# Clean the description column
df['description_clean'] = df[column_to_compare].apply(remove_html_malformed)
column_to_compare = 'description_clean'  # Use the cleaned column for further processing

# Initialize stemmer
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    text = str(text).lower()
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Expand query with synonyms
def expand_query(query):
    expanded_query = set(query)
    for word in query:
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(stemmer.stem(lemma.name()))
    return list(expanded_query)

# Preprocess the data in the selected column
documents = df[column_to_compare].apply(preprocess_text).tolist()

# Statement to compare
statement = "Xtalk PCIE Express"  # Replace with your statement
query = preprocess_text(statement)

# Expand the query
expanded_query = expand_query(query)

# Define a function to perform grid search for k1 and b
def grid_search_bm25(documents, query, k1_values, b_values, top_n=10):
    best_k1 = None
    best_b = None
    best_score = -1
    best_sightings = None

    for k1 in k1_values:
        for b in b_values:
            bm25 = BM25Okapi(documents, k1=k1, b=b)
            scores = bm25.get_scores(query)
            top_indices = scores.argsort()[-top_n:][::-1]
            top_sightings = df.iloc[top_indices]
            avg_score = scores[top_indices].mean()

            if avg_score > best_score:
                best_score = avg_score
                best_k1 = k1
                best_b = b
                best_sightings = top_sightings

    return best_k1, best_b, best_score, best_sightings

# Define ranges for k1 and b
k1_values = [1.2, 1.5, 1.8, 2.0]
b_values = [0.5, 0.75, 0.85]

# Perform grid search
best_k1, best_b, best_score, best_sightings = grid_search_bm25(documents, expanded_query, k1_values, b_values)

print(f"Best k1: {best_k1}, Best b: {best_b}, Best Average Score: {best_score}")
print("Top 5 most similar sightings:")
print(best_sightings[column_to_compare])  # This prints only the cleaned descriptions

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Best k1: 2.0, Best b: 0.85, Best Average Score: 9.042068019835407
Top 5 most similar sightings:
3      Observing completely closed eye once xtalk is ...
2      Observing failing Rx JTOL results at 85C .92V ...
125    As reported in a customer IPS and shown by our...
422    PCIE Gen5 Tx Jitter base measurements from EMR...
39     ** Cloned new sighting from [EMR A0 PO][2S]: P...
389    Description:System is not booting to OS with w...
419    EMR A0 PCIE Gen4 RxCEM test is having errors a...
136    RxLEQ data from XCC and MCC is showing that fo...
266    There is a shortfall in write BW observed when...
427    Actual Behavior:Socket Power Limit Indicator v...
Name: description_clean, dtype: object


In [10]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from transformers import BertTokenizer, BertModel
import torch
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup

# Ensure NLTK resources are available
nltk.download('wordnet')

# Load the Excel file
file_path = 'C:/Users/oscarahe/OneDrive - Intel Corporation/Desktop/Exceles/valgpt.csv'  # Replace with your file path
df = pd.read_csv(file_path)

# Columns to compare against
columns_to_compare = ['Failure Description','Conducted Tests',"Theory"]  # Add more columns as needed

# Initialize stemmer
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    text = str(text).lower()
    # Tokenize, stem, and remove stop words
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Generate embeddings using BERT
def generate_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)

# Generate embeddings for all unique terms in the dataset
unique_terms = set()
for col in columns_to_compare:
    for text in df[col]:
        unique_terms.update(preprocess_text(text))

term_embeddings = {term: generate_embeddings(term) for term in unique_terms}

# Expand query using BERT embeddings
def expand_query_with_bert(query):
    query_embedding = generate_embeddings(query)
    similarities = {}

    for term, embedding in term_embeddings.items():
        similarity = cosine_similarity(query_embedding.detach().numpy(), embedding.detach().numpy())[0][0]
        similarities[term] = similarity

    # Select top N related terms based on similarity
    top_n = 5
    related_terms = sorted(similarities, key=similarities.get, reverse=True)[:top_n]
    
    # Combine original query with related terms
    expanded_query = query.split() + related_terms
    return expanded_query

# Preprocess the data in the selected columns
documents_per_column = {col: df[col].apply(preprocess_text).tolist() for col in columns_to_compare}

# Statement to compare
statement = "Crostalk"  # Replace with your statement
query = preprocess_text(statement)

# Expand the query using BERT
expanded_query = expand_query_with_bert(statement)

# Define a function to perform grid search for k1 and b
def grid_search_bm25(documents_per_column, query, k1_values, b_values, top_n=5):
    best_k1 = None
    best_b = None
    best_score = -1
    best_sightings = None

    for k1 in k1_values:
        for b in b_values:
            combined_scores = None

            for col, documents in documents_per_column.items():
                bm25 = BM25Okapi(documents, k1=k1, b=b)
                scores = bm25.get_scores(query)

                if combined_scores is None:
                    combined_scores = scores
                else:
                    combined_scores += scores

            top_indices = combined_scores.argsort()[-top_n:][::-1]
            top_sightings = df.iloc[top_indices]
            avg_score = combined_scores[top_indices].mean()

            if avg_score > best_score:
                best_score = avg_score
                best_k1 = k1
                best_b = b
                best_sightings = top_sightings

    return best_k1, best_b, best_score, best_sightings

# Define ranges for k1 and b
k1_values = [1.2, 1.5, 1.8, 2.0]
b_values = [0.5, 0.75, 0.85]

# Perform grid search
best_k1, best_b, best_score, best_sightings = grid_search_bm25(documents_per_column, expanded_query, k1_values, b_values)

print(f"Best k1: {best_k1}, Best b: {best_b}, Best Average Score: {best_score}")
print("Top 5 most similar sightings:")
print(best_sightings)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Best k1: 2.0, Best b: 0.85, Best Average Score: 3.0754855373393717
Top 5 most similar sightings:
             ID                                Failure Description    Status  \
12  14017827596  PI5 pcode_sa fuse disable incorrectly programmed.  Complete   
29  14018570336  Host 2971 reports width degradation or links g...  Complete   
8   14020175013  When tuning the EMR EPP settings for the impro...  Rejected   
10  14017677523  The first EMR fused part provisioned with debu...  Complete   
7   14019985093  From the A1 VV 100 Boot Training Repeatability...  Rejected   

                                               Theory  \
12  Wrong fuses provided by HVM, root caused by An...   
29  Test card issue, need to follow up with that t...   
8   Tuning changes at this point in the project ne...   
10  Root Cause: lockout bit for S3M was not fused....   
7                                       Not a defect.   

                                      Conducted Tests  
12  Booted both parts w