Impacto de k1 y b en la Relevancia 
Parámetro k1:
Frecuencia de Términos: k1 controla cuánto influye la frecuencia de un término en el cálculo de relevancia. Si un término aparece muchas veces en un documento, un k1 alto hará que ese documento sea considerado más relevante.
Saturación: Un k1 bajo limita el impacto de la frecuencia de términos, lo que puede ser útil si no quieres que documentos con muchas repeticiones de un término dominen los resultados.
Parámetro b:
Longitud del Documento: b ajusta cómo la longitud del documento afecta la relevancia. Un b alto penaliza documentos largos, asumiendo que pueden diluir la concentración de términos relevantes.
Normalización: Un b bajo reduce el impacto de la longitud, lo que puede ser útil si tus documentos son de longitud uniforme o si la longitud no es un factor importante en la relevancia.

El preprocesamiento de términos es un paso crucial en la preparación de texto para análisis y búsqueda, especialmente cuando se utiliza un algoritmo como BM25. Aquí te explico cada parte del proceso de preprocesamiento:
Convertidos a Minúsculas:
Razón: Esto asegura que las comparaciones de texto sean consistentes. Por ejemplo, "Archivo" y "archivo" se tratarán como el mismo término.
Cómo: Se utiliza el método .lower() en Python para convertir todo el texto a minúsculas.
Sin Palabras Vacías:
Razón: Las palabras vacías (stop words) son términos comunes que no aportan mucho significado a la búsqueda, como "el", "la", "de", "y", etc. Eliminarlas ayuda a centrar el análisis en términos más significativos.
Cómo: Se utiliza una lista de palabras vacías, como la proporcionada por sklearn.feature_extraction.text.ENGLISH_STOP_WORDS, para filtrar estos términos del texto.
Stemming Aplicado:
Razón: El stemming reduce las palabras a su raíz o forma base, lo que ayuda a agrupar diferentes formas de una palabra bajo un mismo término. Por ejemplo, "actualizado", "actualizar", y "actualización" se reducen a "actualizar".
Cómo: Se utiliza un algoritmo de stemming, como PorterStemmer de NLTK, para transformar las palabras a su forma base.
Importancia del Preprocesamiento
El preprocesamiento mejora la eficacia de los algoritmos de búsqueda al:
Reducir la variabilidad en el texto.
Enfocar la búsqueda en términos relevantes.
Aumentar la precisión al agrupar formas diferentes de una palabra.  -->  

In [9]:
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
import nltk
from bs4 import BeautifulSoup

# Ensure NLTK resources are available
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load the CSV file
file_path = 'C:/Users/oscarahe/OneDrive - Intel Corporation/Desktop/Exceles/query2.csv'  # Update path as needed
df = pd.read_csv(file_path)

# Select only the columns you want to analyze
columns_to_keep = [
    "id","rev","submitted_date","closed_date","owner","priority","release","component","family","status",
    "sighting.conclusion","title","status_reason","release_affected","component_affected","sighting.test_found",
    "from_subject","description","comments","sighting_central.sighting.root_cause_description"
]
df_new = df[columns_to_keep].copy()

# Remove HTML tags from the "description" column
def remove_html(text):
    if not isinstance(text, str):
        return ""
    return BeautifulSoup(text, "html.parser").get_text()

df_new['description'] = df_new['description'].apply(remove_html)

# Columns to compare against (choose the text-based ones)
columns_to_compare = ['id', 'title', 'description', 'status_reason']

# Initialize stemmer
stemmer = PorterStemmer()

# Preprocess the text data
def preprocess_text(text):
    text = str(text).lower()
    tokens = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return tokens

# Expand query with synonyms
def expand_query(query):
    expanded_query = set(query)
    for word in query:
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(stemmer.stem(lemma.name()))
    return list(expanded_query)

# Preprocess the data in the selected columns
documents_per_column = {col: df_new[col].apply(preprocess_text).tolist() for col in columns_to_compare}

# Statement to compare
statement = "The problem arises with EMR systems experiencing throttling due to external prochot when configured with 2 sockets. This issue is resolved when either the second socket is removed or the UPI link is disabled, converting the setup to a single-socket configuration. During the occurrence of this issue, there is a notable decrease in core frequency."
query = preprocess_text(statement)

# Expand the query
expanded_query = expand_query(query)

# Define a function to perform grid search for k1 and b
def grid_search_bm25(documents_per_column, query, k1_values, b_values, top_n=5):
    best_k1 = None
    best_b = None
    best_score = -1
    best_sightings = None

    for k1 in k1_values:
        for b in b_values:
            combined_scores = None

            for col, documents in documents_per_column.items():
                bm25 = BM25Okapi(documents, k1=k1, b=b)
                scores = bm25.get_scores(query)

                if combined_scores is None:
                    combined_scores = scores
                else:
                    combined_scores += scores

            top_indices = combined_scores.argsort()[-top_n:][::-1]
            top_sightings = df_new.iloc[top_indices]
            avg_score = combined_scores[top_indices].mean()

            if avg_score > best_score:
                best_score = avg_score
                best_k1 = k1
                best_b = b
                best_sightings = top_sightings

    return best_k1, best_b, best_score, best_sightings

# Define ranges for k1 and b
k1_values = [0.4, 0.6, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8]
b_values = [0.20, 0.35, 0.5, 0.75, 0.85, 1.0]

# Perform grid search
best_k1, best_b, best_score, best_sightings = grid_search_bm25(documents_per_column, expanded_query, k1_values, b_values)

print(f"Best k1: {best_k1}, Best b: {best_b}, Best Average Score: {best_score}")
print("Top 5 most similar sightings:")
print(best_sightings[columns_to_compare])  # Only show the columns you compared

# Optionally, save results
best_sightings
#best_sightings.to_csv("top5_sightings.csv", index=False)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\oscarahe\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
  score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /


Best k1: 2.8, Best b: 0.2, Best Average Score: 47.92918812056532
Top 5 most similar sightings:
              id                                              title  \
0    14018151333  [EMR] EMR system  Throttles due to External pr...   
38   14018527664  [EMR_A0][SVOS][Interrupts] Logical MSIs from P...   
136  14018423542  [EMR A0][VV][Coherency] Legacy APIC Mode is no...   
476  14018467522  [EMR-XCC][A0 VIS][Incoming DPM][PM] Units with...   
245  14019359111  [EMR MCC VV][SAE22A]  Counted power throttle c...   

                                           description  \
0    We booted 2 separate EMR systems (64C350W and ...   
38   The conclusion about Legacy APIC in EMR is tha...   
136  The Coherence/Concurrency team considers that ...   
476  Promoting parent presighting to sysdebug per E...   
245  System was 4539 a SAE22A - HW config of 1Dimm ...   

                 status_reason  
0        rejected.not_a_defect  
38          complete.validated  
136      rejected.not_a_defect

Unnamed: 0,id,rev,submitted_date,closed_date,owner,priority,release,component,family,status,sighting.conclusion,title,status_reason,release_affected,component_affected,sighting.test_found,from_subject,description,comments,sighting_central.sighting.root_cause_description
0,14018151333,23,12/6/2022 9:04,12/19/2022 3:09,adabney,2-high,emrsp-xcc-a0,hw.power,Emerald Rapids-SP-XNC Die,rejected,hw.bug,[EMR] EMR system Throttles due to External pr...,rejected.not_a_defect,emrsp-xcc-a0,hw.power,focus test,,We booted 2 separate EMR systems (64C350W and ...,++++1468929289 srotich\nperf limit reason on p...,Problem resolved after re-flashing the MAIN FP...
38,14018527664,77,1/26/2023 5:48,10/10/2023 21:56,avalcara,3-medium,emrsp-xcc-a0,bios,Emerald Rapids-SP-XNC Die,complete,env.bug,[EMR_A0][SVOS][Interrupts] Logical MSIs from P...,complete.validated,emrsp-xcc-a0,bios,rocket,,The conclusion about Legacy APIC in EMR is tha...,"++++1469135325 jhtran2\n@Victoria Alcaraz, Ale...",Known limitation of APIC for threads above 255
136,14018423542,58,1/13/2023 17:46,1/24/2023 23:01,everasan,3-medium,emrsp-xcc-a0,hw.cha,Emerald Rapids-SP-XNC Die,rejected,not_a_bug,[EMR A0][VV][Coherency] Legacy APIC Mode is no...,rejected.not_a_defect,emrsp-xcc-a0,hw.cha,supercollider,,The Coherence/Concurrency team considers that ...,++++1469084436 avalcara\nOnly to highlight the...,
476,14018467522,6,1/19/2023 18:15,1/19/2023 19:45,scerdasr,1-showstopper,emrsp-xcc-a0,hw.pma,Emerald Rapids-SP-XNC Die,rejected,no_root_cause.rejected,[EMR-XCC][A0 VIS][Incoming DPM][PM] Units with...,rejected.filed_by_mistake,emrsp-xcc-a0,hw.pma,baremetal,,Promoting parent presighting to sysdebug per E...,++++1469100116 daalonso\nNew record was filled...,
245,14019359111,93,5/15/2023 20:39,7/31/2023 18:02,jaimeihe,3-medium,emrsp-xcc-a0,hw.memory,Emerald Rapids-SP-XNC Die,complete,hw.bug,[EMR MCC VV][SAE22A] Counted power throttle c...,complete,emrsp-xcc-a0,hw.memory,svos,sighting,System was 4539 a SAE22A - HW config of 1Dimm ...,++++2269520427 jaimeihe\nThere isn't other tes...,Sub-channel 1 counts power throttling too much.
