# Information Retrieval

Traditionally, information retrieval (IR) has been a key area in computer science, focusing on finding relevant documents from large collections based on user queries. The most common and time-tested method is `keyword search`, where documents are indexed based on the words they contain. When a user inputs a query, the system retrieves documents that match the keywords in the query.

In the age of large language models (LLMs), the landscape of IR is evolving. LLMs can understand and generate human-like text, allowing for more sophisticated interactions. Instead of just matching keywords, LLMs can comprehend the `context and semantics` of a query, leading to more relevant and nuanced results.

## Search Techniques




![Search Techniques](./resource/search_technique.png)

*Screenshot from the course showing different search techniques including keyword search and semantic search with metadata filtering*


## TF-IDF
TF-IDF, or Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two components:

- `Term Frequency (TF)`: Measures how frequently a term appears in a document. The more a term appears, the more relevant it is considered to be for that document.
- 
- `Inverse Document Frequency (IDF)`: Measures how important a term is across the entire corpus. A term that appears in many documents is less informative than a term that appears in only a few. IDF is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.
- 
- The final TF-IDF score is the product of these two components, providing a balanced measure of a term's relevance in a specific document relative to the entire corpus.

### How to calculate TF-IDF:
```math
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
```
Where:
- `t` is the term,

- `d` is the document,

- `TF(t, d)` is the term frequency of term `t` in document `d`, calculated as:
```math
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
```
- `IDF(t)` is the inverse document frequency of term `t`, calculated as:
```math
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents in corpus}}{\text{Number of documents containing term } t}\right)
```

- The TF-IDF score is higher for terms that are frequent in a document but rare across the corpus, making it a powerful tool for identifying relevant documents in information retrieval tasks.

For example, if we have a document containing the term "AI" 5 times out of 100 total terms, and "AI" appears in 10 out of 1000 documents in the corpus, the TF-IDF score for "AI" in that document would be calculated as follows:
```math
\text{TF-IDF}(\text{"AI"}, d) = \left(\frac{5}{100}\right) \times \log\left(\frac{1000}{10}\right) = 0.05 \times \log(100) = 0.05 \times 2 = 0.1
```
In comparison, if another term "ML" appears 2 times in the same document and is present in 100 out of 1000 documents, its TF-IDF score would be:
```math
\text{TF-IDF}(\text{"ML"}, d) = \left(\frac{2}{100}\right) \times \log\left(\frac{1000}{100}\right) = 0.02 \times \log(10) = 0.02 \times 1 = 0.02
```
This shows that "AI" is more relevant to the document than "ML" based on the TF-IDF scores, highlighting its importance in the context of the document and the corpus.



## BM25
BM25, or Best Matching 25, is an advanced ranking function used in information retrieval that builds upon the TF-IDF model. It incorporates several enhancements to improve the relevance of search results:
- `Term Frequency Saturation`: Unlike TF-IDF, BM25 applies a saturation effect to term frequency, meaning that the relevance of a term increases with its frequency but at a diminishing rate. This prevents overly frequent terms from disproportionately influencing the score.

- `Document Length Normalization`: BM25 normalizes the term frequency by considering the length of the document. This helps to ensure that longer documents do not unfairly receive higher scores simply due to their length.

- `IDF Component`: BM25 uses a modified version of the IDF component, which is more robust against terms that appear in many documents. It adjusts the IDF calculation to better reflect the rarity of a term across the corpus.

### Advantages of BM25:

- `Improved Relevance`: By combining term frequency saturation and document length normalization, BM25 provides more accurate relevance scores compared to traditional TF-IDF.

- `Flexibility`: BM25 allows for tuning parameters (`k_1` and `b`) that can be adjusted based on the specific characteristics of the corpus, making it adaptable to different types of documents and queries.

- `Widely Used`: BM25 is a standard in information retrieval and is implemented in many search engines and libraries, making it a reliable choice for building search systems.


![BM25](./resource/BM25.jpg)
*Screenshot from the course showing the BM25 formula and its components*

### How to calculate BM25:
```math
\text{BM25}(t, d) = \text{IDF}(t)\times \frac{\text{TF}(t, d) \times (k_1 + 1)}{\text{TF}(t, d) + k_1 \times (1 - b + b \times \frac{\text{Length}(d)}{\text{Avg.Length}})}
```

Where:
- `t` is the term,

- `d` is the document,

- `TF(t, d)` is the term frequency of term `t` in document `d`,

- `IDF(t)` is the inverse document frequency of term `t`, calculated as:
```math
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents in corpus} - \text{Number of documents containing term } t + 0.5}{\text{Number of documents containing term } t + 0.5}\right)
```
- `k_1` and `b` are tuning parameters that control the term frequency saturation and document length normalization, respectively. Common values are `k_1 = 1.2` and `b = 0.75`.


BM25 is particularly effective in scenarios where the relevance of documents needs to be assessed based on both the frequency of terms and the overall structure of the documents. It is widely used in search engines and information retrieval systems due to its ability to provide more accurate and relevant results compared to traditional TF-IDF methods.



### Python Implementation of BM25


In [1]:
!pip install rank_bm25 nltk

Collecting rank_bm25
  Obtaining dependency information for rank_bm25 from https://files.pythonhosted.org/packages/2a/21/f691fb2613100a62b3fa91e9988c991e9ca5b89ea31c0d3152a3210344f9/rank_bm25-0.2.2-py3-none-any.whl.metadata
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [3]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk import download
download('punkt') #punkt is used for tokenization
download('stopwords') #stopwords are used for filtering out common words
# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are great pets.",
    "Dogs are loyal and friendly.",
    "Cats are independent and curious."
]
# Preprocess documents: tokenize and remove stopwords
stop_words = set(stopwords.words('english'))
tokenized_docs = []
for doc in documents:       
    tokens = word_tokenize(doc.lower())  # Tokenize and convert to lowercase
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]  # Remove stopwords and punctuation
    tokenized_docs.append(tokens)
# Initialize BM25
bm25 = BM25Okapi(tokenized_docs)
# Sample query
query = "cat and dog"
# Preprocess query
query_tokens = word_tokenize(query.lower())
query_tokens = [word for word in query_tokens if word not in stop_words and word not in string.punctuation]
# Get BM25 scores for the query
scores = bm25.get_scores(query_tokens)
# Print scores
for i, score in enumerate(scores):
    print(f"Document {i+1}: {score:.2f}")

ModuleNotFoundError: No module named 'rank_bm25'