# CS533 HW1
In this assignment, you will implement an information retrieval system using a combination of traditional
retrieval methods and word embeddings. Specifically, you will be integrating TF-IDF and BM25 ranking
methods with FastText embeddings. 

- Your task involves three different cases for using FastText: 

    (1) training a FastText model from scratch on the CISI dataset, 

    (2) using a pre-trained FastText model, and 

    (3) fine-tuning a pre-trained FastText model on the CISI dataset. 
    
- You will rank documents for the given queries based on a combination of their traditional retrieval scores and the similarity between their embedding vectors. The performance of your retrieval system will be evaluated using the Mean Average Precision (MAP) metric.

## Part 1: Dataset Preprocessing

* Load the CISI dataset from the provided CSV files. The dataset consists of:
    
    Documents: Represented by document id, title, and text.

    Queries: List of natural language questions represented by query ids and texts.
    
    Ground Truth: Contains relevance judgments for query-document pairs.

* Preprocess the dataset. This may include tokenizing the text, removing stopwords and punctuation marks, and lowercasing the text. Explain your preprocessing steps

In [6]:
import pandas as pd
import os

In [None]:
# Define the paths
base_path = os.getcwd()
cisi_path = os.path.join(base_path, "CISI/")

# Load the datasets
documents = pd.read_csv(f"{cisi_path}documents.csv")
queries = pd.read_csv(f"{cisi_path}queries.csv")
ground_truth = pd.read_csv(f"{cisi_path}ground_truth.csv")

# Print data summaries
print("Documents:")
print(documents.head())
print("\nQueries:")
print(queries.head())
print("\nGround Truth:")
print(ground_truth.head())

Documents:
   Unnamed: 0  doc_id                                             title  \
0           0       1  18 Editions of the Dewey Decimal Classifications   
1           1       2                                               NaN   
2           2       3                                Two Kinds of Power   
3           3       4        Systems Analysis of a University Library;    
4           4       5                        A Library Management Game:   

           author                                               text  
0  Comaromi, J.P.     The present study is a history of the DEWEY...  
1             NaN  This report is an analysis of 6300 acts of use...  
2      Wilson, P.      The relationships between the organization...  
3  Buckland, M.K.      The establishment of nine new universities...  
4      Brophy, P.      Although the use of games in professional ...  

Queries:
   Unnamed: 0  query_id                                               text
0           0         1  Wh


## Part 2: Word Embedding
1.  Training FastText: 
    Use the gensim library to train a FastText model on the CISI dataset. Ensure that both the text from the queries and the documents are used for the training process. This approach will help the model learn the specific vocabulary and context present in the dataset.
2. Using a Pre-trained FastText Model: 
    Instead of starting from scratch, you can leverage pre-existing knowledge by using a pre-trained FastText model. Download a pre-trained FastText model, such as “cc.en.300.bin” (English), from the FastText website or using the fasttext library. The pre-trained model can be loaded using the gensim library for direct use in your retrieval system.
3. Fine-tuning a Pre-trained FastText Model: 
    Further improve a pre-trained FastText model by finetuning it on the CISI dataset. This involves continuing the training process using the text from both the queries and the documents, allowing the model to adapt and better capture the characteristics and domain-specific vocabulary of the dataset.

# Part 3: Retrieval Task
In this part, you will focus on the core of information retrieval: fetching relevant documents based on a user's
query. The challenge lies in effectively combining traditional retrieval methods with modern word embeddings
to enhance the accuracy of the results.
1. Embedding Computation: 
    Implement a method to derive the embedding for an input (a query or a document) by averaging the embeddings of its constituent words.
2. Retrieval System: Construct a retrieval system that:

    - Retrieves the top 10 documents for a given query.
    - Combines scores from TF-IDF, BM25, and word embeddings to rank documents. You can use TfidfVectorizer from sklearn.feature_extraction.text, and BM25Okapi from rank_bm25 to obtain the TFIDF and BM25 scores. Use cosine similarity to calculate the embedding scores. The overall score for a query-document pair is calculated as the weighted average of the TF-IDF, BM25, and embedding scores.
    - Allows for the utilization of three FastText models: (1) the FastText model you trained from scratch on the CISI dataset, (2) a pre-trained FastText model, and (3) the fine-tuned version of the pre-trained model using the CISI dataset.
    - Provides the flexibility to set weights for TF-IDF, BM25, and embedding scores to compute a combined score.

# Part 4: Evaluation
Evaluation is crucial in information retrieval to ensure that the system meets user expectations. In this part, you
will assess the performance of your retrieval system under various configurations. This includes experimenting
with different possibilities of combining FastText embeddings with traditional TF-IDF and BM25 methods, as
well as using each of them in isolation.
- Performance Metrics: Evaluate the efficacy of your retrieval system using the Mean Average Precision
(MAP) metric.
- Comparative Analysis: Contrast the performance metrics when employing the three FastText models:

    (1) the FastText model trained from scratch on the CISI dataset, 

    (2) the pre-trained FastText model, and

    (3) the fine-tuned FastText model. Analyze the potential reasons for observed differences or similarities in the results.

- Experimentation: Experiment with different combinations of weights for TF-IDF, BM25, and embedding scores. Also, test the performance of each method in isolation. Report your observations and the MAP scores for each scenario.