# CS533 HW1
In this assignment, you will implement an information retrieval system using a combination of traditional
retrieval methods and word embeddings. Specifically, you will be integrating TF-IDF and BM25 ranking
methods with FastText embeddings. 

- Your task involves three different cases for using FastText: 

    (1) training a FastText model from scratch on the CISI dataset, 

    (2) using a pre-trained FastText model, and 

    (3) fine-tuning a pre-trained FastText model on the CISI dataset. 
    
- You will rank documents for the given queries based on a combination of their traditional retrieval scores and the similarity between their embedding vectors. The performance of your retrieval system will be evaluated using the Mean Average Precision (MAP) metric.

## Part 1: Dataset Preprocessing

* Load the CISI dataset from the provided CSV files. The dataset consists of:
    
    Documents: Represented by document id, title, and text.

    Queries: List of natural language questions represented by query ids and texts.
    
    Ground Truth: Contains relevance judgments for query-document pairs.

* Preprocess the dataset. This may include tokenizing the text, removing stopwords and punctuation marks, and lowercasing the text. Explain your preprocessing steps.

In [66]:
import numpy as np
import pandas as pd
import os
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from gensim.models import FastText, KeyedVectors
import fasttext

In [45]:
# Define the paths
base_path = os.getcwd()
cisi_path = os.path.join(base_path, "CISI")

# Load the datasets
documents = pd.read_csv(os.path.join(cisi_path, "documents.csv"))
queries = pd.read_csv(os.path.join(cisi_path, "queries.csv"))
ground_truth = pd.read_csv(os.path.join(cisi_path, "ground_truth.csv"))

# Print data summaries
print("Documents:")
print(documents.head())
print("\nQueries:")
print(queries.head())
print("\nGround Truth:")
print(ground_truth.head())

Documents:
   Unnamed: 0  doc_id                                             title  \
0           0       1  18 Editions of the Dewey Decimal Classifications   
1           1       2                                               NaN   
2           2       3                                Two Kinds of Power   
3           3       4        Systems Analysis of a University Library;    
4           4       5                        A Library Management Game:   

           author                                               text  
0  Comaromi, J.P.     The present study is a history of the DEWEY...  
1             NaN  This report is an analysis of 6300 acts of use...  
2      Wilson, P.      The relationships between the organization...  
3  Buckland, M.K.      The establishment of nine new universities...  
4      Brophy, P.      Although the use of games in professional ...  

Queries:
   Unnamed: 0  query_id                                               text
0           0         1  Wh

In [46]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mahmu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mahmu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mahmu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mahmu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [47]:
def preprocess_text(text):
    # Ensure the input is a string, else return an empty string
    if not isinstance(text, str):
        return ""
    # Lowercasing
    text = text.lower()
    # Removing punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenizing
    tokens = word_tokenize(text)
    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a single string
    return ' '.join(tokens)


In [48]:
# Apply preprocessing to the text column
documents['processed_text'] = documents['text'].apply(preprocess_text)

# View the processed data
print(documents.head())

   Unnamed: 0  doc_id                                             title  \
0           0       1  18 Editions of the Dewey Decimal Classifications   
1           1       2                                               NaN   
2           2       3                                Two Kinds of Power   
3           3       4        Systems Analysis of a University Library;    
4           4       5                        A Library Management Game:   

           author                                               text  \
0  Comaromi, J.P.     The present study is a history of the DEWEY...   
1             NaN  This report is an analysis of 6300 acts of use...   
2      Wilson, P.      The relationships between the organization...   
3  Buckland, M.K.      The establishment of nine new universities...   
4      Brophy, P.      Although the use of games in professional ...   

                                      processed_text  
0  present study history dewey decimal classifica...  
1  rep

In [49]:
print(documents[documents['processed_text'] == ""])

     Unnamed: 0  doc_id                      title             author text  \
790         790     791  Progress in Documentation  Fairthorne, R.A.   NaN   

    processed_text  
790                 



## Part 2: Word Embedding

In [50]:
def tokenize_text(text):
    return word_tokenize(text) if isinstance(text, str) else []

### 2.1 Training FastText:
Use the gensim library to train a FastText model on the CISI dataset. Ensure that both the text from the queries and the documents are used for the training process. This approach will help the model learn the specific vocabulary and context present in the dataset.

In [51]:
# Tokenize documents
document_sentences = documents['processed_text'].apply(tokenize_text).tolist()

# Tokenize queries
query_sentences = queries['text'].apply(preprocess_text).apply(tokenize_text).tolist()

# Combine sentences from both documents and queries
all_sentences = document_sentences + query_sentences

# Flatten nested lists and remove empty sentences
all_sentences = [" ".join(sentence) for sentence in all_sentences if sentence]

# Save tokenized sentences to a training text file
training_file_path = "fasttext_cisi_training.txt"
with open(training_file_path, "w", encoding="utf-8") as f:
    for sentence in all_sentences:
        f.write(sentence + "\n")


In [52]:
# Train FastText model using the fasttext library
fasttext_model = fasttext.train_unsupervised(
    input=training_file_path,  # Path to the training file
    model="skipgram",          # Skip-gram model (use 'cbow' for CBOW)
    dim=300,                   # Dimensionality of the embeddings
    ws=5,                      # Context window size
    minCount=2,                # Minimum word count threshold
    epoch=100                   # Number of training epochs
)

# Save the trained FastText model in binary format
model_save_path = "fasttext_cisi_model.bin"
fasttext_model.save_model(model_save_path)

print(f"FastText model trained and saved to {model_save_path}")

FastText model trained and saved to fasttext_cisi_model.bin


In [None]:
# Access vector representations
word_vector = fasttext_model.get_word_vector('example')  # Vector for the word "example"
print("Vector for 'example':", word_vector)

Vector for 'example': [ 0.3601462  -0.03513972 -0.2338145  -0.07723407  0.3736553  -0.16462587
 -0.07055941  0.05074128 -0.1485961  -0.08871047  0.2874292  -0.22264093
 -0.00766258  0.39132214  0.3407526   0.08849433 -0.47778815  0.13766488
 -0.38083586  0.01022877  0.06978649  0.25788116  0.22212768 -0.0748836
 -0.10919072 -0.04937812 -0.35594967  0.11519236  0.04441971  0.20399311
 -0.11428595 -0.01238429 -0.00183018 -0.10965278  0.02442933  0.14038679
  0.3982956   0.1756552  -0.02610425 -0.38572773 -0.11633113 -0.25150946
  0.44252944 -0.40143698  0.23332897  0.77471334  0.18574254  0.20379612
 -0.18797651  0.6479729  -0.46667814 -0.16886805  0.06523523  0.13803321
 -0.3055783   0.27377272 -0.4933838   0.28193444 -0.19928853  0.00812812
  0.09831779  0.19868183 -0.2364418  -0.15905543 -0.05473196 -0.46689537
 -0.25492218 -0.42084703  0.6647783  -0.07065783  0.36880407 -0.17253518
 -0.4677974  -0.3538534  -0.2766774  -0.2462121  -0.01505099 -0.03724478
  0.10972012 -0.01472945 -0.25

In [55]:
fasttext_model.get_nearest_neighbors('example', k=5)

[(0.5058606266975403, 'ample'),
 (0.45378610491752625, 'counterexample'),
 (0.3805534541606903, 'illustrated'),
 (0.3674928843975067, 'illustrate'),
 (0.3653709292411804, 'illustration')]

In [56]:
fasttext_model.get_nearest_neighbors('report', k=5)

[(0.4939620792865753, 'reporter'),
 (0.4387853443622589, 'reported'),
 (0.4237591624259949, 'reporting'),
 (0.33701103925704956, 'calbpc'),
 (0.32045799493789673, 'ugc')]

### 2.2 Using a Pre-trained FastText Model: 
Instead of starting from scratch, you can leverage pre-existing knowledge by using a pre-trained FastText model. Download a pre-trained FastText model, such as “cc.en.300.bin” (English), from the FastText website or using the fasttext library. The pre-trained model can be loaded using the gensim library for direct use in your retrieval system.

In [57]:
# Path to the pre-trained FastText model
pretrained_model_path = os.path.join(base_path, "cc.en.300.bin")

# Load the pre-trained FastText model
pretrained_fasttext_model = fasttext.load_model(pretrained_model_path)

print("Pre-trained FastText model loaded successfully!")

Pre-trained FastText model loaded successfully!


In [58]:
word_vector = pretrained_fasttext_model.get_word_vector('example')
print(f"Vector for 'example': {word_vector}")


Vector for 'example': [-3.01899910e-02  1.67307898e-03 -3.39188091e-02  1.29165754e-04
 -3.39024775e-02 -3.52627262e-02  5.44663481e-02 -2.15502288e-02
  1.57393347e-02 -5.50850853e-03 -9.77861509e-03  6.96822815e-03
  1.34404376e-02  4.04827148e-02 -5.77299595e-02  2.67399456e-02
  4.28873971e-02  1.72743984e-02  5.14067225e-02  4.15806361e-02
 -3.46253510e-03 -4.39561009e-02  4.55061607e-02 -4.61385176e-02
 -6.82864487e-02 -1.10961404e-02  1.33144371e-02  2.14999523e-02
  8.21126904e-03 -5.76011557e-03  1.62116960e-02  6.52960828e-03
  7.23410025e-03 -5.48320338e-02 -1.13268523e-02 -9.41580534e-03
  3.99618335e-02 -5.51603436e-02 -4.69672195e-05 -5.19470498e-02
 -3.15293521e-02 -4.06791782e-03 -5.40495440e-02 -1.99173968e-02
 -8.28304701e-03  4.20339815e-02  2.26341262e-02 -1.23577183e-02
  1.77250840e-02  2.66364366e-02  2.01242566e-02  1.41719412e-02
 -4.94768023e-02  3.80847923e-04  1.61610469e-02 -3.24339680e-02
 -5.72527312e-02 -1.43544767e-02 -1.18667241e-02 -3.18274871e-02
 -6

In [59]:
# Find similar words
pretrained_fasttext_model.get_nearest_neighbors('example', k=5)

[(0.8356781601905823, 'instance'),
 (0.7126652002334595, 'example.In'),
 (0.6859133839607239, 'exmaple'),
 (0.6804730296134949, 'example.The'),
 (0.6717150211334229, 'example.For')]

In [60]:
pretrained_fasttext_model.get_nearest_neighbors('report', k=5)

[(0.7258307337760925, 'reports'),
 (0.6820427775382996, 'report.It'),
 (0.6762813925743103, 'report.The'),
 (0.6462412476539612, 'report.In'),
 (0.627001941204071, 'report.But')]

### 2.3 Fine-tuning a Pre-trained FastText Model: 
Further improve a pre-trained FastText model by finetuning it on the CISI dataset. This involves continuing the training process using the text from both the queries and the documents, allowing the model to adapt and better capture the characteristics and domain-specific vocabulary of the dataset.

In [61]:
# Fine-tune the model on the CISI dataset
finetuned_model = fasttext.train_unsupervised(
    training_file_path, model="skipgram",  # Use the Skip-gram approach
    dim=300,                              # Keep the same dimension as the pre-trained model
    ws=5,                                 # Window size
    epoch=100,                             # Number of epochs for fine-tuning
    lr=0.05                               # Learning rate
)

# Save the fine-tuned model
finetuned_model.save_model("finetuned_fasttext_cisi.bin")
print("Fine-tuned FastText model saved successfully!")

Fine-tuned FastText model saved successfully!


In [62]:
# Path to the pre-trained FastText model
finetuned_model_path = os.path.join(base_path, "finetuned_fasttext_cisi.bin")

# Load the pre-trained FastText model
finetuned_fasttext_model = fasttext.load_model(finetuned_model_path)

print("Finetuned FastText model loaded successfully!")

Finetuned FastText model loaded successfully!


In [63]:
word_vector = finetuned_fasttext_model.get_word_vector('example')
print(f"Vector for 'example': {word_vector}")

Vector for 'example': [ 0.15368047 -0.02703001 -0.21425845 -0.11544652  0.25131127  0.06092939
  0.02180931  0.09665485  0.06949833  0.11939456 -0.03517433  0.33522272
  0.00855149  0.39514363  0.10516681  0.18342508  0.34751529 -0.09838767
  0.11537067  0.40186012  0.03177984 -0.10024697  0.03056737  0.11176201
  0.36433935 -0.20943132 -0.02506921  0.06632397 -0.25563937  0.05850577
 -0.19495898 -0.21565342  0.07011463  0.0396662  -0.13288555  0.5057059
  0.16477966  0.11597671  0.14403333 -0.3003128  -0.01222487  0.03619796
 -0.04004008 -0.00271747  0.11331759  0.3749598   0.06666801  0.29852435
  0.09872383  0.08984828 -0.20170486 -0.12968232  0.29882678  0.04643991
  0.3801889  -0.08765279 -0.18479271  0.30083615  0.15751004 -0.05213655
 -0.16943225  0.02581784  0.07663221  0.06882079 -0.46061513  0.18307348
 -0.00190935 -0.5215184   0.202297    0.08151408 -0.0895128  -0.46907824
 -0.04973831  0.17032252 -0.10496243  0.21327695  0.02070305  0.02749459
  0.05474547 -0.08357549  0.16

In [64]:
# Find similar words
finetuned_fasttext_model.get_nearest_neighbors('example', k=5)

[(0.363446980714798, 'illustration'),
 (0.3617458641529083, 'illustrated'),
 (0.35683703422546387, 'illustrate'),
 (0.3160126209259033, 'illustrative'),
 (0.2687340974807739, 'coordination')]

In [65]:
# Find similar words
finetuned_fasttext_model.get_nearest_neighbors('report', k=5)

[(0.4687941372394562, 'reported'),
 (0.4027373492717743, 'reporting'),
 (0.2799728512763977, 'study'),
 (0.2762048840522766, 'colorado'),
 (0.26580870151519775, 'federal')]

## Part 3: Retrieval Task
In this part, you will focus on the core of information retrieval: fetching relevant documents based on a user's
query. The challenge lies in effectively combining traditional retrieval methods with modern word embeddings
to enhance the accuracy of the results.


### 3.1 Embedding Computation: 
Implement a method to derive the embedding for an input (a query or a document) by averaging the embeddings of its constituent words.

In [68]:
def get_sentence_embedding(sentence, model):
    # Tokenize the input sentence
    words = tokenize_text(sentence)
    
    # Retrieve embeddings for words in the model's vocabulary
    word_vectors = []
    for word in words:
        if word in model:  # Check if the word is in the model's vocabulary
            word_vectors.append(model[word])
    
    if not word_vectors:
        # If no words are found in the model, return a zero vector
        return np.zeros(model.get_dimension())
    
    # Compute the average of the embeddings
    return np.mean(word_vectors, axis=0)

In [72]:
query_embedding = get_sentence_embedding(queries.iloc[0]['text'], finetuned_fasttext_model)
document_embedding = get_sentence_embedding(documents.iloc[0]['text'], finetuned_fasttext_model)
print("Query Embedding:", query_embedding)
print("Document Embedding:", document_embedding)

Query Embedding: [ 0.14987512  0.06952221 -0.06957391 -0.05053607  0.31149802 -0.11628633
 -0.05789592  0.09997884  0.10218373 -0.1806168   0.00511498  0.29726592
 -0.1846541  -0.05025312 -0.06120212  0.22680569 -0.01971935  0.18904953
  0.12363468 -0.08691324  0.3065855   0.254293    0.07490369  0.07381604
  0.00150687 -0.07969659  0.01648275  0.00492999  0.22681278 -0.19567622
  0.2346856  -0.18983017 -0.01651494 -0.12024503 -0.11877542  0.02123858
  0.28430143  0.1665491   0.03157415  0.12296946  0.04768892  0.14323549
 -0.1192183   0.33908176  0.13214438 -0.00437745 -0.08516368 -0.09647523
 -0.17207664  0.2282944   0.27714694 -0.05932504  0.14826377  0.05357199
 -0.04046134 -0.2120988  -0.25026935  0.02721096  0.27870777  0.17933954
  0.1261993  -0.09355965 -0.03809638  0.2521343   0.16309899  0.24700512
  0.08869053  0.08387133  0.10193145 -0.20604278  0.09603649 -0.02020621
 -0.04760153  0.20926857 -0.2605013   0.19680268  0.2038908  -0.00865525
 -0.1350343   0.01701168 -0.050536

### 3.2 Retrieval System

Construct a retrieval system that:

- Retrieves the top 10 documents for a given query.
- Combines scores from TF-IDF, BM25, and word embeddings to rank documents. You can use TfidfVectorizer from sklearn.feature_extraction.text, and BM25Okapi from rank_bm25 to obtain the TFIDF and BM25 scores. Use cosine similarity to calculate the embedding scores. The overall score for a query-document pair is calculated as the weighted average of the TF-IDF, BM25, and embedding scores.
- Allows for the utilization of three FastText models: 
    - (1) the FastText model you trained from scratch on the CISI dataset, 
    - (2) a pre-trained FastText model, and 
    - (3) the fine-tuned version of the pre-trained model using the CISI dataset.
- Provides the flexibility to set weights for TF-IDF, BM25, and embedding scores to compute a combined score.

# Part 4: Evaluation
Evaluation is crucial in information retrieval to ensure that the system meets user expectations. In this part, you
will assess the performance of your retrieval system under various configurations. This includes experimenting
with different possibilities of combining FastText embeddings with traditional TF-IDF and BM25 methods, as
well as using each of them in isolation.
- Performance Metrics: Evaluate the efficacy of your retrieval system using the Mean Average Precision
(MAP) metric.
- Comparative Analysis: Contrast the performance metrics when employing the three FastText models:

    (1) the FastText model trained from scratch on the CISI dataset, 

    (2) the pre-trained FastText model, and

    (3) the fine-tuned FastText model. Analyze the potential reasons for observed differences or similarities in the results.

- Experimentation: Experiment with different combinations of weights for TF-IDF, BM25, and embedding scores. Also, test the performance of each method in isolation. Report your observations and the MAP scores for each scenario.