These are the queries we made from the data set of Assignment 1

In [1]:
queries = [
    "What impact did the recent bombings in Baghdad have on local security measures?",
    "How has the Indonesian forest fire crisis affected neighboring countries like Singapore?",
    "What measures are in place to protect Shia pilgrims in Iraq following recent attacks?",
    "How did Alexander Van der Bellen's victory in the Austrian election impact European politics?",
    "What are the implications of the U.S. Supreme Court’s ruling in Salman v. U.S. for insider trading laws?",
    "What challenges remain after Colombia’s peace accord with FARC was signed?",
    "How are Egyptian authorities responding to the recent killing of a policeman in Cairo?",
    "What are the safety protocols in place at the New York City Metropolitan Opera after a recent disruption?",
    "What are the prospects for diplomatic relations between Iran and Canada following the release of Homa Hoodfar?",
    "What were the key issues in The Gambia’s recent elections?",
    "How is Singapore addressing the health and environmental impacts of Indonesian fires?",
    "What security precautions are being taken for future religious gatherings in Iraq after recent attacks?",
    "How does the Austrian election reflect broader political trends in Europe?",
    "What steps are being taken to prevent further incidents like the Metropolitan Opera disruption?",
    "What does the peace accord with FARC mean for Colombia’s future political landscape?"
]

Now, loading the documents and further training a WORD2VEC model on this data set.

In [2]:
import os
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

def load_documents(folder_path):
    """
    Load and preprocess text files in a given folder.
    
    Args:
        folder_path (str): Path to the folder containing .txt documents.
    
    Returns:
        list of list of str: List of tokenized documents, each document is represented as a list of words.
    """
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                tokenized_content = simple_preprocess(content)  # Tokenize and preprocess
                documents.append(tokenized_content)
    return documents

In [3]:
def train_word2vec(documents, vector_size=100, window=5, min_count=1, workers=4):
    """
    Train a Word2Vec model on the given documents.
    
    Args:
        documents (list of list of str): Tokenized documents for training.
        vector_size (int): Dimensionality of word vectors.
        window (int): Maximum distance between the current and predicted word.
        min_count (int): Ignores words with total frequency lower than this.
        workers (int): Number of worker threads to train the model.
    
    Returns:
        Word2Vec: Trained Word2Vec model.
    """
    model = Word2Vec(sentences=documents, vector_size=vector_size, window=window, min_count=min_count, workers=workers)
    return model

In [4]:
folder_path = "/kaggle/input/data-set/documents"  # Replace with your folder path

# Step 1: Load and preprocess documents
documents = load_documents(folder_path)

In [5]:
documents[1]

['this',
 'article',
 'is',
 'more',
 'than',
 'months',
 'old',
 'this',
 'article',
 'is',
 'more',
 'than',
 'months',
 'old',
 'tourist',
 'attractions',
 'and',
 'museums',
 'in',
 'central',
 'paris',
 'have',
 'said',
 'they',
 'will',
 'not',
 'open',
 'on',
 'saturday',
 'when',
 'fresh',
 'gilets',
 'jaunes',
 'yellow',
 'vests',
 'protests',
 'are',
 'planned',
 'as',
 'french',
 'authorities',
 'prepared',
 'to',
 'deploy',
 'security',
 'personnel',
 'across',
 'the',
 'country',
 'the',
 'demonstrations',
 'announced',
 'on',
 'saturday',
 'december',
 'in',
 'paris',
 'do',
 'not',
 'allow',
 'us',
 'to',
 'welcome',
 'visitors',
 'in',
 'safe',
 'conditions',
 'said',
 'the',
 'operator',
 'of',
 'the',
 'eiffel',
 'tower',
 'in',
 'statement',
 'on',
 'thursday',
 'police',
 'have',
 'also',
 'ordered',
 'about',
 'dozen',
 'museums',
 'including',
 'the',
 'louvre',
 'and',
 'the',
 'grand',
 'palais',
 'cultural',
 'sites',
 'such',
 'as',
 'the',
 'opera',
 'and',
 

In [6]:
# Step 2: Train Word2Vec model
word2vec_model = train_word2vec(documents)

<h2>TRAINED WORD2VEC MODEL</h2>

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def mean_pool_embedding(text, model):
    """
    Compute the mean-pooling of word embeddings for a given text.
    
    Args:
        text (str): Input text to vectorize.
        model (Word2Vec): Trained Word2Vec model.
    
    Returns:
        numpy.ndarray: Mean-pooled vector for the input text.
    """
    # Check if text is a list or string
    if isinstance(text, list):
        text = ' '.join(text)  # Convert list of words to a single string if needed
    
    if not isinstance(text, str):
        raise TypeError("Expected text to be a string, got {type(text)} instead.")
    
    words = simple_preprocess(text)  # Tokenize and preprocess the string
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    
    if not word_vectors:
        return np.zeros(model.vector_size)  # Return a zero vector if no words are in the model
    return np.mean(word_vectors, axis=0)

def vectorize_documents(documents, model):
    """
    Vectorize a list of documents using mean-pooling.
    
    Args:
        documents (list of str): List of document contents as strings.
        model (Word2Vec): Trained Word2Vec model.
    
    Returns:
        list of numpy.ndarray: List of mean-pooled document vectors.
    """
    return [mean_pool_embedding(doc, model) for doc in documents]

def vectorize_queries(queries, model):
    """
    Vectorize a list of queries using mean-pooling.
    
    Args:
        queries (list of str): List of query strings.
        model (Word2Vec): Trained Word2Vec model.
    
    Returns:
        list of numpy.ndarray: List of mean-pooled query vectors.
    """
    return [mean_pool_embedding(query, model) for query in queries]

def rank_documents(queries, query_vectors, document_vectors, k=5):
    """
    Rank documents for each query based on cosine similarity.
    
    Args:
        queries (list of str): List of query strings.
        query_vectors (list of numpy.ndarray): List of query vectors.
        document_vectors (list of numpy.ndarray): List of document vectors.
        k (int): Number of top documents to retrieve.
    
    Returns:
        dict: A dictionary with queries as keys and lists of top-k document indices as values.
    """
    rankings = {}
    for query, query_vector in zip(queries, query_vectors):
        similarities = cosine_similarity([query_vector], document_vectors)[0]
        ranked_indices = np.argsort(similarities)[::-1][:k]
        rankings[query] = ranked_indices
    return rankings

In [8]:
# Vectorize documents and queries
document_vectors = vectorize_documents(documents, word2vec_model)
query_vectors = vectorize_queries(queries, word2vec_model)

# Rank top-k documents for each query
k = 5  # Number of top documents to retrieve for each query
rankings = rank_documents(queries, query_vectors, document_vectors, k)

# Display rankings
for query, indices in rankings.items():
    print(f"Query: '{query}'")
    print("Top documents:")
    for idx in indices:
        print(f" - Document {idx}: {documents[idx][:100]}...")

Query: 'What impact did the recent bombings in Baghdad have on local security measures?'
Top documents:
 - Document 6500: ['mazar', 'sharif', 'afghanistan', 'at', 'least', 'members', 'of', 'security', 'forces', 'in', 'northern', 'afghanistan', 'were', 'killed', 'by', 'the', 'taliban', 'in', 'series', 'of', 'coordinated', 'attacks', 'on', 'tuesday', 'officials', 'said', 'and', 'dozens', 'of', 'others', 'were', 'wounded', 'the', 'deadliest', 'violence', 'took', 'place', 'in', 'sar', 'pul', 'province', 'where', 'the', 'taliban', 'attacked', 'afghan', 'security', 'forces', 'in', 'three', 'areas', 'killing', 'total', 'of', 'people', 'officials', 'said', 'the', 'officials', 'did', 'not', 'provide', 'breakdown', 'of', 'casualties', 'zabihullah', 'amani', 'the', 'spokesman', 'for', 'the', 'governor', 'of', 'sar', 'pul', 'said', 'the', 'taliban', 'had', 'simultaneously', 'attacked', 'the', 'center', 'of', 'sayad', 'district', 'security', 'outposts', 'along', 'the', 'highway', 'linking', 'sar', 

<h2>PRE-TRAINED GLOVE MODEL</h2>

In [9]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


Doing the same thing but this time with pre-trained GLOVE model. Further, using cosine similarity to find the top 5 relevant documents for each query.

In [10]:
import spacy

# Load Spacy's pre-trained model with GloVe embeddings
# For example, 'en_core_web_md' includes GloVe embeddings with 300 dimensions
nlp = spacy.load("en_core_web_md")  # Ensure the model is downloaded with `python -m spacy download en_core_web_md`

def load_documents(folder_path):
    """
    Load text files in a given folder.
    
    Args:
        folder_path (str): Path to the folder containing .txt documents.
    
    Returns:
        list of str: List of document contents as strings.
    """
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                documents.append(content)
    return documents

def mean_pool_embedding(text):
    """
    Compute the mean-pooling of word embeddings for a given text using Spacy.
    
    Args:
        text (str): Input text to vectorize.
    
    Returns:
        numpy.ndarray: Mean-pooled vector for the input text.
    """
    doc = nlp(text)
    word_vectors = [token.vector for token in doc if token.has_vector]
    
    if not word_vectors:
        return np.zeros(nlp.vocab.vectors_length)  # Return a zero vector if no words have embeddings
    return np.mean(word_vectors, axis=0)

def vectorize_documents(documents):
    """
    Vectorize a list of documents using mean-pooling with Spacy embeddings.
    
    Args:
        documents (list of str): List of document contents as strings.
    
    Returns:
        list of numpy.ndarray: List of mean-pooled document vectors.
    """
    return [mean_pool_embedding(doc) for doc in documents]

def vectorize_queries(queries):
    """
    Vectorize a list of queries using mean-pooling with Spacy embeddings.
    
    Args:
        queries (list of str): List of query strings.
    
    Returns:
        list of numpy.ndarray: List of mean-pooled query vectors.
    """
    return [mean_pool_embedding(query) for query in queries]

def rank_documents(queries, query_vectors, document_vectors, k=5):
    """
    Rank documents for each query based on cosine similarity.
    
    Args:
        queries (list of str): List of query strings.
        query_vectors (list of numpy.ndarray): List of query vectors.
        document_vectors (list of numpy.ndarray): List of document vectors.
        k (int): Number of top documents to retrieve.
    
    Returns:
        dict: A dictionary with queries as keys and lists of top-k document indices as values.
    """
    rankings = {}
    for query, query_vector in zip(queries, query_vectors):
        similarities = cosine_similarity([query_vector], document_vectors)[0]
        ranked_indices = np.argsort(similarities)[::-1][:k]
        rankings[query] = ranked_indices
    return rankings
    

# Load documents
documents = load_documents(folder_path)

# Vectorize documents and queries
document_vectors = vectorize_documents(documents)
query_vectors = vectorize_queries(queries)

# Rank top-k documents for each query
k = 5  # Number of top documents to retrieve for each query
rankings = rank_documents(queries, query_vectors, document_vectors, k)

# Display rankings
for query, indices in rankings.items():
    print(f"Query: '{query}'")
    print("Top documents:")
    for idx in indices:
        print(f" - Document {idx}: {documents[idx][:100]}...")  # Display first 100 chars of each document

Query: 'What impact did the recent bombings in Baghdad have on local security measures?'
Top documents:
 - Document 5267: Story highlights Nine police officers and seven residents killed in coordinated attack, police say

...
 - Document 7238: Armed bandits stormed a mining site in a remote village late Wednesday shooting at residents, Mohamm...
 - Document 6382: Bandits have on Wednesday attacked two villages in Anka Local Government Area of Zamfara, killing so...
 - Document 4982: A suicide attack at a voter registration centre in the Afghan capital Kabul has killed at least 31 p...
 - Document 3312: With the motives of the Champs-Elysées gunman considered terror-related, the timing just three days ...
Query: 'How has the Indonesian forest fire crisis affected neighboring countries like Singapore?'
Top documents:
 - Document 381: BEIJING (AP) — All 33 coal miners trapped underground in a gas explosion earlier this week have been...
 - Document 7044: A dust storm that engulfed parts o

<h2>DPR model</h2>

In [11]:
!pip install transformers



In [13]:
import torch
from transformers import DPRContextEncoder, DPRQuestionEncoder, DPRContextEncoderTokenizer, DPRQuestionEncoderTokenizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Set up the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load DPR encoders and tokenizers, and move models to GPU if available
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base").to(device)
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base").to(device)
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

def encode_queries(queries):
    """
    Encode a list of queries using the DPR question encoder with truncation.
    
    Args:
        queries (list of str): List of query strings.
    
    Returns:
        numpy.ndarray: Encoded query vectors.
    """
    query_embeddings = []
    for query in queries:
        # Tokenize and encode each query individually with truncation
        inputs = question_tokenizer(query, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
        with torch.no_grad():
            embedding = question_encoder(**inputs).pooler_output
            query_embeddings.append(embedding.cpu().numpy())  # Move back to CPU for aggregation
    
    # Stack embeddings to form a 2D array for all query vectors
    return np.vstack(query_embeddings)

def encode_documents(documents):
    """
    Encode a list of documents using the DPR context encoder with truncation.
    
    Args:
        documents (list of str): List of document contents as strings.
    
    Returns:
        numpy.ndarray: Encoded document vectors.
    """
    document_embeddings = []
    for doc in documents:
        # Tokenize and encode each document individually with truncation
        inputs = context_tokenizer(doc, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
        with torch.no_grad():
            embedding = context_encoder(**inputs).pooler_output
            document_embeddings.append(embedding.cpu().numpy())  # Move back to CPU for aggregation
    
    # Stack embeddings to form a 2D array for all document vectors
    return np.vstack(document_embeddings)

def rank_documents(queries, query_vectors, document_vectors, k=5):
    """
    Rank documents for each query based on cosine similarity.
    
    Args:
        queries (list of str): List of query strings.
        query_vectors (numpy.ndarray): Array of query vectors.
        document_vectors (numpy.ndarray): Array of document vectors.
        k (int): Number of top documents to retrieve.
    
    Returns:
        dict: A dictionary with queries as keys and lists of top-k document indices as values.
    """
    rankings = {}
    for i, query_vector in enumerate(query_vectors):
        similarities = cosine_similarity([query_vector], document_vectors)[0]
        ranked_indices = np.argsort(similarities)[::-1][:k]
        rankings[queries[i]] = ranked_indices
    return rankings


# Encode documents and queries
document_vectors = encode_documents(documents)
query_vectors = encode_queries(queries)

# Rank top-k documents for each query
k = 5  # Number of top documents to retrieve for each query
rankings = rank_documents(queries, query_vectors, document_vectors, k)

# Display rankings
for query, indices in rankings.items():
    print(f"Query: '{query}'")
    print("Top documents:")
    for idx in indices:
        print(f" - Document {idx}: {documents[idx][:100]}...")  # Display first 100 chars of each document

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the

Query: 'What impact did the recent bombings in Baghdad have on local security measures?'
Top documents:
 - Document 1287: Widespread unrest is engulfing southern Iraq as Iraqis frustrated by shortages of electricity, water...
 - Document 5530: Abbo Hyder, AFP | This picture taken on October 24, 2018 shows fire near the wreckage of a car repor...
 - Document 6397: Protests that began last week in Iraq are continuing amid widespread anger over abysmal public servi...
 - Document 1800: The coordinated attack is another stinging blow to NATO-backed Afghan forces.

KANDAHAR, Afghanistan...
 - Document 330: BAGHDAD (AP) — A car bomb exploded outside a popular ice cream shop in central Baghdad just after mi...
Query: 'How has the Indonesian forest fire crisis affected neighboring countries like Singapore?'
Top documents:
 - Document 3937: Six Indonesian provinces have declared states of emergency as forest fires blanketed a swath of Sout...
 - Document 361: JAKARTA, Indonesia (AP) -- A volati