# Applied Natural Language Processing: Assignment 2 - Group Project


## Group members:

1) Haider Ali Lokhand | a1894658

2) Paridhi Awadheshpratap Singh | a1865487

3) Chahat Segan | a1855353

In [31]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertForQuestionAnswering, BertTokenizer
import chardet
import re
import warnings
import torch
import pprint
warnings.filterwarnings('ignore')

# Downloading NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Loading English language model for spaCy
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /Users/haider/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/haider/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Loading and Preprocessing:

A utility that performs data cleaning, tokenization, and filtering to prepare the news articles dataset for analysis.

In [32]:
# Define the path to the CSV file
path = 'news_dataset.csv'

# Open the CSV file in binary mode and detect the encoding
with open(path, 'rb') as f:
    encoding = chardet.detect(f.read())['encoding']

# Print the detected encoding
print(f"Detected encoding: {encoding}")

Detected encoding: MacRoman


This code block is used to determine the encoding of a CSV file named 'news_dataset.csv'. Here's a breakdown of the code:

The `chardet` module is imported. This module is used to detect the encoding of byte strings. The file path to the CSV file ('news_dataset.csv') is defined. The CSV file is opened in binary mode (`'rb'`). Then, `chardet.detect()` is used to analyze a sample of the file's contents and determine the encoding. The detected encoding is stored in the `encoding` variable.

Finally, the detected encoding is printed to the console, providing information about how the file's characters are encoded. This information can be crucial for correctly reading and interpreting the contents of the CSV file.


In [33]:
# Read data from CSV file
data = pd.read_csv('news_dataset.csv', encoding=encoding)

# Preprocessing
stop_words = set(stopwords.words('english'))

# Function to preprocess text
def preprocess_text(text):
    # Tokenize text into words
    tokens = word_tokenize(text.lower())
    # Filter tokens to remove stopwords and non-alphanumeric characters
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    # Join filtered tokens back into a string
    return " ".join(filtered_tokens)

# Apply preprocessing to each article in the DataFrame
data['preprocessed_article'] = data['article'].apply(preprocess_text)

This code block performs data reading and preprocessing tasks on a CSV file named 'news_dataset.csv'. Here's a breakdown of the code:

The necessary libraries, pandas and NLTK, are imported.

The `pd.read_csv()` function from the pandas library is used to read the data from the CSV file into a DataFrame named `data`. The `encoding` parameter is set to ensure proper decoding of the file contents.

* **Preprocessing:**
    - **Stopwords Removal:** Stopwords are common words that do not carry much meaning, such as "the", "is", "and", etc. These stopwords are removed to focus on the more meaningful words in the text. The `stopwords.words('english')` function from NLTK is used to obtain a set of English stopwords.
    - **Text Preprocessing Function:** The `preprocess_text` function is defined to perform text preprocessing tasks.
        - The text is converted to lowercase using the `lower()` method.
        - The `word_tokenize()` function from NLTK is used to tokenize the text into words.
        - Each token is checked to ensure it consists only of alphanumeric characters and is not a stopword. If both conditions are met, the token is retained.
        - The filtered tokens are joined back into a string using `" ".join(filtered_tokens)`.

The `preprocess_text` function is applied to each article in the DataFrame using the `apply()` method. The preprocessed text is stored in a new column named `'preprocessed_article'`.

In [34]:
data.head()

Unnamed: 0,id,author,date,year,month,topic,article,preprocessed_article
0,17307,Marlise Simons,1/01/2017,2017,1,architecture,PARIS ? When the Islamic State was about to...,paris islamic state driven ancient city palmyr...
1,17292,Andy Newman,31/12/2016,2016,12,art,Angels are everywhere in the Mu?iz family?s ap...,angels everywhere mu iz family apartment bronx...
2,17298,Emma G. Fitzsimmons,2/01/2017,2017,1,business,Finally. The Second Avenue subway opened in Ne...,finally second avenue subway opened new york c...
3,17311,Carl Hulse,3/01/2017,2017,1,business,WASHINGTON ? It?s or time for Republica...,washington time republicans tumultuous decade ...
4,17339,Jim Rutenberg,5/01/2017,2017,1,business,"For Megyn Kelly, the shift from Fox News to NB...",megyn kelly shift fox news nbc host daily dayt...


### Text Matching and Entity Linking:

Text Matching finds the most relevant sentence in an article based on a user question and calculate a confidence score for the match.

Entity Linking aims to recognize the same entity mentioned across different articles, ensuring consistency in entity identification.

In [35]:

def select_top_articles(question, articles, article_entities):
    """
    Selects the top articles relevant to a given question based on cosine similarity between the question and article content.

    Args:
        question (str): The question for which relevant articles are to be selected.
        articles (list): List of articles (strings).
        article_entities (dict): Dictionary where keys are entities and values are lists of indices of articles containing those entities.

    Returns:
        tuple: Tuple containing the top 5 articles relevant to the question and their corresponding similarity scores.

    """
    # Initialize spaCy NER model
    nlp = spacy.load("en_core_web_sm")

    # Extract entities from the question
    question_doc = nlp(question)
    question_entities = [(ent.label_,ent.text) for ent in question_doc.ents]

    # Initialize TF-IDF vectorizer
    vectorizer = TfidfVectorizer()

    # Fit transform the articles
    article_vectors = vectorizer.fit_transform(articles)

    # Initialize list to store relevant articles
    relevant_articles = []

    # Iterate over each entity in the question
    for entity in question_entities:
        # If the entity exists in the article_entities index, add relevant articles
        if entity[0] in article_entities:
            if entity[1].lower() in article_entities[entity[0]]:
                relevant_articles.extend(article_entities[entity[0]][entity[1].lower()])

    # Remove duplicate articles
    relevant_articles = list(set(relevant_articles))

    # Outputs the top 5 articles if the relevant articles are more than 5
    if len(relevant_articles) >= 5:
        # Transform the question into a vector
        question_vector = vectorizer.transform([question])

        # Compute cosine similarity between question and relevant articles
        similarity_scores = cosine_similarity(question_vector, article_vectors[relevant_articles])

        # Select the indices of the top 5 ranked articles
        top_article_indices = np.argsort(similarity_scores[0])[::-1][:5]

        # Select the top 5 articles
        top_articles = [articles[relevant_articles[i]] for i in top_article_indices]

        # Extract similarity scores for the top articles
        top_similarity_scores = [similarity_scores[0][i] for i in top_article_indices]

        return top_articles, top_similarity_scores

    elif len(relevant_articles) > 0: # outputs all the articles if the relevant articles are less than 5
        # Transform the question into a vector
        question_vector = vectorizer.transform([question])

        # Compute cosine similarity between question and relevant articles
        similarity_scores = cosine_similarity(question_vector, article_vectors[relevant_articles])

        # Select the indices of the top ranked articles
        top_article_indices = np.argsort(similarity_scores[0])[::-1]

        # Select the top 5 articles
        top_articles = [articles[relevant_articles[i]] for i in top_article_indices]

        # Extract similarity scores for the top articles
        top_similarity_scores = [similarity_scores[0][i] for i in top_article_indices]

        return top_articles, top_similarity_scores

    else:
        return [], [] # outputs empty lists if the relevant articles are none.


This function `select_top_articles` selects the top articles relevant to a given question based on cosine similarity between the question and article content, making it useful for various natural language processing tasks such as information retrieval, question answering and dialogue system.

Here's the breakdown of the code:

* The function extracts named entities from the question using spaCy's NER model.

* It initializes a TF-IDF vectorizer and fits it to the articles, transforming them into TF-IDF vectors.

* It iterates over each named entity in the question and checks if it exists in the `article_entities` index. If found, it adds the relevant articles to the list.

* Duplicate articles are removed from the list of relevant articles.

* If there are 5 or more relevant articles, it computes the cosine similarity between the question and these articles and selects the top 5 articles with the highest similarity scores.

* If there are fewer than 5 relevant articles, it selects all relevant articles and computes their cosine similarity with the question.

* If there are no relevant articles, it returns an empty list for both articles and similarity scores.

* The function returns a tuple containing the top articles relevant to the question and their corresponding similarity scores.


### Article Indexing:

Article Indexing provides an indexing method to make it easier and faster to retrieve relevant articles based on user questions.


In [36]:

def index_articles(df, column):
    """
    Indexes the articles based on named entities extracted from the specified column of the DataFrame.

    Args:
        df (DataFrame): DataFrame containing the articles.
        column (str): Name of the column containing the text of the articles.

    Returns:
        dict: Dictionary containing named entity index where keys are entity labels and values are dictionaries with entity names as keys and list of article indices as values.

    """

    # Initialize named entity index dictionary
    named_entity_index = {}

    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Load spaCy NER model
        nlp = spacy.load("en_core_web_sm")

        # Extract article ID and text
        article_id = index
        article_text = row[column]

        # Process the article text with spaCy
        doc = nlp(article_text)

        # Extract named entities from the article
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Update named entity index
        for entity, label in entities:
            if label not in named_entity_index:
                named_entity_index[label] = {}
            if entity not in named_entity_index[label]:
                named_entity_index[label][entity] = []
            named_entity_index[label][entity].append(article_id)

    return named_entity_index


This function `index_articles` is designed to index the articles based on named entities extracted from a specified column of a DataFrame. This function efficiently indexes the articles based on named entities, enabling fast retrieval of articles related to specific entities.

Here's a breakdown of the code:

* An empty dictionary named `named_entity_index` is initialized to store the named entity index.

* The function iterates over each row in the DataFrame using `df.iterrows()`.

* The spaCy NER model (`en_core_web_sm`) is loaded for named entity recognition.

* Extract the index and text of the current row are extracted.

* The article text is processed using the spaCy NER model to extract named entities.

* The named entities extracted from the article are added to the named entity index dictionary, where the keys are entity labels and the values are dictionaries with entity names as keys and lists of article indices as values.

* Finally, the function returns the named entity index containing the indexed articles based on named entities.




In [37]:
import json

# Open and load the named_entity_index.json file as a dictionary
with open('named_entity_index.json', 'r') as json_file:
    named_entity_index = json.load(json_file)

This code block loads the contents of the 'named_entity_index.json' file into a Python dictionary named `named_entity_index`. Since extracting named entities from the same dataset again and again is a time consuming process, we saved the dictionary as a json file and we can load the file for selection of top articles.


Here's a breakdown of the code:

* The `json` library is imported, which provides functions for reading and writing JSON data.

* The `open()` function is used to open the 'named_entity_index.json' file in read mode (`'r'`). The file is then loaded using `json.load()` function, which parses the JSON data from the file and returns it as a Python dictionary.

* The loaded dictionary is assigned to the variable `named_entity_index`, which can now be used to access the indexed articles based on named entities.




### Model Selection:

Model Selection plays a critical role in the performance of your system. 

In [38]:
# Load pre-trained BERT model for question answering
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This code block loads a pre-trained BERT model and tokenizer for question answering tasks using the Hugging Face transformers library. Here's a breakdown of the code:

* The `BertForQuestionAnswering.from_pretrained()` function is used to load a pre-trained BERT model fine-tuned for question answering. The model is initialized with weights pre-trained on large text corpora and fine-tuned on the SQuAD (Stanford Question Answering Dataset) dataset.

* The `BertTokenizer.from_pretrained()` function is used to load a pre-trained BERT tokenizer corresponding to the model. The tokenizer is responsible for converting input text into tokens that the model can understand.

* The loaded model is assigned to the variable `model`, and the loaded tokenizer is assigned to the variable `tokenizer`. These variables can now be used to perform question answering tasks using the BERT model.



### Coreference Resolution:

Coreference Resolution is needed to resolve coreferences in the text, helping identify which entity a phrase in the sentence refers to.

For our code we do not need to separately implement this utility since it is an in-built feature of the BERT Model.

### Responses in NLP:

This utility focusses on generating responses in natural language based on the original question and the relevant articles containing the answer for them to be understable to the user.

In [39]:
def answer_question(question, dataset, max_chunk_size=500):
    '''
    Takes a `question` string and a DataFrame `dataset` containing articles,
    identifies the answer to the question within the articles, and returns it along with the confidence score.

    Args:
        question (str): The question to be answered.
        dataset (DataFrame): DataFrame containing articles.
        max_chunk_size (int, optional): Maximum chunk size for splitting the articles. Defaults to 500.

    Returns:
        dict: A dictionary containing the answer text and its confidence score.

    '''
    # Creates an empty dictionary
    output = {}

    # Select the top articles relevant to the question
    articles = list(dataset['article'])
    RelevantArticlesList, SimilarityScoresList = select_top_articles(question, articles, named_entity_index)
    print(f"{len(RelevantArticlesList)} articles are selected, the similarity scores for the selected articles: {SimilarityScoresList}")
    
    if not RelevantArticlesList:
      # If no suitable answer is found, store an appropriate message and the confidence score in the output dictionary
      output['text'] = "No suitable answer found"
      output['score'] = np.nan
      return output
    else:
      answer_text = " ".join(RelevantArticlesList)

    # Tokenize question
    question_tokens = tokenizer.tokenize(question)

    # Split answer_text into chunks
    answer_chunks = [answer_text[i:i+max_chunk_size] for i in range(0, len(answer_text), max_chunk_size)]

    # Initialize variables to store the best answer and its score
    best_answer = None
    best_score = float('-inf')

    # Iterate through answer chunks
    for chunk in answer_chunks:
        # Tokenize answer chunk
        chunk_tokens = tokenizer.tokenize(chunk)

        # Combine question and answer chunk tokens
        tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + chunk_tokens + ['[SEP]']

        # Convert tokens to ids
        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # Set segment ids
        segment_ids = [0] * (len(question_tokens) + 2) + [1] * (len(chunk_tokens) + 1)

        # Ensure segment_ids and input_ids have the same length
        assert len(segment_ids) == len(input_ids)

        # Convert to tensors
        input_ids_tensor = torch.tensor([input_ids])
        segment_ids_tensor = torch.tensor([segment_ids])

        # Run the model
        model_scores = model(input_ids_tensor, token_type_ids=segment_ids_tensor)
        start_scores = model_scores.start_logits
        end_scores = model_scores.end_logits

        # Get the best answer from this chunk
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores)
        answer_score = start_scores[0][answer_start] + end_scores[0][answer_end]

        # If this answer is better than the previous best answer, update the best answer and its score
        if answer_score > best_score:
            best_answer = tokens[answer_start:answer_end+1]
            best_score = answer_score


    # Combine tokens into a single string and reconstruct the answer
    if best_answer is not None:
        answer = best_answer[0]
        for i in range(1, len(best_answer)):
            # If it's a subword token, then recombine it with the previous token.
            if best_answer[i][0:2] == '##':
                answer += best_answer[i][2:]
            # Otherwise, add a space then the token.
            else:
                answer += ' ' + best_answer[i]

        # Store the answer text and its confidence score in the output dictionary
        output['text'] = answer
        output['score'] = best_score

    return output


This function `answer_question` is designed to answer a given question by identifying the relevant information from a dataset of articles and returning the answer along with its confidence score. This function efficiently identifies the answer to a given question within a dataset of articles using a pre-trained BERT model.

Here's a breakdown of the code:

* Relevant articles are selected based on the question using the `select_top_articles` function.

* The question is tokenized, and the answer text is split into chunks of 500 tokens (because the BERT model can only take upto 512 tokens maximum) to fit the model's input size.

* Initialize the variables `best_answer` and `best_score` to None and negative infinity respectively to store the best score and answers for each chunk.

* The function iterates through each chunk of the answer text and tries to find the answer within the given chunk.

* For each chunk, the BERT model is used to infer the start and end logits of the answer span and updates the best answer and best score variables.

* The best answer span is determined based on the highest score obtained from the model. If a higher score is acheived using a chunk, the model updates the best score and best answer.

* The tokens representing the best answer span are combined to reconstruct the answer.

* The function returns a dictionary containing the answer text and its confidence score.




In [40]:
output = answer_question("Rahul Gandhi is the Prime Minister of which country.", data, max_chunk_size=500)

0 articles are selected, the similarity scores for the selected articles: []


In [41]:
output

{'text': 'No suitable answer found', 'score': nan}

### Named Entity Recognition:

The purpose of NER is to identify named entities in the text, such as people, organizations, and locations.

In [42]:
def get_named_entities(text):
    """
    Extracts named entities from the text using spaCy.

    Args:
        text (str): The input text from which named entities are to be extracted.

    Returns:
        list: A list of tuples containing named entity labels and their corresponding text.

    """
    # Process the input text with spaCy
    doc = nlp(text)
    
    # Extract named entities and their labels
    named_entities = [(ent.label_, ent.text) for ent in doc.ents]
    
    return named_entities

This function `get_named_entities` is designed to extract named entities from a given text using the spaCy library. This function efficiently identifies and extracts named entities from a given text using spaCy, making it useful for various natural language processing tasks.

Here's a breakdown of the code:

* The input text is processed using spaCy's NER (Named Entity Recognition) model, creating a `Doc` object.

* Named entities and their corresponding labels are extracted from the `Doc` object using list comprehension.

* The function returns a list of tuples, where each tuple contains a named entity label and its corresponding text extracted from the input text.

In [43]:

def cosine_similarity_score(text1, text2):
    """
    Calculates the cosine similarity between two texts using TF-IDF vectors.

    Args:
        text1 (str): The first text.
        text2 (str): The second text.

    Returns:
        float: The cosine similarity score between the two texts.
    """
    # Initialize TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit the vectorizer and transform the texts into TF-IDF vectors
    tfidf_matrix = tfidf_vectorizer.fit_transform([text1, text2])

    # Compute cosine similarity between the two TF-IDF vectors
    similarity_score = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]

    return similarity_score


This function `cosine_similarity_score` calculates the cosine similarity between two texts using TF-IDF (Term Frequency-Inverse Document Frequency) vectors. Here's a breakdown of the code:

* The function initializes a TF-IDF vectorizer.

* The TF-IDF vectorizer is fitted on the provided texts and then used to transform them into TF-IDF vectors.

* The cosine similarity between the TF-IDF vectors of the two texts is computed using the `cosine_similarity` function.

* The computed cosine similarity score is returned as the output of the function.


In [44]:
def check_similar_named_entities(prev_question, question):
    """
    Checks if there are similar named entities between two questions.

    Args:
        prev_question (str): The previous question.
        question (str): The current question.

    Returns:
        bool: True if there are similar named entities, False otherwise.
    """
    # Extract named entities from the current question
    named_entities_question = get_named_entities(question)
    
    # Extract named entities from the previous question
    named_entities_prev_question = get_named_entities(prev_question)
    
    # Initialize a checklist to store similarity checks
    checklist = []
    
    # Compare named entities between the current and previous questions
    for entity in named_entities_question:
        for prev_entity in named_entities_prev_question:
            checklist.append(entity == prev_entity)
    
    # Return True if any similar named entities are found, False otherwise
    return any(checklist)

This function `check_similar_named_entities` is designed to check if there are similar named entities between two questions. This function efficiently compares named entities between two questions and determines if there are any similarities, making it useful for tasks such as identifying related questions in a question-answering system.

Here's a breakdown of the code:

* Named entities are extracted from both the current and previous questions using the `get_named_entities` function.

* A checklist is initialized to store similarity checks between named entities.

* The function iterates through each named entity in the current question and compares it with each named entity in the previous question. If any similar named entities are found, they are added to the checklist.

* The function returns True if any similar named entities are found in the checklist, indicating that there are similar named entities between the two questions. Otherwise, it returns False.



### New or Continuation Question by User:

Checking for repeated or continued questions is important because it helps maintain context and coherence in a question-answering system. 

<!-- By identifying when a new question is a repetition or continuation of a previous question, the system can:

- Provide more relevant and consistent responses by incorporating information from the previous related question(s).
- Avoid redundant processing and improve the overall efficiency of the system.
- Enhance the user experience by making the conversation more natural and contextual, rather than treating each question in isolation. -->

In [45]:
def check_repeated_question(question, question_history):
    """
    Checks if a question is repeated or in continuation with any other question
    in a question_history list based on cosine similarity and named entity recognition.
    If a match is found, returns a new question that is concatenated with the previous similar question.

    Args:
        question (str): The current question to be checked.
        question_history (list): A list of previous questions.

    Returns:
        str: The processed question (either concatenated or unchanged).
    """
    

    # Check if the question history is not empty
    if question_history:
        similarity_scores = []
        # Iterate through each previous question in the question history
        for prev_question in question_history:
            # Check if the current question shares similar named entities with any previous question
            if check_similar_named_entities(prev_question, question):
                # Calculate cosine similarity score between the current question and the previous similar question
                score = cosine_similarity_score(prev_question, question)
                # Append the previous similar question and its cosine similarity score to the list
                similarity_scores.append((prev_question, score))
        
        # If similarity scores exist
        if similarity_scores:
            print("The question is continuation of a previous question")
            # Find the index of the most similar question based on the cosine similarity score
            most_similar_question_index = np.argsort([score for ques, score in similarity_scores])[-1]
            # Extract the most similar question
            most_similar_question = similarity_scores[most_similar_question_index][0]
            # Concatenate the most similar question with the current question
            output = most_similar_question + " " + question
            # Append the current question to the question history
            question_history.append(question)
            return output
        else:
            # If no similar question is found, return the current question
            output = question
            # Append the current question to the question history
            question_history.append(question)
            return output
    else:
        # If question history is empty, return the current question
        output = question
        # Append the current question to the question history
        question_history.append(question)
        return output

This function `check_repeated_question` checks if a question is repeated or in continuation with any other question in a question history list based on cosine similarity and named entity recognition, making it useful for maintaining question coherence in dialogue systems.

Here's a breakdown of the code:

* If the question history is not empty, the function iterates through each previous question in the question history.

* A list `similarity_scores` is initialized to store tuples of previous similar questions and their cosine similarity scores.

* For each previous question, the function checks if the current question shares similar named entities using the `check_similar_named_entities` function.

* If similar named entities are found, the function calculates the cosine similarity score between the current question and the previous similar question using the `cosine_similarity_score` function.

* The function concatenates the most similar previous question with the current question and returns it as the output.

* The current question is appended to the question history list.

* If no similar question is found, the function returns the current question unchanged and appends it to the question history.


In [46]:
# Example usage:
question_history = [
    "What is the capital of France?",
    "Who is the president of the United States?",
    "What is the largest planet in the solar system?",
]

new_question = "Who leads France?"
result = check_repeated_question(new_question, question_history)
print(result)  # Output: "What is the capital of France? Who leads France?"

new_question = "What is the capital of Spain?"
result = check_repeated_question(new_question, question_history)
print(result)  # Output: "What is the capital of Spain?"


The question is continuation of a previous question
What is the capital of France? Who leads France?
What is the capital of Spain?


### Continuous Question Answering:

Continuous Question Answering enables users to interact with the question-answering system by entering questions and allows them to decide when to end a conversation, thereby enhancing the user experience. This utility improves the efficiency and effectiveness of information retrieval and also helps the system demonstrate a more advanced conversational and reasoning ability.

In [48]:
def continuous_question_answering(data):
    """
    Performs continuous question-answering in a loop, handling repeated or continued questions.

    Args:
        dataset (DataFrame): The dataset containing the articles.

    """
    question_history = []
    count = 0
    while True:
        # Initialize the question history
        
        
        # Prompt user for a question
        user_input = input("Ask a question (type 'quit' to exit): ").strip()
        count += 1

        # Print the question
        print("\n User Question", count, ":")
        print(user_input)
        
        # Check if user wants to quit
        if user_input.lower() == 'quit':
            print("Exiting...")
            break

        # Check if the question is a continuation
        result = check_repeated_question(user_input, question_history)
        print(f" Similar Question by User: {result}")
        
        # Answer the question
        answer = answer_question(result, data)

        # Print the answer
        print("Answer for Question:", count, ":", answer["text"])
        if isinstance(answer["score"], torch.Tensor):
            print("Score:", answer["score"].item())
        else:
            print("Score:", answer["score"])

This function `continuous_question_answering` performs continuous question-answering in a loop that enables users to interact with the question-answering system by entering questions and allows them to decide when to end a conversation, thereby enhancing the user experience. This utility improves the efficiency and effectiveness of information retrieval and also helps the system demonstrate a more advanced conversational and reasoning ability.

* The function enters an infinite loop using `while True`.

* The question history is initialized.

* The loop begins with the user prompted to input a question, with an option to quit by typing 'quit'.

* *Question Handling:*
    - The user's question is printed.
    - If the user inputs 'quit', the loop is exited.
    - The function checks if the question is a continuation or repetition using the `check_repeated_question` function.
    - The function then answers the question using the `answer_question` function.
    - The answer along with its score is printed.

* If the user inputs 'quit', the loop is exited, and the function terminates.

In [49]:
continuous_question_answering(data)


 User Question 1 :
Narendra Modi is the prime minister of which country?
 Similar Question by User: Narendra Modi is the prime minister of which country?
4 articles are selected, the similarity scores for the selected articles: [0.3899812538838734, 0.15021582739263104, 0.1433876741095703, 0.11314473022005117]
Answer for Question: 1 : india
Score: 13.876739501953125

 User Question 2 :
Rahul Gandhi is the prime minister of which country.
 Similar Question by User: Rahul Gandhi is the prime minister of which country.
0 articles are selected, the similarity scores for the selected articles: []
Answer for Question: 2 : No suitable answer found
Score: nan

 User Question 3 :
Name the muslim countries that were suspended immigration for at least 30 days by the united states.
 Similar Question by User: Name the muslim countries that were suspended immigration for at least 30 days by the united states.
5 articles are selected, the similarity scores for the selected articles: [0.26895413509999

### Testing:

To evaluate of the performance of the system, we devised a series of test cases to assess its capabilities thoroughly.

#### Test Case 1: All questions from the SAME article

In [None]:
# Test case 1: All questions from the same article 
actual_articles = list(data['article'])
pprint.pprint(actual_articles[159])

('Ayatollah Ali Akbar Hashemi Rafsanjani, a former president of Iran and a '
 'founder of the Islamic republic, who navigated the opaque shoals of his '
 'country?s theocracy as one of its most enduring, wiliest and wealthiest '
 'leaders, died on Sunday in Tehran. He was 82. His death was announced by '
 'Iranian state television. As his career seesawed through periods of '
 'revolutionary zeal and confrontation with powerful conservative rivals, he '
 'was portrayed as a Machiavellian and often ruthless player in the power '
 'struggles among Iran?s elite factions, protected by his close association '
 'with Ayatollah Ruhollah Khomeini, the revolutionary leader who overthrew the '
 'shah in 1979. Known as a pragmatist and centrist inclined toward economic '
 'liberalism and political authoritarianism, Mr. Rafsanjani was accused by '
 'critics of corruption in amassing his fortune and of a readiness for harsh '
 'tactics to deal with dissent at home and abroad. Argentina has accused M

In [None]:
continuous_question_answering(data)


 User Question 1 :
Who is the former President of Iran and a founder of Islamic Republic?
 Similar Question by User: Who is the former President of Iran and a founder of Islamic Republic?
5 articles are selected, the similarity scores for the selected articles: [0.4024834508115584, 0.3025372676455846, 0.30137646436334037, 0.28907113182223587, 0.2772868993879221]
Answer for Question: 1 : rafsanjani
Score: 15.655486106872559

 User Question 2 :
rafsanjani died at what age?
 Similar Question by User: rafsanjani died at what age?
2 articles are selected, the similarity scores for the selected articles: [0.4952567445571156, 0.32895307773167204]
Answer for Question: 2 : 82
Score: 14.898737907409668

 User Question 3 :
How was rafsanjani's career?
 Similar Question by User: How was rafsanjani's career?
0 articles are selected, the similarity scores for the selected articles: []
Answer for Question: 3 : No suitable answer found
Score: nan

 User Question 4 :
Close associations of rajsanjani
 

#### Test case 2: All questions from the DIFFRENT articles

In [None]:
# Test case 2: All questions from the different articles
pprint.pprint(actual_articles[6])
pprint.pprint(actual_articles[65])
pprint.pprint(actual_articles[165])
pprint.pprint(actual_articles[655])
pprint.pprint(actual_articles[265])

('In the technology industry, the sharks have never long been safe from the '
 'minnows. Over much of the last 40 years, the biggest players in tech  ?   '
 'from IBM to   to Cisco to Yahoo  ?   were eventually outmaneuvered by   that '
 'came out of nowhere. The dynamic is so dependable that it is often taken to '
 'be a kind of axiom. To grow large in this business is also to grow slow, '
 'blind and dumb, to become closed off from the very sources of innovation '
 'that turned you into a shark in the first place. Then, in the last half '
 'decade, something strange happened: The sharks began to get bigger and '
 'smarter. Nearly a year ago, I argued that we were witnessing a new era in '
 'the tech business, one that is typified less by the storied   in a garage '
 'than by a posse I like to call the Frightful Five: Amazon, Apple, Facebook, '
 'Microsoft and Alphabet, Google?s parent company. Together the Five compose a '
 'new superclass of American corporate might. For much of las

In [None]:
continuous_question_answering(data)


 User Question 1 :
Who are the Frightful Five, biggest players in tech?
 Similar Question by User: Who are the Frightful Five, biggest players in tech?
5 articles are selected, the similarity scores for the selected articles: [0.3519142623182059, 0.123156368864995, 0.12267018726679658, 0.09561593723325566, 0.08485832689487187]
Answer for Question: 1 : amazon , apple , facebook , microsoft and alphabet , google ? s parent company
Score: 12.376176834106445

 User Question 2 :
What happened to Violeta Lagunes of Mexico on May 18?
 Similar Question by User: What happened to Violeta Lagunes of Mexico on May 18?
5 articles are selected, the similarity scores for the selected articles: [0.16759382702178313, 0.15093595239163696, 0.1415393444634669, 0.13213901567441072, 0.1135108554228075]
Answer for Question: 2 : her email login represented a point of vulnerability
Score: 9.74312973022461

 User Question 3 :
Why did the Israeli ambassador, Mark Regev apologized for?
 Similar Question by User:

#### Test Case 3: Mix of related and unrelated questions

In [None]:
# Test Case 3: Mix of related and unrelated questions.
pprint.pprint(actual_articles[567])
pprint.pprint(actual_articles[568])
pprint.pprint(actual_articles[574])
pprint.pprint(actual_articles[683])
pprint.pprint(actual_articles[953])

('When you are traveling solo, it?s not always a breeze to strike up a '
 'conversation with a stranger. In fact, how do you meet other single '
 'travelers or locals in the first place? And if you?re looking for '
 'friendship  ?   or even something more  ?   how do you ensure that amid all '
 'the fun you don?t neglect to take safety precautions? Before we get to '
 'tactics, it?s helpful to know that you are likely to be rewarded for '
 'overcoming apprehensions about approaching someone new when you?re on the '
 'road. ?Its easy to imagine all the ways things will go badly or believe that '
 'this person doesn?t want to connect,? said Nicholas Epley, a professor of '
 'behavioral science at the University of Chicago Booth School of Business. '
 'But if you reach out, he continued, ?almost everybody reaches back. ? Social '
 'scientists have found that making such connections, whether traveling or '
 'not, boosts happiness, and yet strangers in proximity ?routinely ignore each '
 'o

In [None]:
continuous_question_answering(data)


 User Question 1 :
What was the research of professor Nicholas Epley of University of Chicago about?
 Similar Question by User: What was the research of professor Nicholas Epley of University of Chicago about?
1 articles are selected, the similarity scores for the selected articles: [0.24549158722431869]
Answer for Question: 1 : we underestimate other people ? s interest in connecting
Score: 8.584796905517578

 User Question 2 :
What did the professor Epley and Juliana Schroeder wrote in their journal?
 Similar Question by User: What did the professor Epley and Juliana Schroeder wrote in their journal?
1 articles are selected, the similarity scores for the selected articles: [0.22766644668061364]
Answer for Question: 2 : strangers in proximity ? routinely ignore each other
Score: 5.894264221191406

 User Question 3 :
What does Dr. Epley suggest about breaking the ice?
 Similar Question by User: What does Dr. Epley suggest about breaking the ice?
0 articles are selected, the similarity