# Context Retrival Introduction

**Introduction to Contextual Question Answering**

In the development of question-answering systems, having a context for each question is often crucial. Typically, models are trained to understand and retrieve answers within a given context. However, our dataset contains only question and answers without any additional context. This presents a unique challenge: how can we train an effective question-answering model without inherent contextual information ?

**Importance of Context**

Context is often vital because it provides the backgrounf information needed for understanding and accurately answering. Without context, the model may struggle to comprehend the nuances and specifics of the questions, leasinf to lower accuracy and reliablity

**Challenges without context**



1.   **Ambiguity**: question cab be ambiguous without context
2.   **Relevance**: without context, it is challenging to determine which information is relevant to the question
3.   **Depth of understanding**: context allows the model to provide more comprehensive and detailed answers

**Proposed Solutions**


**Using External Sources for Context Retrieval**:

 To address the absence of context in our dataset, we propose a method that involves retrieving relevant context from external sources, specifically leveraging the PubMed API to access medical and biomedical papers and their abstracts. Here's an overview of our approach:


 -  **Keyword Extraction**: for each question, we employ keyBERT, a pre-trained mdoel, to extract a set of keywords that capture the essence of the question. These keywords serve as the query terms for retrieving relevant papers and their contexts

 - **Context Retrival**: using the extracted keywords as queries, we utilize the PubMedAPI to search through a vast repository of medical literature. This step aims to retrieve potential contexts from papers and their abstracts, ensuring that the context is domain-specific and directly relevant to the question at hand

 - **Preprocessing and Cleaning**: the retrieved contexts undergo preprocessing and cleaning to ensure they are in the appropriate format for futher analysis. This preprocessing involves removing irrelevant information, normalizing text, and ensuring consistency across retrieved contexts.

 - **Context Ranking**: to prioritize the retrieved contexts, we employ various ranking techniques:

    - **TF-IDF and BM25 Index Search**: These methods assess the importance of terms based on their frequencies within the retrieved contexts and the question itself. This allows us to build a document search engine that matches contexts to a given query (question).

    - **Document vectorization**: utilizing a TD-IDF vectorizer, we compute the textual documents into numerical vectors based on the relevancy of the words and how often they appear in the documents, enabling quantitative comparison.

    - **Sentence Embedding and Cosine Similarity**: We compute sentence embeddings for both the question and the retrieved contexts, then evaluate the cosine similarities between them to obtain similarity scores. This process allows us to rank the contexts based on their relevance to the question. Unlike the previous methods that only consider term frequency, this approach captures the semantic meaning and relationships between words within the sentences.

The last method indeed yielded the best results, as it effectively maintains the semantic relationships between words. This capability enchances the matching process, enabling a more accurate alignment of the question with the context. By considering not only the frequency of terms but also their semantic meaning, this methos provides a more  nuanced understanding of the relationship between the question and the retrieved contexts. This ensures that the generated responses are not only relevant in terms of topic but also in terms of undrlying semantic context, resulting in more precise and informative answers



2.  **Fine-Tuning Pre-Trained Models**
Another solution is to leverage pre-trained models and fine-tune them for our question-answering task. This approach takes advantage of the vast knowledge these models have already acquired during their training


3. **Leveraging Open-Domain Pre-Trained Models**
In addition to domain-specific models, we can also consider open-domain pre-trained models. Literature shows that models trained on vast, open-domain knowledge have performed well on question-answering tasks after fine-tuning.

**Our Work**

In the forthcoming notebook, we will delve into the process of retrieving external context for each question in our dataset. This endeavor will empower us to construct a comprehensive context for every question, a crucial asset in training models that rely on contextual information for optimal performance.

## Import and Install

Let's install some useful libraries

In [None]:
!pip install -q python-terrier

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/110.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m71.7/110.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.9/337.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m3.7 M

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/542.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/542.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m481.3/542.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.83


In [None]:
!pip install keybert

Collecting keybert
  Downloading keybert-0.8.4.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->s

In [None]:
!pip install gensim



In [None]:
!git clone https://github.com/epfml/sent2vec.git
%cd sent2vec
!make
!pip install .

Cloning into 'sent2vec'...
remote: Enumerating objects: 425, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 425 (delta 9), reused 4 (delta 1), pack-reused 403[K
Receiving objects: 100% (425/425), 447.46 KiB | 1.75 MiB/s, done.
Resolving deltas: 100% (261/261), done.
/content/sent2vec
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/shmem_matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=

Let's import some useful libraries

In [None]:
from google.colab import drive
import os

from datasets import load_dataset
import numpy as np
import pandas as pd
import csv

from collections import Counter

import matplotlib.pyplot as plt

import re
import string
from string import punctuation

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import pyterrier as pt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from keybert import KeyBERT

from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


from Bio import Entrez

from sentence_transformers import SentenceTransformer

import sent2vec

## Setup

In [None]:
drive.mount('/content/drive')
os.chdir(f'/content/drive/MyDrive/Colab Notebooks/NLP/Assignment/datasets') #Teka's path
#os.chdir(f'/content/drive/MyDrive/Colab Notebooks/NLP/NLP Project/Datasets')# Alessandro's path
os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive/Colab Notebooks/NLP/Assignment/datasets'

Variables designated as constants and flags are employed to selectively execute specific segments of the code.

In [None]:
HUGGINGFACE = False
SAVE_DATASET = True
SET_RANDOM_INDEX = False

Load models and packages

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
kw_model = KeyBERT()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.9 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



## Utils

The function **preprocess_document** is designed to process a given document, performing various optional text preprocessing steps based on the specified parameters. These parameters include:

* isSorted: If set to True, the function sorts the processed document alphabetically.
* isSet: If set to True, the function converts the processed document into a set, removing any duplicate elements.
* noStopWords: If set to True, the function removes common English stopwords from the processed document.
* lemmatization: If set to True, the function lemmatizes the words in the processed document using WordNet.

The function begins by removing punctuation from the document using a regular expression. Then, depending on the parameters provided, it applies additional processing steps accordingly. Finally, it returns the processed document.

In [None]:
def preprocess_document(document,noPunctuation=False ,isSorted=False, isSet=False, noStopWords=False, lemmatization=False):
  newDocument = document
  if noPunctuation==True:
    regex = '[' + string.punctuation + ']'
    newDocument = [re.sub(regex,'',item) for item in newDocument]
  if isSet==True:
    newDocument = set(newDocument)
  if noStopWords==True:
    newDocument =  [w for w in newDocument if w not in stopwords.words('english')]
  if isSorted==True:
    newDocument = sorted(newDocument)
  if lemmatization==True:
    lemmatizer = WordNetLemmatizer()
    newDocument = [lemmatizer.lemmatize(w) for w in newDocument]
  return newDocument

The upcoming snippet of code defines three functions for interacting with the PubMed databaes:


1.   **search_pubmed(query, max_results=10)**: Searches PubMed using a specified query and returns a list of PubMed IDs (PMIDs) for the retrieved results, with a default maximum of 10 results.

2.   **fetch_abstracts(pubmed_ids)**: Fetches the abstracts corresponding to a list of PubMed IDs and returns them as text.

3.   **fetch_abstracts_and_summaries(pubmed_ids)**: Retrieves summaries for the PubMed records identified by a list of PubMed IDs and returns them as structured data.


In [None]:
def search_pubmed(query, max_results=10):
    Entrez.email = "SemanticSurgeons@gmail.com"
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    results = Entrez.read(handle)
    handle.close()
    return results["IdList"]

def fetch_abstracts(pubmed_ids):
    Entrez.email = "SemanticSurgeons@gmail.com"
    handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="abstract", retmode="text")
    abstracts = handle.read()
    handle.close()
    return abstracts

def fetch_abstracts_and_summaries(pubmed_ids):
    Entrez.email = "SemanticSurgeons@gmail.com"
    summaries = Entrez.esummary(db="pubmed", id=','.join(pubmed_ids))
    summary_records = Entrez.read(summaries)
    summaries.close()
    return summary_records

The following function, **create_query**, generates a query based on the input question. If the question length exceeds a specified threshold (queryLength), the function extracts keywords using a keyword extraction model and constructs a query using these keywords. Otherwise, it returns the original question

In [None]:
def create_query(question,queryLength):
  if len(question) > queryLength:
    query = []
    keywords = kw_model.extract_keywords(question, keyphrase_ngram_range=(1, 1), stop_words='english',top_n=queryLength)
    for j in range(len(keywords)):
      key = keywords[j][0]
      query.append(key)
    query = ' '.join(query)
    return query
  else:
    return question

In [None]:
def most_similar(doc_id,similarity_matrix,matrix):
    print (f'Document: {question_documents_train_set[doc_id]}')
    print ('\n')
    print (f'Similar Documents using {matrix}:')
    if matrix=='Cosine Similarity':
        similar_ix=np.argsort(similarity_matrix[doc_id])[::-1]
        print(similar_ix)
    elif matrix=='Euclidean Distance':
        similar_ix=np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix==doc_id:
            continue
        print('\n')
        print (f'Document: {question_documents_train_set[ix]}')
        print (f'{matrix} : {similarity_matrix[doc_id][ix]}')

# Dataset Loading and preprocessing

We'll begin by loading and preprocessing the dataset to ensure it's in the appropriate format for the context retrieval phase.

In [None]:
medFlashCards_df = pd.read_csv('MedFlashCards.csv')
medFlashCards_df.dropna(inplace=True)
medFlashCards_df

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully
...,...,...,...
33946,"What is Opsoclonus-Myoclonus Ataxia Syndrome, ...",Opsoclonus-Myoclonus Ataxia Syndrome is a para...,Answer this question truthfully
33947,"What is Opsoclonus-Myoclonus Ataxia Syndrome, ...",Opsoclonus-Myoclonus Ataxia Syndrome is a para...,Answer this question truthfully
33948,Is A part of B in a proportion of A/B?,"Yes, A is part of B in a proportion of A/B.",Answer this question truthfully
33949,"What is the mnemonic ""Microtubules Get Constru...","The mnemonic ""Microtubules Get Constructed Ver...",Answer this question truthfully


In [None]:
questions_df = pd.DataFrame(medFlashCards_df['input'])
answers_df = pd.DataFrame(medFlashCards_df['output'])

In [None]:
questions_df['Cleaned Questions'] = questions_df["input"].apply(lambda question: ' '.join(preprocess_document(question.split(), noPunctuation=True,lemmatization=True )) )

# Context Retrival

In [None]:
if SET_RANDOM_INDEX == True:
  random_question_index = np.random.randint(0, len(medFlashCards_df)-1)
else:
  random_question_index = 1460

In [None]:
question = questions_df['Cleaned Questions'][random_question_index]
answer = answers_df['output'][random_question_index]

In [None]:
print(f'Question: {question}')
print(f'Answers: {answer}')

Question: What cause amoebic brain disease and is associated with freshwater source
Answers: Naegleria fowleri causes amoebic brain disease and is associated with freshwater sources.


We'll now generate the query for retrieving documents from the PubMed API. Considering a query length of 6, we'll use the create_query function to construct the query based on the provided question.

In [None]:
query_length = 6
query = create_query(question, query_length)

The upcoming code segment starts by setting the maximum number of results to retrieve from a PubMed search and performs the search using the provided query. If the initial search yields no results, it retries the search using the last keyword from the query which is the least important

In [None]:
max_results = 30  # Maximum number of results to retrieve
pubmed_ids = search_pubmed(query, max_results)
if not pubmed_ids:
    pubmed_ids = search_pubmed(query[:-1], max_results)

This code fetches abstracts for PubMed IDs and creates a DataFrame called "documents" with two columns: "docno" and "text," where each row represents an abstract with an incremented ID.

In [None]:
abstracts = fetch_abstracts(pubmed_ids)
documents = pd.DataFrame(columns=["docno", "text"])
for idx, abstract in enumerate(abstracts.split("\n\n")):
  newEntry = {'docno': idx + 1, 'text': abstract}
  documents = documents._append(newEntry, ignore_index = True)

Let's preprocess the retrieved contexts to ensure they are in the optimal format for subsequent operations.

In [None]:
documents['Cleaned Context'] = documents['text'].apply(lambda question: " ".join(preprocess_document(question.split(),noPunctuation=True, lemmatization=True)))

In [None]:
documents

Unnamed: 0,docno,text,Cleaned Context
0,1,1. ACS ES T Water. 2023 Mar 15;3(4):1126-1133....,1 ACS ES T Water 2023 Mar 153411261133 doi 101...
1,2,A Case of Primary Amebic Meningoencephalitis A...,A Case of Primary Amebic Meningoencephalitis A...
2,3,"Miko S(1), Cope JR(1), Hlavsa MC(1), Ali IKM(1...",Miko S1 Cope JR1 Hlavsa MC1 Ali IKM1 Brown TW1...
3,4,Author information:\n(1)U.S. Centers for Disea...,Author information 1US Centers for Disease Con...
4,5,Naegleria fowleri is a thermophilic ameba foun...,Naegleria fowleri is a thermophilic ameba foun...
...,...,...,...
88,89,The possibility of congenital infection with A...,The possibility of congenital infection with A...
89,90,"Awadalla HN(1), Sadaka HA.",Awadalla HN1 Sadaka HA
90,91,Author information:\n(1)Department of Parasito...,Author information 1Department of Parasitology...
91,92,Acanthamoeba culbertsoni is one of the free-li...,Acanthamoeba culbertsoni is one of the freeliv...


## Document Search Engine using Indexing

These methods assess the importance of terms based on their frequencies within the retrieved contexts and the question itself

The first step in order to apply any kind of search is to index the whole dataset of contexts

In [None]:
indexer = pt.DFIndexer("./index_contexts", overwrite=True)
documents['docno'] = documents['docno'].astype(str)
index_ref = indexer.index(documents["Cleaned Context"],documents['docno'])
index_ref.toString()

'./index_contexts/data.properties'

In [None]:
!ls -lh index_contexts/

total 134K
-rw------- 1 root root 1.6K May 25 08:43 data.direct.bf
-rw------- 1 root root 1.6K May 25 08:43 data.document.fsarrayfile
-rw------- 1 root root 2.9K May 25 08:43 data.inverted.bf
-rw------- 1 root root 112K May 25 08:43 data.lexicon.fsomapfile
-rw------- 1 root root  993 May 25 08:43 data.lexicon.fsomaphash
-rw------- 1 root root 5.2K May 25 08:43 data.lexicon.fsomapid
-rw------- 1 root root 1023 May 25 08:43 data.meta-0.fsomapfile
-rw------- 1 root root  744 May 25 08:43 data.meta.idx
-rw------- 1 root root 1.5K May 25 08:43 data.meta.zdata
-rw------- 1 root root 4.2K May 25 08:43 data.properties


Let's now load the index and print  some information about our collection

In [None]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 93
Number of terms: 1328
Number of postings: 2459
Number of fields: 0
Number of tokens: 3205
Field names: []
Positions:   false



### TF-IDF Index Search


This code segment initializes a TF-IDF-based batch retrieval system with a specified index and then executes a search using the provided query.

In [None]:
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
ranking_td_idf = tf_idf.search(query)
ranking_td_idf

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,52,53,0,6.013538,amoebic freshwater disease brain cause associated
1,1,91,92,1,5.589591,amoebic freshwater disease brain cause associated
2,1,66,67,2,5.270139,amoebic freshwater disease brain cause associated
3,1,63,64,3,4.64091,amoebic freshwater disease brain cause associated
4,1,31,32,4,4.603874,amoebic freshwater disease brain cause associated
5,1,4,5,5,4.081031,amoebic freshwater disease brain cause associated
6,1,78,79,6,3.699922,amoebic freshwater disease brain cause associated
7,1,58,59,7,3.276339,amoebic freshwater disease brain cause associated
8,1,18,19,8,3.266189,amoebic freshwater disease brain cause associated
9,1,49,50,9,3.216381,amoebic freshwater disease brain cause associated


This code snippet checks if the ranking list is not empty. If it's not empty, it retrieves the most relevant result and, if its length is less than 300 characters, also retrieves the second relevant result and concatenates them. If the ranking list is empty, it assigns 'NAN' to the most relevant result.

In [None]:
ranking_length = len(ranking_td_idf)
if(ranking_length)!=0:
    most_relevant_result_docno = ranking_td_idf.loc[ranking_td_idf['rank'] == 0, 'docno'].values[0]
    most_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]

    if len(most_relevant_result) < 300:
      second_relevant_result_docno = ranking_td_idf.loc[ranking['rank'] == 1, 'docno'].values[0]
      second_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]
      most_relevant_result = most_relevant_result + ' ' + second_relevant_result

else:
      most_relevant_result = 'NAN'

In [None]:
context_tfIdf = most_relevant_result
print(context_tfIdf)

INTRODUCTION: Primary amoebic meningoencephalitis (PAM) is a rare disease caused 
by the free-living amoeba Naegleria fowleri. Infection occurs by insufflation of 
water containing amoebae into the nasal cavity, and is usually associated with 
bathing in freshwater. Nasal irrigation is a more rarely reported route of 
infection.
CASE PRESENTATION: A fatal case of PAM in a previously healthy Norwegian woman, 
acquired during a holiday trip to Thailand, is described. Clinical findings were 
consistent with rapidly progressing meningoencephalitis. The cause of infection 
was discovered by chance, owing to the unexpected detection of N. fowleri DNA by 
a PCR assay targeting fungi. A conclusive diagnosis was established based on 
sequencing of N. fowleri DNA from brain biopsies, supported by histopathological 
findings. Nasal irrigation using contaminated tap water is suspected as the 
source of infection.
CONCLUSION: The clinical presentation of PAM is very similar to severe bacterial 
men

### BM25 Index Search

This code segment initializes a BM25-based batch retrieval system with a specified index and then executes a search using the provided query.

In [None]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
ranking_bm25 = bm25.search(query)
ranking_bm25

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,52,53,0,9.418227,amoebic freshwater disease brain cause associated
1,1,91,92,1,8.867455,amoebic freshwater disease brain cause associated
2,1,66,67,2,8.481014,amoebic freshwater disease brain cause associated
3,1,63,64,3,7.273235,amoebic freshwater disease brain cause associated
4,1,31,32,4,6.767356,amoebic freshwater disease brain cause associated
5,1,4,5,5,6.395992,amoebic freshwater disease brain cause associated
6,1,78,79,6,5.642075,amoebic freshwater disease brain cause associated
7,1,49,50,7,5.468598,amoebic freshwater disease brain cause associated
8,1,58,59,8,5.131413,amoebic freshwater disease brain cause associated
9,1,44,45,9,5.052329,amoebic freshwater disease brain cause associated


In [None]:
ranking_length = len(ranking_bm25)
if(ranking_length)!=0:
    most_relevant_result_docno = ranking_td_idf.loc[ranking_bm25['rank'] == 0, 'docno'].values[0]
    most_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]

    if len(most_relevant_result) < 300:
      second_relevant_result_docno = ranking_bm25.loc[ranking_bm25['rank'] == 1, 'docno'].values[0]
      second_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]
      most_relevant_result = most_relevant_result + ' ' + second_relevant_result

else:
      most_relevant_result = 'NAN'

In [None]:
context_bm25 = most_relevant_result
print(context_bm25)

INTRODUCTION: Primary amoebic meningoencephalitis (PAM) is a rare disease caused 
by the free-living amoeba Naegleria fowleri. Infection occurs by insufflation of 
water containing amoebae into the nasal cavity, and is usually associated with 
bathing in freshwater. Nasal irrigation is a more rarely reported route of 
infection.
CASE PRESENTATION: A fatal case of PAM in a previously healthy Norwegian woman, 
acquired during a holiday trip to Thailand, is described. Clinical findings were 
consistent with rapidly progressing meningoencephalitis. The cause of infection 
was discovered by chance, owing to the unexpected detection of N. fowleri DNA by 
a PCR assay targeting fungi. A conclusive diagnosis was established based on 
sequencing of N. fowleri DNA from brain biopsies, supported by histopathological 
findings. Nasal irrigation using contaminated tap water is suspected as the 
source of infection.
CONCLUSION: The clinical presentation of PAM is very similar to severe bacterial 
men

### Composition

This operator allow to re-rank the output of one retriever using a second retriever

In [None]:
composition = tf_idf >> bm25
composition_ranking = composition.search(query)
composition_ranking

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,52,53,0,9.418227,amoebic freshwater disease brain cause associated
1,1,91,92,1,8.867455,amoebic freshwater disease brain cause associated
2,1,66,67,2,8.481014,amoebic freshwater disease brain cause associated
3,1,63,64,3,7.273235,amoebic freshwater disease brain cause associated
4,1,31,32,4,6.767356,amoebic freshwater disease brain cause associated
5,1,4,5,5,6.395992,amoebic freshwater disease brain cause associated
6,1,78,79,6,5.642075,amoebic freshwater disease brain cause associated
7,1,49,50,7,5.468598,amoebic freshwater disease brain cause associated
8,1,58,59,8,5.131413,amoebic freshwater disease brain cause associated
9,1,44,45,9,5.052329,amoebic freshwater disease brain cause associated


In [None]:
if(len(composition_ranking))!=0:
    most_relevant_result_docno = composition_ranking.loc[composition_ranking['rank'] == 0, 'docno'].values[0]
    most_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]
    if len(most_relevant_result) < 300:
      second_relevant_result_docno = composition_ranking.loc[composition_ranking['rank'] == 1, 'docno'].values[0]
      second_relevant_result = documents.loc[documents['docno'] ==  most_relevant_result_docno, 'text'].values[0]
      most_relevant_result = most_relevant_result + ' ' + second_relevant_result
else:
      most_relevant_result = 'NAN'

In [None]:
context_composition_tfIdf_bm25 = most_relevant_result
print(context_composition_tfIdf_bm25)

INTRODUCTION: Primary amoebic meningoencephalitis (PAM) is a rare disease caused 
by the free-living amoeba Naegleria fowleri. Infection occurs by insufflation of 
water containing amoebae into the nasal cavity, and is usually associated with 
bathing in freshwater. Nasal irrigation is a more rarely reported route of 
infection.
CASE PRESENTATION: A fatal case of PAM in a previously healthy Norwegian woman, 
acquired during a holiday trip to Thailand, is described. Clinical findings were 
consistent with rapidly progressing meningoencephalitis. The cause of infection 
was discovered by chance, owing to the unexpected detection of N. fowleri DNA by 
a PCR assay targeting fungi. A conclusive diagnosis was established based on 
sequencing of N. fowleri DNA from brain biopsies, supported by histopathological 
findings. Nasal irrigation using contaminated tap water is suspected as the 
source of infection.
CONCLUSION: The clinical presentation of PAM is very similar to severe bacterial 
men

## Document Search Engine using Document Vectorization

Compute the textual documents into numerical vectors based on the relevancy of the words and how often they appear in the documents

Create lists of contexts and questions from the corresponding dataframes for building the train set

In [None]:
contexts_list = documents['text'].tolist()
questions_list  = questions_df['input'].tolist()

In [None]:
train_set = questions_list + contexts_list

This line fits the vectorizer to the training set, learning the vocabulary and the idf (inverse document frequency) values from the training data

In [None]:
vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english')

In [None]:
vectorizer.fit(train_set)

In [None]:
learned_vocabulary = vectorizer.get_feature_names_out()
print(f'Length of Vocabulary: {len(learned_vocabulary)}')

Length of Vocabulary: 6239


The upcoming lines of code transform the question and contexts list into their vector representations using the fitted vectorizer. This process converts the text data into numerical vectors based on the learned vocabulary and idf values

In [None]:
vector_question = vectorizer.transform([question])
vector_contexts = vectorizer.transform(contexts_list)

Let's compute the similarities

In [None]:
question_contexts_similarities = np.dot(vector_question, vector_contexts.T)

In [None]:
best_matches = np.argmax(question_contexts_similarities, axis=1)
value = best_matches[0][0, 0]
print(f"The more similar context to question 0 is {best_matches[0]}")
print('CONTEXT:' ,contexts_list[value] )

The more similar context to question 0 is [[49]]
CONTEXT: Fatal primary amoebic meningoencephalitis in a Norwegian tourist returning from 
Thailand.


## Document Search Engine using Document Embedding

A document search engine using embeddings transforms how search results are generated by focusing on the semantic meaning of text rather than just matching keywords. Choosing embeddings for this purpose offers significant advantages. Embeddings capture the semantic meaning of words, enabling more accurate and contextually relevant search results by understanding synonyms and polysemous words. They enhance contextual understanding, improve disambiguation, support scalable and efficient searches, and enable cross-lingual capabilitie

**Doc2Vec**

Doc2Vec is an extension of the Word2Vec model that learns continuous representations for pieces of text, such as sentences, paragraphs, or documents, while also capturing their semantic meanings. In Doc2Vec, each document is represented as a fixed-length vector, just like words are represented as vectors in Word2Vec.

The following code  prepares the data for training a Doc2Vec model by creating a list of tagged documents. Each document is tokenized into words, and a unique tag is assigned to it. The resulting tagged_data list contains TaggedDocument objects, where each object represents a document (question or context) with its corresponding tag.

In [None]:
question_documents_train_set = [question] + documents['Cleaned Context'].tolist()
tagged_data = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(question_documents_train_set)]

This code initializes a Doc2Vec model with a vector size of 100 and trains it for 100 epochs using the provided tagged data. During training, the model learns to generate document vectors that represent the semantic meaning of the input documents.

In [None]:
model_d2v = Doc2Vec(vector_size=100,alpha=0.025, min_count=1)

model_d2v.build_vocab(tagged_data)

for epoch in range(100):
    model_d2v.train(tagged_data,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)



Let's initialize an array to store document embeddings with dimensions corresponding to the number of documents in the dataset and a vector size of 100. Then, we are going  iterate over each document in the dataset, retrieve its corresponding Doc2Vec embedding from the trained model, and assign it to the corresponding position in the document_embeddings array.

In [None]:
document_embeddings=np.zeros((documents.shape[0],100))

for i in range(len(document_embeddings)):
    document_embeddings[i]= model_d2v.docvecs[i]

  document_embeddings[i]= model_d2v.docvecs[i]


Let's calculates pairwise similarities and differences between document embeddings using cosine similarity

In [None]:
pairwise_similarities=cosine_similarity(document_embeddings)

Let's retrieve the most similar documents

In [None]:
most_similar(0,pairwise_similarities,'Cosine Similarity')

Document: What cause amoebic brain disease and is associated with freshwater source


Similar Documents using Cosine Similarity:
[ 0 78 48 50 76 43  7  8 12 16 18 89 42 17 87 33 58 64 82 60 41  9 46 91
 74 47 83 68 84 62 29 22 71  1 40 35 39 67 61 26  2 66 57 53  3 13 85 69
 65 24 20 49 31 14 77 72 28 36 59 79 37 70 30 11 55  6 51 10 21 92 52 75
 80 19 90 44 23 63  4 56 25 27 54 34  5 32 15 88 81 86 38 45 73]


Document: Author information 1National Center for Emerging and Zoonotic Infectious Diseases Centers for Disease Control and Prevention1600 Clifton Road Atlanta GA 30329 USA jey9cdcgov
Cosine Similarity : 0.7261215865224676


Document: Conflict of interest statement Potential conflict of interest All author No reported conflict
Cosine Similarity : 0.7080370555308303


Document: Fatal primary amoebic meningoencephalitis in a Norwegian tourist returning from Thailand
Cosine Similarity : 0.6972952042767578


Document: Primary amebic meningoencephalitis death associated with sinus ir

**Sentence Transformer**


Sentence Transformers is a versatile toolkit designed for creating fixed-length numerical representations, or embeddings, of sentences and short texts. It harnesses the power of pre-trained transformer models like BERT, RoBERTa, or DistilBERT to encode input sentences into dense vector representations. These embeddings effectively capture the semantic meaning of the sentences, enabling a wide range of downstream tasks such as semantic similarity assessment, text classification, or clustering. In our case, we're utilizing the **all-mpnet-base-v2** variant, which is built on the MPNet architecture.

The upcoming  code initializes a Sentence Transformers model using the **all-mpnet-base-v2** variant, which is based on the MPNet architecture. The model is capable of generating high-quality sentence embeddings for a wide range of natural language processing task

In [None]:
sbert_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We'll employ the previously loaded model to create embeddings for a collection of documents containing both questions and answers, which together constitute the training set

In [None]:
document_embeddings = sbert_model.encode(question_documents_train_set)

Let's calculates pairwise similarities and differences between document embeddings using cosine similarity

In [None]:
pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)

Let's retrieve the most similar documents

In [None]:
most_similar(0,pairwise_similarities,'Cosine Similarity')

Document: What cause amoebic brain disease and is associated with freshwater source


Similar Documents using Cosine Similarity:
[ 0 67 73 53 92 76 64 42  5 86 45  2 50 32 59 79 19 56 82 29 11 71 38 68
 66 18 63 44  6 25 89 85 22 75 55 41 52 10 28 20 33 69 61  8 78 74 91 47
 40 80 26 58 37 31  1 60 46  4 24 72  7 87 54 93 16 13 21 34 88 14 49 35
 15 27 81 39 90 83 62 48 84 57 36 70  3  9 77 23 51 65 43 17 30 12]


Document: Balamuthia mandrillaris is an amoeba found in fresh water and soil that cause granulomatous amoebic encephalitis We report herein an autopsy case of B mandrillaris amoebic encephalitis which wa definitely diagnosed by PCR An 81yearold man who had Sjögrens syndrome manifested drowsiness 2 month before his death with progressive deterioration Neuroimaging demonstrated focus of T2 and fluidattenuated inversion recovery high and T1 lowintensity with irregular postcontrast ring enhancement in the cerebral hemisphere thalamus and midbrain Pathologically multiple hemorrhag

**BioSentVec**

BioSentVec, a pre-trained model, transforms sentences into numerical forms, effectively capturing the intricate connections between words within the realm of biology and medicine. Its training dataset includes a vast collection of over 30 million documents sourced from scholarly articles found in PubMed and clinical notes extracted from the MIMIC-III Clinical Database. This extensive training ensures its proficiency in understanding the language and concepts specific to these fields.

In [None]:
model_BioSentVec_path = '/content/drive/MyDrive/Colab Notebooks/NLP/project/Models/Datasets/BioSentVec_PubMed_MIMICIII-bigram_d700.bin'
testing_model = sent2vec.Sent2vecModel()

Let's load the model

In [None]:
try:
    testing_model.load_model(model_BioSentVec_path)
except Exception as e:
    print(e)
print('model successfully loaded')

model successfully loaded


We'll employ the previously loaded model to create embeddings for a collection of documents containing both questions and answers, which together constitute the training set

In [None]:
document_embeddings = testing_model.embed_sentences(question_documents_train_set)

Let's calculates pairwise similarities and differences between document embeddings using cosine similarity

In [None]:
pairwise_similarities = cosine_similarity(document_embeddings)

Let's retrieve the most similar documents

In [None]:
most_similar(0,pairwise_similarities,'Cosine Similarity')

Document: What cause amoebic brain disease and is associated with freshwater source


Similar Documents using Cosine Similarity:
[ 0 53 79 67 59 32 25 45 86 19 38 73  5 92 64 56 11 50 82 76 71 27 22 35
 14  8 89 46 16 48 60 62 37 75  4 63  1  7 21 55 41 39 81 28 15 49 10 85
 34 88 44 78 70 18 31 68 91 52 66 77 57  2 58 42 24 83 72 12 80 87 20 93
 74 26 61 69 33 40 47 90  3 54 51  6 84  9 23 43 13 36 30 17 65 29]


Document: INTRODUCTION Primary amoebic meningoencephalitis PAM is a rare disease caused by the freeliving amoeba Naegleria fowleri Infection occurs by insufflation of water containing amoeba into the nasal cavity and is usually associated with bathing in freshwater Nasal irrigation is a more rarely reported route of infection CASE PRESENTATION A fatal case of PAM in a previously healthy Norwegian woman acquired during a holiday trip to Thailand is described Clinical finding were consistent with rapidly progressing meningoencephalitis The cause of infection wa discovered by ch

# Evaluation

The code cells above showcase three methods for documents retrival based on a given query. Each method successfully organizes the retrieved documents from most relevant to least relevant. It's evident that approaches solely relying on term frequency were surpassed by those utilizing sentence embeddings, which capture semantic relatioships of words within a sentence.
Additionaly, models pre-trained within our specific domain, such as BioSentVec, demonstrated superior accuracy


In [None]:
tf_idf_bm_25 = "INTRODUCTION: Primary amoebic meningoencephalitis (PAM) is a rare disease caused\
by the free-living amoeba Naegleria fowleri. Infection occurs by insufflation of\
water containing amoebae into the nasal cavity, and is usually associated with\
bathing in freshwater. Nasal irrigation is a more rarely reported route of\
infection.\
CASE PRESENTATION: A fatal case of PAM in a previously healthy Norwegian woman,\
acquired during a holiday trip to Thailand, is described. Clinical findings were\
consistent with rapidly progressing meningoencephalitis. The cause of infection\
was discovered by chance, owing to the unexpected detection of N. fowleri DNA by\
a PCR assay targeting fungi. A conclusive diagnosis was established based on\
sequencing of N. fowleri DNA from brain biopsies, supported by histopathological\
findings. Nasal irrigation using contaminated tap water is suspected as the\
source of infection.\
CONCLUSION: The clinical presentation of PAM is very similar to severe bacterial\
meningitis. This case is a reminder that when standard investigations fail to\
identify a cause of infection in severe meningoencephalitis, it is of crucial\
importance to continue a broad search for a conclusive diagnosis. PAM should be\
considered as a diagnosis in patients with symptoms of severe\
meningoencephalitis returning from endemic areas."

tf_idf_vectorizer = "Fatal primary amoebic meningoencephalitis in a Norwegian tourist returning from\
Thailand."

doc2vec = " Author information 1National Center for Emerging and Zoonotic Infectious Diseases Centers\
 for Disease Control and Prevention1600 Clifton Road Atlanta GA 30329 USA jey9cdcgov"

sentence_transformer = "Balamuthia mandrillaris is an amoeba found in fresh water and soil that cause\
 granulomatous amoebic encephalitis We report herein an autopsy case of B mandrillaris amoebic\
  encephalitis which wa definitely diagnosed by PCR An 81yearold man who had Sjögrens syndrome\
   manifested drowsiness 2 month before his death with progressive deterioration Neuroimaging\
    demonstrated focus of T2 and fluidattenuated inversion recovery high and T1 lowintensity with\
     irregular postcontrast ring enhancement in the cerebral hemisphere thalamus and midbrain\
      Pathologically multiple hemorrhagic and necrotic lesion were found in the cerebrum thalamus\
       midbrain pons medulla and cerebellum which were characterized by liquefactive necrosis marked \
        edema hemorrhage and necrotizing vasculitis associated with the perivascular accumulation of amoebic\
         trophozoite a few cyst and the infiltration of numerous neutrophil and microgliamacrophages The trophozoite\
          were ovoid or round 1060 μm in diameter and they showed foamy cytoplasm and a round nucleus with small\
           karyosome in the center The PCR and immunohistochemistry from paraffinembedded brain specimen revealed\
            angioinvasive encephalitis due to B mandrillaris Human case of B mandrillaris brain infection are rare\
             in Japan with only a few brief report in the literature"

bioSentVec = "INTRODUCTION Primary amoebic meningoencephalitis PAM is a rare disease caused by the freeliving\
 amoeba Naegleria fowleri Infection occurs by insufflation of water containing amoeba into the nasal cavity\
  and is usually associated with bathing in freshwater Nasal irrigation is a more rarely reported route of infection\
   CASE PRESENTATION A fatal case of PAM in a previously healthy Norwegian woman acquired during a holiday trip\
    to Thailand is described Clinical finding were consistent with rapidly progressing meningoencephalitis\
     The cause of infection wa discovered by chance owing to the unexpected detection of N fowleri DNA by a\
      PCR assay targeting fungi A conclusive diagnosis wa established based on sequencing of N fowleri DNA\
       from brain biopsy supported by histopathological finding Nasal irrigation using contaminated tap water\
        is suspected a the source of infection CONCLUSION The clinical presentation of PAM is very similar to\
         severe bacterial meningitis This case is a reminder that when standard investigation fail to identify\
          a cause of infection in severe meningoencephalitis it is of crucial importance to continue a broad search\
           for a conclusive diagnosis PAM should be considered a a diagnosis in patient with symptom of severe\
            meningoencephalitis returning from endemic area"

In [None]:
print(f'Question: {question}\n')
print(f'TF-IDF>>BM25: {tf_idf_bm_25 }\n')
print(f'TF-IDF Vectorizer: {tf_idf_vectorizer}\n')
print(f'Doc2Vec: {doc2vec}\n')
print(f'sentence transformers: {sentence_transformer}\n')
print(f'BioSentVec: {bioSentVec}\n')

Question: What cause amoebic brain disease and is associated with freshwater source

TF-IDF>>BM25: INTRODUCTION: Primary amoebic meningoencephalitis (PAM) is a rare disease causedby the free-living amoeba Naegleria fowleri. Infection occurs by insufflation ofwater containing amoebae into the nasal cavity, and is usually associated withbathing in freshwater. Nasal irrigation is a more rarely reported route ofinfection.CASE PRESENTATION: A fatal case of PAM in a previously healthy Norwegian woman,acquired during a holiday trip to Thailand, is described. Clinical findings wereconsistent with rapidly progressing meningoencephalitis. The cause of infectionwas discovered by chance, owing to the unexpected detection of N. fowleri DNA bya PCR assay targeting fungi. A conclusive diagnosis was established based onsequencing of N. fowleri DNA from brain biopsies, supported by histopathologicalfindings. Nasal irrigation using contaminated tap water is suspected as thesource of infection.CONCLUSION