# Information Retrieval from PDF with Sentence Transformers

This notebook is built to run on Google Colab platform.




---
## 1. Runtime Performance

Change "Runtime Type" in of the Google Colab notebook to increase computational performance. In the menu, go to "Runtime" and set "Runtime Type" to "T4 GPU". (Limited usage, full GPU performance only available on 'Pay As You Go' registration.)




---
## 2. Install SentenceTransformer Library

This library provides an easy method to compute dense vector representations for sentences, paragraphs, and images.

In [None]:
# Install sentence transformer library
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


---
## 3. Install PyPDF2 Library

 PyPDF2 is a pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files.


In [2]:
# Install PyPDF2 library
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


---
## 4. Import Dependencies

In [3]:
import os # used for accessing source folder path on Google Drive
import PyPDF2 # used for extracting text from PDF files
from sentence_transformers import SentenceTransformer, util # used for text encoding and sentence embedding
import torch # used for identifying index with highest similarity score
import spacy # used for processing and analyzing text data, and retrieving most relevant text passages

---
## 5. Prepare Source Data on the Google Drive folder

Make sure that your source data is available on Google Drive.

Otherwise, prepare data as following:


1.   On your Google Drive, create a source folder, i.e. named 'pdf_folder'.
2.   Access the source folder and upload PDF files, i.e. scientific papers.



---
## 6. Mount Google Drive

Mount your Google Drive in the Colab notebook to access your data (PDF files).

Run the code, follow the appearing link to get an authorization code and confirm the access to your Google Drive account:

In [4]:
# Mount Google Drive:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
## 7. Access PDF Files

Once your Google Drive is mounted, you can access data from a specified source folder ('pdf_folder') in your Google Drive.



In [5]:
# Define the path to the source folder containing PDF files
folder_path = '/content/drive/My Drive/pdf_folder'
pdf_files = os.listdir(folder_path)

---
## 8. Load SentenceTransformer Models

We already imported specific modules and classes from the sentence-transformers library, using:

```
from sentence_transformers import SentenceTransformer, util
```

*   The SentenceTransformer class is used to create models for encoding
sentences.
*   The util module provides utility functions for working with sentence embeddings.




Now, we load specific SentenceTransformer models. Here, we load four different pre-trained models:

*   'paraphrase-MiniLM-L6-v2'
*   'multi-qa-MiniLM-L6-cos-v1'
*   'all-MiniLM-L6-v2'
*   'sentence-transformers/paraphrase-distilroberta-base-v1'

In [6]:
# Load pre-trained models
#model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
model = SentenceTransformer('all-MiniLM-L6-v2')
#model = SentenceTransformer('sentence-transformers/paraphrase-distilroberta-base-v1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Later, we can switch between the models by commenting/commenting out.

For more details on the used models, visit: https://huggingface.co/sentence-transformers

---
## 9. The Main Program Code

**The main program performs following steps:**



1.   Retrieves text data from PDF files.
2. Obtains all related phrases based on the initial question.
3.   Filters the output to include only the phrases that are most relevant to the terms in the initial question.
4.   Uses Threshold to filter the related phrases based on their similarity scores to the initial question.
5.   Prints only the phrases with similarity scores above the threshold, ensuring that the output is focused on the specific terms and concepts mentioned in the initial question.



### 9.1 Extracting Text from PDF Files

The first function, `def extract_text_from_pdfs(folder_path)`, retrieves text data from PDF files.

It takes the previously defined `folder_path` as input path, iterates through multiple PDF files in the specified folder `pdf_folder`, and extracts the text from each PDF file using PyPDF2.

The extracted text from each PDF is then appended to a list, and the function returns this list of text data.

In [7]:
# Step 1: Retrieve Text Data from PDF Files
def extract_text_from_pdfs(folder_path):
    folder_path = '/content/drive/My Drive/pdf_folder'
    pdf_files = os.listdir(folder_path)
    text_data = []
    pdf_files = [file for file in os.listdir(folder_path) if file.endswith(".pdf")]
    for file in pdf_files:
        pdf_path = os.path.join(folder_path, file)
        with open(pdf_path, 'rb') as f:
            pdf = PyPDF2.PdfReader(f)
            text = ''
            for page in pdf.pages:
                text += page.extract_text()
            text_data.append(text)
    return text_data

### 9.2 Encoding Extracted Text Data

The second function, `def encode_text_data(text_data)`, takes `text_data` as input.

Inside the function, it initializes one SentenceTransformer model, i.e. with the name `'all-MiniLM-L6-v2'`.

The chosen model encodes the `text_data` into text embeddings, which are then returned by the function.

The text embeddings capture the semantic information of the input text and provide a dense vector representation for each input sentence or paragraph.

To switch between the models, comment them out as needed. Re-run all following code cells.

In [8]:
# Step 2: Encode the Retrieved Text Data
def encode_text_data(text_data):
    #model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    #model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
    model = SentenceTransformer('all-MiniLM-L6-v2')
    #model = SentenceTransformer('sentence-transformers/paraphrase-distilroberta-base-v1')
    text_embeddings = model.encode(text_data)
    return text_embeddings

### 9.3 Calculating Similarity Scores

The third function `def calculate_similarity_scores(question, text_embeddings) ` takes a `question` and `text_embeddings` as input.

Inside the function, it uses the transformer model to encode the `question` into a `question_embedding`.

Then, it calculates the similarity scores between the question embedding and the `text_embeddings` using the PyTorch Cosine Similarity function.

Finally, it returns the similarity scores.


In [9]:
# Step 3: Calculate Similarity Scores
def calculate_similarity_scores(question, text_embeddings):
    question_embedding = model.encode(question)
    similarity_scores = util.pytorch_cos_sim(question_embedding, text_embeddings)
    return similarity_scores

### 9.4 Identifying Most Similar Text

The fourth function `def identify_most_similar_text(similarity_scores, text_data)` finds the most similar text based on the similarity scores calculated.

It takes the `similarity_scores` and the `text_data` as input, then identifies the index of the most similar text by using the torch library. The specific function `torch.argmax` is used to identify the index with the highest similarity score within the `similarity_scores` tensor. This index is then utilized to retrieve the most similar text from the `text_data`.


Finally, it returns the text from the `text_data` that corresponds to the index of the most similar text.

In [10]:
# Step 4: Identify the Most Similar Text
def identify_most_similar_text(similarity_scores, text_data):
    most_similar_index = torch.argmax(similarity_scores)
    return text_data[most_similar_index]

### 9.5 Retrieving Most Relevant Text Passages

The fifth function `def retrieve_information(question, threshold)` performs the following steps:

- It applies previously defined functions to retrieve and process text data from PDFs, encode the text data, calculate similarity scores between the encoded text and the input question, and identify the most similar text passage.

- It filters and ranks phrases based on similarity scores using the provided threshold.

- It then calculates the similarity at the phrase level and outputs the matching phrases that have a similarity score above the specified threshold.

- It retrieves and presents the most relevant text passages based on the input question and a specified similarity score threshold.

The code `nlp = spacy.load("en_core_web_sm")` is loading the English language model `"en_core_web_sm"` from the spaCy library. This model provides access to a wide range of linguistic annotations, such as part-of-speech tags, syntactic dependencies, and named entities. Once loaded, the nlp object can be used to process and analyze text data, including tasks such as tokenization, lemmatization, and entity recognition.

In [11]:
# Step 5: Retrieve Most Relevant Text Passages
def retrieve_information(question, threshold):

    # assign a variable for the question input
    question = question

    # Apply previously defined functions (step 1, step 2, step 3, step 4)
    text_data = extract_text_from_pdfs(folder_path)
    text_embeddings = encode_text_data(text_data)
    similarity_scores = calculate_similarity_scores(question, text_embeddings)
    relevant_text = identify_most_similar_text(similarity_scores, text_data)

    # Filter and rank phrases based on similarity scores
    threshold = threshold  # assign a variable for the Threshold input
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(relevant_text)
    tokenized_phrases = [sent.text for sent in doc.sents]

    # Calculate similarity at the phrase level
    question_embedding = model.encode(question)
    phrase_embeddings = model.encode(tokenized_phrases)
    similarity_scores = util.pytorch_cos_sim(question_embedding, phrase_embeddings)

    # Assuming similarity_scores is a 2D tensor, reshape it to have a single dimension
    similarity_scores = similarity_scores.flatten()

    # Output the matching phrases
    print('"' + question + '":', "- Matching Phrases:")
    print("-------------------------------------------------------------")
    for phrase in tokenized_phrases:
        if similarity_scores[tokenized_phrases.index(phrase)] > threshold:
            print(phrase)
    print("----------------------------------------------------------END")


**Function Call:**

Setting the function call at the end of the program allows to easily change question and threshold. It takes two arguments:

- `question`: "Your question..." - the question must be provided as a text input string. Ask simple questions or provide keywords phrases.

- `threshold`: choose a number between 0 and 1. It is recommended to start with 0.6 (adjust Threshold as needed).

The function call `retrieve_information("Your question...", 0.6)` initiates the process of retrieving and using the most relevant text passage related to the input question `"Your question..."` with a specified similarity score threshold of `0.6`.

In [12]:
retrieve_information("cilium, microtubules, IFT?", 0.6)

"cilium, microtubules, IFT?": - Matching Phrases:
-------------------------------------------------------------
These trains are formed of 22 IFT-A and IFT-B proteins that link structural and signaling cargos to microtubule motors for import into cilia.
At their core is a ring of nine interconnected microtubule doublets 
in a structure known as the axoneme (Fig. 1a).
A diffusion barrier exists 
at the base of the cilium, meaning that the vast quantities of structural 
proteins required to build the axoneme need to be delivered by micro -
tubule motors in a process called intraflagellar transport (IFT).
IFT also 
transports membrane-associated proteins into and out of the cilium 
to regulate key developmental signaling pathways1.
b , The new subtomogram averages lowpass filtered and 
colored by complex (yellow, IFT-A; blue, IFT-B1; green, IFT-B2; purple, dynein), 
docked onto a cryo-ET average of the microtubule doublets found in motile  
cilia.
Molecular basis of tubulin transport with