# Simple PDF NLQ Using Vector Databases
**Author :** Dhruv Parthasarathy

**Date :** 03/25/2023


The goal of this notebook is to explore the concept of similarity serach through text embedding
and usage of vector databases

Here we will make use of a simple text based pdf file that we want to process

In [41]:
# Installing pypdf - this library helps wiht pdf text extraction
!pip install pypdf



In [42]:
from pypdf import PdfReader
from google.colab import userdata

In [43]:
fileName = userdata.get('filePath')

# Read the PDF File
reader = PdfReader(fileName)
number_of_pages = len(reader.pages)


# Here we combine all the page's content together
content = ""
for pageNum in range(number_of_pages):
  pageContent = reader.pages[pageNum].extract_text()
  content+= pageContent

print(content)

Animesh Giri 002743464  Assignment 3: Practical Implementation of Advanced Prompt Engineering Techniques 1) Root Prompts: Create a Root Prompt that serves as a foundational query for complex interactions. Answer:  Prompt : "Guide me toward a personalized health and wellness plan based on my current lifestyle, preferences, and health goals."  This Root Prompt lays the foundation for a multifaceted interaction, allowing the AI to delve into the user's lifestyle, preferences, and health objectives. It serves as a starting point for generating tailored advice on nutrition, exercise routines, and overall well-being.  Response from ChatGPT :  Creating a personalized health and wellness plan involves considering your current lifestyle, preferences, and health goals. Keep in mind that it's important to consult with a healthcare professional before making significant changes to your diet or exercise routine. Here's a general guide to help you get started: 1. Assess Your Current Lifestyle: a. Da

Now we need to pre process the text for vector embedding

In [44]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

For pre-processing the data before tokenizing it, especially when dealing with text from a PDF for a task like this, we can follow these common steps:

1. **Text Cleaning**: Start by removing any irrelevant characters such as special characters, numbers, or extra whitespace. This helps in reducing noise in your data.

2. **Lowercasing**: Convert all your text to lowercase. This ensures that the algorithm treats words like "The" and "the" as the same word.

3. **Removing Stopwords**: Stopwords are common words like "is", "an", "the", etc., that are usually irrelevant in the context of text analysis. Removing them can help focus on important words.

4. **Stemming/Lemmatization**: This step involves reducing words to their base or root form. For example, "running" becomes "run". Lemmatization is more sophisticated than stemming as it considers the context of the word.

5. **Handling Punctuation**: Depending on your application, you might want to remove punctuation or treat it in a specific way.

6. **Tokenization**: Once the text is cleaned, you can convert sentences into tokens (usually words). Tokenization is crucial for vectorization later on.

7. **Optional - Advanced Processing**: Depending on your needs, you might want to apply more advanced processing like Named Entity Recognition (NER), Part of Speech (POS) tagging, or chunking to extract more structured information from the text.

After these preprocessing steps, your text will be in a clean and structured form, ready for tokenization and further processing for your vector database. Each of these steps helps in refining the text data to ensure that the subsequent vectorization captures the most relevant semantic information.

In [45]:
# Convert text to lowercase
text = content.lower()

# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# Tokenization
tokens = word_tokenize(text)

# Remove stop words
tokens = [word for word in tokens if not word in stopwords.words('english')]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]

print(tokens)

['animesh', 'giri', '002743464', 'assignment', '3', 'practical', 'implementation', 'advanced', 'prompt', 'engineering', 'technique', '1', 'root', 'prompt', 'create', 'root', 'prompt', 'serf', 'foundational', 'query', 'complex', 'interaction', 'answer', 'prompt', 'guide', 'toward', 'personalized', 'health', 'wellness', 'plan', 'based', 'current', 'lifestyle', 'preference', 'health', 'goal', 'root', 'prompt', 'lay', 'foundation', 'multifaceted', 'interaction', 'allowing', 'ai', 'delve', 'user', 'lifestyle', 'preference', 'health', 'objective', 'serf', 'starting', 'point', 'generating', 'tailored', 'advice', 'nutrition', 'exercise', 'routine', 'overall', 'wellbeing', 'response', 'chatgpt', 'creating', 'personalized', 'health', 'wellness', 'plan', 'involves', 'considering', 'current', 'lifestyle', 'preference', 'health', 'goal', 'keep', 'mind', 'important', 'consult', 'healthcare', 'professional', 'making', 'significant', 'change', 'diet', 'exercise', 'routine', 'here', 'general', 'guide',

In [46]:
print(len(tokens))

2550
