In [2]:
import os
import faiss
from sentence_transformers import SentenceTransformer
import pdfplumber
import numpy as np


Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md






In [3]:
# Step 1: Extract text from a PDF
def extract_text_from_pdf(pdf_path):
    extracted_text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            extracted_text += page.extract_text() + "\n"
    return extracted_text

In [4]:
# Step 2: Split text into chunks
def split_into_chunks(text, chunk_size=100):
    sentences = text.split(". ")
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk.split()) + len(sentence.split()) <= chunk_size:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks


In [5]:
# Step 3: Create embeddings for the chunks
def create_embeddings(chunks, model_name="all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks)
    return embeddings, model


In [6]:
# Step 4: Store embeddings in FAISS

def store_embeddings(embeddings):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index

In [7]:
# Step 5: Search the most relevant chunks
def search_query(query, index, chunks, model):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k=1)

    results = []
    for i in indices[0]:
        results.append(chunks[i])

    return results

In [8]:
# Step 6: Use a mock LLM to generate a response
def generate_response(query, results):
    response = "Here are the relevant results for your query: \n\n"
    for result in results:
        response += f"- {result}\n\n"
    return response

In [10]:
# Main Script
def main():
    # 1. Path to the PDF
    pdf_path = "test.pdf"  # Replace with your PDF path

    # 2. Extract and preprocess text
    text = extract_text_from_pdf(pdf_path)
    chunks = split_into_chunks(text)

    # 3. Create embeddings
    embeddings, model = create_embeddings(chunks)

    # 4. Store embeddings in FAISS index
    index = store_embeddings(np.array(embeddings))

    # 5. Query the system
    query = "from page 6 get the tabular data"  # Replace with your query
    results = search_query(query, index, chunks, model)

    # 6. Generate and print response
    response = generate_response(query, results)
    print(response)

if __name__ == "__main__":
    main()

Here are the relevant results for your query: 

- Tables, Charts, and
Graphs
with Examples from History, Economics,
Education, Psychology, Urban Affairs and
Everyday Life
REVISED: MICHAEL LOLKUS 2018

Tables, Charts, and
Graphs Basics
 We use charts and graphs to visualize data.
 This data can either be generated data, data gathered from
an experiment, or data collected from some source.
 A picture tells a thousand words so it is not a surprise that
many people use charts and graphs when explaining data.
Types of Visual
Representations of Data
Table of Yearly U.S. GDP by
Industry (in millions of dollars)
Source: U.S.


