## Semi Structured and multimodal RAG

- We will use Unstructured to parse both text and tables from documents (PDFs).
- We will use the multi-vector retriever to store raw tables, text along with table summaries better suited for retrieval.
- We will use LCEL to implement the chains used.

Notebook for reference: https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb


In [9]:
from typing import Any
import sys
from tqdm import trange
import time

import pandas as pd
import numpy as np
from groq import Groq
import os
from pinecone import Pinecone
import requests
import warnings
warnings.filterwarnings("ignore")

from pinecone_text.sparse import BM25Encoder
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# Load API Keys
from unstructured.staging.base import elements_to_json
from unstructured.staging.base import convert_to_dict
import json
import yaml
from groq import Groq
from dotenv import load_dotenv

from unstructured.partition.pdf import partition_pdf
from langchain_core.prompts import ChatPromptTemplate
#Need to import groq from langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict

# Can try paddle OCR instead of tesseract


In [10]:
load_dotenv()
groq_api_key = os.getenv('GROQ_API_KEY')
hf_key = os.getenv('HUGGINGFACE_API_KEY')
pinecone_api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
openai_api_key = os.getenv('OPENAI_API_KEY')
groq_client = Groq(api_key = groq_api_key)
model = "llama3-8b-8192"

## Data Loading

- Using partitionpdf, which segments a pdf document by using a layout model.
- This layout model makes it possible to extract elements, such as tables, from PDFs.
- We will also use unstructured chunking
  - Tries to identify document sections
  - builds text blocks that maintain sections while also honoring user-defined chunk sizes


In [3]:
# Code taken from unstructured website and stack overflow 
path_to_hsi = "../data/HSI1000_1to9.pdf"
raw_pdf_elements = partition_pdf("../data/HSI1000_1to9.pdf", 
                        strategy="hi_res", 
                        hi_res_model_name="yolox",
                        infer_table_structure=True
                        )

# Save output to json file (Future use mongodb maybe)
convert_to_dict(raw_pdf_elements)

element_output_file = "../data/element_entities.json"
elements_to_json(raw_pdf_elements, filename=element_output_file)
with open("../data/element_entities.json", "r", encoding='utf-8') as fin:
    read_elements = json.load(fin)
print(f"length before filtering: {len(read_elements)}")

unwanted_types = ['Footer', 'Image', 'FigureCaption', 'UncategorizedText']
filtered_el = []
for el in read_elements:
    if el['type'] in unwanted_types:
        continue
    else:
        filtered_el.append(el)
print(f"length after filtering: {len(filtered_el)}")
filtered_el[0]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


length before filtering: 130
length after filtering: 109


{'type': 'Title',
 'element_id': 'cc9971d7967ab7ce6a3ac73cc065832e',
 'metadata': {'coordinates': {'points': [[195.6, 158.1],
    [195.6, 187.6],
    [313.3, 187.6],
    [313.3, 158.1]],
   'system': 'PixelSpace',
   'layout_width': 1653,
   'layout_height': 2339},
  'filename': 'HSI1000_1to9.pdf',
  'file_directory': '../data',
  'last_modified': '2024-06-12T13:15:53',
  'filetype': 'application/pdf',
  'page_number': 1},
 'text': 'Lecture 1'}

In [4]:
table_elements =  [
    {'type': el['type'], 
     'Page': el['metadata']['page_number'],
     "text": el['metadata']['text_as_html']
     } for el in filtered_el if el['type'] == 'Table']
print(f"Number of tables identified: {len(table_elements)}")
text_elements =  [{'type': el['type'], 
     'Page': el['metadata']['page_number'],
     "text": el['text']
     } for el in filtered_el if el['type'] != 'Table']
print(f"Number of text elements identified: {len(text_elements)}")


3
106


In [5]:
text_elements[0]

{'type': 'Title', 'Page': 1, 'text': 'Lecture 1'}

In [6]:
table_elements[0]

{'type': 'Table',
 'Page': 2,
 'text': '<table><thead><th></th><th>Laptop doesn’t boot</th></thead><tr><td></td><td>Battery dead</td></tr><tr><td>the explanation |</td><td>Plug in external power</td></tr><tr><td>of test</td><td>Laptop seems to boot, so the battery must have been dead</td></tr></table>'}

In [7]:

def get_file_docs(element: List[Dict]) -> List[Dict]:
    def get_num_pages(elements):
        num = 0
        for el in elements:
            if el['Page'] > num:
                num = el['Page']
        return num

    def generate_chunks(text: str, page_num: int) -> List[Dict]:
        separator_ls = ["\n\n", "\n", ". ", "!", "?", ",", " ", ""]
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=300,
            chunk_overlap=30,
            length_function=len,
            separators=separator_ls
        )
        separated_list = text_splitter.split_text(text)
        # Add page number to each chunk
        return [{'Page': page_num, 'text': chunk} for chunk in separated_list]
    
    file_chunks = []
    num_pages = get_num_pages(text_elements)
    
    for i in range(num_pages):
        page_ls = []
        for el in element:
            if el['Page'] == i:
                page_ls.append(el['text'])
        
        page_text = "\n".join(page_ls)
        text_chunks = generate_chunks(page_text, i)
        file_chunks.extend(text_chunks)
    
    return file_chunks

text_documents = get_file_docs(text_elements)
print(text_documents)

[{'Page': 1, 'text': 'Lecture 1\nHSI1000\n1 The Founding of Modern Science\nIntended Learning Outcomes for Lecture 01 You should be able to do the following after this lecture.\n(1) Describe what is science and explain the scientific method “in a nutshell”, illustrating your explanation with a straightforward example.'}, {'Page': 1, 'text': '(2) Describe the roles scientific observations play in the scientific method. (3) Explain what are the main concerns that should be addressed when making scientific observations. (4) Explain why anomalous phenomena are important for science, illustrating your explanation with some'}, {'Page': 1, 'text': 'examples from the scientific revolution.\n(5) In the context of the scientific revolution, discuss the difference between an evidence-based understanding of the natural world versus one based on authority.'}, {'Page': 1, 'text': '(6) Discuss the steam engine’s contribution to the Industrial Revolution and its impact on population growth in industri

In [27]:
# Initialize the BM25Encoder and SentenceTransformer model
def initialise_models():
	bm25 = BM25Encoder()

	# Load embeddings. Need to change from ...co/models/ to ...co/pipeline/feature-extraction/...
	HF_API_URL = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2"
	headers = {"Authorization": f"Bearer {hf_key}"}

	def dense_embed(payload: str) -> str:
		response = requests.post(HF_API_URL, headers=headers, json=payload)
		return response.json()

In [30]:
def upsert_pinecone(text_documents):
    # Convert text_documents to DataFrame
    df = pd.DataFrame(text_documents)

    pinecone_upserts = []
    db_dense_embeddings = []
    db_text_chunks = []

    batch_size = 32

    # Create something to check the status of the pinecone index before upserting

    # Loop through the DataFrame 'df' in batches of size 'batch_size'
    for i in trange(0, len(df), batch_size):
        i_end = min(i+batch_size, len(df)) # Determine the end index of the current batch
        df_batch = df.iloc[i:i_end] # Extract the current batch from the DataFrame
        df_dict = df_batch.to_dict(orient="records") # Convert the batch to a list of dictionaries
        
        meta_batch = [
            f"Page {row['Page']}: {row['text']}" for _, row in df_batch.iterrows()
        ]
        
        # bm25.fit(meta_batch)

        text_chunks = df_batch['text'].tolist()
        db_text_chunks.extend(text_chunks)
        
        # Encode combined metadata and text using BM25Encoder to create sparse embeddings
        # sparse_embeddings = bm25.encode_documents([combined for combined in meta_batch])

        # Encode text using SentenceTransformer to create dense embeddings
        dense_embeddings = dense_embed(text_chunks)
        db_dense_embeddings.extend(dense_embeddings)
        
        # Generate a list of IDs for the current batch
        ids = ['vec' +str(x) for x in range(i, i_end)]
        time.sleep(2)
        pinecone_batch_upserts = []
        
        for _id, dense, meta in zip(ids, dense_embeddings, df_dict):
            pinecone_batch_upserts.append({
                'id': _id,
                'values': dense,
                'metadata': meta
            })
        
        index = pc.Index('hsi-notes')
        
        # RUN ONLY WHEN WANT TO UPSERT NEW BATCH
        if isinstance(dense_embeddings, list):
            upsert_response = index.upsert(vectors = pinecone_batch_upserts, namespace='page-1to9-texts')
        else:
            print("Embedding model not connected properly. Dense embeddings not generated. ")
            return
        print(f"Batch starting with index {i} upserted")
        pinecone_upserts.append(pinecone_batch_upserts)
    return

 33%|███▎      | 1/3 [00:17<00:35, 17.61s/it]

Batch starting with index 0 upserted


 67%|██████▋   | 2/3 [00:34<00:17, 17.24s/it]

Batch starting with index 32 upserted


100%|██████████| 3/3 [00:50<00:00, 17.00s/it]

Batch starting with index 64 upserted





### Using the pinecone index to test the retrieval


In [35]:
index_stats = pc.describe_index(os.environ['PINECONE_INDEX_NAME'])
print(index_stats)

{'dimension': 768,
 'host': 'hsi-notes-ugakozt.svc.aped-4627-b74a.pinecone.io',
 'metric': 'dotproduct',
 'name': 'hsi-notes',
 'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
 'status': {'ready': True, 'state': 'Ready'}}


In [56]:
def get_relevant_chunks(query, top_k):
    # Create dense vector of user query
    dense_query = dense_embed(query)
    matches = index.query( 
        namespace='page-1to9-texts',
        top_k=top_k, 
        vector=dense_query, 
        include_metadata=True
        )
    return matches

def pretty_print_matches(result):
    print(f"Namespace searched: {result['namespace']}\n")
    num_results = len(result['matches'])
    print(f"Top {num_results} relevant chunks found:\n")
    for i in range(num_results):
        print(f"Found on page {int(result['matches'][i]['metadata']['Page'])}:")
        print(f"{result['matches'][i]['metadata']['text']}")
        print(f"Dotproduct score: {result['matches'][i]['score']}")
        print("-" * 80)

def get_llm_context(query, top_k):
    index_stats = pc.describe_index(os.environ['PINECONE_INDEX_NAME'])
    if index_stats['status']['ready'] and index_stats['status']['state'] == "Ready":
        relevant_matches = get_relevant_chunks(query, top_k)        
    # ideally its just to combine the first 2 matches. Or maybe to go by dotproduct score and difference 
    context = ""
    for i in range(len(relevant_matches['matches'])):
        context += f"Page number: {int(relevant_matches['matches'][i]['metadata']['Page'])}" + relevant_matches['matches'][i]['metadata']['text'] + "\n"
    return context

In [57]:
from langchain_groq import ChatGroq

def llama_chat(user_question, k):
    context = get_llm_context(user_question, k)
    chat = ChatGroq(temperature=0, model_name="llama3-8b-8192")
    system = '''
            You are a science professor in a university. 
            Given the user's question and relevant sections from a set of school notes about scientific methodology and the history of science.
            You will also answer the question by including direct quotes from the notes, \
            along with the page number where the answer or answers can be found.
            '''
    human = "{text}"
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system", system
            ),
            (
                "human", human
                )
        ]
    )
    chain = prompt | chat
    return chain.invoke({"text": f"User Question: " + user_question + "\n\nRelevant section in textbook:\n\n" + context})

answer = llama_chat("What is Cadaverous Poisoning?", 5)
print(answer.content)

A fascinating topic! According to our notes, Cadaverous Poisoning refers to the transmission of a mysterious, invisible substance, also known as "cadaver matter," from corpses to medical professionals, particularly doctors and students, who handled the bodies during autopsies and dissections. This substance was believed to be responsible for the spread of diseases, including childbed fever, which was a significant cause of maternal mortality at the time.

As noted on page 7, this substance was found to be particularly prevalent in the anatomical pathology lab, where doctors and students would often come into contact with infected corpses. The same solution used to remove the putrid smell of infected autopsy tissue was also found to be effective in removing the cadaver matter from the skin.

It's worth noting that the concept of cadaverous poisoning was first proposed by a doctor who observed that medical professionals who had handled corpses were more likely to contract diseases, such 

In [58]:
# Table section
table_elements

[{'type': 'Table',
  'Page': 2,
  'text': '<table><thead><th></th><th>Laptop doesn’t boot</th></thead><tr><td></td><td>Battery dead</td></tr><tr><td>the explanation |</td><td>Plug in external power</td></tr><tr><td>of test</td><td>Laptop seems to boot, so the battery must have been dead</td></tr></table>'},
 {'type': 'Table',
  'Page': 2,
  'text': '<table><tr><td rowspan="2">Explanation Test the explanation |</td><td>Laptop monitor not working</td></tr><tr><td></td><td>Try connecting the external monitor with HDMI cable 1</td></tr><tr><td rowspan="2">Result of test</td><td>Laptop seems to boot, but there’s nothing on the screen. (a) Either the graphics card or motherboard has issues,</td></tr><tr><td></td><td>or (b) Something was wrong with our test.</td></tr></table>'},
 {'type': 'Table',
  'Page': 3,
  'text': '<table><tr><td>Observation</td><td>Laptop seems boot, but there’s nothing on the screen</td></tr><tr><td>Explanation</td><td>Laptop monitor not working</td></tr><tr><td>Test 