## Semi Structured and multimodal RAG
- We will use Unstructured to parse both text and tables from documents (PDFs).
- We will use the multi-vector retriever to store raw tables, text along with table summaries better suited for retrieval.
- We will use LCEL to implement the chains used.

Notebook for reference: https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb

In [1]:
from typing import Any
import pandas as pd
import numpy as np
from groq import Groq
import os
import pinecone
import requests


from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

# Load API Keys
from unstructured.staging.base import elements_to_json, elements_from_json
from unstructured.staging.base import convert_to_dict
from unstructured.documents.elements import Title, NarrativeText, Table
from unstructured.staging.base import convert_to_csv
import json
from IPython.display import display, HTML
import yaml
from groq import Groq
from dotenv import load_dotenv

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
#Need to import groq from langchain
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

# Can try paddle OCR instead of tesseract


In [2]:
load_dotenv()
groq_api_key = os.getenv('GROQ_API_KEY')
hf_key = os.getenv('HUGGINGFACE_API_KEY')
pinecone_api_key = os.getenv('PINECONE_API_KEY')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = Groq(api_key = groq_api_key)
model = "llama3-8b-8192"

## Data Loading 
- Using partitionpdf, which segments a pdf document by using a layout model.
- This layout model makes it possible to extract elements, such as tables, from PDFs.
- We will also use unstructured chunking
  - Tries to identify document sections
  - builds text blocks that maintain sections while also honoring user-defined chunk sizes

In [17]:
# Code taken from unstructured website and stack overflow 
path_to_hsi = "../data/HSI1000_1to9.pdf"
raw_pdf_elements = partition_pdf("../data/HSI1000_1to9.pdf", 
                        strategy = "hi_res", 
                        infer_table_structure=True, 
                        )

# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

# Save output to json file (Future use mongodb maybe)
convert_to_dict(raw_pdf_elements)

element_output_file = "../data/element_entities.json"
elements_to_json(raw_pdf_elements, filename=element_output_file)

In [15]:
with open("../data/element_entities.json", "r", encoding='utf-8') as fin:
    read_elements = json.load(fin)
print(f"length before filtering: {len(read_elements)}")

unwanted_types = ['Footer', 'Image', 'FigureCaption', 'UncategorizedText']
filtered_el = []
for el in read_elements:
    if el['type'] in unwanted_types:
        continue
    else:
        filtered_el.append(el)
print(f"length after filtering: {len(filtered_el)}")

length before filtering: 130
length after filtering: 109


In [24]:
class Element(BaseModel):
    type: str
    text: Any
    
table_elements =  [Element(type= 'Table', text=el['metadata']['text_as_html']) for el in filtered_el if el['type'] == 'Table']
print(len(table_elements))
text_elements =  [Element(type= el['type'], text=el['text']) for el in filtered_el if el['type'] != 'Table']
print(len(text_elements))

3
106


In [135]:
for i in range(len(text_elements)):
    if i >7:
        break
    print(text_elements[i].text)

Lecture 1

HSI1000

1 The Founding of Modern Science

Intended Learning Outcomes for Lecture 01 You should be able to do the following after this lecture.

(1) Describe what is science and explain the scientific method “in a nutshell”, illustrating your explanation with a straightforward example.
(2) Describe the roles scientific observations play in the scientific method. (3) Explain what are the main concerns that should be addressed when making scientific observations. (4) Explain why anomalous phenomena are important for science, illustrating your explanation with some

examples from the scientific revolution.

(5) In the context of the scientific revolution, discuss the difference between an evidence-based understanding of the natural world versus one based on authority.
(6) Discuss the steam engine’s contribution to the Industrial Revolution and its impact on population growth in industrialized nations.
1.1 What is Science? Hi all, welcome to the first video in this series. This 

In [137]:
# Get embeddings 
test = text_elements[0].text
API_URL = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2"
headers = {"Authorization": f"Bearer {hf_key}"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "this is a sentence",
})
output


[0.045405540615320206,
 -0.0042373742908239365,
 -0.012394663877785206,
 0.02608884684741497,
 -0.05298587679862976,
 0.02970622107386589,
 0.0017054344061762094,
 0.005578131414949894,
 -0.044122420251369476,
 -0.01496050599962473,
 0.0319620817899704,
 0.027854466810822487,
 0.027543310075998306,
 -0.04236595332622528,
 -0.0032520033419132233,
 -0.053771618753671646,
 0.053806569427251816,
 0.008799940347671509,
 0.025979885831475258,
 0.031901273876428604,
 -0.007877819240093231,
 0.00924465898424387,
 0.028893062844872475,
 0.019664430990815163,
 0.004536487627774477,
 -0.027799231931567192,
 -0.003416441148146987,
 -0.026624126359820366,
 -0.018539005890488625,
 -0.02505040541291237,
 0.03084537200629711,
 -0.027419831603765488,
 -0.005387916229665279,
 -0.06154040992259979,
 1.8225887288281228e-06,
 0.011428561061620712,
 0.02282109670341015,
 -0.04400307685136795,
 -0.07313310354948044,
 0.044216543436050415,
 -0.03551965579390526,
 0.026562348008155823,
 0.0009227489354088902,


In [None]:
# Code to initialise Pinecone Db
from pinecone.grpc import PineconeGRPC as Pinecone, ServerlessSpec, PineconeVectorStore
import os
import time

pc = Pinecone(api_key=pinecone_api_key)
index_name = "hsi-rag"

# I want 2 namespaces, one for table embeddings and one for text embeddings
text_namespace = "text-embeddings"
docsearch = PineconeVectorStore.from_documents(
    documents=text_documents,
    index_name=index_name,
    embedding=text_embeddings, 
    namespace=text_namespace 
)

time.sleep(1)