<a href="https://colab.research.google.com/github/meet5398/MY_work_toward_data_science/blob/main/Table_Extraction_RAG_Local.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End to End Implementation of Retrieval Augmented Generation (RAG) with Unstructured, LangChain and ChromaDB

In the following python notebook we will go over how to extract tables from quarterly earnings reports using Unstructured's python library. We will then chunk, embedd and store the tables in a vector database for retrieval.




In [2]:
!pip install chromadb



In [4]:
!pip install unstructured unstructured-inference

[31mERROR: Could not find a version that satisfies the requirement unstructured-u (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for unstructured-u[0m[31m
[0m

In [5]:
!pip install openai langchain



In [6]:
!sudo apt-get install poppler-utils
!sudo apt-get install tesseract-ocr


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [7]:
!pip install tiktoken



In [8]:
!pip install pillow_heif



In [9]:
import os
import json
import pprint
import openai
import chromadb

from chromadb.utils import embedding_functions
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings

In [10]:
filename = "/content/NVIDIAAn.pdf" # For this notebook I uploaded Nvidia's earnings into Google Colab's files directory called "/content/"
output_dir = "/content/Output" # I also put the output in the "/content" directory

In [11]:
# Define parameters for Unstructured's library
strategy = "hi_res" # Used for analyzing PDFs and extracting table structure
model_name = "yolox" # Best model for table extraction. Other options are detectron2_onnx and chipper depending on file layout

In [12]:
!pip install pikepdf



In [16]:
!pip install pytesseract



In [14]:
elements = partition_pdf(filename=filename, strategy=strategy, infer_table_structure=True, model_name=model_name)

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
elements_to_json(elements, filename=f"{filename}.json") # Takes a while for file to show up on the Google Colab

In [44]:
def process_json_file(input_filename):
    # Read the JSON file
    with open(input_filename, 'r') as file:
        data = json.load(file)

    # Iterate over the JSON data and extract required table elements
    extracted_elements = []
    for entry in data:
        if entry["type"] == "Table":
            extracted_elements.append(entry["metadata"]["text_as_html"])

    # Write the extracted elements to the output file
    with open("/content/nvidia-yolox.txt", 'w') as output_file:
        for element in extracted_elements:
            output_file.write(element + "\n\n")  # Adding two newlines for separation


In [45]:
process_json_file(f"{filename}.json") # Takes a while for the .txt file to show up in Colab

In [46]:
text_file = "/content/nvidia-yolox.txt"

In [47]:
loader = TextLoader(text_file)
documents = loader.load()

In [49]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=3304, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

In [30]:
!pip install -U langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.0.8-py3-none-any.whl (32 kB)
Installing collected packages: langchain-openai
Successfully installed langchain-openai-0.0.8


In [50]:
from langchain_openai import OpenAIEmbeddings

In [51]:
os.environ['OPENAI_API_KEY'] = "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"
embeddings = OpenAIEmbeddings()

In [71]:
db = Chroma.from_documents(docs, embeddings)

In [72]:
# Initialize your model and retriever
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())

# List of questions
questions = [
    "tell me about all table  name",
    "data inside each table"

]

# Store responses in output_list
output_list = []

for query in questions:
    response = qa_chain({"query": query})
    output_list.append(response)

In [73]:
# Use pprint to pretty print the output list
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(output_list)

[   {   'query': 'tell me about all table  name',
        'result': 'The tables you provided contain financial information for '
                  'the periods ending January 30, 2022, January 31, 2021, and '
                  'some details about revenue, cost of revenue, gross profit, '
                  'operating expenses, net income, and earnings per share.'},
    {   'query': 'data inside each table',
        'result': "I'm sorry, but I cannot see the data inside the tables you "
                  'provided. If you have a specific question or need '
                  "information from the tables, please let me know, and I'll "
                  'do my best to help you.'}]
