# Building RAG with Custom Unstructured Data

Whether we are building our own RAG-based personal asistant, a pet project, or an enterprise RAG system, we will quickly discover that lots of important knowledge is stored in various formats like PDFs, emails, Markdown files, PPTs, HTML pages, Word documents, and so on.

In this example, we will build a RAG system that incporates data from multiple data types. We will use the [`unstructured`](https://github.com/Unstructured-IO/unstructured) libray for data preprocessing, open-source models from HuggingFace Hub for embeddings and text generation, ChromaDB as a vector store, and LangChain for bringing everything together.

## Setups

In [None]:
!pip install -qU torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community

Assuming that we want to build a RAG system that will help us manage pests in our garden. We will use diverse documents that cover the topic of integraed pest management (IPM): PDF, PPT, EPUB, HTML.

In [None]:
!mkdir -p "./documents"
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf"
!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O "./documents/Citrus_IPM_090913.pptx"
!wget https://www.gutenberg.org/ebooks/45957.epub3.images -O "./documents/45957.epub"
!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html -O "./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html"

## Unstructured data preprocessing

We can use the `unstructured` library to preprocess documents one by one, and write our own script to walk through a directory, but it is easier to use a local source connector to ingest all documents in a given directory.

The `unstructured` library can ingest documents from local directories, S3 buckets, blob storage, SFTP, and so on. In this example, we will use a local source connector. Optionally, we can also choose a destination connector for the processed documents - this could be MongDB, Pinecone, Weaviate, etc. Here, we will keep everything local.

In [None]:
import logging

logger = logging.getLogger('unstructured.ingest')
logger.root.removeHandler(logger.root.hadlers[0])

In [None]:
from google.colab import userdata
UNSTRUCTURED_API_KEY = userdata.get('UNSTRUCTURED_API_KEY')

In [None]:
import os

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

output_path = './local-ingest-output'

runner = LocalRunner(
    processor_config=ProcessorConfig(
        verbose=True, # logs verbosity
        output_dir=output_path, # local directory to store outputs
        num_processes=2
    ),
    read_config=ReadConfig(),
    partition_config=PartitionConfig(
        partition_by_api=True,
        api_key=UNSTRUCTURED_API_KEY
    ),
    connector_config=SimpleLocalConfig(
        input_path='./documents',
        recursive=False, # get the documents recursively from given directory
    )
)

runner.run()

* `ProcessorConfig` controls various aspects of the processing pipeline, including output locations, number of workers, error handling behavior, logging verbosity and more.
* `ReadConfig` customizes the data reading process for different scenarios, such as re-downloading data, preserving downloaded files, or limiting the number of documents processed.
* `PartitionConfig` determines if we partition the documents locally or via API. Here uses API which requries `UNSTRUCTURED_API_KEY`. If we remove these two parameters, the documents will be processed locally. If so, we may need to install `poppler` and `tesseract` if the documents require OCR and/or document understanding models.
* `SimpleLocalConfig` specifies where our original documents reside and whether we want to walk through the directory recursively.

In [None]:
from unstructured.staging.base import elements_from_json

elements = []

for filename in os.listdir(output_path):
    filepath = os.path.join(output_path, filename)
    elements.extend(elements_from_json(filepath))

## Chunking

The chunking methods in `unstructured` are slightly different from the chunking methods we used to apply, because the partitioning step has already divided an entire document into its structural elements: title, list items, tables, text, etc. By partitioning documents this way, we can avoid a situation where unrelated pieces of text end up in the same element, and then same chunk.

When we chunk the document elements with `unstructured`, individual elements are already small so they will only be split if they exceed the desired maximum chunk size. We can also optionally choose to combine consecutive text elements such as list items, for example, that will together fit within chunk size limit.

In [None]:
from unstructured.chunking.title import chunk_by_title

chunked_elements = chunk_by_title(
    elements,
    max_characters=512, # max chunk size
    combine_text_under_n_chars=200, # combine consecutive elements that are too small
)

Now the chunks are ready for RAG. To use them with LangChain, we can convert `unstructured` elements to LangChain documents.

In [None]:
from langchain_core.documents import Docuement

documents = []

for chunked_element in chunked_elements:
    metadata = chunked_element.metadata.to_dict()
    metadata['source'] = metadata['filename']

    del metadata['languages']

    documents.append(
        Document(
            page_content=chunked_element.text,
            metadata=metadata
        )
    )

## Setting up the retriever

We will use ChromaDB as a vector store and [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) as an embeeding model.

In [None]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import utils as chromautils

embedding_model = HuggingFaceBgeEmbeddings(model_name='BAAI/bge-base-en-v1.5')

# ChromaDB does not support complex metadata, e.g., lists, so we drop it here.
docs = chromautils.filter_complex_metadata(documents)
vectorestore = Chroma.from_documents(docs, embedding_model)
retriever = vectorestore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)

## RAG with LangChain

In this example, we will use [`Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with quantization.

In [None]:
from langchain.prompts import PromptTemplate
from langcahin.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

In [None]:
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eos_id|>")]

In [None]:
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=256,
    eos_token_id=terminators
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [None]:
prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=['context', 'question'],
    template=prompt_template
)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
    chain_type_kwargs={'prompt': prompt}
)

## Results

In [None]:
question = 'Are aphids a pest?'

qa_chain.invoke(question)['result']