# Build a document-based question answering system by using Docling with Granite 3.1

**Authors:** Ash Minhas, Anna Gutowska

In this tutorial, you will use IBM® [Docling](https://github.com/DS4SD/docling) and open-source [Granite™ 3.1](https://www.ibm.com/granite) to perform document visual question answering for various file types. 

## What is Docling? 

Docling is an IBM open-source toolkit for parsing documents and exporting them to preferred formats. Input file formats include PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc and Markdown. These documents can be exported to markdown or JSON. Docling also provides [OCR (optical character recognition)](https://www.ibm.com/think/topics/optical-character-recognition) support for scanned documents. Use cases include scanning medical records, banking documents and even travel documents for quicker processing. 

## RAG and large context windows

[Retrieval augmented generation (RAG)](https://www.ibm.com/think/topics/retrieval-augmented-generation) is an architecture for connecting [large language models (LLMs)](https://www.ibm.com/topics/large-language-models) with external knowledge bases without [fine-tuning](https://www.ibm.com/topics/fine-tuning) or retraining. Text is embedded, stored in a vector database and finally, is used by the pre-trained model to return relevant information for [natural language processing (NLP)](https://www.ibm.com/think/topics/natural-language-processing) and [machine learning](https://www.ibm.com/topics/machine-learning) tasks. 

When an LLM has a larger [context window](<https://www.ibm.com/think/topics/context-window#:~:text=The%20context%20window%20(or%20%E2%80%9Ccontext,of%20information%20into%20each%20output.>), the generative AI model can process more information at once. This means that we can use both RAG and models with large context windows to leverage the ability to efficiently process more relevant information at a time. The LLM we use in this tutorial is the IBM `Granite-3.1-8B-Instruct` model. This model extends to a context window size of 128K tokens. We will access the model locally by using [Ollama](https://ollama.com/), without the use of an [API](https://www.ibm.com/topics/api). This model is also available on [Hugging Face](https://huggingface.co/ibm-granite). 


In [1]:
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

>_Note_:  This is a bad workaround for the following issue:
>`OMP: Error #15`: Initializing `libomp140.x86_64.dll`, but found `libiomp5md.dll` already initialized.
>OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/


## Steps

This tutorial can be found on our GitHub in the form of a Jupyter Notebook.  Jupyter Notebooks are widely used within [data science](https://www.ibm.com/topics/data-science) to combine code, text, images and [data visualizations](https://www.ibm.com/topics/data-visualization) to formulate a well-formed analysis.

### Step 1. Set up your environment
We first need to set up our environment by fulfilling some prerequisites. 

1. Install the latest version of [Ollama](https://ollama.com/) to run locally.

2. Pull the latest Granite 3.1 model by running the following command. 

3. Install and import the necessary libraries and modules.

In [2]:
!ollama pull granite3.1-dense:8b
!ollama pull nomic-embed-text

[?25lpulling manifest â ‹ [?25h[?25l[2K[1Gpulling manifest â ™ [?25h[?25l[2K[1Gpulling manifest â ¹ [?25h[?25l[2K[1Gpulling manifest â ¸ [?25h[?25l[2K[1Gpulling manifest â ¼ [?25h[?25l[2K[1Gpulling manifest â ´ [?25h[?25l[2K[1Gpulling manifest â ¦ [?25h[?25l[2K[1Gpulling manifest â § [?25h[?25l[2K[1Gpulling manifest â ‡ [?25h[?25l[2K[1Gpulling manifest â � [?25h[?25l[2K[1Gpulling manifest â ‹ [?25h[?25l[2K[1Gpulling manifest â ™ [?25h[?25l[2K[1Gpulling manifest â ¹ [?25h[?25l[2K[1Gpulling manifest â ¸ [?25h[?25l[2K[1Gpulling manifest â ¼ [?25h[?25l[2K[1Gpulling manifest 
pulling 0a922eb99317... 100% â–•â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–� 4.9 GB                         
pulling f7b956e70ca3... 100% â–•â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–�   69 B                         
pulling f76a906816c4... 100% â–•â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–� 1.4 KB                         
pulling 49206

In [3]:
# Install required packages
!pip install -q "langchain>=0.1.0" "langchain-community>=0.0.13" "langchain-core>=0.1.17" \
    "langchain-ollama>=0.0.1" "pdfminer.six>=20221105" "markdown>=3.5.2" "docling>=2.0.0" \
    "beautifulsoup4>=4.12.0" "unstructured>=0.12.0" "chromadb>=0.4.22" "faiss-cpu>=1.7.4" \
    "requests>=2.32.0"

In [4]:
# Required imports
import os
import tempfile
import shutil
from pathlib import Path
from IPython.display import Markdown, display

# Docling imports
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption, WordFormatOption, SimplePipeline

# LangChain imports
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_community.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

  from .autonotebook import tqdm as notebook_tqdm


### Step 2. Document format detection

We will work with various document formats in this tutorial. Let's create a helper function to detect document formats by using the file extension.

In [5]:
def get_document_format(file_path) -> InputFormat:
    """Determine the document format based on file extension"""
    try:
        file_path = str(file_path)
        extension = os.path.splitext(file_path)[1].lower()

        format_map = {
            '.pdf': InputFormat.PDF,
            '.docx': InputFormat.DOCX,
            '.doc': InputFormat.DOCX,
            '.pptx': InputFormat.PPTX,
            '.html': InputFormat.HTML,
            '.htm': InputFormat.HTML
        }
        return format_map.get(extension, None)
    except:
        return "Error in get_document_format: {str(e)}"

### Step 3. Document conversion

Next, we can use the `DocumentConverter` class to create a function that converts any supported document to markdown. This function identifies text, data tables, document images and captions by using Docling. The function takes a file as input, processes it using Docling's advanced document handling, converts it to markdown and saves the results in a Markdown file. Both scanned and text-based documents are supported and document structure is preserved. Key components of this function are:
- `PdfPipelineOptions`: Configures how PDFs are processed.
- `TesseractCliOcrOptions`: Sets up OCR for scanned documents.
- `DocumentConverter`: Handles the actual conversion process

In [6]:
def convert_document_to_markdown(doc_path) -> str:
    """Convert document to markdown using simplified pipeline"""
    try:
        # Convert to absolute path string
        input_path = os.path.abspath(str(doc_path))
        print(f"Converting document: {doc_path}")

        # Create temporary directory for processing
        with tempfile.TemporaryDirectory() as temp_dir:
            # Copy input file to temp directory
            temp_input = os.path.join(temp_dir, os.path.basename(input_path))
            shutil.copy2(input_path, temp_input)

            # Configure pipeline options
            pipeline_options = PdfPipelineOptions()
            pipeline_options.do_ocr = False  # Disable OCR temporarily
            pipeline_options.do_table_structure = True

            # Create converter with minimal options
            converter = DocumentConverter(
                allowed_formats=[
                    InputFormat.PDF,
                    InputFormat.DOCX,
                    InputFormat.HTML,
                    InputFormat.PPTX,
                ],
                format_options={
                    InputFormat.PDF: PdfFormatOption(
                        pipeline_options=pipeline_options,
                    ),
                    InputFormat.DOCX: WordFormatOption(
                        pipeline_cls=SimplePipeline
                    )
                }
            )

            # Convert document
            print("Starting conversion...")
            conv_result = converter.convert(temp_input)

            if not conv_result or not conv_result.document:
                raise ValueError(f"Failed to convert document: {doc_path}")

            # Export to markdown
            print("Exporting to markdown...")
            md = conv_result.document.export_to_markdown()

            # Create output path
            output_dir = os.path.dirname(input_path)
            base_name = os.path.splitext(os.path.basename(input_path))[0]
            md_path = os.path.join(output_dir, f"{base_name}_converted.md")

            # Write markdown file
            print(f"Writing markdown to: {base_name}_converted.md")
            with open(md_path, "w", encoding="utf-8") as fp:
                fp.write(md)

            return md_path
    except:
        return f"Error converting document: {doc_path}"

### Step 4. QA chain setup

The QA chain is the heart of our system. It combines several components:

1. Document loading:
- Loads the markdown file that we created.
- Splits it into manageable chunks for processing.

2. Text splitting:
- Breaks down the document into smaller pieces.
- Maintains context with overlap between chunks.
- Ensures efficient processing by the language model.

3. Vector store:
- Creates embeddings for each text chunk.
- Stores them in a FAISS index for fast retrieval.
- Enables semantic search capabilities.

4. Language model:
- Uses Ollama for both embeddings and text generation.
- Maintains conversation history.
- Generates contextual responses.

The following `setup_qa_chain` function sets up this entire pipeline.

In [7]:
def setup_qa_chain(markdown_path: Path, embeddings_model_name:str = "nomic-embed-text:latest", model_name: str = "granite3.1-dense:8b"):
    """Set up the QA chain for document processing"""
    # Load and split the document
    loader = UnstructuredMarkdownLoader(str(markdown_path)) 
    documents = loader.load()
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len
    )
    texts = text_splitter.split_documents(documents)
    # texts= documents
    
    # Create embeddings and vector store
    embeddings = OllamaEmbeddings(
        model=embeddings_model_name
        )
    vectorstore = FAISS.from_documents(texts, embeddings)
    
    # Initialize LLM
    llm = OllamaLLM(
        model=model_name,
        temperature=0
    )
    
    # Set up conversation memory
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        output_key="answer",
        return_messages=True
    )
    
    # Create the chain
    qa_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(
            search_kwargs={"k": 10}
            ),
        memory=memory,
        return_source_documents=True
    )
    
    return qa_chain

### Step 5. Set up question-answering interface

Finally, let's create a simple interface for asking questions. This function takes in the chain and user query as parameters. 

In [8]:
def ask_question(qa_chain, question: str):
    """Ask a question and display the answer"""
    result = qa_chain.invoke({"question": question})
    display(Markdown(f"**Question:** {question}\n\n**Answer:** {result['answer']}"))

### Step 6. Perform question-answering

Let's put it all together and enumerate over our questions for a specific document. The path to this document is stored in `doc_path` and can be any document you want to test. For our sample document, check out our GitHub. The system maintains conversation history and can handle follow-up questions.

In [9]:
# Process a document
doc_path = Path("Stochastic_Parrots.pdf")  # Replace with your document path

# Check format and process
doc_format = get_document_format(doc_path)
if doc_format:
    md_path = convert_document_to_markdown(doc_path)
    qa_chain = setup_qa_chain(md_path)
    
    # Example questions
    questions = [
        "What is the main topic of this document?",
        "What are the key points discussed?",
        "Can you summarize the conclusions?",
    ]
    
    for question in questions:
        ask_question(qa_chain, question)
else:
    print(f"Unsupported document format: {doc_path.suffix}")

Converting document: Stochastic_Parrots.pdf
Starting conversion...
Exporting to markdown...
Writing markdown to: Stochastic_Parrots_converted.md


  memory = ConversationBufferMemory(


**Question:** What is the main topic of this document?

**Answer:** The main topic of this document revolves around responsible practices in the research and development of language technology. It emphasizes the importance of considering human impacts, environmental consequences, data curation, documentation, stakeholder engagement, and ethical considerations when creating such technologies. The authors advocate for a broad view on potential effects of technology on people and communities, particularly those that may be adversely affected. They also discuss the challenges associated with large datasets based on Internet texts, including overrepresentation of hegemonic viewpoints and encoding biases harmful to marginalized populations. The document encourages researchers to budget for curation and documentation at the beginning of a project and explore alternative research directions beyond ever-larger language models.

**Question:** What are the key points discussed?

**Answer:** 1. The document emphasizes the need for careful planning in language technology research and development, considering various dimensions to mitigate risks associated with increasingly large language models (LMs).

2. It advocates for a mindset that centers on the people who may be adversely affected by the resulting technology, taking into account environmental impacts, data curation, documentation, and stakeholder engagement early in the design process.

3. The authors warn about the risks of ingesting everything from the web, as it can lead to overrepresentation of hegemonic viewpoints and encoding biases potentially damaging to marginalized populations. They recommend budgeting for curation and documentation at the start of a project.

4. The document discusses how large datasets based on texts from the Internet may incur "documentation debt," making it difficult to understand what is in the training data as model size increases.

5. It highlights the unequal distribution of risks and benefits, with marginalized communities often bearing the brunt of negative consequences, such as environmental racism.

6. The authors explore real-world risks associated with deploying language technologies that encode hegemonic worldviews, amplify biases, and are mistaken for actual natural language understanding.

7. They suggest reevaluating the primary driver of increased performance in language technology, moving away from relying solely on ever-increasing size of LMs towards approaches that avoid some of these risks while still benefiting from improvements to language technology.

8. The document references a critical survey on 'bias' in NLP by Su Lin Blodgett et al., published in the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020).

**Question:** Can you summarize the conclusions?

**Answer:** The main conclusions drawn from this document regarding responsible practices in language technology research and development are as follows:

1. Hegemonic viewpoints and biases: Large datasets based on texts from the Internet tend to overrepresent hegemonic worldviews and encode biases that can be potentially damaging to marginalized populations (§4).

2. Documentation debt: The risk of incurring documentation debt increases with the size of models, making it difficult to understand what is in the training data (§4).

3. Careful planning: Researchers should adopt a mindset of careful planning before starting to build datasets or systems trained on them, considering various dimensions such as environmental impact, human impacts, and ethical implications (§7).

4. Environmental consequences: The environmental impact scales with model size, and the document encourages weighing the environmental costs first when developing language technology (§6).

5. Data curation and documentation: Investing resources into curating and carefully documenting datasets is recommended instead of ingesting everything on the web (§7).

6. Pre-development exercises: Conducting pre-development exercises to evaluate how the planned approach fits into research and development goals and supports stakeholder values is encouraged (§7).

7. Stakeholder engagement: Engaging with stakeholders, including marginalized communities, is essential to ensure that the benefits and risks of language technology are distributed fairly (§6).

8. Ethical implications: The document highlights the need to consider ethical implications, such as the potential for amplifying biases in training data, when developing language technology (§5).

9. Risk/benefit analysis: When performing risk/benefit analyses of language technology, it is crucial to keep in mind how risks and benefits are distributed, as they do not accrue to the same people (§6).

10. Avoiding over-reliance on large models: The document suggests that relying solely on ever-increasing size of LMs as the primary driver of increased performance may lead to unintended consequences and risks, and encourages exploring alternative research directions beyond larger language models (§7).

Great! The system was able to retrieve relevant information from the document to answer questions. Feel free to test this system with any of your own files and questions!

## Conclusion

Using Docling and Granite 3.1, you built a document question answering system compatible with various file types. As a next step, this methodology can be applied to a [chatbot](https://www.ibm.com/topics/chatbots) with an interactive UI. There are many opportunities to transform this tutorial to apply to specific use cases. 