# Example testing notebook

This notebook provides an example of how you can test LLM-assisted document conversion vs. the alternative. Before attempting to run, be sure to set up your Python environment using the code in `initial-setup.ipynb` and configure the `.ini` file as discussed below.

The notebook begins by loading credentials and configuration from an `.ini` file stored in `~/.hbai/ai-workflows.ini`. The `~` in the path refers to the current user's home directory, and the `.ini` file contents should follow this format (with keys, models, and paths as appropriate):

    [openai]
    openai-api-key=keyhere-with-sk-on-front
    openai-model=gpt-4o
    azure-api-key=keyhere-or-blank
    azure-api-base=azure-base-url-here
    azure-api-engine=gpt-4o
    azure-api-version=2024-02-01

    [langsmith]
    langsmith-api-key=leave-blank-unless-you're-using-langsmith

    [files]
    input-dir=~/Files/ai-workflows/inputs
    output-dir=~/Files/ai-workflows/outputs

You can leave the Azure settings blank if you're using OpenAI (or vice versa). You also don't need to supply a Langsmith API key unless you're using Langsmith. The `input-dir` and `output-dir` settings are used to specify the directories where input and output files are stored, respectively.

## Initializing

This next code block initializes the notebook, reading parameters from the configuration file and initializing an LLM interface.

In [1]:
# for convenience, auto-reload modules when they've changed
%load_ext autoreload
%autoreload 2

import logging
import configparser
import os
from ai_workflows.llm_utilities import LLMInterface 

# set log level to WARNING
logging.basicConfig(level=logging.INFO)

# load credentials and other configuration from a local ini file
inifile_location = os.path.expanduser("~/.hbai/ai-workflows.ini")
inifile = configparser.RawConfigParser()
inifile.read(inifile_location)

# load configuration
openai_api_key = inifile.get("openai", "openai-api-key")
openai_model = inifile.get("openai", "openai-model")
azure_api_key = inifile.get("openai", "azure-api-key")
azure_api_base = inifile.get("openai", "azure-api-base")
azure_api_engine = inifile.get("openai", "azure-api-engine")
azure_api_version = inifile.get("openai", "azure-api-version")
input_dir = os.path.expanduser(inifile.get("files", "input-dir"))
output_dir = os.path.expanduser(inifile.get("files", "output-dir"))
langsmith_api_key = inifile.get("langsmith", "langsmith-api-key")

# initialize LangSmith API (if key specified)
if langsmith_api_key:
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = "local"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
    os.environ["LANGCHAIN_API_KEY"] = langsmith_api_key

# initialize the LLM
llm = LLMInterface(
    openai_api_key=openai_api_key,
    openai_model=openai_model,
    azure_api_key=azure_api_key,
    azure_api_base=azure_api_base,
    azure_api_engine=azure_api_engine,
    azure_api_version=azure_api_version,
    langsmith_api_key=langsmith_api_key
)

# report results
print("Local configuration loaded, LLM initialized.")

Local configuration loaded, LLM initialized.


## Converting input documents to Markdown

This next code block runs through all the files in the configured input directory and converts them to Markdown, saving the Markdown files in the output directory. It also converts the files without the LLM (adding a "-no-llm" suffix to the base name of each output file) so that you can see the difference betweeen LLM-assisted and regular conversion.

In [2]:
# use document_utilities to convert all files in the input directory
from ai_workflows.document_utilities import DocumentInterface

# initialize the document interface
doc = DocumentInterface(llm_interface=llm)
doc_no_llm = DocumentInterface()

# convert all files in the input directory
for filename in os.listdir(input_dir):
    if os.path.isfile(os.path.join(input_dir, filename)) and not filename.startswith('.') and not filename.endswith('.md'):
        print(f"Converting {filename} to markdown...")
        input_path = os.path.join(input_dir, filename)
        markdown = doc.convert_to_markdown(input_path)

        # write the markdown to the output directory
        output_path = os.path.join(output_dir, os.path.splitext(os.path.basename(filename))[0] + '.md')
        with open(output_path, 'w') as f:
            f.write(markdown)

        # now convert again without the LLM
        markdown_no_llm = doc_no_llm.convert_to_markdown(input_path)
        output_path = os.path.join(output_dir, os.path.splitext(os.path.basename(filename))[0] + '-no-llm.md')
        with open(output_path, 'w') as f:
            f.write(markdown_no_llm)        

        print(f"Conversion complete. Markdown saved to {output_path}")

Converting test2.pdf to markdown...


INFO:root:Processing PDF /Users/crobert/Files/ai-workflows/inputs/test2.pdf from 17 images
INFO:root:Processing PDF page 1: Size=(2481, 3508), Mode=RGB
INFO:httpx:HTTP Request: POST https://hbai-openai-useast2.openai.azure.com//openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK"
INFO:root:Extracted JSON for page 1: {
  "elements": [
    {
      "type": "image",
      "content": "The image shows a stylized illustration of a group of people holding a large sign. The sign features a logo with an eye design and the text 'CARBON MARKET WATCH'. Below the group, the text 'ANNUAL REPORT 2023' is prominently displayed. The background is a solid teal color, and the overall design is minimalist with silhouettes of people."
    }
  ]
}
INFO:root:Processing PDF page 2: Size=(2480, 3508), Mode=RGB
INFO:httpx:HTTP Request: POST https://hbai-openai-useast2.openai.azure.com//openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK"
INFO:root:Ext

Processing /Users/crobert/Files/ai-workflows/inputs/test2.pdf...
Conversion complete. Markdown saved to /Users/crobert/Files/ai-workflows/outputs/test2-no-llm.md
Converting test1.pdf to markdown...


INFO:root:Processing PDF /Users/crobert/Files/ai-workflows/inputs/test1.pdf from 21 images
INFO:root:Processing PDF page 1: Size=(2550, 3300), Mode=RGB
INFO:httpx:HTTP Request: POST https://hbai-openai-useast2.openai.azure.com//openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK"
INFO:root:Extracted JSON for page 1: {
  "elements": [
    {
      "type": "body_text_section",
      "content": "# Wealth and Well-being: Lessons from Indian Debt Relief\n\nChristopher Robert  \n*John F. Kennedy School of Government, Harvard University*  \n*Mailbox 27, 79 JFK Street, Cambridge, MA 02138*  \n*617-807-0794*  \n*chris_robert@hksphd.harvard.edu*"
    },
    {
      "type": "body_text_section",
      "content": "## Abstract\n\nThis paper uses a natural experiment to estimate the causal effect of income on subjective well-being. Among a population of indebted farmers in rural India, the marginal effect of income on self-reported life satisfaction is found to be positi

Processing /Users/crobert/Files/ai-workflows/inputs/test1.pdf...
Conversion complete. Markdown saved to /Users/crobert/Files/ai-workflows/outputs/test1-no-llm.md


## Comparing Markdown documents

The next several sections have code to compare the LLM and no-LLM versions of the Markdown outputs. They are a work-in-progress.

In [3]:
import re
from difflib import SequenceMatcher
from typing import Tuple, List, Dict

class MarkdownDiffAnalyzer:
    """A class to analyze differences between Markdown documents while ignoring formatting."""
    
    @staticmethod
    def strip_markdown(text: str) -> str:
        """
        Remove markdown formatting while preserving the actual text content.
        
        Args:
            text (str): The markdown text to process
            
        Returns:
            str: Text with markdown formatting removed
        """
        
        # Remove code blocks and their content
        text = re.sub(r'```[\s\S]*?```', '', text)
        
        # Remove inline code
        text = re.sub(r'`[^`]*`', '', text)
        
        # Remove headers
        text = re.sub(r'^#{1,6}\s*', '', text, flags=re.MULTILINE)
        
        # Remove bold and italic
        text = re.sub(r'\*\*.*?\*\*', lambda m: m.group()[2:-2], text)
        text = re.sub(r'\*.*?\*', lambda m: m.group()[1:-1], text)
        text = re.sub(r'__.*?__', lambda m: m.group()[2:-2], text)
        text = re.sub(r'_.*?_', lambda m: m.group()[1:-1], text)
        
        # Remove links but keep text
        text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', text)
        
        # Remove images
        text = re.sub(r'!\[([^\]]*)\]\([^\)]+\)', '', text)
        
        # Remove horizontal rules
        text = re.sub(r'^[-*_]{3,}$', '', text, flags=re.MULTILINE)
        
        # Remove blockquotes
        text = re.sub(r'^\s*>\s*', '', text, flags=re.MULTILINE)
        
        # Remove list markers
        text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE)
        text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)
        
        # Remove tables
        text = re.sub(r'\|.*\|', '', text)
        text = re.sub(r'^\s*[-:|\s]+$', '', text, flags=re.MULTILINE)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    @staticmethod
    def find_missing_chunks(text1: str, text2: str, min_length: int = 10) -> Tuple[List[str], List[str]]:
        """
        Find chunks of text that exist in one document but not the other.
        
        Args:
            text1 (str): First text to compare
            text2 (str): Second text to compare
            min_length (int): Minimum length of chunks to consider
            
        Returns:
            Tuple[List[str], List[str]]: Lists of chunks unique to text1 and text2
        """
        # Initialize sequence matcher
        matcher = SequenceMatcher(None, text1, text2)
        
        # Get matching blocks
        matches = matcher.get_matching_blocks()
        
        # Find chunks unique to text1
        unique_to_1 = []
        last_a = 0
        for match in matches:
            i, j, n = match
            if i - last_a >= min_length:
                unique_to_1.append(text1[last_a:i].strip())
            last_a = i + n
            
        # Find chunks unique to text2
        unique_to_2 = []
        last_b = 0
        for match in matches:
            i, j, n = match
            if j - last_b >= min_length:
                unique_to_2.append(text2[last_b:j].strip())
            last_b = j + n
            
        return unique_to_1, unique_to_2
    
    def compare_markdown_docs(self, md1: str, md2: str, min_length: int = 10) -> Dict:
        """
        Compare two Markdown documents and find their differences.
        
        Args:
            md1 (str): First Markdown document
            md2 (str): Second Markdown document
            min_length (int): Minimum length of different chunks to report
            
        Returns:
            Dict: Dictionary containing analysis results
        """
        # Strip Markdown formatting
        clean1 = self.strip_markdown(md1)
        clean2 = self.strip_markdown(md2)
        
        # Find differences
        missing_from_2, missing_from_1 = self.find_missing_chunks(clean1, clean2, min_length)
        
        # Calculate similarity ratio
        similarity = SequenceMatcher(None, clean1, clean2).ratio()
        
        return {
            'similarity_ratio': similarity,
            'missing_from_doc1': missing_from_1,
            'missing_from_doc2': missing_from_2,
            'clean_text1': clean1,
            'clean_text2': clean2
        }

In [7]:
import difflib

# loop through input files and compare the -no-llm version to the LLM version of the output files
for filename in os.listdir(input_dir):
    if os.path.isfile(os.path.join(input_dir, filename)) and not filename.startswith('.') and not filename.endswith('.md'):
        output_path1 = os.path.join(output_dir, os.path.splitext(os.path.basename(filename))[0] + '.md')
        output_path2 = os.path.join(output_dir, os.path.splitext(os.path.basename(filename))[0] + '-no-llm.md')

        print(f"\n\nComparing {output_path1} to {output_path2}...")
        with open(output_path1, 'r') as f:
            md1 = f.read()
        with open(output_path2, 'r') as f:
            md2 = f.read()
        
        results = MarkdownDiffAnalyzer().compare_markdown_docs(md1, md2)
        
        print(f"Similarity ratio: {results['similarity_ratio']:.2%}")
        print("\nMissing from document 1 (LLM version):")
        for chunk in results['missing_from_doc1']:
            print(f"- {chunk}")

        print("\nMissing from document 2 (no-LLM version):")
        for chunk in results['missing_from_doc2']:
            print(f"- {chunk}")
        
        print("\nDifferences:")
        for line in difflib.unified_diff(md1.splitlines(), md2.splitlines(), fromfile='With LLM', tofile='Without LLM', lineterm=''):
            print(line)

/Users/crobert/Files/ai-workflows/inputs
.DS_Store
test2.pdf
Comparing test2.pdf...


Comparing /Users/crobert/Files/ai-workflows/outputs/test2.md to /Users/crobert/Files/ai-workflows/outputs/test2-no-llm.md...
Similarity ratio: 37.52%

Missing from document 1 (LLM version):
- Report by Khaled Diab, Communications Director Graphic design by Léa Moisan (moisan
- arkets bring must
- 6 mounth ago January 21 5 e
- nts organised 9 videos released on You
- 117 articles and publications Media outlets in 101 countries CARBON MARKET WATCH The human side of emissions trading The Emissions Trading System is one of the most important policy instruments in the EU’s climate armory, yet public awareness of and interest in the EU ETS is low
- Few non-specialists know what this instrument is, how it works, and how it affects people both directly and indirectly. In a bid to raise public awareness of the Emissions Trading System, both its strengths and its weaknesses, we, in collaboration with our partne