<a href="https://colab.research.google.com/github/m3wzz/very_fake/blob/main/Kim_LLMs_Post_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Large Language Models and Post-OCR-Correction


Created by Sarah Oberbichler [ORCID](https://orcid.org/0000-0002-1031-2759)

Post-OCR correction addresses errors introduced when optical character recognition (OCR) converts scanned images into digital text. Common errors include character substitutions (e.g., "rn" misread as "m"), deletions, and formatting issues caused by poor image quality, unusual fonts, or degraded historical documents. Correction techniques range from dictionary-based validation and language models that use context to identify mistakes, to modern machine learning approaches—particularly large language models (LLMs)—that can intelligently reconstruct intended meaning from garbled text. This correction step is essential for downstream applications like text mining, digital archiving, and information retrieval, as even small error rates can significantly impact analysis quality and user experience.

In [None]:
!git clone https://github.com/soberbichler/NLP-Course4Humanities_2025.github.io.git

In [None]:
import pandas as pd

# Replace 'your_excel_file.xlsx' with the actual path to your Excel file
df = pd.read_excel('/content/NLP-Course4Humanities_2025.github.io/datasets/lügenpresse_dataset (1).xlsx')

# Now you can work with the DataFrame 'df'
df.head()

In [None]:
df = df[:4]

In [None]:
import pandas as pd
import requests
from google.colab import userdata

# API Config
api_url = "https://ki-chat.uni-mainz.de/api"
api_key = userdata.get('UNI-MAINZ')

def call_mainz_api(system_prompt, user_prompt, temperature=0.0, max_tokens=20000):
    """
    Call University of Mainz API with system and user prompts.
    """
    payload = {
        "model": "Qwen3 235B VL",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "temperature": temperature,
        "max_tokens": max_tokens
    }

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    response = requests.post(
        f"{api_url}/chat/completions",
        headers=headers,
        json=payload
    )

    if response.status_code != 200:
        raise Exception(f"API Error: HTTP {response.status_code} - {response.text}")

    result = response.json()
    return result['choices'][0]['message']['content']


# Process the DataFrame
all_articles = []
for index, row in df.iterrows():
    try:
        # Make API call
        content = call_mainz_api(
            system_prompt="""You are an expert in OCR-Post-Correction""",
            user_prompt=f"""Please correct OCR error in the newspaper texts. If a word is not readable, add a ?. Only correct if certain
Text to analyze:
{row['context_small']}""",
            temperature=0.0,
            max_tokens=20000
        )

        # Process articles
        if content and "Keine Artikel mit dem angegebenen Thema gefunden." not in content:
            new_row = row.to_dict()
            new_row['article_corrected'] = content.strip()
            all_articles.append(new_row)

        print(f"Processed row {index + 1}/{len(df)}")

    except Exception as e:
        print(f"Error processing row {index}: {str(e)}")
        continue

# Create final DataFrame
result_2_df = pd.DataFrame(all_articles)

# Save to Excel
result_2_df.to_excel('test_2.xlsx', index=False)

# Display results
print(f"\nProcessed {len(result_2_df)} articles successfully")
print(result_2_df.head())

In [None]:
result_2_df['article_corrected'][0]

In [None]:
result_2_df.to_excel('test.xlsx')