#LLM Spelling Corrector

This code takes preformatted text and send it to DeepSeek API with specifying the prompts to perform contextual and word-candidate aware spelling correction.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from openai import OpenAI

MAX_WORKERS = 3
FOLDER_PARALLELISM = 3

def get_system_content_summary():
    return """
        make a summary of this text
    """

def get_user_content_summary(partition_text):
    return f"""
        {partition_text}
    """

def get_system_content_block(summary_result):
    return f"""
### CONTEXT

OCR was performed using tesseract on Old Indonesian newspaper from 1946-1947. The text was mainly about description of latest events (of the era) and its commentary.

The text had some errors due to OCR processes. A spelling correction attempt was made by suggesting correction candidates to each word by utilizing modern Indonesian, Dutch, and English Dictionary, and an orthographically reversed dictionary obtained by scraping Indonesian wikipedia. Due to the nature of wikipedia, some words were mispelled, and some ground truth word didn't suggested as candidates due to the limitation of the suggestion algorithm.

The corrected text need to be as similar as possible to how the text was originally written to keep authenticity.

Later, the text will be used as resource for NLP based resarch on Indonesian and world history.


### ROLE


Your role is to perform spelling correction, using your ability to evaluate with later described criteria, given the correction candidates for each word and the summary of the article.



### CAPABILITIES



First of all, try to get an understanding of what the given summary of the bigger text was all about and try to relate how would the text be a part of the bigger text, so then the correction can be done with knowing the general theme of the text in mind.


Generally, you need to try to pick at one word from each possible slot, which consists of word as detected by ocr machine (labeled ORI) and its up to 10 correction candidates (labeled CAND).


As an exception, if you have your own candidates either when you see two adjacent word made more sense semantically if each retained/modified and then combined rather than correcting individual words, or if you are very certain that a word must be some other word, then youre allowed to use your candidate, even if that means omitting non-latin symbols outside brackets. Note that 'al' is very often a mispelled of 'di'


Generally, you're not allowed to change symbols outside the brackets/slots, unless if you're very certain it was an error. If question mark symbol appear, you're allowed to judge whether its a misspelled '2' that often represent repetition or not. For example, "orang2" means orang-orang in old indonesian. An exception was made for this rule when considering the merge and modification of 2 adjacent words.


As for casing, whatever word was chosen, try to match the casing style with the one used in original word, whether its all lower case, all upper case, upper case only on first letter or other possibilities.


Do not favor formality over informality and abbreviation over full word. Also note that sk, jl, and jad was common abbreviation that means "soerat kabar", "jang laloe", and "jang akan datang" respectively.


You can search during correction, but don't attempt to search whether or not a word was a proper indonesian old word. Other search purposes were permissible.


IMPORTANT : Always recheck whether you had correctly compile all answer in outputting final text. Just answer with the final corrected text result and do not give any other explanation or correction notes of any kind.


### EVALUATION


Evaluation should made consideration for each levels of language analysis. Some level had further details as described below.


1. Lexical level :

- prioritize the word in brackets (original/candidate) that most likely came from the era 1946-1947

- never assume that some casing style made some word more likely to be a named entity

- during cosideration of merging two adjacent word you're allowed to omit all non-latin entities and considering slight modifications to each word such that when merge makes the most logical

- place less priority on obscure words

- for name of an entity, for options that already the above rule, prioritize commonality

2. Syntactic level

3. Semantic level

- take notice that word may have slightly different meaning in the past compared to current modern usage

4. Discourse Integration

5. Sensibility Checking

- The text must represent either a descriptive news or commentary of likely event in the past (around 1946-1947). You need to match the writing style with the theme being brought such as the persona of the person talking.

- Possible candidates of correction should be ranked based on degree of support it shows to make the text a descriptive news or commentary of likely event in the past


### INPUT

[ORI: OLITIEK, CAND: politiek, politek] [ORI: jang, CAND: jang, mang, bang, kang] [ORI: dilaksanakan, CAND: dilaksanakan, dilaksanaman] [ORI: orang, CAND: orang, okang]? Be-'landa [ORI: jang, CAND: jang, mang, bang, kang] [ORI: bertempat, CAND: bertempat, berempat, berempat] [ORI: al, CAND: hal, mal, als] [ORI: Djakaria, CAND: djakarta, djakaria, djakarra].


### OUTPUT

POLITIEK jang dilaksanakan orang2 Belanda jang bertempat di Djakarta.

### SUMMARY

{summary_result}
    """

def get_user_content_block(block_text):
    return f"""
### TEXT

{block_text}
    """

def log(message):
    print(message)

def send_summary_request(partition_text) -> str:
    log("Sending summary request...")
    client = OpenAI(api_key=API_KEY, base_url="https://api.deepseek.com")

    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=[
            {"role": "system", "content": get_system_content_summary()},
            {"role": "user", "content": get_user_content_summary(partition_text)},
        ],
        stream=False
    )
    summary = response.choices[0].message.content
    log("Received summary.")
    return summary

def send_chunking_request(summary_result, block_text) -> str:
    log("Sending block request...")
    client = OpenAI(api_key=API_KEY, base_url="https://api.deepseek.com")

    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=[
            {"role": "system", "content": get_system_content_block(summary_result)},
            {"role": "user", "content": get_user_content_block(block_text)},
        ],
        stream=False
    )
    block_result = response.choices[0].message.content
    log("Received block result.")
    return block_result

def read_file(file_path):
    try:
        with open(file_path, 'r') as file:
            return file.read()
    except FileNotFoundError:
        log(f"[ERROR] File not found: {file_path}")
    except PermissionError:
        log(f"[ERROR] Permission denied: {file_path}")
    except IOError as e:
        log(f"[ERROR] Failed to read {file_path}: {e}")
    return ""  # Return empty string to prevent crash

def write_file(file_path, content):
    try:
        with open(file_path, 'w') as file:
            file.write(content)
    except PermissionError:
        log(f"[ERROR] Permission denied when writing: {file_path}")
    except IOError as e:
        log(f"[ERROR] Failed to write {file_path}: {e}")

def process_image_folder(image_folder_path, output_image_folder_path):
    all_partition_text_results = []

    for partition_folder in os.listdir(image_folder_path):
        partition_folder_path = os.path.join(image_folder_path, partition_folder)
        output_partition_folder_path = os.path.join(output_image_folder_path, partition_folder)
        os.makedirs(output_partition_folder_path, exist_ok=True)

        log(f"Processing partition: {partition_folder}")

        partition_text = ""
        summary_result = ""

        # Read .partition text
        for file_name in os.listdir(partition_folder_path):
            if file_name.startswith(".partition"):
                file_path = os.path.join(partition_folder_path, file_name)
                partition_content = read_file(file_path)
                partition_text += partition_content
                log(f"Processing .partition file: {file_name}")

        if not partition_text.strip():
            log("No partition text found. Skipping summary.")
            continue

        summary_result = send_summary_request(partition_text)
        write_file(os.path.join(output_partition_folder_path, "partition_summary.txt"), summary_result)

        # Process blocks in parallel
        block_pattern = re.compile(r'^block_(\d+)_.*\.txt$')
        block_files = [
            (int(re.match(block_pattern, f).group(1)), f)
            for f in os.listdir(partition_folder_path)
            if re.match(block_pattern, f)
        ]
        block_files.sort()

        results_by_block_number = {}

        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
            future_to_block = {}
            for block_number, block_file_name in block_files:
                block_file_path = os.path.join(partition_folder_path, block_file_name)
                block_content = read_file(block_file_path)
                if not block_content.strip():
                    log(f"[WARNING] Empty or unreadable block file: {block_file_path}")
                    continue
                future = executor.submit(send_chunking_request, summary_result, block_content)
                future_to_block[future] = (block_number, block_file_name)

            for future in as_completed(future_to_block):
                block_number, block_file_name = future_to_block[future]
                try:
                    api_result = future.result()
                except Exception as e:
                    log(f"[ERROR] Error processing block {block_file_name}: {e}")
                    continue
                block_result_path = os.path.join(output_partition_folder_path, f"block_result_{block_number}.txt")
                write_file(block_result_path, api_result)
                results_by_block_number[block_number] = api_result
                log(f"Processed block: {block_file_name}")

        # Append results in sorted order
        partition_api_results = [
            results_by_block_number[bn]
            for bn in sorted(results_by_block_number)
        ]

        partition_result_content = "\n\n".join(partition_api_results)
        write_file(os.path.join(output_partition_folder_path, "partition_final.txt"), partition_result_content)
        all_partition_text_results.append(partition_result_content)

        log("Partition processing complete.")

    # Write final summary
    chunking_results_content = "\n\n".join(all_partition_text_results)
    write_file(os.path.join(output_image_folder_path, "result_final.txt"), chunking_results_content)

def main():
    INPUT_DIR = '/content/drive/MyDrive/tugas-akhir/korpus-teks/korpus-pre-llm/korpus-pre-llm-omdta-5-symspell-indoengdutch-not-processed'
    OUTPUT_DIR = '/content/drive/MyDrive/tugas-akhir/korpus-teks/korpus-terproses/korpus-terproses-omdta-5-symspell-indoengdutch-stopped'

    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    with ThreadPoolExecutor(max_workers=FOLDER_PARALLELISM) as executor:
        futures = []
        for image_folder in os.listdir(INPUT_DIR):
            image_folder_path = os.path.join(INPUT_DIR, image_folder)
            output_image_folder_path = os.path.join(OUTPUT_DIR, image_folder)
            os.makedirs(output_image_folder_path, exist_ok=True)
            futures.append(
                executor.submit(process_image_folder, image_folder_path, output_image_folder_path)
            )

        for future in as_completed(futures):
            try:
                result = future.result()
                log(f"Folder {image_folder} processed successfully.")
            except Exception as e:
                log(f"[ERROR] Exception while processing image folder: {e}")

    """
    for image_folder in os.listdir(INPUT_DIR):
        image_folder_path = os.path.join(INPUT_DIR, image_folder)
        output_image_folder_path = os.path.join(OUTPUT_DIR, image_folder)
        os.makedirs(output_image_folder_path, exist_ok=True)

        log(f"Processing image folder: {image_folder}")
        process_image_folder(image_folder_path, output_image_folder_path)
    """

if __name__ == "__main__":
    main()

Processing partition: 1
Processing partition: 1
Processing partition: 1
Processing .partition file: .partition_1_4b5ab62a-SOEARA_OEMOEM_1947_01_09_001_page-0002.jpg.txt
Sending summary request...
Processing .partition file: .partition_1_6ed84bbb-SOEARA_OEMOEM_1947_01_06_001_page-0001.jpg.txt
Sending summary request...
Processing .partition file: .partition_1_5ea7d699-SOEARA_OEMOEM_1947_01_03_001_page-0001.jpg.txt
Sending summary request...
Received summary.
Sending block request...
Received summary.
Sending block request...
Sending block request...
Received summary.
Sending block request...
Sending block request...
Sending block request...
Received block result.
Processed block: block_1_6ed84bbb-SOEARA_OEMOEM_1947_01_06_001_page-0001.jpg.txt
Partition processing complete.
Processing partition: 2
Processing .partition file: .partition_2_6ed84bbb-SOEARA_OEMOEM_1947_01_06_001_page-0001.jpg.txt
Sending summary request...
Received block result.
Sending block request...
Processed block: bloc