# Eksperimen 2: OCR Pipeline (PaddleOCR + LLM)

## Objective
Mengimplementasikan pipeline OCR + LLM dengan output terstandar, pengukuran waktu inferensi, dan perhitungan matriks evaluasi (CER).

- **Stage 1 (OCR):** PaddleOCR untuk ekstraksi teks.
- **Stage 2 (LLM):** Ollama (LLaVA) untuk structuring/refining.
- **Metrics:** Inference Time, Character Error Rate (CER).

## Output Format
```python
{
    'time': float,      # Detik
    'text': str,        # Hasil akhir (JSON/Text)
    'image': np.array   # Citra input (optional visualization)
}
```

In [3]:
import os
import cv2
import time
import numpy as np
import ollama
from paddleocr import PaddleOCR
from pathlib import Path

# --- Utility: Character Error Rate (CER) Calculation ---
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def calculate_cer(reference, hypothesis):
    if not reference:
        return 0.0
    # Remove extra whitespaces for fairer comparison
    ref = " ".join(reference.split())
    hyp = " ".join(hypothesis.split())
    dist = levenshtein_distance(ref, hyp)
    return dist / len(ref)

  from .autonotebook import tqdm as notebook_tqdm
[33mChecking connectivity to the model hosters, this may take a while. To bypass this check, set `DISABLE_MODEL_SOURCE_CHECK` to `True`.[0m


!pip show paddleocr
### 1. Setup & Initialization

In [4]:
# Init OCR Engine (Load Model)
print("Initializing PaddleOCR...")
# enable_mkldnn=False prevents Windows-specific AVX crashes
ocr = PaddleOCR(lang='en', enable_mkldnn=False, use_angle_cls=True)

# Setup Path
import glob
DATASET_DIR = r'c:\projekdosen\tutoring\Agentic Multimodal Tutor - SLL\playwithOCR\dataset\test'
IMAGES_DIR = os.path.join(DATASET_DIR, 'images')
GT_DIR = os.path.join(DATASET_DIR, 'gt')
print(f"Dataset Dir: {DATASET_DIR}")

# --- GROUND TRUTH UTILITY ---
def read_ground_truth(filename_base):
    gt_path = os.path.join(GT_DIR, f"{filename_base}.txt")
    if os.path.exists(gt_path):
        with open(gt_path, 'r', encoding='utf-8') as f:
            return f.read().strip()
    return ""

Initializing PaddleOCR...


  ocr = PaddleOCR(lang='en', enable_mkldnn=False, use_angle_cls=True)
[32mCreating model: ('PP-LCNet_x1_0_doc_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\NUC\.paddlex\official_models\PP-LCNet_x1_0_doc_ori`.[0m
[32mCreating model: ('UVDoc', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\NUC\.paddlex\official_models\UVDoc`.[0m
[32mCreating model: ('PP-LCNet_x1_0_textline_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\NUC\.paddlex\official_models\PP-LCNet_x1_0_textline_ori`.[0m
[32mCreating model: ('PP-OCRv5_server_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\NUC\.paddlex\official_models\PP-OCRv5_server_det`.[0m
[32mCreating model: ('en_PP-OCRv5_mobil

Dataset Dir: c:\projekdosen\tutoring\Agentic Multimodal Tutor - SLL\playwithOCR\dataset\test


### 2. Pipeline Execution (Timed)

**Prompting Technique:**  
Kami menggunakan **Zero-Shot Instruction Prompting**. Prompt dirancang untuk memberikan instruksi langsung kepada model (LLaVA) agar mengubah input teks mentah menjadi format JSON spesifik tanpa memberikan contoh (shot) sebelumnya.

**Prompt:**
> "Berikut adalah teks mentah... Tolong rapikan data ini menjadi format JSON yang valid..."

In [5]:
image_files = glob.glob(os.path.join(IMAGES_DIR, "*.jpg")) + glob.glob(os.path.join(IMAGES_DIR, "*.png")) + glob.glob(os.path.join(IMAGES_DIR, "*.jpeg"))
print(f"Found {len(image_files)} images.")

results = []

# Ensure OCR is initialized correctly
if 'ocr' not in locals():
    ocr = PaddleOCR(lang='en', enable_mkldnn=False, use_angle_cls=True)

for image_path in image_files:
    filename = os.path.basename(image_path)
    filename_base = os.path.splitext(filename)[0]
    ground_truth_text = read_ground_truth(filename_base)
    print(f"\nProcessing: {filename}...")
    
    start_time = time.time()
    
    # --- STAGE 1: OCR ---
    # Pass path directly to avoid CV2 nuances
    ocr_result = ocr.predict(image_path)

    extracted_lines = []

    if ocr_result and len(ocr_result) > 0:
        extracted_lines = ocr_result[0].get("rec_texts", [])

    raw_text = "\n".join(extracted_lines)
    
    # --- STAGE 2: LLM ---
    final_text_output = ""
    if raw_text.strip():
        prompt_content = f"""
        Berikut adalah teks mentah hasil OCR:\n
        {raw_text}\n
        \n
        Tersurukturkan teks di atas menjadi format JSON.\n
        Expected Key: 'tasks' (list of objects with 'task_name', 'status', 'notes').\n
        OUTPUT JSON ONLY.\n
        """
        try:
            response = ollama.chat(
                model='llava',
                messages=[{'role': 'user', 'content': prompt_content}]
            )
            if isinstance(response, dict) and 'message' in response:
                final_text_output = response['message']['content']
            else:
                final_text_output = str(response)
        except Exception as e:
             print(f"LLM Error: {e}")
    
    end_time = time.time()
    inference_time = end_time - start_time
    
    cer_score = calculate_cer(ground_truth_text, raw_text)
    print(f"  OCR Length: {len(raw_text)} chars | CER: {cer_score:.2%} | Time: {inference_time:.2f}s")
    
    results.append({
        'filename': filename,
        'time': inference_time,
        'cer': cer_score,
        'raw_text': raw_text,
        'final_json': final_text_output
    })

Found 11 images.

Processing: if4908_103012500097_nomor1.jpg...


[2026-01-14 16:32:00,799] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 146 chars | CER: 34.95% | Time: 81.93s

Processing: if4908_103012500098_nomor1.jpg...


[2026-01-14 16:32:58,924] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 139 chars | CER: 47.31% | Time: 58.07s

Processing: if4908_103012500281_nomor1.jpg...


[2026-01-14 16:34:59,181] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 232 chars | CER: 93.71% | Time: 120.21s

Processing: if4908_103012500305_nomor1.jpg...


[2026-01-14 16:35:28,301] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 132 chars | CER: 58.70% | Time: 29.03s

Processing: if4908_103012500322_nomor1.jpg...


[2026-01-14 16:37:29,990] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 214 chars | CER: 38.33% | Time: 121.67s

Processing: if4908_103012530052_nomor1.jpg...


[2026-01-14 16:38:39,224] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 242 chars | CER: 78.29% | Time: 69.21s

Processing: if4910_103012500004_nomor1.jpg...


[2026-01-14 16:39:20,455] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 200 chars | CER: 23.08% | Time: 41.17s

Processing: if4911_103012500384_nomor1.jpg...


[2026-01-14 16:40:00,648] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 197 chars | CER: 30.60% | Time: 40.18s

Processing: if4910_103012500367_nomor1.png...


[2026-01-14 16:41:08,405] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 309 chars | CER: 25.85% | Time: 67.74s

Processing: if4909_103012500132_nomor1.jpeg...


[2026-01-14 16:43:20,156] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 185 chars | CER: 40.85% | Time: 131.53s

Processing: if4909_103012530074_nomor1.jpeg...


[2026-01-14 16:44:48,952] [    INFO] _client.py:1025 - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


  OCR Length: 94 chars | CER: 69.89% | Time: 88.71s


### 3. Output & Evaluation

In [6]:
print("\n=== SUMMARY ===")
import pandas as pd
df = pd.DataFrame(results)
if not df.empty:
    print(f"Average Time: {df['time'].mean():.4f}s")
    print(f"Average CER: {df['cer'].mean():.2%}")
    print("\nDetailed Results exported to 'exp2_results.csv'")
    df.to_csv('exp2_results.csv', index=False)
else:
    print("No results to show.")


=== SUMMARY ===
Average Time: 77.2224s
Average CER: 49.23%

Detailed Results exported to 'exp2_results.csv'
