## Step 1 - ## Environment Setup

Before running the OCR workflow, we need to install all required packages.

This includes:
- PaddleOCR-VL for visual-language text extraction  
- PaddlePaddle GPU runtime (CUDA 12.6)  
- Utilities for PDF-to-image conversion and file handling

Once complete, restart the Colab runtime if prompted.

In [1]:
# (Re)start runtime after installing, if Colab prompts.

# System deps for PDF -> image conversion
!apt-get -y update && apt-get -y install poppler-utils

# Core OCR stack (doc-parser flavor includes PaddleOCR-VL pipeline)
# The official docs show these installs; if you have Colab T4, it works but may be slower.
!pip -q install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
!pip -q install -U "paddleocr[doc-parser]"

# Some builds require a nightly safetensors wheel per docs; install if needed on your GPU:
!pip -q install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl

# PDF + image utils
!pip -q install pdf2image PyPDF2 pillow tqdm


0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (91.18                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connected to r2u.s                                                                               Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:10 https://ppa.launchpadcontent.net/graph

## Step 1.1 — Clean Environment & Install Dependencies

We start by preparing a fresh Colab environment for the PaddleOCR-VL pipeline:

1. Check GPU and CUDA version using `nvidia-smi`  
2. Remove any existing PyTorch / NVIDIA CUDA packages to prevent library conflicts  
3. Install `poppler-utils` for PDF → image conversion  
4. Install PaddlePaddle GPU v3.2.0 (choose the index that matches your CUDA version)  
5. Install PaddleOCR (doc-parser flavor, which includes PaddleOCR-VL)  
6. Add helper libraries – `pdf2image`, `PyPDF2`, `pillow`, `tqdm`

After installation, Colab may prompt to restart the runtime — accept it before continuing.

In [2]:
# STEP 1 (clean + install) — Colab
# 0) Optional: see your GPU + CUDA version
!nvidia-smi

# 1) Remove PyTorch (and friends) so CUDA libs don't conflict with Paddle
!pip -q uninstall -y torch torchvision torchaudio xformers
!pip -q uninstall -y nvidia-cublas-cu12 nvidia-cuda-cupti-cu12 nvidia-cuda-runtime-cu12 \
  nvidia-cudnn-cu12 nvidia-cufft-cu12 nvidia-curand-cu12 nvidia-cusolver-cu12 \
  nvidia-cusparselt-cu12 nvidia-nccl-cu12 nvidia-nvjitlink-cu12

# 2) System deps for PDF→image
!apt-get -y update && apt-get -y install poppler-utils

# 3) Install Paddle GPU (choose the index that matches your CUDA shown by nvidia-smi)
#    If nvidia-smi says CUDA Version 12.1 → use cu121; if it says 12.6 → use cu126.
#    (Run ONLY ONE of the following two lines.)

# For CUDA 12.1:
# !pip -q install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu121/

# For CUDA 12.6:
!pip -q install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# 4) PaddleOCR doc-parser (includes PaddleOCR-VL pipeline)
!pip -q install -U "paddleocr[doc-parser]"

# 5) PDF + image utils
!pip -q install pdf2image PyPDF2 pillow tqdm


Fri Oct 24 02:34:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   60C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Step 2 — Verify Installation and Environment

Before running OCR, let’s confirm that everything installed correctly:

- PaddlePaddle is available and compiled with CUDA  
- cuDNN version is detected  
- PaddleOCR-VL imports successfully  
- The model initializes without running inference

If all checks print “OK,” you’re ready to proceed to the next step.

In [3]:
# Sanity check: Paddle + GPU + PaddleOCR-VL import
import paddle
print("paddle:", paddle.__version__)
print("compiled_with_cuda:", paddle.device.is_compiled_with_cuda())
try:
    print("cudnn_version:", paddle.get_cudnn_version())
except Exception as e:
    print("cudnn_version: <unavailable>", e)

from paddleocr import PaddleOCRVL
print("PaddleOCRVL import: OK")

# quick init (no inference yet)
_ = PaddleOCRVL()
print("PaddleOCRVL init: OK")




paddle: 3.2.0
compiled_with_cuda: True
cudnn_version: 90501


[32mCreating model: ('PP-DocLayoutV2', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/root/.paddlex/official_models/PP-DocLayoutV2`.[0m


PaddleOCRVL import: OK


[32mCreating model: ('PaddleOCR-VL-0.9B', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/root/.paddlex/official_models/PaddleOCR-VL`.[0m
[32mLoading configuration file /root/.paddlex/official_models/PaddleOCR-VL/config.json[0m
[32mLoading weights file /root/.paddlex/official_models/PaddleOCR-VL/model.safetensors[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key

PaddleOCRVL init: OK


## Step 3 — Import libraries

In [4]:
from pathlib import Path
import json, os, shutil
from tqdm import tqdm
from pdf2image import convert_from_path
from PyPDF2 import PdfReader
from paddleocr import PaddleOCRVL


Step 4 — Upload or specify PDF path (assume uploaded to Colab) and get filename

In [5]:
from google.colab import files

print("Upload your TDF Cricket Rule Book PDF...")
uploaded = files.upload()  # Choose your PDF
pdf_path = Path(list(uploaded.keys())[0]).resolve()
print(f"Uploaded: {pdf_path.name}")


Upload your TDF Cricket Rule Book PDF...


Saving TDF Cricket - 2025 Rule Book_v2.pdf to TDF Cricket - 2025 Rule Book_v2.pdf
Uploaded: TDF Cricket - 2025 Rule Book_v2.pdf


## Step 5 — Count and show number of pages in the PDF

In [6]:
reader = PdfReader(str(pdf_path))
num_pages = len(reader.pages)
print(f"Page count: {num_pages}")


Page count: 19


## Step 6 — Initialize PaddleOCR-VL in Low-VRAM Mode

This step sets conservative GPU memory flags and initializes the PaddleOCR-VL pipeline
without loading the heavy layout-detection model.  

Key settings:
- Uses **`float16`** to reduce model load size  
- Disables layout detection at init (`use_layout_detection=False`)  
- Frees VRAM for the later OCR inference stage  

When you see “PaddleOCR-VL pipeline initialized (layout disabled at init)” the model is ready.

In [7]:
# Step 6 (low-VRAM init)
import os, paddle
# Make Paddle allocate conservatively & prefer smaller dtype during model load
os.environ["FLAGS_allocator_strategy"] = "naive_best_fit"
os.environ["NVIDIA_TF32_OVERRIDE"] = "0"

import paddle
paddle.set_default_dtype('float16')  # steer loads toward fp16 where supported

from paddleocr import PaddleOCRVL

# ⚠️ Key: DO NOT spin up the heavy layout model at init.
# Some builds load PP-DocLayoutV2 by default; force-disable it here:
pipeline = PaddleOCRVL(
    use_layout_detection=False  # keep stage-1 off during initialization
)

print("PaddleOCR-VL pipeline initialized (layout disabled at init).")


[32mCreating model: ('PaddleOCR-VL-0.9B', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/root/.paddlex/official_models/PaddleOCR-VL`.[0m
[32mLoading configuration file /root/.paddlex/official_models/PaddleOCR-VL/config.json[0m
[32mLoading weights file /root/.paddlex/official_models/PaddleOCR-VL/model.safetensors[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key_value_heads: 2[0m
[32muse GQA - num_heads: 16- num_key

PaddleOCR-VL pipeline initialized (layout disabled at init).


## Step 7 - Workspace Cleanup & PDF-to-Image Rendering

We’ll start fresh by cleaning previous outputs and converting the uploaded
TDF Cricket Rule Book PDF into 19 PNG page images.  
A manifest file is also created so the next OCR step knows exactly which pages to read.

In [9]:
# CLEAN old outputs from prior runs
!rm -rf tdf_cricket_rules_work/pages/*
!rm -rf tdf_cricket_rules_work/page_json/*
!rm -rf tdf_cricket_rules_work/page_md/*

# Re-render ONLY the current PDF to images and save a manifest
from pathlib import Path
from pdf2image import convert_from_path

work_dir = Path("tdf_cricket_rules_work")
img_dir = work_dir / "pages"
img_dir.mkdir(parents=True, exist_ok=True)

images = convert_from_path(str(pdf_path), dpi=150, fmt="png",
                           output_folder=str(img_dir), paths_only=True)
print("Rendered", len(images), "images")

# Save a manifest for Step 6 to read (one path per line)
man_path = work_dir / "images_manifest.txt"
man_path.write_text("\n".join(images), encoding="utf-8")
print("Wrote manifest to:", man_path)

Rendered 19 images
Wrote manifest to: tdf_cricket_rules_work/images_manifest.txt


## Step 8 — Load Image Manifest for OCR

Now that we’ve re-rendered the PDF, this step reads the `images_manifest.txt`
to make sure the OCR pipeline processes only the **current 19 page images**  
and ignores any old or leftover files from previous runs.

In [10]:
# Use the manifest so we only process images from THIS render
from pathlib import Path
work_dir = Path("tdf_cricket_rules_work")
man_path = work_dir / "images_manifest.txt"
images = [Path(p) for p in man_path.read_text(encoding="utf-8").splitlines() if p.strip()]
num_pages = len(images)
print(f"Found {num_pages} page images from manifest.")

Found 19 page images from manifest.


## Step 9 - Resilient OCR Pass

We now process the 19 rendered pages with safeguards:
- Prevent GPU out-of-memory by resizing large pages
- Skip already-completed pages for fast reruns
- Write structured outputs per page (Markdown + JSON)
After this step, all pages should have corresponding files for aggregation.

In [12]:
# --- RESUMABLE + DEFENSIVE STEP 6 ---
import os, gc, time, signal, json, shutil
from pathlib import Path
from PIL import Image
import paddle

# ensure unbuffered prints
os.environ["PYTHONUNBUFFERED"] = "1"

# dirs from earlier steps
work_dir = Path("tdf_cricket_rules_work")
img_dir = work_dir / "pages"
page_json_dir = work_dir / "page_json"
page_md_dir = work_dir / "page_md"
aggregate_dir = Path("tdf_ocr_aggregate")
for d in [img_dir, page_json_dir, page_md_dir, aggregate_dir]:
    d.mkdir(parents=True, exist_ok=True)

# detect how many pages we have (from images already rendered)
images = sorted(img_dir.glob("*.png"))
num_pages = len(images)
print(f"Found {num_pages} page images.")

def stem_for(i): return f"page-{i:04d}"

# pages already processed (have MD+JSON)
done_pages = {int(p.stem.split("-")[-1]) for p in page_md_dir.glob("page-*.md")}
print(f"Already processed pages (md present): {sorted(done_pages)[:10]}{' ...' if len(done_pages)>10 else ''}")

# --- Hard-resize helper (actual downscale) ---
def resize_to_cap(src_path, cap_pixels):
    img = Image.open(src_path).convert("RGB")
    w, h = img.size
    cur = w * h
    tmp_path = str(src_path) + f".tmp_{cap_pixels}.png"
    if cur <= cap_pixels:
        img.save(tmp_path, format="PNG", optimize=True)
        return tmp_path
    scale = (cap_pixels / float(cur)) ** 0.5
    new_w = max(400, int(w * scale))
    new_h = max(400, int(h * scale))
    img = img.resize((new_w, new_h), Image.BICUBIC)
    img.save(tmp_path, format="PNG", optimize=True)
    return tmp_path

# --- Timeout alarm (Unix) ---
class Timeout(Exception): pass
def _handler(signum, frame): raise Timeout()
signal.signal(signal.SIGALRM, _handler)

# caps + timeout
PIXEL_CAPS = (700_000, 500_000, 350_000, 250_000)   # progressively smaller
PAGE_TIMEOUT_S = 150                                 # per attempt
GEN_TOKENS = 128                                     # smaller decode window → less VRAM

def infer_one_image(img_path, pipeline):
    last_err = None
    for cap in PIXEL_CAPS:
        tmp = None
        try:
            tmp = resize_to_cap(img_path, cap)
            signal.alarm(PAGE_TIMEOUT_S)
            out = pipeline.predict(
                tmp,
                use_layout_detection=False,
                use_chart_recognition=False,
                max_new_tokens=GEN_TOKENS,
                # you can also set: temperature=0.2, top_p=0.9, repetition_penalty=1.1
            )
            signal.alarm(0)
            return out
        except Timeout:
            last_err = Timeout(f"timeout>{PAGE_TIMEOUT_S}s at cap={cap}")
        except Exception as e:
            last_err = e
            msg = str(e).lower()
            if not ("out of memory" in msg or "resourceexhausted" in msg or ("cuda" in msg and "alloc" in msg)):
                # non-OOM error → stop retrying caps; raise
                raise
        finally:
            try:
                if tmp and os.path.exists(tmp): os.remove(tmp)
            except:
                pass
            paddle.device.cuda.empty_cache()
            gc.collect()
    # all caps failed
    raise last_err if last_err else RuntimeError("unknown error")

# process remaining pages
start = time.time()
for idx, img_path in enumerate(images, start=1):
    stem = stem_for(idx)
    if (idx in done_pages) and (page_json_dir / f"{stem}.json").exists():
        print(f"[{idx}/{num_pages}] skip (already done)")
        continue

    t0 = time.time()
    print(f"[{idx}/{num_pages}] OCR: {img_path}", flush=True)

    try:
        res_list = infer_one_image(str(img_path), pipeline)
    except Exception as e:
        # write placeholders on failure
        with open(page_json_dir / f"{stem}.json", "w", encoding="utf-8") as f:
            json.dump({"page": idx, "error": f"{type(e).__name__}: {e}"}, f, ensure_ascii=False)
        with open(page_md_dir / f"{stem}.md", "w", encoding="utf-8") as f:
            f.write(f"--- Page {idx} ---\n\n[OCR error: {type(e).__name__}: {e}]")
        print(f"  -> ERROR: {type(e).__name__}: {e}  ({time.time()-t0:.1f}s; elapsed {time.time()-start:.1f}s)", flush=True)
        continue

    if not res_list:
        with open(page_json_dir / f"{stem}.json", "w", encoding="utf-8") as f:
            json.dump({"page": idx, "error": "no_result"}, f, ensure_ascii=False)
        with open(page_md_dir / f"{stem}.md", "w", encoding="utf-8") as f:
            f.write(f"--- Page {idx} ---\n\n[No OCR result]")
        print(f"  -> no result  ({time.time()-t0:.1f}s; elapsed {time.time()-start:.1f}s)", flush=True)
        continue

    # save via built-ins, rename deterministically
    for res in res_list:
        before_json = set(page_json_dir.glob("*.json"))
        before_md   = set(page_md_dir.glob("*.md"))
        res.save_to_json(save_path=str(page_json_dir))
        res.save_to_markdown(save_path=str(page_md_dir))
        after_json = list(set(page_json_dir.glob("*.json")) - before_json)
        after_md   = list(set(page_md_dir.glob("*.md")) - before_md)
        if after_json:
            newest_json = max(after_json, key=lambda p: p.stat().st_mtime)
            shutil.move(str(newest_json), str(page_json_dir / f"{stem}.json"))
        if after_md:
            newest_md = max(after_md, key=lambda p: p.stat().st_mtime)
            shutil.move(str(newest_md), str(page_md_dir / f"{stem}.md"))

    paddle.device.cuda.empty_cache()
    gc.collect()

    print(f"  -> done in {time.time()-t0:.1f}s  (elapsed {time.time()-start:.1f}s)", flush=True)

print("Per-page OCR complete.")





Found 19 page images.
Already processed pages (md present): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ...
[1/19] skip (already done)
[2/19] skip (already done)
[3/19] skip (already done)
[4/19] skip (already done)
[5/19] skip (already done)
[6/19] skip (already done)
[7/19] skip (already done)
[8/19] skip (already done)
[9/19] skip (already done)
[10/19] skip (already done)
[11/19] skip (already done)
[12/19] skip (already done)
[13/19] skip (already done)
[14/19] skip (already done)
[15/19] skip (already done)
[16/19] skip (already done)
[17/19] skip (already done)
[18/19] skip (already done)
[19/19] skip (already done)
Per-page OCR complete.


## Step 10: Build Aggregated Output Files

Here we consolidate the page-level OCR results into:
1. `tdf_rules_full.md` — full Markdown text (all 19 pages, labeled)  
2. `tdf_rules_full.jsonl` — JSON Lines file with per-page data and error tags  
3. `tdf_rules_log.json` — processing summary and metadata  

This ensures all recognized content is organized and ready for structured parsing.

In [14]:
aggregate_dir = Path("tdf_ocr_aggregate")
aggregate_dir.mkdir(exist_ok=True, parents=True)

agg_md_path = aggregate_dir / "tdf_rules_full.md"
agg_jsonl_path = aggregate_dir / "tdf_rules_full.jsonl"
log_path = aggregate_dir / "tdf_rules_log.json"

# Initialize page_records to an empty list
page_records = []

# 7a) Aggregate Markdown in correct page order with page markers
with open(agg_md_path, "w", encoding="utf-8") as out_md:
    out_md.write(f"# TDF Cricket Rule Book (OCR)\n\n_Total pages: {num_pages}_\n")
    for p in range(1, num_pages + 1):
        stem = f"page-{p:04d}"
        md_file = page_md_dir / f"{stem}.md"
        out_md.write(f"\n\n--- Page {p} ---\n\n")
        if md_file.exists():
            out_md.write(md_file.read_text(encoding="utf-8"))
        else:
            out_md.write("[Missing Markdown for this page]")

# 7b) Aggregate JSON into JSONL (one line per page)
with open(agg_jsonl_path, "w", encoding="utf-8") as out_jsonl:
    for p in range(1, num_pages + 1):
        stem = f"page-{p:04d}"
        jf = page_json_dir / f"{stem}.json"
        if jf.exists():
            try:
                j = json.loads(jf.read_text(encoding="utf-8"))
                if isinstance(j, dict) and "page" not in j:
                    j["page"] = p
            except Exception as e:
                j = {"page": p, "error": f"json_parse: {e}", "file": str(jf)}
        else:
            j = {"page": p, "error": "missing_json", "file": str(jf)}
        out_jsonl.write(json.dumps(j, ensure_ascii=False) + "\n")

# 7c) Write a small run log
with open(log_path, "w", encoding="utf-8") as f:
    json.dump({
        "pdf_name": pdf_path.name,
        "pages_detected": num_pages,
        "per_page_records": page_records,
        "outputs": {
            "markdown": str(agg_md_path),
            "jsonl": str(agg_jsonl_path)
        }
    }, f, ensure_ascii=False, indent=2)

print("Aggregation complete.")
print(f"Markdown: {agg_md_path}")
print(f"JSONL:    {agg_jsonl_path}")
print(f"Log:      {log_path}")

Aggregation complete.
Markdown: tdf_ocr_aggregate/tdf_rules_full.md
JSONL:    tdf_ocr_aggregate/tdf_rules_full.jsonl
Log:      tdf_ocr_aggregate/tdf_rules_log.json


## Step 11 — Download Final OCR Outputs

Now that the OCR aggregation is complete, you can download the generated files
directly from Colab for offline analysis or sharing:

- `tdf_rules_full.md` → Complete rulebook text in Markdown  
- `tdf_rules_full.jsonl` → Structured JSON output (one record per page)  
- `tdf_rules_log.json` → Run metadata and file paths

In [29]:
from google.colab import files
files.download("tdf_ocr_aggregate/tdf_rules_full.md")
files.download("tdf_ocr_aggregate/tdf_rules_full.jsonl")
files.download("tdf_ocr_aggregate/tdf_rules_log.json")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Step 12 - Final Output — TDF Cricket Rules (Extracted Text)

In [19]:
from IPython.display import Markdown, display

# show the markdown nicely rendered in Colab
display(Markdown(open("tdf_ocr_aggregate/tdf_rules_full.md", encoding="utf-8").read()))


# TDF Cricket Rule Book (OCR)

_Total pages: 19_


--- Page 1 ---

Telangana Development Forum (TDF) Detroit Chapter

Prof. Jayashankar Cricket Tournament

Cricket Rule Book-2025

&

Code of Conduct

--- Page 2 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Table of Contents

TDF Cricket Spirit - Above All..... 4

Player Registration & Waiver Forms..... 4

Tournament Specific Rules and Guidelines..... 5

Tournament Format..... 6

Allotted Overs..... 6

Core Rule..... 7

Fielding Restrictions..... 7

Power Play..... 7

Game Timings..... 7

Match Interruption..... 8

Extras..... 8

Team Roster..... 8

--- Page 3 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Bowling Guard.....12

Free hit.....12

FAIR AND UNFAIR PLAY.....12

NRR Calculation.....12

Scoring.....13

Team Ranking.....13

Tie-Deciders.....13

Semi-Final Qualifier.....14

Wide Ball Rules.....14

Wide ball Markings.....14

No Ball Rules.....15

Ground Setup.....18

TDF Cricket

--- Page 4 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

TDF Cricket Spirit - Above All

Cricket is a game that owes much of its unique appeal to the fact that it should be played not only within its Laws but also within the TDF Spirit of the Game. Any action, which is seen to abuse this spirit, causes injury to the game itself. The major responsibility for ensuring the spirit of fair play rests with the captains and players.

TDF  Cricket  main  objective  is  to  inculcate  Culture  of  Inclus

--- Page 5 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Tournament Specific Rules and Guidelines

1. Unless rules are defined in this document, all other issues/rules to be followed according to ICC T20 Rules. Subject to the regulations outlined in this document, all other matters and rules governing the tournament shall be adhered to in accordance with the ICC T20 Rules. The ICC T20 Rules shall prevail in all respects, and any provisions not specifically addressed  in  this  document  shall  be  governed  by  the  relevant  rules  and

--- Page 6 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

b) If a player sustains an injury during the game, and both captains mutually agree, a 'by runner' provision may be allowed. The committee will not intervene in the decision unless requested by both captains

c) In the "by runner" provision, the opposition captain selects the runner. A change may be requested, but the final decision rests with the opposition captain.

Tournament Format

The tournament consists of four rounds:

Round 1: League Stage

Round 2: Quarterfinals and Contender

--- Page 7 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

· In a reduced overs game, the committee will provide allotted overs based on total overs.

· Umpires, Scorer, and captains will ensure these rules are enforced. If somehow a batsman faced more than the restrictive limit, he is immediately asked to retire when discovered of such instance. If a bowler bowls more than the allocation limit, he is immediately asked to discontinue  when  discovered  of  such  an  instance.  Balls  bowled  by  the  bow

--- Page 8 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

· Batting powerplay overs shall be chosen by the batting team after bowling powerplay completed during the innings.

· During the Power Play, there cannot be more than two fielders in the Grass Field.

· If a team is playing with less than 11 players on the field, fielders not at the ground are counted as if they were in the grass field.

· Example:  if  a  team  is  fielding  with  8  players;  During  power  play  

--- Page 9 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Main & Leg Umpire

Request the main umpire to consult leg umpire in situations where the second bounce, keeper catches etc. Main umpire can discuss with leg umpire for full toss no-balls and make the final call. It is always advised for two umpires to consult and not rush to decision. If a leg umpire is from a playing team, a neutral main umpire can override leg umpire's call. A decision made by either of the ump

--- Page 10 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Keeper & Field Changes

· Umpire  should  be  notified  of  all  the  fielder  moments  in  and  out  of  the  field.  If  such notification is not provided, either umpire shall call or signal as a NO BALL '

· Anytime  the  Wicketkeeper  makes  a  change  in  position  during  the  Over  towards  the Wickets,  it  is  the  Keeper's

--- Page 11 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

A replacement player can only be requested after the day's play has concluded and may be provided (depending on reserve list availability and circumstances) for the next day's games.

Late to Ground

Players arriving late should let the umpire know and wait for that over completion to be in the field. To bowl, he must wait one more over after coming to the field. As such no restriction for batsman.

Retired Hurt

Retired hurt players should come for batting at the end. Retired

--- Page 12 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Ball becomes 'not in play', when

· A serious injury to a player or Umpire occurs.

· One or both bails fall from the Striker's wicket before he receives delivery.

· The Umpire leaves his normal position for consultation.

Ball is 'in play', when

· An unsuccessful appeal is made.

· The Umpire has called 'no ball' or 'wide'.

· It strikes an Umpire-minor Injury (unless it lodges in his

--- Page 13 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

NRR Calculation

· A team's net run rate is calculated by deducting from the average runs per over scored by that team throughout the competition, the average runs per over scored against that team throughout the tournament.

· NRR Formula: (Runs Scored/Balls Faced) -(Runs Scored Against/ Balls Bowled).

· In the event of a team being all out in less than its full quota of overs, the calculation of its net run rate shall be based on the full quota of overs

--- Page 14 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Semi-Final Qualifier

Top 2 teams based on the number of points from each pool will advance to Semi-finals. If teams have the same number of points in league, the qualification of semifinals will be decided based on following

· Team with better net run-rate will go into semifinals.

· Head-Head results will be taken from league games for winner to go into semifinals.

· If there is still a tie, then the team with maximum wins will move to semifinals.

Wide Ball Rules

--- Page 15 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

NOTE: All ICC rules for scoring, counting the number balls bowled in an over and getting out to a wide ball are valid.

No Ball Rules

All No Ball calls result in a penalty of 1 run in addition to the runs scored of the ball and yield a freehit. Please refer to free hit rules.

As per TDF Cricket, the following rules override the ICC no-ball rule

· 1 ball per over is allowed by the bowler to bowl between the shoulder and head height but not above head

--- Page 16 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament

Code of Conduct and Rule Book

The umpire shall call and signal No ball, if

· If the ball bowled reaches the batsman at waist height without bouncing (Full Toss) unless it hits wickets, see the below figure.

· Above shoulder (good bouncer) one per over allowed with warning. The second one in the same over is no-ball. If a leg-umpire is not neutral, the main umpire can discuss with leg umpire and override leg-umpire decision.

· If

--- Page 17 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

· Umpire shall call and signal no-ball, if the bowler breaks the wicket at any time by any part of the body during the ball delivery including action and after delivery, except to run out the non-striker

· In addition to the instances above, no ball is to be called and signaled as required by the following

· Position of wicketkeeper

· Limitation fielders on each side.

· Limitations of fielders in the grass Area.

· Dangerous and unfair bow

--- Page 18 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

· Wilson A1228 Level 5 Soft T-Ball will be used for all matches. Match balls will be provided for each innings.

· Unless restricted by the shape of the ground, the boundary limit should be 195 feet (~63 yards) measured from the center of the pitch.

· Declared runs and boundaries on the offside and fine leg positions are to be followed.

o Runs Over the Fence Rule Anytime the ball in play goes over/under the fence outside the field

--- Page 19 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

· Also note that in the case of runouts and stumping, a batsman is considered to be out of his ground unless a part of his bat or body is grounded inside the crease. Thus, if the body or the bat is grounded on the crease line, but not inside, the batsman is 'OUT.

FINAL NOTE

All teams and their players must abide by the rules and regulations of the park (Parking Field) they are playing in, like no alcoholic drinks, no public indecent exposure

## Step 13 - JSON Results — Page-wise OCR Data

In [17]:
import json, itertools
with open("tdf_ocr_aggregate/tdf_rules_full.jsonl", encoding="utf-8") as f:
    for line in itertools.islice(f, 19):   # show first 19 pages
        print(json.loads(line))


{'input_path': 'tdf_cricket_rules_work/pages/e3ada67c-4aec-4ba6-a358-7848d236a232-01.png.tmp_700000.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': False, 'use_chart_recognition': False, 'format_block_content': False}, 'parsing_res_list': [{'block_label': 'ocr', 'block_content': 'Telangana Development Forum (TDF) Detroit Chapter\nProf. Jayashankar Cricket Tournament\nCricket Rule Book-2025\n&\nCode of Conduct', 'block_bbox': [0, 0, 735, 952], 'block_id': 0, 'block_order': 1}], 'page': 1}
{'input_path': 'tdf_cricket_rules_work/pages/e3ada67c-4aec-4ba6-a358-7848d236a232-02.png.tmp_700000.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': False, 'use_chart_recognition': False, 'format_block_content': False}, 'parsing_res_list': [{'block_label': 'ocr', 'block_content': 'TDF\nDetroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book\nTable of Contents\nTDF Cricket Spirit - A

## Step 14: Validate Final OCR Output

Before analyzing or structuring the text, let’s quickly inspect the combined Markdown result.
We’ll check:
- Total characters extracted  
- Number of lines  
- A short 250-line preview of the content

In [18]:
with open("tdf_ocr_aggregate/tdf_rules_full.md", encoding="utf-8") as f:
    text = f.read()
print("Characters:", len(text))
print("Total lines:", len(text.splitlines()))
print("\nSample excerpt:\n")
print("\n".join(text.splitlines()[:250]))  # first 250 lines


Characters: 9654
Total lines: 312

Sample excerpt:

# TDF Cricket Rule Book (OCR)

_Total pages: 19_


--- Page 1 ---

Telangana Development Forum (TDF) Detroit Chapter

Prof. Jayashankar Cricket Tournament

Cricket Rule Book-2025

&

Code of Conduct

--- Page 2 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Table of Contents

TDF Cricket Spirit - Above All..... 4

Player Registration & Waiver Forms..... 4

Tournament Specific Rules and Guidelines..... 5

Tournament Format..... 6

Allotted Overs..... 6

Core Rule..... 7

Fielding Restrictions..... 7

Power Play..... 7

Game Timings..... 7

Match Interruption..... 8

Extras..... 8

Team Roster..... 8

--- Page 3 ---

TDF

Detroit Prof. Jayashankar Cricket Tournament Code of Conduct and Rule Book

Bowling Guard.....12

Free hit.....12

FAIR AND UNFAIR PLAY.....12

NRR Calculation.....12

Scoring.....13

Team Ranking.....13

Tie-Deciders.....13

Semi-Final Qualifier.....14

Wide Ball Rules.....14

Wi