# <h2 style='text-align:center;'>OCR ‚Äî OpenCV Text Recognition</h2>

This notebook demonstrates a robust OCR pipeline using **OpenCV** for preprocessing and **Tesseract OCR** (via `pytesseract`) for text extraction. The companion script `ocr_text_detection.py` (VSCode-friendly) performs the actual extraction; use this notebook to explore preprocessing steps and visualize intermediate results.

## ‚öôÔ∏è Setup

Install Python packages (run in your environment):

```bash
pip install opencv-python numpy pytesseract pillow matplotlib
```

**Important:** Tesseract OCR is a system program and must be installed separately (not via pip). Instructions below.

### üîß Install Tesseract (quick guide)

- **Windows (recommended):** Download installer from UB-Mannheim builds: https://github.com/UB-Mannheim/tesseract/wiki and run the `.exe`. During install, check **Add to PATH**. Typical path: `C:\Program Files\Tesseract-OCR\tesseract.exe`.
- **macOS:** `brew install tesseract`
- **Ubuntu/Debian:** `sudo apt update && sudo apt install tesseract-ocr`

After installing, verify in terminal:
```
tesseract --version
```

In [None]:
import shutil
print('tesseract on PATH ->', bool(shutil.which('tesseract')))
# If False, set pytesseract.pytesseract.tesseract_cmd to the full path to your tesseract executable.

## üì∑ Load sample image

We use the sample images in `sample_images/`. If you don't have images, add them to that folder or change the `img_path` to your image file.

In [None]:
import cv2
from pathlib import Path
from matplotlib import pyplot as plt

sample = Path('sample_images')
imgs = list(sample.glob('*'))
if not imgs:
    print('No sample images found in sample_images/. Please add images (e.g. images.jpg) and re-run.')
else:
    img_path = imgs[0]
    print('Using:', img_path)
    img = cv2.imread(str(img_path))
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    plt.figure(figsize=(10,6)); plt.imshow(img_rgb); plt.axis('off')

## üîç Preprocessing recipes

We will demonstrate three common preprocessing approaches before passing images to Tesseract:

- **Otsu thresholding** (`thresh`) ‚Äî good for high-contrast text
- **Adaptive thresholding** (`adaptive`) ‚Äî robust to varying illumination
- **Bilateral smoothing** (`smooth`) ‚Äî denoising while preserving edges

Use whichever works best for your image; try all three if unsure.

In [None]:
def preprocess_and_show(path, mode='thresh'):
    import cv2
    from matplotlib import pyplot as plt
    img = cv2.imread(str(path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    if mode == 'thresh':
        _, proc = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    elif mode == 'adaptive':
        proc = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,11,2)
    else:
        proc = cv2.bilateralFilter(gray, 9, 75, 75)
    plt.figure(figsize=(10,6)); plt.imshow(proc, cmap='gray'); plt.title(mode); plt.axis('off')

# show all three for the first sample image if present
from pathlib import Path
s = Path('sample_images')
imgs = list(s.glob('*'))
if imgs:
    for m in ['thresh','adaptive','smooth']:
        preprocess_and_show(imgs[0], m)
else:
    print('Add sample images to sample_images/')

## ‚úçÔ∏è Extract text using pytesseract (example)
Below we call pytesseract on the preprocessed image. If `tesseract` is not installed or not on PATH, this will raise an error ‚Äî refer to the installation steps above.

In [None]:
import pytesseract
import cv2
from pathlib import Path

s = Path('sample_images')
imgs = list(s.glob('*'))
if imgs:
    img_path = imgs[0]
    img = cv2.imread(str(img_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, proc = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    try:
        text = pytesseract.image_to_string(proc, config='--psm 6')
        print('----- Recognized Text -----')
        print(text)
    except Exception as e:
        print('pytesseract error:', e)
else:
    print('No sample image available.')

## üß∞ Use the backend script
We created a companion script `ocr_text_detection.py` that you can run from the command line. Example:

```bash
python ocr_text_detection.py --image sample_images/images.jpg --mode thresh --save
```

This script automatically detects Tesseract (if on PATH) and shows the processed image, printing extracted text to console and (optionally) saving it as `.txt`.

## üì¶ Next steps & enhancements
- Use `pytesseract.image_to_data()` to get bounding boxes and confidence scores.
- Add morphological operations (dilate/erode) for noisy documents.
- Extend to multi-page PDF OCR using `pdf2image`.
- Integrate with Streamlit front-end for easy uploads and downloads.

## ‚úÖ Troubleshooting
- If you see `pytesseract` errors: ensure **Tesseract OCR** is installed and on PATH.
- If recognized text is poor: try different preprocessing modes, resize the image, or restrict the character whitelist via tesseract config.