In [None]:
# Mount Google Drive to save the notebook
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## OCR Engine Selection and Trial Documentation



### Investigation Report

### 1. Project Goal and Constraints

The primary goal of the initial phase of investigation was to identify and set up suitable Optical Character Recognition (OCR) engines for potential use in a News Fact-Checker/Analyzer project. Key constraints included:
* **Environment:** Google Colab (initially targeting T4 GPU, later adapting to CPU due to potential free-tier limits).
* **Cost:** Must utilize free and open-source tools.
* **Input Data:** Primarily designed for processing screenshots of online news articles (typically multi-line, standard fonts, varying layouts).
* **Usability:** Libraries should be reasonably easy to install and integrate within a Python environment.
* **Performance:** Aim for a balance of accuracy (correct text extraction) and speed suitable for an interactive demo.

### 2. OCR Candidates Investigated

Several open-source OCR libraries and models were considered:

* **Tesseract:** A mature, widely-used OCR engine, primarily CPU-based. ([GitHub](https://github.com/tesseract-ocr/tesseract))
* **EasyOCR:** A popular Python library using deep learning (PyTorch backend by default), supporting GPU acceleration. ([GitHub](https://github.com/JaidedAI/EasyOCR))
* **PaddleOCR:** A comprehensive OCR toolkit based on the PaddlePaddle deep learning framework, supporting GPU acceleration. ([GitHub](https://github.com/PaddlePaddle/PaddleOCR))
* **TrOCR:** A Transformer-based model from Microsoft, available via Hugging Face. ([GitHub](https://github.com/microsoft/unilm/tree/master/trocr))
* **Keras-OCR:** A Python library using TensorFlow/Keras backend (CRAFT/CRNN models). ([GitHub](https://github.com/faustomorales/keras-ocr))
* **MMOCR:** An extensive OCR toolbox from the OpenMMLab ecosystem (PyTorch-based). ([GitHub](https://github.com/open-mmlab/mmocr))

### 3. Initial Testing Findings & Rationale for Selection/Exclusion

Basic installation and execution tests were performed within the Colab environment on sample images (primarily `pbIdS.png`, a standard multi-line OCR test image).

* **Tesseract (v4/v5):**
    * Installed successfully via `apt-get` and `pip install pytesseract`.
    * Ran reliably on **CPU**.
    * Produced **accurate text extraction** with correct line breaks on the test image.
    * Execution time was moderate for CPU.
    * **Selected for further testing** as a reliable CPU baseline.

* **EasyOCR:**
    * Installed successfully via `pip install easyocr`.
    * Ran successfully on **GPU** (`gpu=True`) with very fast execution (~0.15s). Text content was accurate, but line breaks were merged incorrectly with the default `paragraph=True` setting. Setting `paragraph=False` corrected line breaks on the test image.
    * Ran successfully on **CPU** (`gpu=False`). Text content and formatting (with `paragraph=False`) were accurate, but execution was significantly slower.
    * **Selected for further testing (CPU)** due to its functional accuracy and the need for CPU comparison. Its GPU potential is noted separately.

* **PaddleOCR:**
    * Initial install (`pip install paddlepaddle-gpu paddleocr`, v2.6.2) resulted in runtime `CUDNN_STATUS_SUBLIBRARY_VERSION_MISMATCH` errors on GPU execution in the target Colab environment, despite restarts and reinstalls.
    * Explicitly installing `paddlepaddle-gpu==2.6.1` (in a clean environment test) **resolved the CUDNN error**.
    * However, subsequent execution attempts with v2.6.1 on GPU **failed at the text detection stage** (`dt_boxes num : 0`) for simple test images where other engines succeeded.
    * Testing the **CPU version** (`pip install paddlepaddle`, `use_gpu=False`) with the underlying v2.6.2 framework **worked correctly**: no CUDNN error, successful text detection, accurate text output, and correct formatting. CPU speed was moderate, slower than Tesseract but faster than EasyOCR CPU.
    * **Selected for further testing (CPU)** as it became functional in this mode, providing a third distinct implementation for comparison.

* **TrOCR (`microsoft/trocr-small-stage1`):**
    * Installed successfully via `transformers`. Ran very fast on GPU (~0.1s).
    * Produced **completely incorrect text output** on the multi-line test image (`pbIdS.png`).
    * Further investigation confirmed TrOCR models are primarily designed for **single text-line images** and require complex line segmentation pre-processing for multi-line documents.
    * **Excluded** as unsuitable for direct use on news screenshots without significant added complexity.

* **Keras-OCR:**
    * Installation failed due to an `AttributeError` related to `np.sctypes`, indicating incompatibility with modern NumPy versions (2.0+).
    * Required downgrading NumPy to <2.0 (e.g., 1.26.4) for the library import to succeed.
    * **Excluded** due to reliance on outdated dependencies and the undesirability of forcing an older NumPy version in the environment.

* **MMOCR:**
    * Briefly investigated. Identified as a powerful but potentially complex research toolbox requiring multiple dependencies from the OpenMMLab ecosystem (MMCV, MMDetection).
    * **Excluded** in favor of more straightforward, self-contained libraries to maintain focus and manage project complexity within the given timeframe.

### 4. Plan for Testing (CPU Comparison) (Postponed)

Based on the findings above, and deciding to prioritize reliable testing execution by avoiding potential Colab free-tier GPU limits during the multi-image evaluation phase, the detailed comparison will focus on these three candidates running in **CPU mode**:

1.  **Tesseract (CPU)**
2.  **EasyOCR (CPU)**
3.  **PaddleOCR (CPU)**

Evaluation will focus on **Execution Time** and **Accuracy** (text content and formatting) across a diverse set of US news article screenshots. The potential speed advantage of EasyOCR on GPU will be noted separately based on initial tests.

### 5. Plan for Pipeline Demonstration

 For demonstration purposes in this notebook, the OCR components were integrated with a downstream AI task of **Sentiment Analysis** performed by a pre-trained **DistilBERT** model (`distilbert-base-uncased-finetuned-sst-2-english`), chosen for its balance of accuracy and efficiency.

The subsequent sections will provide the finalized, cleaned code for setting up each of these three selected OCR models (Tesseract CPU, EasyOCR CPU, PaddleOCR CPU) and demonstrate connecting their output to the DistilBERT model to showcase a complete, working **OCR -> Sentiment Analysis pipeline prototype** for each option.

---

## OCR Installation and Implementations

This section contains the setup and execution code for the three selected OCR models

### 1. Google Tesseract (CPU)

Tesseract is a widely-used open-source OCR engine maintained by Google, primarily operating on the CPU.


**1.1 Installation**

Installs the Tesseract engine on the Colab instance and the necessary Python wrapper.

* **Steps:**
    1.  Update Linux package manager lists.
    2.  Install the `tesseract-ocr` engine package.
    3.  Install the `pytesseract` Python library via pip.

**1.2. Execution Example**

This code block performs the following:
1.  Imports necessary libraries (`files`, `pytesseract`, `PIL`, `io`, `time`).
2.  Prompts the user to upload an image file.
3.  Loads the uploaded image using PIL.
4.  Runs Tesseract OCR (`image_to_string`) on the image (CPU-based).
5.  Measures and prints the execution time.
6.  Prints the extracted text.

In [None]:
# --- Tesseract Installation ---


# Update package list and install Tesseract OCR engine
!sudo apt-get update
!sudo apt-get install tesseract-ocr -y

# Install the Python wrapper library for Tesseract
!pip install pytesseract -q # Use -q for quieter output
print("\n\nTesseract engine and pytesseract wrapper installed.")


# -----------------------------

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.81)] [                                                                               Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://developer.download.n

In [None]:
# --- Tesseract Execution ---

print("--- Running Tesseract OCR ---")

from google.colab import files
import pytesseract
from PIL import Image
import io
import time

# Prompt for image upload
print("Please upload an image file for Tesseract:")
uploaded_tess = files.upload() # Use a unique variable name for upload dict

# Process if file uploaded
if uploaded_tess:
    file_name_tess = list(uploaded_tess.keys())[0]
    image_data_tess = uploaded_tess[file_name_tess]
    print(f"\nProcessing {file_name_tess} with Tesseract...")

    try:
        # Load image data
        img_tess = Image.open(io.BytesIO(image_data_tess))

        # Start timer
        start_time_tess = time.time()

        # Perform OCR
        extracted_text_tess = pytesseract.image_to_string(img_tess, lang='eng')

        # Stop timer
        end_time_tess = time.time()
        duration_tess = end_time_tess - start_time_tess

        # Print results
        print("\n--- Tesseract OCR Result ---")
        print("----------------------------")
        print(extracted_text_tess)
        print("----------------------------")
        print(f"Tesseract Execution Time: {duration_tess:.4f} seconds")

    except Exception as e:
        print(f"An error occurred during Tesseract processing: {e}")
else:
    print("No file uploaded for Tesseract.")


# -----------------------------

### 2. EasyOCR (CPU)

**2.1. Installation**

Installs the `easyocr` Python library.

* **Steps:**
    1. Install the `easyocr` library using pip.
    2. Note: The first time the `easyocr.Reader` is initialized (in the execution step), it will automatically download the required pre-trained language models (e.g., for English detection and recognition).

**2.2. Execution Example (CPU)**

This code block performs the following:
1. Imports necessary libraries (`easyocr`, `files`, `io`, `time`).
2. Initializes the `easyocr.Reader` for English, explicitly configuring it for **CPU execution** (`gpu=False`).
3. Prompts the user to upload an image file.
4. Loads the uploaded image data (bytes).
5. Runs EasyOCR (`readtext`) on the image data with settings optimized for separate lines (`paragraph=False`).
6. Measures and prints the execution time (expected to be slower than GPU but faster than PaddleOCR CPU based on initial tests).
7. Prints the extracted text.

*Note: GPU execution can be enabled by initializing the Reader with `gpu=True` (requires a suitable CUDA environment and dependencies like PyTorch compiled for GPU). The same `easyocr` library installation supports both CPU and GPU modes, offering flexibility.*

In [None]:
# --- EasyOCR Installation ---


# Install the easyocr library
!pip install easyocr -q # Use -q for quieter output
print("\n\nEasyOCR library installation command executed.")


# -----------------------------

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m125.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m91.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# --- EasyOCR Execution (CPU) ---

print("--- Running EasyOCR (CPU) ---")

import easyocr
from google.colab import files
import io
import time
from PIL import Image
import numpy as np

# Initialize EasyOCR Reader for CPU
# Downloads models on first execution in a session if not cached.
reader_easy_cpu = None # Define outside try block
try:
    # Initialize Reader for English, explicitly using CPU
    print("Initializing EasyOCR Reader for CPU...")
    reader_easy_cpu = easyocr.Reader(['en'], gpu=False) # gpu=False is key
    print("EasyOCR Reader initialized for CPU.")
except Exception as e:
    print(f"Error initializing EasyOCR Reader: {e}")

# Prompt for image upload
print("\nPlease upload an image file for EasyOCR:")
uploaded_easy = files.upload()

# Process if file uploaded and reader was initialized
if uploaded_easy and reader_easy_cpu:
    file_name_easy = list(uploaded_easy.keys())[0]
    image_data_easy = uploaded_easy[file_name_easy]
    print(f"\nProcessing {file_name_easy} with EasyOCR (CPU)...")

    try:
        # Load image data (EasyOCR can often handle bytes directly)
        img_bytes_easy = image_data_easy
        print("Image loaded successfully. Starting EasyOCR (CPU)...")

        # Start timer
        start_time_easy = time.time()

        # Perform OCR using readtext() on the image bytes
        # Using detail=0 to get only text, paragraph=False for better line breaks
        results_easy = reader_easy_cpu.readtext(img_bytes_easy, detail=0, paragraph=True)

        # Stop timer
        end_time_easy = time.time()
        duration_easy = end_time_easy - start_time_easy

        # Process results - join detected text blocks with newlines
        extracted_text_easy = "\n".join(results_easy)

        # Print results
        print("\n--- EasyOCR Result (CPU) ---")
        print("------------------------------")
        print(extracted_text_easy)
        print("----------------------------")
        print(f"EasyOCR (CPU) Execution Time: {duration_easy:.4f} seconds")

    except Exception as e:
        print(f"An error occurred during EasyOCR processing: {e}")

elif not reader_easy_cpu:
     print("EasyOCR Reader failed to initialize. Cannot process image.")
else:
    print("No file uploaded for EasyOCR.")


# ------------------------------------

--- Running EasyOCR (CPU) ---




Initializing EasyOCR Reader for CPU...
Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.3% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.7% Complet

Saving WhatsApp Image 2025-05-02 at 11.34.26 AM.jpeg to WhatsApp Image 2025-05-02 at 11.34.26 AM.jpeg

Processing WhatsApp Image 2025-05-02 at 11.34.26 AM.jpeg with EasyOCR (CPU)...
Image loaded successfully. Starting EasyOCR (CPU)...

--- EasyOCR Result (CPU) ---
------------------------------
Both Sides Wondering How Next Six Weeks Will Play Out
By PATRICK KINGSLEY JERUSALEM As truce took hold on Sunday in Gaza, pO- tentially ending the longest and deadliest war mn a century of Is- raeli-Palestinian conflict; two men used the same metaphor to de- scribe how they felt. The weight on  my chest has lifted;" said Ziad Obeid, Gazan civil servant displaced several times during the war: We have survived" 'The rock lying on my heart has been removed;" said Dov Weiss- glas, former Israeli politician_ "We want sec the hostages home, period" Both men also had a "but" Mr: Obeid has not seen his dam- aged house in northern Gaza for more than year: How bad, he wondered, is the damage? Who will  re

### 3. PaddleOCR (CPU)

PaddleOCR is a comprehensive toolkit from Baidu based on the PaddlePaddle framework. We test the CPU version here, using the specific framework version found compatible during earlier tests.

**3.1. Installation**

Installs the specific CPU version of the PaddlePaddle framework (`2.6.1`) and the `paddleocr` library.

* **Steps:**
    1. Install `paddlepaddle==2.6.1` (CPU version) using pip.
    2. Install the `paddleocr` library using pip.
    4. Note: The `PaddleOCR` engine will download required detection, classification, and recognition models on first initialization.

**3.2. Execution Example (CPU)**

Demonstrates running PaddleOCR on the CPU using an uploaded image.

* **Steps:**
    1. Import required libraries (`PaddleOCR`, `files`, `PIL`, `numpy`, `io`, `time`).
    2. Initialize the `PaddleOCR` engine, explicitly setting `use_gpu=False`.
    3. Prompt user for image upload.
    4. Load uploaded image into a NumPy array (RGB format).
    5. Execute `ocr_engine.ocr()` (CPU-bound).
    6. Measure and display execution time.
    7. Process the nested result list to extract and display text.

*Note: GPU acceleration with PaddleOCR requires installing the specific `paddlepaddle-gpu` package instead of the `paddlepaddle` (CPU) package used here. The CPU version was implemented due to runtime compatibility errors encountered with the GPU package in the Colab environment during initial testing.*

In [None]:
# --- PaddleOCR Installation (CPU) ---


# Install specific CPU version 2.6.2 and paddleocr
print("\nInstalling paddlepaddle==2.6.2 (CPU) and paddleocr...")
!pip install paddlepaddle paddleocr -q # Use -q for quieter output
print("\nInstallation command for specific CPU versions executed.")


# ----------------------------------


Installing paddlepaddle==2.6.2 (CPU) and paddleocr...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.8/192.8 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.8/297.8 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for fire (setup.py) ... [?25

In [None]:
# --- PaddleOCR Execution (CPU) ---

print("--- Running PaddleOCR (CPU) ---")

from paddleocr import PaddleOCR
from google.colab import files
from PIL import Image
import numpy as np
import io
import time

# Initialize PaddleOCR engine for CPU
# Downloads models on first execution in a session if not cached.
ocr_engine_cpu_paddle = None # Define outside try block
try:
    # Initialize Reader for English, explicitly using CPU
    print("Initializing PaddleOCR engine for CPU...")
    # Using use_angle_cls=True (default), lang='en', explicitly setting use_gpu=False
    ocr_engine_cpu_paddle = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False)
    print("PaddleOCR CPU engine initialized.")
except Exception as e:
    print(f"Error initializing PaddleOCR: {e}")
    ocr_engine_cpu_paddle = None # Ensure engine is None if init fails

# Prompt for image upload
print("\nPlease upload an image file for PaddleOCR:")
uploaded_paddle = files.upload()

# Process if file uploaded and engine was initialized
if uploaded_paddle and ocr_engine_cpu_paddle:
    # Get filename and image data
    file_name_paddle = list(uploaded_paddle.keys())[0]
    image_data_paddle = uploaded_paddle[file_name_paddle]
    print(f"\nProcessing {file_name_paddle} with PaddleOCR (CPU)...")

    try:
        # Load image bytes into PIL, convert to RGB, then to NumPy array
        img_pil_paddle = Image.open(io.BytesIO(image_data_paddle)).convert('RGB')
        img_np_paddle = np.array(img_pil_paddle)
        print("Image loaded successfully. Starting PaddleOCR (CPU)...")

        # --- Time the OCR operation ---
        start_time_paddle = time.time()
        # Perform OCR using the ocr method
        result_paddle = ocr_engine_cpu_paddle.ocr(img_np_paddle, cls=True)
        end_time_paddle = time.time()
        # --- End Timing ---

        duration_paddle = end_time_paddle - start_time_paddle
        print(f"PaddleOCR (CPU) execution finished.")

        # Process results (structure might be nested list)
        extracted_lines_paddle = []
        if result_paddle and result_paddle[0] is not None:
           for line_info in result_paddle[0]:
               # line_info is [[box_points], (text, confidence)]
               extracted_lines_paddle.append(line_info[1][0]) # Extract text

        # Print results
        print("\n--- PaddleOCR Result (CPU) ---")
        print("------------------------------")
        print("\n".join(extracted_lines_paddle))
        print("------------------------------")
        print(f"PaddleOCR (CPU) Execution Time: {duration_paddle:.4f} seconds")

    except Exception as e:
        print(f"An error occurred during PaddleOCR processing: {e}")

elif not ocr_engine_cpu_paddle:
     print("PaddleOCR engine failed to initialize earlier. Cannot process image.")
else:
    print("No file uploaded for PaddleOCR.")


# ----------------------------------

--- Running PaddleOCR (CPU) ---




Initializing PaddleOCR engine for CPU...
download https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_infer.tar to /root/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer/en_PP-OCRv3_det_infer.tar


100%|██████████| 3910/3910 [00:15<00:00, 251.42it/s] 


download https://paddleocr.bj.bcebos.com/PP-OCRv4/english/en_PP-OCRv4_rec_infer.tar to /root/.paddleocr/whl/rec/en/en_PP-OCRv4_rec_infer/en_PP-OCRv4_rec_infer.tar


100%|██████████| 10000/10000 [00:16<00:00, 614.81it/s]


download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar


100%|██████████| 2138/2138 [00:14<00:00, 151.30it/s]

[2025/05/03 01:31:28] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, use_mlu=False, use_gcu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/en/en_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_l




PaddleOCR CPU engine initialized.

Please upload an image file for PaddleOCR:


TypeError: 'NoneType' object is not subscriptable

## Sentiment Analysis Model Setup and OCR Integration

This section installs the necessary libraries for the Sentiment Analysis model and demonstrates integrating each OCR output with it.

### 1. Environment Setup

Make sure to install the `transformers` library from Hugging Face, which provides access to pre-trained models like DistilBERT, and `torch` (PyTorch) as the backend framework.

  
*Note: `torch` might have been installed as a dependency by `easyocr`, but this ensures it's present.*

In [None]:
# Install Hugging Face transformers library and PyTorch
# Using -q for quieter output

!pip install transformers torch -q
print("\n \n Transformers and PyTorch installation command executed.")

### 2. DistilBERT Setup

Loads the pre-trained DistilBERT model (fine-tuned on the SST-2 dataset for sentiment analysis) and its associated tokenizer from the Hugging Face Hub.

* **Steps:**
    1. Import necessary classes from `transformers` and `torch`.
    2. Define the specific model identifier string.
    3. Load the tokenizer using `AutoTokenizer.from_pretrained()`.
    4. Load the sequence classification model using `AutoModelForSequenceClassification.from_pretrained()`.
     
     (Model weights will download on the first run, approx. 250MB).
    5. Check if a CUDA GPU is available using `torch.cuda.is_available()`.
    6. Move the loaded model to the appropriate device ('cuda' or 'cpu').

In [None]:
# --- Load BERT Model and Tokenizer ---

print("--- Loading Sentiment Analysis Model (DistilBERT) ---")

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Determine device (use GPU if available, else CPU)
device_sa = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Attempting to load DistilBERT on: {device_sa}")

try:
    # Define model identifier
    model_name_sa = "distilbert-base-uncased-finetuned-sst-2-english"

    # 3. Load tokenizer
    # Downloads tokenizer files on first run
    tokenizer_sa = AutoTokenizer.from_pretrained(model_name_sa)
    print("Tokenizer loaded.")

    # Load model
    # Downloads model weights (~250MB) on first run
    model_sa = AutoModelForSequenceClassification.from_pretrained(model_name_sa)
    print("Model loaded.")

    # Move model to the determined device
    model_sa.to(device_sa)
    print(f"Model moved to {device_sa}.")

    # Print confirmation
    print("\nDistilBERT Sentiment Analysis model and tokenizer are ready.")

except Exception as e:
    print(f"An error occurred loading the model/tokenizer: {e}")
    # Ensure variables don't exist if loading failed to prevent later errors
    if 'tokenizer_sa' in locals(): del tokenizer_sa
    if 'model_sa' in locals(): del model_sa


# ------------------------------------

### 3. Sentiment Analysis Function Setup

Defines a function to perform sentiment analysis on a given text using the loaded DistilBERT model and tokenizer.

* **Steps:**
    1. Define function `get_sentiment` taking text as input.
    2. Tokenize the input text using `tokenizer_sa`, ensuring truncation and returning PyTorch tensors. Move tensors to the correct device (GPU or CPU).
    3. Perform inference using `model_sa` within a `torch.no_grad()` context (for efficiency).
    4. Apply Softmax to get probabilities and find the predicted label ID using `argmax`.
    5. Convert the label ID back to a human-readable label (e.g., 'POSITIVE', 'NEGATIVE') using the model's config.
    6. Return the predicted label and the confidence score.

In [None]:
# --- Define Sentiment Analysis Function ---

print("--- Defining Sentiment Analysis Function ---")

import torch # Ensure torch is imported

def get_sentiment(text_input, sa_tokenizer, sa_model, sa_device):
    """
    Performs sentiment analysis on the input text using the provided
    model, tokenizer and device.

    Args:
        text_input (str): The text to analyze.
        sa_tokenizer: The loaded Hugging Face tokenizer.
        sa_model: The loaded Hugging Face model.
        sa_device: The torch device ('cuda' or 'cpu') the model is on.

    Returns:
        tuple: A tuple containing the predicted label (str) and
               the confidence score (float), or ("Error", 0.0) if error.
    """
    if not text_input or not isinstance(text_input, str):
        print("  (SA Error: Invalid input text)")
        return "Error", 0.0
    if not sa_tokenizer or not sa_model:
        print("  (SA Error: Model or Tokenizer not provided to function)")
        return "Error", 0.0

    try:
        inputs = sa_tokenizer(text_input, return_tensors="pt", truncation=True, padding=True, max_length=512)
        inputs = {k: v.to(sa_device) for k, v in inputs.items()} # Move inputs to specified device
        with torch.no_grad():
            outputs = sa_model(**inputs)
        probabilities = torch.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(probabilities, dim=-1)
        score = probabilities[0, prediction.item()].item()
        label = sa_model.config.id2label[prediction.item()]
        return label, score
    except Exception as e:
        print(f"  (SA Error: {e})")
        return "Error", 0.0

print("\n\nSentiment analysis function 'get_sentiment' defined.")


# --------------------------------------

### 4. OCR DistilBERT Integration Function Setup

These functions encapsulate the process for each OCR method: taking raw image data, performing OCR, and then performing sentiment analysis on the extracted text using the previously defined `get_sentiment` function.

* **Input:** Image data (bytes).
* **Output:** Sentiment label and score tuple, or error indicators.
* **Note:** These functions assume the necessary OCR engines (`reader_easy_cpu`, `ocr_engine_cpu_paddle`) and the SA components (`get_sentiment`, `model_sa`, `tokenizer_sa`) have been initialized/defined in previous cells. Internal timing is omitted here for clarity, focusing on the pipeline logic.

In [None]:
# --- OCR DistilBERT Integration Functions ---

print("--- Defining OCR-BERT Pipeline Functions ---")

from PIL import Image
import numpy as np
import io
import pytesseract

# Assumes 'reader_easy_cpu', 'ocr_engine_cpu_paddle' exist
# Assumes 'model_sa', 'tokenizer_sa', 'device_sa' exist
# Assumes 'get_sentiment_revised' is defined

def run_tesseract_pipeline(image_data_bytes, sa_tokenizer, sa_model, sa_device):
    """Runs Tesseract OCR -> Sentiment Analysis."""
    print("\n\nRunning Tesseract Pipeline...")
    ocr_text = ""
    try:
        img = Image.open(io.BytesIO(image_data_bytes))
        start_ocr_time = time.time()
        ocr_text = pytesseract.image_to_string(img, lang='eng')
        end_ocr_time = time.time()
        ocr_duration = end_ocr_time - start_ocr_time
        print("  Tesseract OCR complete.")
    except Exception as e:
        print(f"  Error in Tesseract OCR step: {e}")
        return "OCR Error", "Error", 0.0,

    if ocr_text and ocr_text.strip():
        label, score = get_sentiment(ocr_text.strip(), sa_tokenizer, sa_model, sa_device)
        print("  Tesseract Sentiment Analysis complete.")
        return ocr_text.strip(), label, score, ocr_duration
    else:
        print("  Tesseract returned empty text.")
        return "", "No Text", 0.0

def run_easyocr_pipeline(image_data_bytes, reader, sa_tokenizer, sa_model, sa_device):
    """Runs EasyOCR (CPU) -> Sentiment Analysis."""
    print("\n\nRunning EasyOCR (CPU) Pipeline...")
    ocr_text = ""
    if not reader: return "Error: Reader not init.", "Error", 0.0
    try:
        start_ocr_time = time.time()
        results = reader.readtext(image_data_bytes, detail=0, paragraph=False)
        end_ocr_time = time.time()
        ocr_duration = end_ocr_time - start_ocr_time
        ocr_text = "\n".join(results)
        print("  EasyOCR complete.")
    except Exception as e:
        print(f"  Error in EasyOCR step: {e}")
        return "OCR Error", "Error", 0.0

    if ocr_text and ocr_text.strip():
        label, score = get_sentiment(ocr_text.strip(), sa_tokenizer, sa_model, sa_device)
        print("  EasyOCR Sentiment Analysis complete.")
        return ocr_text.strip(), label, score, ocr_duration
    else:
        print("  EasyOCR returned empty text.")
        return "", "No Text", 0.0

def run_paddleocr_pipeline(image_data_bytes, engine, sa_tokenizer, sa_model, sa_device):
    """Runs PaddleOCR (CPU) -> Sentiment Analysis."""
    print("\n\nRunning PaddleOCR (CPU) Pipeline...")
    ocr_text = ""
    if not engine: return "Error: Engine not init.", "Error", 0.0
    try:
        img_pil = Image.open(io.BytesIO(image_data_bytes)).convert('RGB')
        img_np = np.array(img_pil)
        start_ocr_time = time.time()
        result = engine.ocr(img_np, cls=True)
        end_ocr_time = time.time()
        ocr_duration = end_ocr_time - start_ocr_time
        print("  PaddleOCR complete.")
        extracted_lines = []
        if result and result[0] is not None:
           for line_info in result[0]:
               extracted_lines.append(line_info[1][0])
        ocr_text = "\n".join(extracted_lines)
    except Exception as e:
        print(f"  Error in PaddleOCR step: {e}")
        return "OCR Error", "Error", 0.0

    if ocr_text and ocr_text.strip():
        label, score = get_sentiment(ocr_text.strip(), sa_tokenizer, sa_model, sa_device)
        print("  PaddleOCR Sentiment Analysis complete.")
        return ocr_text.strip(), label, score, ocr_duration
    else:
        print("  PaddleOCR returned empty text.")
        return "", "No Text", 0.0

print("\n\nPipeline functions defined: \nrun_tesseract_pipeline, \nrun_easyocr_pipeline, \nrun_paddleocr_pipeline")


# --------------------------------------

## Complete Project Pipeline Demonstration


This final section provides a single point of execution to test and compare the end-to-end performance of the three selected OCR pipelines (Tesseract CPU, EasyOCR CPU, PaddleOCR CPU) integrated with the DistilBERT sentiment analysis model.

* **Steps:**
    1. Ensure necessary OCR engines (EasyOCR, PaddleOCR) and the SA model/tokenizer are initialized (code includes checks to initialize them if they haven't been run yet in the session).
    2. Prompt the user to upload a single image file.
    3. Load the uploaded image data.
    4. Call the respective pipeline function for each OCR method (`run_tesseract_pipeline`, `run_easyocr_pipeline`, `run_paddleocr_pipeline`) defined in Section 6.5.
    5. Store the results returned by each pipeline function (extracted text, sentiment label, sentiment score).
    6. Print a formatted summary comparing the outputs from all three pipelines for the uploaded image.

In [None]:
# --- Complete Pipeline Demonstration ---

print("--- Running All OCR + BERT Pipelines ---")

# Ensure necessary libraries are imported (files, io, Image, np)
# These might be implicitly imported if previous cells ran, but good to have
from google.colab import files
import io
from PIL import Image
import numpy as np
import time # Re-import time if measuring overall time here

# --- Ensure OCR Engines and SA Model/Tokenizer are Initialized ---
# These blocks will skip initialization if the variables already exist
# from running previous cells in this session.

# Initialize EasyOCR Reader for CPU if needed
print("Tesseract is already initialized.")
try:
    if 'reader_easy_cpu' not in locals():
        print("Initializing EasyOCR Reader for CPU (first time)...")
        reader_easy_cpu = easyocr.Reader(['en'], gpu=False)
        print("EasyOCR Reader initialized.")
    else:
         print("EasyOCR Reader already initialized.") # Confirm it exists
except Exception as e:
    print(f"Error initializing EasyOCR Reader: {e}")
    reader_easy_cpu = None

# Initialize PaddleOCR engine for CPU if needed
try:
    if 'ocr_engine_cpu_paddle' not in locals():
         print("Initializing PaddleOCR engine for CPU (first time)...")
         ocr_engine_cpu_paddle = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False)
         print("PaddleOCR initialized.")
    else:
         print("PaddleOCR engine already initialized.") # Confirm it exists
except Exception as e:
    print(f"Error initializing PaddleOCR: {e}")
    ocr_engine_cpu_paddle = None

# Ensure Sentiment Analysis function is defined properly
sa_ready = False
if 'model_sa' in locals() and model_sa is not None and \
   'tokenizer_sa' in locals() and tokenizer_sa is not None and \
   'device_sa' in locals() and \
   'get_sentiment' in globals():
    print("Sentiment Analysis components and function verified.")
    sa_ready = True
else:
     print("ERROR: SA model/tokenizer/device or revised function not ready.")
     print("Please ensure cells in section 6.2 and 6.5 ran successfully.")

# --- Upload Single Image ---
if sa_ready:
    print("\nPlease upload a single image file to test all pipelines:")
    uploaded_final = files.upload()

    if uploaded_final:
        file_name_final = list(uploaded_final.keys())[0]
        image_data_final = uploaded_final[file_name_final]
        print(f"\nProcessing {file_name_final} with all pipelines...\n\n")
        print("="*50)

        pipeline_results_summary = {}
        overall_start_time = time.time()

        # --- Run Pipelines by passing SA components ---
        # Tesseract
        if 'run_tesseract_pipeline_revised' in globals():
            ocr_text_t, label_t, score_t, ocr_time_t = run_tesseract_pipeline(
                image_data_final, tokenizer_sa, model_sa, device_sa # Pass SA args
            )
            pipeline_results_summary['Tesseract (CPU)'] = {'ocr_text': ocr_text_t, 'label': label_t, 'score': score_t, 'ocr_time': ocr_time_t}
        else: print("Tesseract revised pipeline function not defined.")

        # EasyOCR
        if 'run_easyocr_pipeline_revised' in globals() and 'reader_easy_cpu' in locals() and reader_easy_cpu:
            ocr_text_e, label_e, score_e, ocr_time_e = run_easyocr_pipeline(
                image_data_final, reader_easy_cpu, tokenizer_sa, model_sa, device_sa # Pass SA args
            )
            pipeline_results_summary['EasyOCR (CPU)'] = {'ocr_text': ocr_text_e, 'label': label_e, 'score': score_e, 'ocr_time': ocr_time_e}
        else: print("EasyOCR revised pipeline function or reader not ready.")

        # PaddleOCR
        if 'run_paddleocr_pipeline_revised' in globals() and 'ocr_engine_cpu_paddle' in locals() and ocr_engine_cpu_paddle:
            ocr_text_p, label_p, score_p, ocr_time_p = run_paddleocr_pipeline(
                image_data_final, ocr_engine_cpu_paddle, tokenizer_sa, model_sa, device_sa # Pass SA args
            )
            pipeline_results_summary['PaddleOCR (CPU)'] = {'ocr_text': ocr_text_p, 'label': label_p, 'score': score_p, 'ocr_time': ocr_time_p}
        else: print("PaddleOCR revised pipeline function or engine not ready.")

        overall_end_time = time.time()
        print(f"\n\nAll pipelines processed in ~{overall_end_time - overall_start_time:.2f} seconds.\n\n")
        print("="*50)

        # --- Print Comparative Summary ---
        print("--- Comparative Pipeline Results ---")
        print("="*50)
        for method, results in pipeline_results_summary.items():
             print(f"\n\nOCR Method: {method}")
             ocr_display_text = results.get('ocr_text', 'N/A')
             if len(ocr_display_text) > 150: ocr_display_text = ocr_display_text[:150] + "..."
             print(f"  OCR Execution Time: {results.get('ocr_time', 0.0):.4f} seconds")
             score_val = results.get('score')
             score_str = f"{score_val:.4f}" if isinstance(score_val, (int, float)) else "N/A"
             print(f"  Sentiment Result: {results.get('label', 'N/A')} (Score: {score_str})\n\n")
             print("-"*50)
        print("\n\n--- End of Comparison ---")

    else:
        print("No file uploaded.")
else:
    print("Cannot run comparison: Sentiment Analysis setup is not ready.")


# ---------------------------------------------

In [None]:
!python --version

Python 3.11.12
