# Lewis Rincon Castano
## Project: Extracting Text from Images with Tesseract and Hugging Face

### **Image and Output Files:** https://github.com/lericas/portfolio/tree/main/fall%202024
### **Hugging Face Pipelines Information**: https://huggingface.co/docs/transformers/en/main_classes/pipelines

### **Pros:**

1. **Automated Data Extraction**: Saves time by extracting text from invoices/receipts without manual entry.
2. **Supports Multiple Image Formats**: Compatible with various formats (PNG, JPEG, etc.).
3. **Flexible Parsing**: Uses regex to extract common fields like phone numbers, dates, and amounts.
4. **Structured Output**: Exports data to Excel for easy analysis and integration.
5. **Batch Processing**: Handles multiple images in a directory efficiently.
6. **Easily Customizable**: Modular design allows for easy modifications and field additions.

### **Cons:**

1. **Accuracy Depends on Image Quality**: Poorly scanned or low-resolution images result in inaccurate text extraction.
2. **Limited Context Understanding**: Relies on simple rules, lacking advanced interpretation of document layouts.
3. **Weak OCR on Complex Text**: Struggles with non-standard fonts or handwritten text.
4. **Rigid Field Extraction**: Fixed logic may not work with varying formats (e.g., international layouts).
5. **Lack of Adaptability**: Assumes static positions for fields, requiring manual adjustments for different layouts.
6. **Security Risks for Model Using Hugging Transformers**: During our testing, we got a warning about torch.load with weights_only=False indicates a potential security risk due to the execution of arbitrary code during unpickling. This could lead to vulnerabilities if untrusted model files are used.

### **Limitations:**

1. **Dependent on Image Quality**: Blurry or poorly lit images affect OCR performance.
2. **No Advanced NLP**: Can't handle complex invoice formats or variations in terminology.
3. **Locale-Specific Formats**: May not handle international formats (dates, currency, addresses) without customization.
4. **No Handwriting Support**: Unable to process handwritten content effectively.
5. **Lack of Data Validation**: No error handling or post-processing to verify extracted data.
6. **Static Rule-Based Parsing**: Needs frequent adjustments for non-standard or varying document layouts.

### **Conclusions:** 
The code below explains two different methods to extract text from images using Tesseract and Hugging Face. Both methods produced different results; for example, Tesseract performed slightly better at displaying total amounts, while the Hugging Face model provided more accurate address information and item descriptions. These results are not conclusive because each receipt is unique in its content and format. The code included below serves as a guide for this practice.



In [1]:
import os
os.getcwd()



'C:\\Users\\lewis\\Desktop\\Deep Learning Project'

In [2]:
# To install missing packages:
# !pip install [package name]

In [3]:
import os
import re
import pytesseract
from PIL import Image
import pandas as pd



C:\Users\lewis\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\lewis\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


In [4]:
# Set the Tesseract executable path (adjust for your OS)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Update this for your system

def extract_text_from_image(image_path):
    """Extract text from an image using Tesseract OCR."""
    img = Image.open(image_path)  # Open the image
    text = pytesseract.image_to_string(img)  # Perform OCR to extract text
    return text

def parse_extracted_text(text):
    """Parse the extracted text and structure it into the desired fields."""
    
    # Use regular expressions or keyword searches to extract specific details
    company_name = address = phone_number = payment_amount = date = item_description = other = None

    # Extract company name (assuming the company name is in the first line of text)
    lines = text.split('\n')
    if lines:
        company_name = lines[0].strip()  # Assuming company name is at the top of the document

    # Extract phone number (using regex to match phone number patterns)
    phone_match = re.search(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
    if phone_match:
        phone_number = phone_match.group()

    # Extract address (this could be complex, depending on format)
    # Assume the address follows the company name
    for i, line in enumerate(lines):
        if 'address' in line.lower():
            address = lines[i + 1].strip()  # Assuming address follows after the line containing "address"

    # Extract payment amount (assuming it has a $ or some currency symbol)
    payment_match = re.search(r'[\$\€\£]\s*\d+(?:,\d{3})*(?:\.\d{2})?', text)
    if payment_match:
        payment_amount = payment_match.group()

    # Extract date (looking for common date formats like MM/DD/YYYY or DD/MM/YYYY)
    date_match = re.search(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', text)
    if date_match:
        date = date_match.group()

    # Extract item description (this could vary depending on format, let's assume it's after a keyword like "description")
    for i, line in enumerate(lines):
        if 'description' in line.lower():
            item_description = lines[i + 1].strip()  # Assuming the description follows after a line mentioning "description"

    # Extract other information (could be any unclassified info for now)
    other = "\n".join(lines)  # For now, just adding the remaining text

    # Return structured data as a dictionary
    return {
        'Company Name': company_name,
        'Address': address,
        'Phone Number': phone_number,
        'Payment Amount': payment_amount,
        'Date': date,
        'Item Description': item_description,
        'Other': other
    }



In [5]:
def process_images_in_directory(directory):
    """Process all images in the directory and save the structured data to an Excel file."""
    data = []  # List to store parsed data for each image

    # Loop through all files in the directory
    for filename in os.listdir(directory):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
            image_path = os.path.join(directory, filename)
            print(f"Processing image: {image_path}")

            # Step 1: Extract text from the image
            extracted_text = extract_text_from_image(image_path)

            # Step 2: Parse the extracted text into fields
            parsed_data = parse_extracted_text(extracted_text)
            parsed_data['Image Name'] = filename  # Add image file name for reference

            # Append the structured data to the list
            data.append(parsed_data)

    # Convert the list of dictionaries to a pandas DataFrame
    df = pd.DataFrame(data)

    # Save the DataFrame to an Excel file
    excel_output_path = os.path.join(directory, 'extracted_data.xlsx')
    df.to_excel(excel_output_path, index=False)

    print(f"Extracted data has been saved to: {excel_output_path}")

if __name__ == "__main__":
    # Set your working directory path (where your images are located)
    working_directory = os.getcwd()  # This gets the current working directory
    
    # Process all images in the directory and save results to Excel
    process_images_in_directory(working_directory)


Processing image: C:\Users\lewis\Desktop\Deep Learning Project\cab receipt.png
Processing image: C:\Users\lewis\Desktop\Deep Learning Project\IC-Basic-Receipt-Template.png
Processing image: C:\Users\lewis\Desktop\Deep Learning Project\magicpay.jpeg
Processing image: C:\Users\lewis\Desktop\Deep Learning Project\receipt-template-us-classic-white-750px.png
Processing image: C:\Users\lewis\Desktop\Deep Learning Project\restaurant.jpg
Extracted data has been saved to: C:\Users\lewis\Desktop\Deep Learning Project\extracted_data.xlsx


In [6]:
#!pip install easyocr pandas openpyxl transformers or other libraries if needed


In [7]:
import os
import easyocr
import pandas as pd
from transformers import pipeline
import warnings
import torch

# Suppress all warnings
warnings.filterwarnings("ignore")

# Suppress Hugging Face cache symlink warning
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'

# Force torch.load to be more secure (weights_only=True) as recommended
def safe_torch_load(path, map_location):
    return torch.load(path, map_location=map_location, weights_only=True)

# Initialize EasyOCR reader
reader = easyocr.Reader(['en'])


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


In [8]:

# Initialize Hugging Face pipeline for text extraction
nlp_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

def extract_text_from_image(image_path):
    """Extract text from an image using EasyOCR."""
    result = reader.readtext(image_path)
    text = ' '.join([res[1] for res in result])
    return text

def extract_field(text, question):
    """Use NLP model to extract specific information from the text."""
    result = nlp_pipeline(question=question, context=text)
    return result['answer']


In [9]:
def process_images_in_directory(directory):
    """Process images, extract text with EasyOCR, query NLP model, and save to Excel."""
    data = []
    
    for filename in os.listdir(directory):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(directory, filename)
            extracted_text = extract_text_from_image(image_path)
            
            # Extract specific fields using NLP
            company_name = extract_field(extracted_text, "What is the company name?")
            address = extract_field(extracted_text, "What is the address?")
            phone_number = extract_field(extracted_text, "What is the phone number?")
            payment_amount = extract_field(extracted_text, "What is the payment amount?")
            date = extract_field(extracted_text, "What is the date?")
            item_description = extract_field(extracted_text, "What is the item description?")
            other = extract_field(extracted_text, "What other information is available?")

            # Append results to the list
            data.append({
                'Image Name': filename,
                'Company Name': company_name,
                'Address': address,
                'Phone Number': phone_number,
                'Payment Amount': payment_amount,
                'Date': date,
                'Item Description': item_description,
                'Other': other
            })

    # Convert to DataFrame and save to Excel
    df = pd.DataFrame(data)
    excel_output_path = os.path.join(directory, 'extract_data_with_deep_learning.xlsx')
    df.to_excel(excel_output_path, index=False)
    print(f"Extracted data has been saved to: {excel_output_path}")

if __name__ == "__main__":
    working_directory = os.getcwd()  # Set the working directory
    process_images_in_directory(working_directory)


Extracted data has been saved to: C:\Users\lewis\Desktop\Deep Learning Project\extract_data_with_deep_learning.xlsx
