This project demonstrates how to use Python with the Tesseract OCR engine to extract text from images. The script processes images in the input_images
directory and saves the extracted text to the output_texts
directory.
Optical Character Recognition (OCR) is a technology used to convert different types of documents, such as scanned paper documents, PDFs, or images taken by a digital camera, into editable and searchable data. This project utilizes Tesseract, an open-source OCR engine, to perform text extraction from images using Python.
-
Install Tesseract-OCR:
- Download and install Tesseract from here.
-
Clone the Repository:
git clone https://github.com/your-username/ocr-project.git cd ocr-project
-
Install Python Dependencies:
pip install pytesseract pillow
-
Usage
-
- Place your images in the input_images directory.
-
- Run the OCR script:
python ocr_script.py
-
- The extracted text will be saved in the output_texts directory.
-
We begin by providing the path to the Tesseract executable:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
NOTE: Make sure to give the correct path to your Tesseract executable file.
For this tutorial, we directly use the images without additionl preprocessing. However, for better accuracy, you can add image preprocessing steps like grayscale conversion, noise reduction, and thresholding.
The script reads each image file, performs OCR using Tesseract, and writes the extracted text to a text file:
for filename in os.listdir(input_dir):
try:
if filename.endswith(".jpg") or filename.endswith(".png"):
image_path = os.path.join(input_dir, filename)
image = Image.open(image_path)
text = pytesseract.image_to_string(image)
output_path = os.path.join(output_dir, filename + ".txt")
with open(output_path, "w", encoding="utf-8") as f:
f.write(text)
print(f"Extracted text from {filename} and wrote to {output_path}")
except FileNotFoundError:
print(f"Error: Image file not found: {image_path}")
except Exception as e:
print(f"An unexpected error occurred for {filename}: {e}")
The script includes exception handling to manage common errors:
- FileNotFoundError: Raised if the image file is not found.
- General Exception: Catches any other unexpected errors.
Place your image files in the input_images directory. Here’s an example image:
After running the script, the extracted text will be saved in the output_texts directory. For the example image, the output text might look like this:
This is a sample text extracted from the image.
This project is licensed under the MIT License. See the LICENSE file for more details.
-
Replace Placeholders:
- Replace
https://github.com/your-username/ocr-project.git
with your actual GitHub repository URL.
- Replace
-
Include Example Images:
- Add a sample image in the
input_images
directory. - Adjust the example image link in the README if needed.
- Add a sample image in the
-
Update License:
- If you choose to include licensing information, ensure the
LICENSE
file is added to your repository.
- If you choose to include licensing information, ensure the
This README file provides a comprehensive tutorial for your project, guiding users through installation, usage, and understanding of the code. It also includes instructions for licensing, making it a professional and user-friendly documentation for your project.