<a href="https://colab.research.google.com/github/kaustubhverma01/Projects/blob/Python-Project/Project_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Code for Extracting Data from **Image**

In [None]:
#Step 1: Install Required Libraries
!pip install pytesseract
!apt-get install tesseract-ocr
!pip install Pillow
!pip install opencv-python

# Step 2: Import Libraries
import cv2
import pytesseract
from PIL import Image
from google.colab import files
import re  # Regular expression library for pattern matching

# Step 3: Upload Image
uploaded = files.upload()

# Step 4: Load the Image
for filename in uploaded.keys():
    img = Image.open(filename)

# Step 5: Convert Image to OpenCV format
img_cv = cv2.cvtColor(cv2.imread(filename), cv2.COLOR_BGR2RGB)

# Step 6: Use OCR to Extract Text
extracted_text = pytesseract.image_to_string(img_cv)

# Step 7: Display Extracted Text
print("\nExtracted Text from the Image:")
print(extracted_text)


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 

### **Extract Date, Name & Amount (Flipkart)**

In [None]:
# Step 8: Clean and Extract Specific Information using Regex

# Clean extracted text for consistent spacing and remove non-printable characters
cleaned_text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', extracted_text)
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)

# Extract Client Name (line immediately after "Ship To")
name_pattern = r'Ship To\s*\n(.*)'
name_match = re.search(name_pattern, extracted_text)
client_name = name_match.group(1).strip() if name_match else "Client name not found"

# Extract Invoice Date or Order Date (Date of Purchase)
date_pattern = r'(Invoice Date|Order Date)\s*:\s*(\d{2}-\d{2}-\d{4})'
date_match = re.search(date_pattern, cleaned_text)
purchase_date = date_match.group(2) if date_match else "Date not found"

# Extract Total Amount Paid
amount_pattern = r'Grand Total\s*=\s*₹?([\d,]+\.\d{2})'
amount_match = re.search(amount_pattern, cleaned_text)
amount_paid = f"₹{amount_match.group(1)}" if amount_match else "Amount not found"

# Step 9: Display Corrected Extracted Information
print(f"\nClient Name: {client_name}")
print(f"Date of Purchase: {purchase_date}")
print(f"Total Amount Paid: {amount_paid}")


Client Name: Kaustubh Verma
Date of Purchase: 11-10-2022
Total Amount Paid: ₹1199.00


### Pandas **DataFrame** with columns for Date, Name, and Amount

In [None]:
# Step 10: Import Pandas for DataFrame creation
import pandas as pd

# Step 11: Create a Pandas DataFrame with columns for Name, Date, and Amount
data = {
    'Name': [client_name],
    'Date of Purchase': [purchase_date],
    'Total Amount Paid': [amount_paid]
}

invoice_df = pd.DataFrame(data)

# Step 12: Display the DataFrame
print("\nDataFrame with Extracted Information:")
print(invoice_df)

# Step 13: Save the DataFrame to an Excel file (optional)
invoice_df.to_excel("invoice_data.xlsx", index=False)

# Step 14: Download the Excel file (optional, if running in Colab)
from google.colab import files
files.download("invoice_data.xlsx")


DataFrame with Extracted Information:
             Name Date of Purchase Total Amount Paid
0  Kaustubh Verma       11-10-2022          ₹1199.00


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>