**Let's see how does pytesseract work for text extraction**

In [119]:
!pip install -q langchain pdf2image pytesseract pillow

In [120]:
# need poppler for pytesseract
!apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.10).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [3]:
import os
from pdf2image import convert_from_path
import pytesseract
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

In [4]:
pdf_path = '/content/Chapter1CinderellaCWnotes.pdf'

In [10]:
def extract_text_from_scanned_pdf(pdf_path: str) -> str:
    """Extract text from scanned PDF using OCR."""
    # Convert PDF pages to images
    pages = convert_from_path(pdf_path)

    extracted_text = ""
    for page_number, page in enumerate(pages, start=0):
        text = pytesseract.image_to_string(page, lang="eng")
        extracted_text += f"\n--- Page {page_number} ---\n{text}"
    return extracted_text

Let's check the output

In [11]:
extract_text_from_scanned_pdf(pdf_path)

"\n--- Page 0 ---\n \n\n \n\n \n\n \n\n \n\n \n\nI\nit vor\n\n \n\x0c\n--- Page 1 ---\n     \n  \n    \n  \n  \n  \n \n \n\nZ '\n\n  \n\nCae\n“sat\n\ni\n=e rs\n\nDATE:\n\nae\n\n   \n\nHA” Mess QLoe bo people:\n\n3 |] R oye a ae o tek\n\na 4-1 Batt oe rage [0 mal party “wr\n\ndancing =\n\n=3-{Me :\n—— 0 ER Ly opie ty Ca\n\n \n\n \n\n \n\ni a\nee aid ap nig tea clocksot ign\n\nHF rion—and Ye\niL eSLION rYYIS¥V AY\n\n \n\n \n\n \n\n \n\n—t!: ‘Why A tne Tung s Messe! ger\n\nEe at -Cinde rer) ASC Avors pep\n\nA :\nA ne Kinz BS Mhesseng er_cam MH ce\n\nPer aa\n\n—\n\n¥.\n\x0c\n--- Page 2 ---\n \n\n \n\x0c\n--- Page 3 ---\n \n     \n   \n    \n  \n \n  \n    \n\n  \n\nD)\n\n: sa 6) eae\ntep MONEY aia NOT\n\n~ —Gnderella to attend the bak\n\na\n\nYoo CNET S =o a\n\n \n\n \n\nN = ee\nMASCuNe Vr enminine\nPUN eran ter\na. a) ane\n\nae\neT TOOL\n\n   \n\x0c\n--- Page 4 ---\ni) | an\n2a\n\n \n\neM\n\nat\n\nTTT IO ee\npO are\n\n \n\n \n\n \n\n \n\n \n\x0c"

Output from pytesseract is quite bad. Not able to detect even a single word.

Let's check the pages quality.

In [12]:
pages = convert_from_path(pdf_path)

In [166]:
# from IPython.display import display

# for idx, page in enumerate(pages):
#   print(f"Page number {idx}")
#   display(page)
#   break

Now see how pytesseract is reading page number 1

In [14]:
pytesseract.image_to_string(pages[0])

' \n\n \n\n \n\n \n\n \n\n \n\nI\nit vor\n\n \n\x0c'

**As we see, pytesseract did quite poor job. Instead of optimizing it, let's try few other approaches quickly.**

Let's try easyocr

In [118]:
!pip install -q easyocr

In [19]:
import easyocr

reader = easyocr.Reader(['en'])  # supports multiple languages



Let's see how well easyocr is able to read the first page. We need to pass the image from pdf to easyocr.

In [138]:
pages[0].save('/content/Chapter1CinderellaCWnotes.jpeg')
result = reader.readtext('/content/Chapter1CinderellaCWnotes.jpeg', detail=0)  # detail=0 returns only text
result



['DATE:',
 'PAGE:',
 'hatEer]',
 'Cin derella',
 'TEuzl',
 'GormoHeT',
 'essenger',
 'TmUd',
 'IiLalion',
 '(Iass',
 'Slipper',
 'OOI',
 'Cinderea',
 'meLTgE',
 '€rul',
 'S0pORe',
 'KeS',
 'PAII',
 'LO',
 'oLers',
 'exenger',
 'PeESon',
 'Who',
 'NOLd',
 'uf',
 'gieS']

easyocr is able to read few words, **but mostly incorrectly**.
Also adding many junk words like TEuzl, meLTgE etc.

Now let's try **Transformer Vision Model** for OCR task.

In [139]:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [141]:
image = Image.open('/content/Chapter1CinderellaCWnotes.jpeg')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(text)

['Print export']


Transformer is giving totally unrelated output.

Let's try another page.

In [144]:
pages[1].save('/content/Chapter1CinderellaCWnotes1.jpeg')
image = Image.open('/content/Chapter1CinderellaCWnotes1.jpeg')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(text)

['You must be able to provide a close connection to the']


Even on this page, the output is totally unrelated.

Pure case of Hallucination.

Let's verify the image we passed again.

In [164]:
# display(image)

Since all opensource approaches failed, let's try how **AWS Textract **work.

In [148]:
!pip install -q boto3 dotenv

In [149]:
import boto3
from dotenv import load_dotenv

load_dotenv('/content/.env')
#setting aws configurations
access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
region_name = os.getenv("AWS_REGION")

client = boto3.client("textract", region_name="us-east-1")

Now let's pass page 1 to AWS Textract and check the output.

In [152]:
# Read the file bytes
with open('/content/Chapter1CinderellaCWnotes.jpeg', "rb") as f:
    pdf_bytes = f.read()

# Call Textract
response = client.analyze_document(
    Document={"Bytes": pdf_bytes},
    FeatureTypes=["TABLES", "FORMS"]
)
# Extract text from blocks
text = []
for block in response["Blocks"]:
    if block["BlockType"] == "LINE":
      text.append(block["Text"])

print('\n'.join(text))

DATE:
PAGE:
Chapter-1
Cinderella
I
Word wall :-
Cruel
Fairy Godmother
Messenger Warned
Invitation Glass slipper
Royal ball Cinderella
11
Word meaning:
1.
Cruel- - someone who causes pain
to others
2.
Messenger - a person who gives


Quite good. let's try next page.

In [154]:
# Read the file bytes
with open('/content/Chapter1CinderellaCWnotes1.jpeg', "rb") as f:
    pdf_bytes = f.read()

# Call Textract
response = client.analyze_document(
    Document={"Bytes": pdf_bytes},
    FeatureTypes=["TABLES", "FORMS"]
)
# Extract text from blocks
text = []
for block in response["Blocks"]:
    if block["BlockType"] == "LINE":
      text.append(block["Text"])

print('\n'.join(text))

DATE:
PAGE:
a message to people
3.
Royal - related to the king or queen
4.
Ball - a large formal party with
dancing
5.
Meekly - quietly
6.
Midnight - 12:0'clock at night
III
Question and Answer
Q1.
Why did the King's messenger come
at Cinderella's doorstep?
A1.
The King's messenger came at
Cinderella's doorstep with an


In the end, AWS Textract is giving best result, there are few minor problems here and there, but LLM can fix those.

Let's extract all the text from full pdf and pass it to Llama to correct the mistakes.   

In [155]:
import io

def extract_text_from_scanned_pdf(pdf_path: str) -> str:
    """Extract text from scanned PDF using OCR."""
    # Convert PDF pages to images
    all_text = []
    pages = convert_from_path(pdf_path)
    for idx, page in enumerate(pages):
      print(f"Processing page no {idx+1}")
      img_bytes_io = io.BytesIO()
      page.save(img_bytes_io, format="PNG")
      img_bytes = img_bytes_io.getvalue()

      # Pass to Textract
      response = client.detect_document_text(
          Document={"Bytes": img_bytes}
      )

      # Extract text from blocks
      text = []
      for block in response["Blocks"]:
          if block["BlockType"] == "LINE":
            text.append(block["Text"])
      all_text.append(text)
    return all_text

output = extract_text_from_scanned_pdf(pdf_path)

Processing page no 1
Processing page no 2
Processing page no 3
Processing page no 4
Processing page no 5


In [156]:
output

[['DATE:',
  'PAGE:',
  'Chapter-1',
  'Cinderella',
  'I',
  'Word wall :-',
  'Cruel',
  'Fairy Godmother',
  'Messenger Warned',
  'Invitation Glass slipper',
  'Royal ball Cinderella',
  '11',
  'Word meaning:',
  '1.',
  'Cruel- - someone who causes pain',
  'to others',
  '2.',
  'Messenger - a person who gives'],
 ['DATE:',
  'PAGE:',
  '1',
  'a message to people',
  '3.',
  'Royal - related to the king or queen',
  '4.',
  'Ball - a large formal party with',
  'dancing',
  '5.',
  'Meekly - quietly',
  '6.',
  "Midnight - 12:0'clock at night",
  'III',
  'Question and Answer',
  'Q1.',
  "Why did the King's messenger come",
  "at Cinderella's doorstep?",
  'A1.',
  "The King's messenger came at",
  "Cinderella's doorstep with an"],
 ['DATE:',
  'PAGE:',
  'invitation to the royal ball.',
  'Q2.',
  'Why was the King organising the',
  'royal ball?',
  'A2.',
  'The King was organising the royal',
  'ball as he wanted his son to',
  'choose a bride.',
  'Q3.',
  "Why did Cinder

Using Llama through Groq to make the text coherent.

In [157]:
!pip install -q langchain_groq

In [159]:
from langchain_groq import ChatGroq

grok_key = os.getenv("GROK_KEY")
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    api_key=grok_key,
    temperature=0.1
)
# Prompt to instruct LLM to fix issues.
ocr_prompt = """"
You are an OCR and language expert.
You have been provided an extracted text from students notes.
Some text maybe divided into multiple lines.
You job is to combine the text and give a clean output.
Only combine text based on the order given.
Ignore these words: ["DATE:","PAGE:"]

text: \n"""
extracted_text = '\n'.join(['\n'.join(x) for x in output])
ocr_extracted_input = ocr_prompt + extracted_text

llm_response = llm.invoke(ocr_extracted_input).content


In [160]:
#Input we passed to LLM
print(extracted_text)

DATE:
PAGE:
Chapter-1
Cinderella
I
Word wall :-
Cruel
Fairy Godmother
Messenger Warned
Invitation Glass slipper
Royal ball Cinderella
11
Word meaning:
1.
Cruel- - someone who causes pain
to others
2.
Messenger - a person who gives
DATE:
PAGE:
1
a message to people
3.
Royal - related to the king or queen
4.
Ball - a large formal party with
dancing
5.
Meekly - quietly
6.
Midnight - 12:0'clock at night
III
Question and Answer
Q1.
Why did the King's messenger come
at Cinderella's doorstep?
A1.
The King's messenger came at
Cinderella's doorstep with an
DATE:
PAGE:
invitation to the royal ball.
Q2.
Why was the King organising the
royal ball?
A2.
The King was organising the royal
ball as he wanted his son to
choose a bride.
Q3.
Why did Cinderella's stepmother
not want her to attend the
royal ball ?
DATE:
PAGE:
1
A3,
The stepmother did not want
Cinderella to attend the ball as
she was jealous of her beauty.
IV
Activity Genders
Masculine LT: Feminine
1.
Son Datry Daughter
2.
Milkman M - Milkmai

In [161]:
# Output from LLM
print(llm_response)

Here's the cleaned and combined text based on the order given:

Chapter-1
Cinderella
I
Word wall :-
Cruel - someone who causes pain to others
Fairy Godmother
Messenger - a person who gives a message to people
Invitation
Glass slipper
Royal ball Cinderella
11
Word meaning:
1. Cruel - someone who causes pain to others
2. Messenger - a person who gives a message to people
3. Royal - related to the king or queen
4. Ball - a large formal party with dancing
5. Meekly - quietly
6. Midnight - 12:0'clock at night
III
Question and Answer
Q1. Why did the King's messenger come at Cinderella's doorstep?
A1. The King's messenger came at Cinderella's doorstep with an invitation to the royal ball.
Q2. Why was the King organising the royal ball?
A2. The King was organising the royal ball as he wanted his son to choose a bride.
Q3. Why did Cinderella's stepmother not want her to attend the royal ball?
A3. The stepmother did not want Cinderella to attend the ball as she was jealous of her beauty.
IV
Acti

So, with AWS Textract and LLM, we are able to read whole text quite well.

I could have used AWS Bedrock for LLM as well, but chose groq since it's free.

Now Let's extract questions and answers from the whole text in json format.

In [162]:
extract_question_answer_prompt = """"
You are a language expert.
You are given a list of questions and answers.
Do not return any extra information.
Return the question-answer pair in list of dictionary format as mentioned below:
[{"question":..., "answer":...},
{"question":..., "answer":...},
]"""

extract_question_answer_with_data = extract_question_answer_prompt + llm_response

question_answer = llm.invoke(extract_question_answer_with_data).content

In [163]:
eval(question_answer)

[{'question': "Why did the King's messenger come at Cinderella's doorstep?",
  'answer': "The King's messenger came at Cinderella's doorstep with an invitation to the royal ball."},
 {'question': 'Why was the King organising the royal ball?',
  'answer': 'The King was organising the royal ball as he wanted his son to choose a bride.'},
 {'question': "Why did Cinderella's stepmother not want her to attend the royal ball?",
  'answer': 'The stepmother did not want Cinderella to attend the ball as she was jealous of her beauty.'}]

**This is pretty good.**

**We can use this and more pdfs later in our RAG application later.**