<a href="https://colab.research.google.com/github/henryj18/AI4Math/blob/main/PDF_Extraction/GOT_OCR_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing packages

In [None]:
pip install tiktoken

In [None]:
pip install verovio

In [None]:
pip install accelerate

In [None]:
pip install pdf2image


In [None]:
!apt-get install poppler-utils

## steps involved in converting PDF to text + latex

1. Converting PDF to images (one image per page)
2. Using OCR Model to extract the text + mathematical equations(in latex)
3. preprocessing the extracted text to our format

in python the single backslash (' \\ ') is taken as escape character so it automatically converts the single bakcslash ( ' \\ ' ) into the double backslash ( '\\\\' ) while storing the text in the datastrucutures like array(this doesnt apply to the variables)


so we see the Ambiguity when we try to store the text in the variable to the data structure(single backslash in the variable is automatically converted into the double backslash)
[link](https://stackoverflow.com/questions/17327202/python-replace-single-backslash-with-double-backslash#:~:text=The%20backslash%20indicates%20a%20special,quotation%20as%20an%20escape%20character.)

### STEP-1:- PDF to Image

link:- [pdf2image](https://pypi.org/project/pdf2image/)

Note:- make sure you have installed poppler in your environment using cmd



```
 !apt-get install poppler-utils
```




In [None]:
from pdf2image import convert_from_path
import os


pdf_path = '<path_to_the_pdf> '

#converting PDF to a list of images (one image per page)
images = convert_from_path(pdf_path)

# path for saving the created images of the PDF
os.makedirs('output_folder')

for i, image in enumerate(images):
    image_path = os.path.join('output_folder', f'page_{i + 1}.png')
    image.save(image_path, 'PNG')

### STEP-2:- Using OCR Model to extract Text

hf model link :- [GOT-OCR-2.0](https://huggingface.co/stepfun-ai/GOT-OCR2_0)

requirements:-


torch==2.0.1

torchvision==0.15.2

transformers==4.37.2

tiktoken==0.6.0

verovio==4.3.1

accelerate==0.28.0





Note:- Load the model to GPU ('cuda') for faster processing

In [None]:
from transformers import AutoModel, AutoTokenizer

# Loading the tokenizer for the pre-trained OCR model
tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
# Loading the pre-trained OCR model
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)

model = model.eval().cuda()

### STEP 3:- Preprocessing the output text



In [None]:
# function the change the delimiters of the latex part in the text
# for the mathematical equations the output of the model is  'Let \(\alpha\) and \(\beta\) be real numbers' or \[
#\left(\frac{\sin \alpha}{\cos \beta}+\frac{\cos \beta}{\sin \alpha}+\frac{\cos \alpha}{\sin \beta}+\frac{\sin \beta}{\cos \alpha}\right)^{2}\]
# the latex in indicated in the \( - \) or \[ - \]
# this function changes that to the mathematical equation between $ - $
import re

def Transform_Latex_Delimiters(text):
  result = re.sub(r'\\\(', '$', text)
  result = re.sub(r'\\\[', '$', result)
  result = re.sub(r'\\\)', '$', result)
  result = re.sub(r'\\\]', '$', result)
  return result

Note:- In this function the pattern changes from pdf to pdf we should change it by observing the pdf

In [None]:
# function to separate the questions from single text

def SplitQuestions(text):
  pattern = r'Q\.\s*\d+'
  questions = re.split(pattern, text)
  questions = [q.strip() for q in questions if q.strip()]
  return questions

In [None]:
def ExtractQuestionsFromImage(img_path):
  text_in_image = model.chat(tokenizer, img_path, ocr_type='format')
  transformed_text = Transform_Latex_Delimiters(text_in_image)
  questions = SplitQuestions(transformed_text)
  return questions

In [None]:
import pandas as pd
def ConvertToCSV(questions, csv_file_path):
  questions_df = pd.DataFrame(questions, columns=['Question'])
  questions_df.to_csv(csv_file_path, index=False)
  return

In [None]:
folder_path = ''
all_questions = []
for filename in os.listdir(folder_path):
  img_path = os.path.join(folder_path, filename)
  print(filename)
  questions = ExtractQuestionsFromImage(img_path)
  # this part may not be needed for every pdf (removing the instructions )
  # this is pdf dependent
  questions = [question for question in questions if not question.startswith('\\title')]
  all_questions.extend(questions)

ConvertToCSV(all_questions, '<your_csv_file_path>')


## For testing on single image the code part is

In [None]:
import time

def TestingWithImage(img_path):
  start = time.time()
  #extracting text from image
  text_in_image = model.chat(tokenizer, img_path, ocr_type='format')
  end = time.time()
  # Inference time
  print(f"Inference Time:- {end-start}")


  print("************************* text in the image from ocr ***********************")
  print(text_in_image)

  #preprocessing the output text to convert into required format

  print("************************* text in the image after transformation ***********************")
  transformed_text = Transform_Latex_Delimiters(text_in_image)
  print(transformed_text)

  print("************************* questions ***********************")
  questions = SplitQuestions(transformed_text)
  print(questions)

  print("************************* questions after removing Instructions ***********************")
  questions = [question for question in questions if not question.startswith('\\title')]
  print(questions)

TestingWithImage('<image path>')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


17.631876230239868
************************* text in the image from ocr ***********************
\title{
Section 1 (Maximum marks: 24)
}
- This section contains EIGHT (08) questions.
- The answer to each question is a SINGLE DIGIT INTEGER ranging from 0 TO 9, BOTH INCLUSIVE.
- For each question, enter the correct integer corresponding to the answer using the mouse and the onscreen virtual numeric keypad in the place designated to enter the answer.
- Answer to each question will be evaluated according to the following marking scheme:
Full Marks : +3 If ONLY the correct integer is entered;
Zero Marks : 0 If the question is unanswered;
Negative Marks : -1 In all other cases.
Q. 1 Let \(\alpha\) and \(\beta\) be real numbers such that \(-\frac{\pi}{4}<\beta<0<\alpha<\frac{\pi}{4}\). If \(\sin (\alpha+\beta)=\frac{1}{3}\) and \(\cos (\alpha-\beta)=\frac{2}{3}\), then the greatest integer less than or equal to
\[
\left(\frac{\sin \alpha}{\cos \beta}+\frac{\cos \beta}{\sin \alpha}+\frac{\cos \