A framework made for converting an image file to PDF file.
Usage of just 4 packages of Python.
For latest code on how to convert multiple image files to pdf, check multiple-files branch.
- First, this will seacrh for the text inside the image file(most probably .jpg/.PNG file).
- Then, it'll fetch the text & store it in a variable.
- And then, writes it into a docx file(So that if it extracts wrong text, you can manually type the correct text).
- And finally, convert the word file hence created into a pdf file.
- You need to download tesseract.
- You need following packages of python:
- PIL
- pytesseract
- docx
- docx2pdf
- IDLE (PyCharm Community Edition preferrable)
- Go to main file.
- Just check the following lines in the code:
text = extractTextFromImg(imgPath) # - extract text from image file
appendDataToWord(wordPath, text) # - append data into word file
convertDocxToPdf(wordPath, pdfPath) # - convert it into pdf
- You don't need to change anything in this file. The changes which you need to make is in variables file.
- Just change the value of following 4 variables:
tesseractPath # the tesseract path you just downloaded. The one written in the file is default path
wordPath # the word file path. Put the word file path under wordFiles directory for better readability
pdfPath # the pdf file path. Put the pdf file path under pdfFiles directory for better readability
Nitin Kumar |