### Install libraries in pip

* **Pytesseract** is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a fil

* **pdf2image** is a python (3.5+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object 

In [None]:
#pip install pytesseract
#pip install Pillow
#pip install tesseract
#pip install pdf2image

### Import functions from the libraries which were installed above

In [166]:
import re
import pytesseract
import argparse
import os
import pdf2image
import time
import numpy as np
import pandas as pd

from PIL import Image
from pdf2image import convert_from_path

### Import the pdf file

In [29]:
PDF_file = "D:/External Data/Basic-Salary-Slip-Example/PaySlip_and_Employee_account/payslip.pdf"

### Convert the pdf to images

Here we are converting all the paged of the pdf into a jpeg format. The idea is first the pdf is converted to an images of respective pages in the pdf. Now, these images are PIL images which needs to be converted back to Jpeg or png.

Once, we have converted them to images, those are then converted to text.

In [31]:
#DECLARE CONSTANTS
PDF_PATH = "C:/Users/AJ/Downloads/PAYSLIP.pdf"
DPI = 200
OUTPUT_FOLDER = None
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'jpg'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False

def pdftopil():
    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_file, dpi=DPI, output_folder=OUTPUT_FOLDER, 
                                             first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, 
                                             userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT, poppler_path= r'D:\External Data\Basic-Salary-Slip-Example\poppler-0.68.0\bin')
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images
    
def save_images(pil_images):
    #This method helps in converting the images in PIL Image file format to the required image format
    index = 1
    for image in pil_images:
        image.save("page_" + str(index) + ".jpg")
        index += 1

if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)

Time taken : 0.5772304534912109


### This is for installing pytesseract into your system as else it throws an error

In [32]:
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

### Convert the images to string or text format

In [33]:
text = pytesseract.image_to_string(Image.open('page_1.jpg'))

In [34]:
text

'Pay Slip Sample\n\n \n\nGKW English School\n\nJayanagar, Bangalore\n\nPay slip for the month of JAN, 2008\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nEmp Name SAPNA MISHRA Date of Joining 05/04/2005\nEmp No 6 Date of Birth 22/09/1979\nFather MR AMIT MISHRA Mother MRS PAYAL MISHRA\nBank UBI Bank A/c No 2149237429742\nPAN No JAIPS73519L PF No OR/ 9383/38\nLOP / LWP 6 Pay Days 25\nCurrent Basic 8000 Post / Designation Asst Manager\nIncome Amount! Arrears| |Deduction Amount| Arrears\nBASIC 6452 O| |PF 774| oO\nDA 5291 Oo) [I TAX o| oO\nADVANCE o| 0\n|\n|\nGross 11743 Q| |Deductions 774 0\nNet Pay: 10969 (Rs, ONE ZERO NINE SIX NINE and paise ZERO only)\nLeave Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec Availed Cl Bal\ncL 4 1 5 9\nML | 5 Zt 6 39\n\n \n\n \n\n \n\n \n\n \n\n 

### Text using NLP and parsing
We will be using text parsing and NLP to extract the unstructured text into meaning full arrays and tables.
The objective is to get all the details of a person on his payslip

### Remove the spaces which are there because of \n

In [37]:
text_no_space = text.replace('\n', '')

In [40]:
text_no_space

'Pay Slip Sample GKW English SchoolJayanagar, BangalorePay slip for the month of JAN, 2008                                               Emp Name SAPNA MISHRA Date of Joining 05/04/2005Emp No 6 Date of Birth 22/09/1979Father MR AMIT MISHRA Mother MRS PAYAL MISHRABank UBI Bank A/c No 2149237429742PAN No JAIPS73519L PF No OR/ 9383/38LOP / LWP 6 Pay Days 25Current Basic 8000 Post / Designation Asst ManagerIncome Amount! Arrears| |Deduction Amount| ArrearsBASIC 6452 O| |PF 774| oODA 5291 Oo) [I TAX o| oOADVANCE o| 0||Gross 11743 Q| |Deductions 774 0Net Pay: 10969 (Rs, ONE ZERO NINE SIX NINE and paise ZERO only)Leave Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec Availed Cl BalcL 4 1 5 9ML | 5 Zt 6 39      Employee Account of SAPNA MISHRA for Jan 2008 to Apr 2008'

### Type of PDF

In [68]:
## TYPE
type_    = text_no_space[0:8]
print(type_)

Pay Slip


In [66]:
pattern  = re.compile(r'Pay.Slip', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [67]:
for match in matches:
    print(match)

<re.Match object; span=(0, 8), match='Pay Slip'>
<re.Match object; span=(54, 62), match='Pay slip'>


### Organization & City Name

Generally, after the first occurence of the Pay Slip it contains the Organization Name. So, we have got the first occurence of the Pay Slip from above, we can extend it further to get the company Name

In [69]:
Org_name   = text_no_space[9:100]
print(Org_name)

Sample GKW English SchoolJayanagar, BangalorePay slip for the month of JAN, 2008           


In [71]:
pattern  = re.compile(r',', re.IGNORECASE)
matches  = pattern.finditer(Org_name)

In [72]:
for match in matches:
    print(match)

<re.Match object; span=(34, 35), match=','>
<re.Match object; span=(74, 75), match=','>


In [77]:
Org_  = Org_name[6:34]
print(Org_)
City  = Org_name[36:45]
print(City)

 GKW English SchoolJayanagar
Bangalore


### Salary Month & Year

In [84]:
pattern  = re.compile(r'[2]\d\d\d', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [85]:
for match in matches:
    print(match)

<re.Match object; span=(85, 89), match='2008'>
<re.Match object; span=(180, 184), match='2005'>
<re.Match object; span=(283, 287), match='2149'>
<re.Match object; span=(287, 291), match='2374'>
<re.Match object; span=(291, 295), match='2974'>
<re.Match object; span=(774, 778), match='2008'>
<re.Match object; span=(786, 790), match='2008'>


In [96]:
### Salary of the Year is the first thing on the top of the payslip so it should be 2008
Year_    = text_no_space[85:89]
print(Year_)

2008


In [92]:
pattern  = re.compile(r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [93]:
for match in matches:
    print(match)

<re.Match object; span=(80, 83), match='JAN'>
<re.Match object; span=(619, 622), match='Jan'>
<re.Match object; span=(625, 628), match='Feb'>
<re.Match object; span=(631, 634), match='Mar'>
<re.Match object; span=(637, 640), match='Apr'>
<re.Match object; span=(643, 646), match='May'>
<re.Match object; span=(649, 652), match='Jun'>
<re.Match object; span=(655, 658), match='Jul'>
<re.Match object; span=(661, 664), match='Aug'>
<re.Match object; span=(667, 670), match='Sep'>
<re.Match object; span=(673, 676), match='Oct'>
<re.Match object; span=(679, 682), match='Nov'>
<re.Match object; span=(685, 688), match='Dec'>
<re.Match object; span=(770, 773), match='Jan'>
<re.Match object; span=(782, 785), match='Apr'>


In [97]:
### Here also we would find that the month of the salary is in the first occurence.
### The other things asked could be just a cumulative salary description in the Payslip

month_   = text_no_space[80:83]
print(month_)

JAN


### Finding the details of employer; Name, Date of Joining, Employee Number, Date of Birth, Parents Name

In [98]:
pattern  = re.compile(r'\d\d[/]\d\d[/]\d\d\d\d', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [99]:
for match in matches:
    print(match)

<re.Match object; span=(174, 184), match='05/04/2005'>
<re.Match object; span=(207, 217), match='22/09/1979'>


In [100]:
### Logically, the DOB could not be 2005 as a person less than 18 years old can't work
DoB_      = text_no_space[207:217]
print(DoB_)
DoJ_      = text_no_space[174:184]
print(DoJ_)

22/09/1979
05/04/2005


In [119]:
pattern  = re.compile(r'(Employee|Emp) Name', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [120]:
for match in matches:
    print(match)

<re.Match object; span=(136, 144), match='Emp Name'>


In [135]:
emp_name_    = text_no_space[144:157]
print(emp_name_)

 SAPNA MISHRA


In [138]:
pattern  = re.compile(r'(Mr|Mrs|Miss|Ms)\.?\s[a-zA-Z]\w*\s[a-zA-Z]\w*', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [139]:
for match in matches:
    print(match)

<re.Match object; span=(224, 238), match='MR AMIT MISHRA'>
<re.Match object; span=(246, 266), match='MRS PAYAL MISHRABank'>


In [140]:
father_name   = text_no_space[224:238]
print(father_name)
mother_name   = text_no_space[246:262]
print(mother_name)

MR AMIT MISHRA
MRS PAYAL MISHRA


### Designation

In [147]:
pattern  = re.compile(r'Designation\s[a-zA-Z]\w*\s[a-zA-Z]\w*', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [148]:
for match in matches:
    print(match)

<re.Match object; span=(381, 411), match='Designation Asst ManagerIncome'>


In [152]:
designation  = text_no_space[392:405]
print(designation)

 Asst Manager


### Salary

In [156]:
pattern  = re.compile(r'(Salary|Pay|Sal):\s[0-9]\d*', re.IGNORECASE)
matches  = pattern.finditer(text_no_space)

In [157]:
for match in matches:
    print(match)

<re.Match object; span=(554, 564), match='Pay: 10969'>


In [160]:
Salary_  =  text_no_space[559:564]
print(Salary_)

10969


### Creating a DataFrame

In [178]:
dict_  = {
    'Type':type_,
    'Organization':Org_,
    'City':City,
    'Year':Year_,
    'Month': month_,
    'Employee Name': emp_name_,
    'Date of Birth': DoB_,
    'Date of Joining': DoJ_,
    'Designation':designation,
    'Father Name':father_name,
    'Mother Name':mother_name,
    'Salary After Deductions':Salary_
    
}

In [179]:
dict_

{'Type': 'Pay Slip',
 'Organization': ' GKW English SchoolJayanagar',
 'City': 'Bangalore',
 'Year': '2008',
 'Month': 'JAN',
 'Employee Name': ' SAPNA MISHRA',
 'Date of Birth': '22/09/1979',
 'Date of Joining': '05/04/2005',
 'Designation': ' Asst Manager',
 'Father Name': 'MR AMIT MISHRA',
 'Mother Name': 'MRS PAYAL MISHRA',
 'Salary After Deductions': '10969'}