Work in progress codes for VADR

Based on idea formulated by: https://github.com/ahmetozlu

This method is helpful in:
a) Extracting overlapping signatures in a document
b) Telling you how many people have signed the document when multiple signatures are required

Entire Process is broken down into:

a) Converting a PDF page into an image

b) Reading the image

c) Extracting blobs (textual prints within the document - including the signature) and identify their location

d) Understanding the area of the text with a skimage library called regionprops, which uses the info on the area of the blobs - need to identify how much text is present in the image

e) Create a threshold value - usually, a signature appears longer than the printed matter in the text - threshold is defined by the average value - i.e - size of the printed text - if the text (blob size) is greater than the average value, then this is an area of interest in the document - if the text (blob size) is lesser than the average value, then this is just noise 

average value = total area / number of blobs

In [18]:
import cv2
import matplotlib.pyplot as plt
import skimage
from skimage import measure, morphology
from skimage.color import label2rgb
from skimage.measure import regionprops
from pdf2image import convert_from_path
from PIL import Image

In [19]:
#Read in a pdf file and convert it to an image
from pdf2image import convert_from_path
images=convert_from_path("C:/Users/preet/Desktop/dunnhumby_The-Complete-Journey/Sample.pdf")
for image in images:
    image.save("Sample.jpeg","jpeg")

In [20]:
# Read in the converted image
#Use Open CV's imread to read the sample image
img = cv2.imread('Sample.jpeg', 0)

#The code below converts the image into a threshold binary image
#We are interested in the black colour regions of the document
#So, the values 127, 255 are pixel intensity values
#127 is a global thresholding value
#Thresholding is a technique in OpenCV, which is the assignment of pixel values in relation to the threshold value provided. 
#In thresholding, each pixel value is compared with the threshold value. 
#If the pixel value is smaller than the threshold, it is set to 0, otherwise, 
#it is set to a maximum value (generally 255). 
#Thresholding is a very popular segmentation technique, used for separating an object considered as a foreground 
#from its background. A threshold is a value which has two regions on its either side 
#i.e. below the threshold or above the threshold.
#cv2.threshold(source, thresholdValue, maxVal, thresholdingTechnique)

img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)[1] 

#Print it out to see if you are getting a multi-dimensional array of pixel values
#The pixel values range from 0 to 255. That means 0 represents black and 255 represents white.
#The range is 0-255 means that each pixel is represented by a single 8-bit integer.
#Since the image is a colored image there are three channels. 
#Opencv reads the image in Blue Green Red(BGR) format.
#So in the colored image, each pixel is represented by a three-element array, 
#with each integer representing one of the three color channels: B, G, and R, respectively.
#Danger - if the object type produced by cv2.imread is none
print(img)

[[255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 ...
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]
 [255 255 255 ... 255 255 255]]


In [21]:
# Extract Blobs
#Need blob info to determine whether these blobs fulfill a criteria concerning the threshold value
blobs = img > img.mean()
blobs_labels = measure.label(blobs, background=1)
#Convert these blobs into pixel values
image_label_overlay = label2rgb(blobs_labels, image=img)

  


In [22]:
# Area of Text
the_biggest_component = 0
total_area = 0

#counter denotes number of blobs
#Extrcating information from the image - the area of the blobs
#The code below captures total area of the blobs - trying to identify how much text is present in the image
#Extracting this information is critical to creaate a threshold value
#We are assuming that a person's signature is longer relative to the printed matter in the text
counter = 0
average = 0
for region in regionprops(blobs_labels):
    if (region.area > 10):
        total_area = total_area + region.area
        counter = counter + 1
    if (region.area >= 250):
        if (region.area > the_biggest_component):
            the_biggest_component = region.area


# Threshold
#The average value is the size of the printed text
average = (total_area/counter)
print("the_biggest_component: " + str(the_biggest_component))
print("average: " + str(average))

#Use the average value is part of the formula to identify the location of the signature
#The formula below pertains to documents of the size A4 (for a single A4 sized page)
#This was compiled by: https://github.com/ahmetozlu
#This a4_constant tells you how large the document's printed text actualy is

a4_constant = ((average/84.0)*250.0)+100
print("a4_constant: " + str(a4_constant))

#Call the remove_small_objects function that is part of the skimage library
# Remove Noise - all those blobs that have a size that is lesser than the constant that was defined for A4 size documents
#All parts of the printed text that are smaller than the constant for an A4 size sheet are removed. 
#blobs_labels

b = morphology.remove_small_objects(blobs_labels, a4_constant)

#Save the image with the noise removed

plt.imsave('pre_version.jpeg', b)

# read in the saved image that does not contain any noise

img2 = cv2.imread('pre_version.jpeg', 0)
img2 = cv2.threshold(img2, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

# save the the result
cv2.imwrite("output.jpeg", img2)

the_biggest_component: 6147
average: 198.21671826625388
a4_constant: 689.9307091257556


True

In [23]:
im = Image.open(r"C:\Users\preet\Desktop\output.jpeg") 
  
im.show()