## address extraction algorithm
### (0) pdf pages are  converted to images
### (1) Region of interest is searched in invoice image through object detection with OPENCV. These region of interest are blocks of text in an invoice.
### (2) The blocks of text from invoice is extracted through OCR.
### (3) Each Block of text is checked , if it could be an address or not using spacy and some clever heuristics like not having a comma delimeter in an address or non alphabetical character like @ in an address which may depict an email
### (4) The address type with machine learning(SPACY Named entity recognition)  is not only limited to indian address.
### (5) The address is not labelled like "from address" or "to address" , "source" / "Destination" so , we will take first 2 address found in an invoice as source and destination address. 
### (6) preprocess address text to remove discrepencies (Ex. time).

In [8]:
import pdfplumber #convert pdf to image
import numpy as np 
import cv2 # for text block detection
import os
import pytesseract #image to text (OCR)
import spacy 
# Load spacy NER model
nlp = spacy.load("en_core_web_md")

In [9]:
def check_address(text):
    # Assumption: Address contains commas as seperator
    if ',' not in text: # Check if text doesn't have comma delimeter.
        return False
    # Assumption: Address Doesn't contain emails and hence no @ symbol 
    if '@' in text: # spacy considers text containing gmail address as location , so text containing gmail can never be address
        return False
    # this was an edge case found in invoices, through spacy NER , text containing thanks was also considered as location.
    if 'thank' in text : 
        return False
    doc = nlp(text)
    # Check if location is found in address 
    for entity in doc.ents:
        if entity.label_=='GPE' or entity.label_=='LOC':
            return True
    return False

In [10]:
def image_to_rect(img):
    ret, thresh1 = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
    rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (16,16))   
    dilation = cv2.dilate(thresh1, rect_kernel, iterations = 3) 
    contours, _ = cv2.findContours(dilation,  cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) 
    index=1
    for cnt in contours[::-1]: 
        x, y, w, h = cv2.boundingRect(cnt) 
        cropped = img[y:y + h, x:x + w]
        text=pytesseract.image_to_string(cropped)
        if check_address(text):
            text=text.replace('\n',' ')
            print(f"({index}) {text}")
            index+=1
            
def generate_address_box(file):
    pdf=pdfplumber.open(file)
    print('file_name:',file)
    for page in pdf.pages:
      img=np.array(page.to_image(resolution=300).original.convert('L'))
      print(f'page {page.page_number} contains address:')
      image_to_rect(img)

In [11]:
path=os.getcwd()+'/pdf_dataset'
files=os.listdir(path)
for file in files:
  generate_address_box(path+'/'+file)

file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_25.pdf
page 1 contains address:
(1) 55, Gopalpura Road, Chhayadip Nagar, Shri Gopal Nagar, Gopal Pura Mode, Jaipur 
(2) Unnamed Road, Bhakrota, Jaipur 
page 2 contains address:
(1) Pickup Address 55, Gopalpura Road, Chhayadip Nagar, Shri Gopal Nagar, Gopal Pura Mode, Jaipur 
page 3 contains address:
(1) ANI Technologies Pvt. Ltd.  Office unit 405 - 406, fourth floor, southend squire, mansarover industrial area, new sanganer road, Jaipur South 302020 
(2) Supply Address Office unit 405 - 406, fourth floor, southend squire, mansarover industrial area, new sanganer road, Jaipur South 302020 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_3.pdf
page 1 contains address:
(1) e Road Number 23, Andheri East, Mumbai, Maharashtra, India 
(2) Karjat - Murbad Rd, Kotwal Nagar, Deulwadi, Karjat 
page 2 contains address:
(1) Road Number 23, Andheri East, Mumbai, Maharashtra, India 
page 3

(1) India Expo Mart Cir, Knowledge Park II, Greater Noida, Uttar Pradesh 201310, India 
(2) © 2C, near IIMT College, Knowledge Park III, Greater Noida, Uttar Pradesh 201310, India 
page 2 contains address:
(1) Customer Pick Up Address India Expo Mart Cir, Knowledge Park II, Greater Noida, Uttar Pradesh 201310, India 
page 3 contains address:
(1) Roppen Transportation Services Private Limited 1st, No 179-D, Khasra no 78 Khirki Village, New Delhi, South Delhi, Delhi, 110017 
(2) Vijayguna India Expo Mart Cir, Knowledge Park II, Greater Noida, Uttar Pradesh 201310, India 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/UBER_22.pdf
page 1 contains address:
(1) 8:31 | 1431/1, 4th Cross Rd, Rashad Nagar, Govindapura, HBR Layout, Bengaluru, Karnataka 560045, India 
(2) 9:14 | RK Puram, Gandhi Nagar, Bengaluru, Karnataka 560009, India 
page 2 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/UBER_15.pdf
page 1 contains addr

(6) 700156, India 
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_6.pdf
page 1 contains address:
(1) « B-74, Mehuwala Mafi, Dehradun, Uttarakhand 248171, India 
(2) 824X+39V, Model Colony, Araghar, Dharampur, Dehradun 
(3) In case of any complaint/grievance against this invoice, write to us at  Grievance officer, ANI Technologies Private Limited, Ola Campus, Prestige RMZ star tech, C wing, Koramangala Industrial layout, Koramangala, Hosur road, Bengaluru, Karnataka, 560095 
page 2 contains address:
(1) B-74, Mehuwala Mafi, Dehradun, Uttarakhand 248171, India 
(2) Please note the following terms: This invoice is issued by ANI Technologies Private Limited in the capacity of an Electronic Commerce Operator as per Section 9(5) of the Central Goods and Tax Act, 2017 and corresponding provision(s) of the State/ UT GST laws. This invoice has been issued and signed by the Authorized signatory of ANI Technologies Private Limited only fo

page 1 contains address:
(1) Manapakkam, Chennai, Tamil Nadu 600118, India 
(2) © 2-161, Phase-3, AGS Colony, Mugalivakkam, Chennai, Tamil Nadu 600116, India 
page 2 contains address:
(1) Customer Pick Up Address  1/352, Manapakkam Main Rd, near Miot Hospital, Parthasarathy Nagar, Manapakkam, Chennai, Tamil Nadu 600118, India 
page 3 contains address:
(1) Roppen Transportation Services Private Limited 78, Old door no 34/a, Mount Road, Guindy, Chennai, Tamil Nadu 600032 
(2) Ramesh 1/352, Manapakkam Main Rd, near Miot  Hospital, Parthasarathy Nagar, Manapakkam, Chennai, Tamil Nadu 600118, India 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_23.pdf
page 1 contains address:
(1) 2722+5RC, Kasturba Nagar 3rd Cross St, Venkata Rathnam Nagar Extension, Venkata Rathinam Nagar, Adyar, Chennai 
(2) Meenambakkam, Chennai - Theni Hwy, National Airports Authority Colony, Meenambakkam, Nandanvakkam, Chennai 
(3) Grievance officer, ANI Technologies Private Limited

page 2 contains address:
(1) SRA-SY, Crash Rd, Vazhakkala, Kakkanad, Kerala 682021, India 
(2) Roshan Ln, Kacheripady, Ernakulam, Kerala 682018, 
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/RAPIDO_5.pdf
page 1 contains address:
(1) Kondapur oe,  Porirh 
(2) PF 10, Railway Officer Colony, Botiguda, Bhoiguda, Secunderabad, Telangana 
(3) 500025, India 
(4) © 17-1-178/A/15/A, near Raheem Function Hall, Salala Nagar, Santosh Nagar, Hyderabad, Telangana 500059, India 
page 2 contains address:
(1) Customer Pick Up Address  PF10, Railway Officer Colony, Botiguda, Bhoiguda, Secunderabad, Telangana 500025, India 
page 3 contains address:
(1) Roppen Transportation Services Private Limited 3rd Floor, Sai Prithvi Acrade, Megha Hills, Sri  Rama Colony, Madhapur, Hyderabad, Telangana, 500081 
(2) Mastanvali Shaik  PF10, Railway Officer Colony, Botiguda, Bhoiguda, Secunderabad, Telangana 500025, India 
file_name: /home/panwar2001/Desktop/T

In [15]:
# extracting from UBER_2.pdf as example
def check_address(text):
    # Assumption: Address contains commas as seperator
    if ',' not in text: # Check if text doesn't have comma delimeter.
        return False
    # Assumption: Address Doesn't contain emails and hence no @ symbol 
    if '@' in text: # spacy considers text containing gmail address as location , so text containing gmail can never be address
        return False
    # this was an edge case found in invoices, through spacy NER , text containing thanks was also considered as location.
    if 'thank' in text : 
        return False
    doc = nlp(text)
    # Check if location is found in address 
    for entity in doc.ents:
        if entity.label_=='GPE' or entity.label_=='LOC':
            return True
    return False

def get_address():
    pdf=pdfplumber.open(os.getcwd()+'/pdf_dataset/UBER_2.pdf')
    address=[]
    for page in pdf.pages:
        img=np.array(page.to_image(resolution=300).original.convert('L'))
        ret, thresh1 = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
        rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (16,16))   
        dilation = cv2.dilate(thresh1, rect_kernel, iterations = 3) 
        contours, _ = cv2.findContours(dilation,  cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) 
        for cnt in contours[::-1]: 
            x, y, w, h = cv2.boundingRect(cnt) 
            cropped = img[y:y + h, x:x + w]
            text=pytesseract.image_to_string(cropped)
            if check_address(text):
                address.append(text)
            if len(address)==2:
                return address
    return address    

for address in get_address():
    print(address.replace('\n',' ')+'\n')

6/6, 3rd St, Shanthi Nagar, Adambakkam Shanthi Nagar, Adambakkam Chennai, Tamil Nadu 600088, Santhi Nagar, Adambakkam, Chennai, Tamil Nadu 600088, India 

12:33pm Unnamed Road, Kovilancheri,  Tamil Nadu 600048, India 



### preprocess address text to remove time from address and non alphabetical / non numerical characters.

In [28]:
import re
import re
def preprocess_address_text(text):
   #remove time from string
   text= re.sub(r"\d{1,2}:\d{1,2}\s*(am|pm)?", '', text)
   #remove unwanted characters from address
   text= re.sub(r"[^a-zA-Z0-9,\s/-]", "", text)
   return text

print(preprocess_address_text('12:33pm Unnamed Road, Kovilancheri,  Tamil Nadu 600048, India '))
print(preprocess_address_text('© F-35, 2nd Ave, G Block, Anna Nagar, Sector B, Annanagar East, Chennai, Tamil Nadu 600040, India '))
print(preprocess_address_text(' 17:07  Haware Infotech Park, Sector 30A, Vashi, Navi Mumbai, Maharashtra 400703, India'))


 Unnamed Road, Kovilancheri,  Tamil Nadu 600048, India 
 F-35, 2nd Ave, G Block, Anna Nagar, Sector B, Annanagar East, Chennai, Tamil Nadu 600040, India 
 Haware Infotech Park, Sector 30A, Vashi, Navi Mumbai, Maharashtra 400703, India


In [29]:
# again checking for address , now taking  preprocessing in consideration
def image_to_rect(img):
    ret, thresh1 = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
    rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (16,16))   
    dilation = cv2.dilate(thresh1, rect_kernel, iterations = 3) 
    contours, _ = cv2.findContours(dilation,  cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) 
    index=1
    for cnt in contours[::-1]: 
        x, y, w, h = cv2.boundingRect(cnt) 
        cropped = img[y:y + h, x:x + w]
        text=pytesseract.image_to_string(cropped).lower()
        if check_address(text):
            text=text.replace('\n',' ')
            print(f"({index}) {preprocess_address_text(text)}")
            index+=1
            
def generate_address_box(file):
    pdf=pdfplumber.open(file)
    print('file_name:',file)
    for page in pdf.pages:
      img=np.array(page.to_image(resolution=300).original.convert('L'))
      print(f'page {page.page_number} contains address:')
      image_to_rect(img)
    
path=os.getcwd()+'/pdf_dataset'
files=os.listdir(path)
for file in files:
  generate_address_box(path+'/'+file)

file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_25.pdf
page 1 contains address:
(1) 55, gopalpura road, chhayadip nagar, shri gopal nagar, gopal pura mode, jaipur 
page 2 contains address:
(1) pickup address 55, gopalpura road, chhayadip nagar, shri gopal nagar, gopal pura mode, jaipur 
page 3 contains address:
(1) ani technologies pvt ltd  office unit 405 - 406, fourth floor, southend squire, mansarover industrial area, new sanganer road, jaipur south 302020 
(2) supply address office unit 405 - 406, fourth floor, southend squire, mansarover industrial area, new sanganer road, jaipur south 302020 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_3.pdf
page 1 contains address:
(1) e road number 23, andheri east, mumbai, maharashtra, india 
(2) karjat - murbad rd, kotwal nagar, deulwadi, karjat 
page 2 contains address:
(1) road number 23, andheri east, mumbai, maharashtra, india 
page 3 contains address:
file_name: /home/pan

(2)  rk puram, gandhi nagar, bengaluru, karnataka 560009, india 
page 2 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/UBER_15.pdf
page 1 contains address:
page 2 contains address:
(1) dlf phase 5, sector 54, gurugram, haryana 122002,  india 
(2)  delhi, 110037, india 
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_18.pdf
page 1 contains address:
(1)   road number ce124  salt lake, sector  1, kolkata india  19 ns road, hmp house  bbd bagh, dalhousie, kolkata india 
page 2 contains address:
(1) road number 23, andheri east, mumbai, maharashtra, india 
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/RAPIDO_4.pdf
page 1 contains address:
(1) unnamed road, kavuri hills phase 1, kavuri hills, jubilee hills, hyderabad, telangana 500033, india  e block ysr residency, barkatpura rd, chitrapuri colony, lingampally, kachiguda, hyderabad, telangana

page 3 contains address:
(1) roppen transportation services private limited 1st, no 179-d, khasra no 78 khirki village, new delhi, south delhi, delhi, 110017 
(2) vijayguna 281c, near iimt, knowledge park iii, greater noida, uttar pradesh 201310, india 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/UBER_5.pdf
page 1 contains address:
page 2 contains address:
(1) dlf phase 5, sector 54, gurugram, haryana 122002,  india 
(2) delhi, 110037, india 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_7.pdf
page 1 contains address:
(1) bashettihalli          oases, halli 3  paar sulibel  or 648 makali ava 75 whitefield istzeert koramangala boedebortes fas  google map data 2018 google 
(2) 786, 1st main rd, 1st block, dubasi palya, kengeri satellite town, bengaluru, karnataka 560059, india 
(3) atc tower, bial road, bengaluru 
page 2 contains address:
(1) kengeri satellite town, bengaluru, karnataka 560059,  india 
page 3 contains addr

page 2 contains address:
(1)   189, udyog vihar phase 1, udyog vihar, sector 20, gurugram, haryana 122008,  india 
(2)   6/82, jnarera, jharera village, delhi cantonment, new delhi, delhi 110010, 
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_10.pdf
page 1 contains address:
page 2 contains address:
(1) vimlesh kumar ola micro, xcent hr55ae9928 operator state/ut delhi    
page 3 contains address:
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_22.pdf
page 1 contains address:
(1) aravakkam,  a 
(2) base fare of 80 for the first 4 km and 10/km thereafter ride time at 10 per min after first 5 min, includes waiting time during the trip  additional service tax is applicable on your fare total fare includes this additional service tax toll and parking charges are extra  we levy peak pricing charges when the demand is high, so that we can make more cabs available to you and continue to serve you efficiently 

(3) udupi bus stand udupi, karnataka 
page 2 contains address:
(1) 916366161696 mangalore international airport bajpe main rd, kenjar hc, mangalore, karnataka 
page 3 contains address:
(1)    o ola  ani technologies pvt ltd  5th floor, maruthi infotech center, 100 feet rd, embassy golf links business park, domlur, bengaluru, karnataka 560071           cidvswww0324488 
file_name: /home/panwar2001/Desktop/TEST/2023-10-09-ESS1/AI/pdf_dataset/OLA_12.pdf
page 1 contains address:
(1)  survey no 169, old mahabalipuram rd, kumaran nagar, chemmanchery, chennai, tamil nadu 600119, india 
page 2 contains address:
(1) survey no 169, old mahabalipuram rd, kumaran nagar, chemmanchery, chennai, tamil nadu 600119, india 
page 3 contains address:
(1) o ola  ani technologies pvt ltd  ani technologies pvtltd, 18-a, sidco industrial area, mmda bus terminal, arumbakkam, chennai 
(2) ani technologies pvtltd, 18-a, sidco industrial area, mmda bus terminal, arumbakkam, chennai 
file_name: /home/panwar2