## 1. Information extraction from PDF documents

### Question 

Information extraction from PDF documents. There are 4 PDF documents in the zip file which are stock contracts used in banks. You need to build an NLP based IE system which can extract the following information from the document. There is a subsection called 'Rounding' in 'Paragraph 11: Elections and Variables' section. You need to extract
Currency Amount Rounding (up / down/ nearest) If not mentioned, take it as 'nearest' by default.

for 'Delivery Amount' and 'Return Amount'.

Eg: In the file '16017sec.pdf' here are the values.

Delivery Amount: Currency: USD Amount: 10,000 Rounding: up

Recall Amount: Currency: USD Amount: 10,000 Rounding: down

A few pointers on this problem

· Do not process PDF directly. Use some third party tools to convert it to text. · Some strings can be uniquely identified in the text. Use that to narrow down search space. · Look into POS tagging and Named Entity Recognition. These NLP techniques which will be helpful to crack this problem.

### Instructions on execution

Steps to follow to execute the below code:
1. Make sure all the required packages mentioned in the "requirements.txt" file are installed
2. Place the unzipped pdf documents in a folder
3. Configure the "base_path" in the "conf.json" to the path to folder in which pdfs are placed
4. Make sure the "conf.json" file is always in the same location as this ipynb file

### Execute the below code for results

In [5]:
# importing required packages
import re
import os
import nltk
import json
from pathlib import Path
from tika import parser

# fetching_path_to_documents_location_from_configuration_file
with open("conf.json") as json_conf : 
    CONF = json.load(json_conf)

directory = format(CONF["base_path"])
#print(format(CONF["base_path"]))
    
# function to extract grammer pattern from text
def grammerPatternExtraction(chunkName):
    attribute_list = []
    # for each phrase sub tree in the parse tree
    for subtree in tree.subtrees(filter=lambda t: t.label() == chunkName):
        # appending the phrase from a list of part-of-speech tagged words
        attribute_list.append(" ".join([word for word, pos in subtree]))
    return attribute_list

# traversing through all the docs in the directory
for filename in os.listdir(directory):
    # filtering the docs ending with .pdf's and .PDF's
    if filename.endswith(".pdf") or filename.endswith(".PDF"): 
        # building the path to each file
        file = Path(directory) / filename
        print(filename)
        #print(file)
        print("===============================================================")
        # using tika's parser to extract text from pdf
        text = parser.from_file(str(file))
        # extracting required content from tika's output
        text=text["content"]
        text = text.replace("\n","")
        #print(text)
        # extracting the part of text lying between "Rounding" and "Valuation and Timing" keywords
        result = re.search("%s(.*)%s" % ("Rounding", "Valuation and Timing"), text).group(1)
        # sentences tokenizer
        sentences = nltk.sent_tokenize(str(result))
        # word tokenize each sentence
        sentences = [nltk.word_tokenize(sent) for sent in sentences]
        # pos tagging each sentence
        sentences = [nltk.pos_tag(sent) for sent in sentences]
        #print(sentences)
        # defining a grammar with a regular-expression rule 
        grammar = r"""
        NP: {<DT><NNP><NNP>} 
        VP: {<MD><VB><VBN>} 
        PP: {<R.*>}
        ND: {<NNP><CD>}
            {<NNP><,><NNP>}
        """
        # creating a chunk parser
        cp = nltk.RegexpParser(grammar)
        # looping over all the sentences from extracted text
        for sent in sentences:
            tree = cp.parse(sent)
            amount_type=[]
            relation=[]
            indicator=[]
            amount=[]
            currency_pattern = "USD|INR|EUR|GBP"
            # calling grammerPatternExtraction method to extract the words which has required chunk grammer patterns
            amount_type = grammerPatternExtraction("NP")
            relation = grammerPatternExtraction("VP")
            indicator = grammerPatternExtraction("PP")
            amount = grammerPatternExtraction("ND")
            #extracting required keywords and relations from list of extracted words in grammerPatternExtraction
            if len(amount_type)>0:
                print(amount_type[0])
            # skipping the sentences which does't have required grammer patterns
            else:
                continue
            if len(relation)>0 and len(indicator)>0:
                print(relation[0]+" : "+indicator[0])
            # skipping the sentences which does't have required grammer patterns
            else:
                continue
            if len(amount)>0:
                currency = "".join(re.findall(currency_pattern,amount[0]))
                print("currency : "+currency)
                print("amount :"+amount[0].replace(currency,""))
            # skipping the sentences which does't have required grammer patterns
            else:
                continue
            print("---------------------------")
            if len(amount_type)>1:
                print(amount_type[1])
            if len(relation)>0:
                if len(indicator)>1:
                    print(relation[0]+" : "+indicator[1])
                elif len(indicator)>0:
                    print(indicator[0]+" : "+indicator[0])
            # separating currancy and amount from extracted amount
            if len(amount)>1:
                currency = "".join(re.findall(currency_pattern,amount[1]))
                print("currency : "+currency)
                print("amount :"+amount[1].replace(currency,""))
            elif len(amount)>0:
                currency = "".join(re.findall(currency_pattern,amount[0]))
                print("currency : "+currency)
                print("amount :"+amount[0].replace(currency,""))
            #print(tree)
            #result.draw()
            print("---------------------------")
        print()

16017sec.pdf
The Delivery Amount
will be rounded : up
currency : USD
amount : 10,000
---------------------------
the Return Amount
will be rounded : down
currency : USD
amount : 10,000
---------------------------

20197sec.pdf
The Delivery Amount
will be rounded : up
currency : EUR
amount : 10,000
---------------------------
the Return Amount
will be rounded : down
currency : EUR
amount : 10,000
---------------------------

English law VM CSA EXECUTED.PDF
the Delivery Amount
will be rounded : up
currency : USD
amount : 10,000
---------------------------
the Return Amount
will be rounded : down
currency : USD
amount : 10,000
---------------------------

HBUS00002134-00002150.pdf
The Delivery Amount
shall be rounded : up
currency : USD
amount :lOO , OOO
---------------------------
the Return Amount
shall be rounded : down
currency : USD
amount :lOO , OOO
---------------------------

