## 2. Identify noun phrases from given input 

### Instructions on execution

Steps to follow to execute the below code:
1. Make sure all the required packages mentioned in the "requirements.txt" file are installed
2. Place the unzipped pdf documents in a folder
3. Configure the "base_path" in the "conf.json" to the path to folder in which pdfs are placed
4. Make sure the "conf.json" file is always in the same location as this ipynb file

### Execute the below code for results

In [2]:
# importing required packages
import re
import os
import nltk
import json
from pathlib import Path
from tika import parser

# fetching_path_to_documents_location_from_configuration_file
with open("conf.json") as json_conf : 
    CONF = json.load(json_conf)

directory = format(CONF["base_path"])
#print(format(CONF["base_path"]))

# function to extract grammer pattern from text
def grammerPatternExtraction(chunkName,attribute_list):
    # for each phrase sub tree in the parse tree
    for subtree in tree.subtrees(filter=lambda t: t.label() == chunkName):
        # appending the phrase from a list of part-of-speech tagged words
        attribute_list.append(" ".join([word for word, pos in subtree]))
    return attribute_list

# traversing through all the docs in the directory
for filename in os.listdir(directory):
    # filtering the docs ending with .pdf's and .PDF's
    if filename.endswith(".pdf") or filename.endswith(".PDF"): 
        # building the path to each file
        file = os.path.join(directory, filename)
        print(filename)
        print("===============================================================")
        # using tika's parser to extract text from pdf
        text = parser.from_file(file)
        # extracting required content from tika's output
        text=text["content"]
        text = text.replace("\n","")
        #print(text)
        # sentences tokenizer
        sentences = nltk.sent_tokenize(str(text))
        # word tokenize each sentence
        sentences = [nltk.word_tokenize(sent) for sent in sentences]
        # pos tagging each sentence
        sentences = [nltk.pos_tag(sent) for sent in sentences]
        #print(sentences)
        # defining a grammar with a regular-expression rule 
        grammar = r"""
        NP: {<NN.*>}
        """
        # creating a chunk parser
        cp = nltk.RegexpParser(grammar)
        # looping over all the sentences from extracted text
        noun_phrases = []
        for sent in sentences:
            tree = cp.parse(sent)
            # calling grammerPatternExtraction method to extract the words which has required chunk grammer patterns
            noun_phrases = grammerPatternExtraction("NP",noun_phrases)
        # filterng list elements with character length one
        noun_phrases = [word for word in noun_phrases if len(word)>1]
        # filtering alphanumeric and numeric elements from list
        noun_phrases = [word for word in noun_phrases if word.isalpha()]
        # printing the length of list
        print("No. of word phrases in the documnet : "+str(len(list(set(noun_phrases)))))
        # printing the final list of noun phrases from documents
        print("Noun phrases in the document : ")
        print(list(set(noun_phrases)))
        print("---------------------------")

16017sec.pdf
No. of word phrases in the documnet : 393
Noun phrases in the document : 
['restriction', 'consideration', 'Transaction', 'Rate', 'measure', 'Anything', 'property', 'cither', 'accordance', 'use', 'valuation', 'YES', 'Minimum', 'multiple', 'currency', 'sum', 'ExchangeDate', 'Rounding', 'Alternative', 'Performance', 'Independent', 'items', 'Page', 'Termination', 'Part', 'banks', 'Time', 'Disputing', 'minus', 'Thresholds', 'Definition', 'form', 'credit', 'Original', 'Final', 'effect', 'meaning', 'New', 'Dealers', 'liens', 'Event', 'demands', 'terms', 'Recalculation', 'part', 'calculations', 'rise', 'purpose', 'types', 'Value', 'Transfer', 'Address', 'times', 'rest', 'DefaultIf', 'English', 'reference', 'users', 'clearance', 'date', 'location', 'execution', 'favour', 'Timing', 'Daily', 'stamp', 'charges', 'Definitions', 'December', 'instruments', 'Reuters', 'Association', 'Am', 'valuations', 'pledge', 'Base', 'market', 'appetite', 'WHEREOF', 'USD', 'considerations', 'Transfero

No. of word phrases in the documnet : 447
Noun phrases in the document : 
['restriction', 'Transaction', 'Rate', 'Anything', 'agreement', 'agent', 'property', 'accordance', 'option', 'use', 'Sachs', 'precedent', 'Minimum', 'multiple', 'currency', 'sum', 'Rounding', 'Peterborough', 'Alternative', 'Performance', 'items', 'Termination', 'Offsets', 'banks', 'Time', 'Disputing', 'None', 'Thresholds', 'intermediaries', 'form', 'protocol', 'deliveries', 'credit', 'Original', 'definitions', 'effect', 'meaning', 'New', 'September', 'liens', 'Event', 'demands', 'terms', 'Recalculation', 'Legal', 'States', 'duly', 'part', 'calculations', 'rise', 'purpose', 'types', 'Value', 'Transfer', 'times', 'rest', 'determination', 'English', 'CREDIT', 'reference', 'users', 'Type', 'clearance', 'vii', 'date', 'execution', 'Document', 'favour', 'Timing', 'stamp', 'charges', 'Definitions', 'instruments', 'Netting', 'Association', 'valuations', 'pledge', 'Base', 'ANNEX', 'market', 'WHEREOF', 'reduction', 'USD', 