## I. Setup, Installations, and Imports

#### (a) Installations

Run these if not on your computer already.

In [1]:
# ! pip install spacy
# ! pip install nltk
# ! python -m spacy download en_core_web_sm
# ! pip install svgling

#### (b) Imports

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
import glob
from tqdm import tqdm
from bs4 import BeautifulSoup

#### (c) Downloads

Run these if not on your computer already.

In [3]:
# nltk.download('words')
# nltk.download('punkt')
# nltk.download('maxent_ne_chunker')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('state_union')

## II. Defined Functions Used In Program


#### (a) Named Entity Recognizer Function
**Input:** A sentence <br> 
**Output:** A list of the named entity recognizers of that sentence

In [4]:
# here is some temp code: i'm trying to figure out how to find/save the verb
# sentence = '13.7 In the event that a court of competent jurisdiction holds that\nparticular provisions or requirements of this Agreement are in violation of any\nlaw, such provisions or requirements shall be enforced and shall remain in full\nforce and effect to the extent they are not in violation of any such law or not\notherwise unenforceable, and all other provisions and requirements of this\nAgreement shall remain in full force and effect.'
# sentence = '13.6 This Agreement is deemed to have been entered into in the State of\nIllinois and its interpretations, construction, and the remedies for its\nenforcement of breach are to be applied pursuant to and in accordance with the\nlaws of the State of Illinois. 	'
sentence = '\nEX-10\n2\nex10-11.txt\nCFC INTERNATIONAL, INC. - CONTRACT ADDENDUM\n\n ADDENDUM TO\n PURCHASE AGREEMENT - DATED MARCH 1, 2001\n\nThis Agreement (the "Addendum"), effective October 15, 2002, between CFC\nInternational, a Delaware corporation, ("CFC"), and Baxter Healthcare\nCorporation, a Delaware corporation, and its successors, affiliates and assigns\n("Baxter"), amends the Purchase Agreement ("Agreement") between the two\ncompanies dated March 1, 2001.'
words = nltk.word_tokenize(sentence)         #break down the sentence into words
tagged = nltk.pos_tag(words)                 #tag the words with Part of Speech 
chunks = nltk.ne_chunk(tagged, binary=False) #binary = False named entities are classified (i.e PERSON, ORGANIZATION)

# todo experiement here to get verb and subject, once done, implement in function below

In [5]:
def ner(sentence): 
    
    words  = nltk.word_tokenize(sentence)        # break down the sentence into words
    tagged = nltk.pos_tag(words)                 # tag the words with Part of Speech 
    chunks = nltk.ne_chunk(tagged, binary=False) # binary = False named entities are classified (i.e PERSON, ORGANIZATION)
    
    entities = []
    
    for chunk in chunks:
        if hasattr(chunk, 'label'): # hasattr(obj, key) -- checking if chunks have a label or not 
            entities.append(' '.join(c[0] for c in chunk)) # append entities to array
    
    # todo add code here as needed to get the verb and subject, (if you get them via looping over chunks, then do within the for loop above)
        
    return {'entities':entities, 
           'random_out':np.random.uniform()  }  # todo update the output dictionary to output the verb and subject (and delete the placeholder "random" output)

#### (b) Filename Traversal Function
**Input:** A Filename **(i.e /inputs/ex10.txt)** <br>
**Output:** A dict of dicts: <br>
&emsp;&emsp;&emsp;&emsp;**{sentence: {sentence_level_outputs}}** <br>
&emsp;&emsp;where sentence_level_outputs <br>
&emsp;&emsp;&emsp;&emsp;**{'analysis type/function' : output thereof}**


In [6]:
def doc_trawl(filename):

    file_output = {}
    
    with open(filename, "r") as fp:
        raw = BeautifulSoup(fp.read(), 'html.parser').get_text()
        raw_sentences = nltk.sent_tokenize(raw)
    
    for sentence in raw_sentences:
        
        # put all output of this sentence here 
        # key=analysis type/function, value=output thereof
        sentence_level_outputs = {} 
        
        # use ner function  
        sentence_level_outputs.update(ner(sentence))
        
        # any other output we want to add that doesn't rely on the ner tokenization
        # should be done here
        # to show that the plumbing works correctly, let's add variable 2:
        sentence_level_outputs['random_num'] = np.random.uniform()
        
        # Add to output dictionary
        file_output.update({sentence:sentence_level_outputs})
        
    return file_output

## III. Automation

In [7]:
file_sentence_dict = {}
files = glob.glob("inputs/*") #get all the files in the inputs folder

for file in tqdm(files,total=len(files)):
    file_sentence_dict.update({file: doc_trawl(file)}) #update the dictionary 

100%|█████████████████████████████████████████████| 2/2 [00:00<00:00,  2.53it/s]


## IV. Unpacking that into DF

Dataframe with 
- index is filename-sentence
- columns are sentence level variables

Now we can do diagnostics, examine the output, and use it faster!

In [8]:
def unpack_tri_level_dict(a_dict):
    df = pd.concat(map(lambda x: pd.DataFrame.from_dict(x).T, a_dict.values()), keys=a_dict.keys())
    df.index = df.index.rename(['file','sentence'])
    return df

unpack_tri_level_dict(file_sentence_dict)


Unnamed: 0_level_0,Unnamed: 1_level_0,entities,random_out,random_num
file,sentence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
inputs/ex10.txt,"\nEX-10\n4\nex10.txt\nCFC INTERNATIONAL, INC.-BAXTER PURCHASE AGREEMENT\n\nExhibit 10.9\n\n\n PURCHASE AGREEMENT\n\n This Agreement, effective March 1, 2001 is between CFC International, a\nDelaware corporation, with offices at 500 State Street, Chicago Heights,\nIllinois 60411 (""Seller"") and Baxter Healthcare Corporation, a Delaware\ncorporation, with offices at One Baxter Parkway, Deerfield, Illinois 60015 on\nbehalf or its self and its affiliates (entities controlling, controlled by, or\nunder common control with Baxter)(""Buyer"").","[CFC, PURCHASE, CFC International, Delaware, S...",0.346138,0.181434
inputs/ex10.txt,"1.0 Background\n\n\n 1.1 Seller produces hot stamping foil which conforms and meets the\nSpecification Requirements submitted, accepted and in Seller's possession for\nthe Specification numbers listed attached in the Exhibit A., hereafter referred\nto as ""Products"".","[Seller, Requirements, Seller, Exhibit]",0.597307,0.524241
inputs/ex10.txt,Product Specifications may be revised from time to time and\nnew Specifications and numbers added by mutual agreement between parties.,"[Product, Specifications]",0.537171,0.431521
inputs/ex10.txt,Buyer\nrequires foil for use in printing flexible packaging.,[Buyer],0.424741,0.496088
inputs/ex10.txt,"2.0 Distribution\n\n\n 2.1 Subject to the terms and conditions of this Agreement, Seller shall\nmanufacture and sell Products to Buyer, and Buyer shall purchase Products for\nmanufacture into goods for use or resale in any country in the world.","[Seller, Buyer, Buyer]",0.197163,0.499701
...,...,...,...,...
inputs/ex10-11.txt,4.0 Price for Products\n-----------------------\n\nCFC will reduce the cost structure of black B10EK and black B5603AB as stated in\nExhibit 1 (new price list).,"[CFC, B10EK, B5603AB]",0.742139,0.827807
inputs/ex10-11.txt,Products shipped from Chicago Heights will be billed\nin U.S. dollars; products shipped from Germany will be billed in Euros.,"[Chicago, U.S., Germany, Euros]",0.706312,0.411017
inputs/ex10-11.txt,"10.0 Term\n----------\n\nThe addendum will extend the contract until February 28, 2007.",[],0.289073,0.106714
inputs/ex10-11.txt,13.0 Other Provisions\n----------------------\n\nSection 13.2 is deleted for B10EK and B5603AB.,"[B10EK, B5603AB]",0.01304,0.428711
