## I. Setup, Installations, and Imports

#### (a) Prereqs

1. You have jupyterlab running in your base conda environment. 
1. Create the `nlp_stanza` environment via the readme's instructions. 
1. Add that environment's python kernel to jupyterlab via the following code. This will also download the `spacy` model. 

    ```terminal
    $ conda activate nlp_stanza          
    (nlp_stanza)$ conda install ipykernel
    (nlp_stanza)$ python -m ipykernel install --user --name=nlp_stanza --display-name "NLP Stanza Env"
    (nlp_stanza)$ python -m spacy download en_core_web_sm
    (nlp_stanza)$ conda deactivate
    $ jupyter lab          
    ```
    
**Make sure this script is being run with the nlp_stanza conda environment's kernel. In the upper right corner, click the kernel name and change to "NLP Stanza Env".**


#### (b) Installations

Run these if not on your computer already.

In [1]:
### NOTE: THESE ARE REDUNDANT IF PREREQS ABOVE MEET
# ! pip install spacy
# ! pip install nltk
# ! python -m spacy download en_core_web_sm
# ! pip install svgling
# ! python -m pip install textacy
# ! pip install stanza

#### (c) Imports

In [2]:
import os
import stanza
# stanza.download('en') # This downloads the English models for the neural pipelin
stanza_nlp = stanza.Pipeline('en', download_method=None) # This sets up a default neural pipeline in English

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

import glob
from tqdm import tqdm
from bs4 import BeautifulSoup
from __future__ import unicode_literals
import spacy

spacy_nlp = spacy.load("en_core_web_sm",)  

nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')

  from .autonotebook import tqdm as notebook_tqdm
2023-03-30 14:14:02 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

2023-03-30 14:14:02 INFO: Using device: cpu
2023-03-30 14:14:02 INFO: Loading: tokenize
2023-03-30 14:14:02 INFO: Loading: pos
2023-03-30 14:14:03 INFO: Loading: lemma
2023-03-30 14:14:03 INFO: Loading: constituency
2023-03-30 14:14:04 INFO: Loading: depparse
2023-03-30 14:14:04 INFO: Loading: sentiment
2023-03-30 14:14:04 INFO: Loading: ner
2023-03-30 14:14:05 INFO: Done loading processors!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\DonsLaptop\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\U

True

## II. Defined Functions Used In Program


#### (a) Named Entity Recognizer Function
**Input:** A sentence <br> 
**Output:** A dictionary in the form of: <br>
&emsp;**{'entities':** [list of named entity recognizers of that sentence], **'verbs':** [list of the verbs], **'subjects':** [list of the subjects]**, 'subj_verb_linkages',** [root subject verb linkage array]**}**

In [6]:
def ner(sentence): 
    
    entities = []
    verbs = []
    subjects = []
    subj_verb_linkages = []
    
    #Find the entities in the sentence
    words  = nltk.word_tokenize(sentence)        # break down the sentence into words
    tagged = nltk.pos_tag(words)                 # tag the words with Part of Speech 
    chunks = nltk.ne_chunk(tagged, binary=False) # binary = False named entities are classified (i.e PERSON, ORGANIZATION)
    
    for chunk in chunks:
        if hasattr(chunk, 'label'):              # hasattr(obj, key) -- checking if chunks have a label or not 
            entities.append(' '.join(c[0] for c in chunk)) # append entities to array
    
    
    #Find the verbs/subjects in the sentence
             # load in the spacy model
    doc = spacy_nlp(sentence)                          # create spacy doc object
    
    verbs = [token.text for token in doc if token.pos_ == "VERB"]     # traverse thru the tokens, find the verbs
    subjects = [token.text for token in doc if token.dep_ == "nsubj"]  # traverse thru the tokens, find the subjects
    
    
    #Find the Root Subject-verb linkages in the sentences using stanza
    doc = stanza_nlp(sentence)
     
    subject, root_verb = None, None
    
    for word in doc.sentences[0].words:    # for each word in the sentence
        if word.deprel == 'root':          # if a word is the root, its dependency relation label is 'root'. thus if this is true, the curr word = root word

            root_verb = word.text          # save the root verb  
            root_id = word.id              # get the words id 
            
            for w in doc.sentences[0].words:                      # loop over words in the sentence
                if w.head == root_id and w.deprel == 'nsubj':     # if words head attribute = root_id, then its a a direct dependent of the root (is a child of the root)
                    subject = w.text 
    
    if subject and root_verb: # not empty
        subj_verb_linkages = [subject, root_verb]   # subj_verb linkages array 
    
        
    return {'entities':entities, 
            'verbs':verbs,
            'subjects':subjects,
            'subj_verb_linkages':subj_verb_linkages} 

#### (b) Filename Traversal Function
**Input:** A Filename **(i.e /inputs/ex10.txt)** <br>
**Output:** A dict of dicts: <br>
&emsp;&emsp;&emsp;&emsp;**{sentence: {sentence_level_outputs}}** <br>
&emsp;&emsp;where sentence_level_outputs <br>
&emsp;&emsp;&emsp;&emsp;**{'analysis type/function' : output thereof}**


In [7]:
def doc_trawl(filename):

    file_output = {}
    
    with open(filename, "r") as fp:
        raw = BeautifulSoup(fp.read(), 'html.parser').get_text()
        raw_sentences = nltk.sent_tokenize(raw)
    
    for sentence in raw_sentences:
        
        # put all output of this sentence here 
        # key=analysis type/function, value=output thereof
        sentence_level_outputs = {} 
        
        # use ner function  
        sentence_level_outputs.update(ner(sentence))
        
        # any other output we want to add that doesn't rely on the ner tokenization
        # should be done here
        # to show that the plumbing works correctly, let's just add a rand variable:
        sentence_level_outputs['random_num'] = np.random.uniform()
        
        # Add to output dictionary
        file_output.update({sentence:sentence_level_outputs})
        
    return file_output

## III. Automation

In [8]:
file_sentence_dict = {}
files = glob.glob("inputs/*") #get all the files in the inputs folder

for file in tqdm(files,total=len(files)):
    file_sentence_dict.update({file: doc_trawl(file)}) #update the dictionary 

100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [04:43<00:00, 141.51s/it]


## IV. Unpacking that into DF

Dataframe with 
- index is filename-sentence
- columns are sentence level variables

Now we can do diagnostics, examine the output, and use it faster!

In [12]:
def unpack_tri_level_dict(a_dict):
    df = pd.concat(map(lambda x: pd.DataFrame.from_dict(x).T, a_dict.values()), keys=a_dict.keys())
    df.index = df.index.rename(['file','sentence'])
    return df

display(unpack_tri_level_dict(file_sentence_dict))

os.makedirs('outputs',exist_ok=True)

unpack_tri_level_dict(file_sentence_dict).to_csv('outputs/test.csv',index=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,entities,verbs,subjects,subj_verb_linkages,random_num
file,sentence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
inputs\ex10-11.txt,"\nEX-10\n2\nex10-11.txt\nCFC INTERNATIONAL, INC. - CONTRACT ADDENDUM\n\n ADDENDUM TO\n PURCHASE AGREEMENT - DATED MARCH 1, 2001\n\nThis Agreement (the ""Addendum""), effective October 15, 2002, between CFC\nInternational, a Delaware corporation, (""CFC""), and Baxter Healthcare\nCorporation, a Delaware corporation, and its successors, affiliates and assigns\n(""Baxter""), amends the Purchase Agreement (""Agreement"") between the two\ncompanies dated March 1, 2001.","[CFC, INC., CONTRACT, AGREEMENT, DATED, CFC In...","[DATED, amends, dated]",[INTERNATIONAL],[],0.49866
inputs\ex10-11.txt,1.,[],[],[],[],0.151303
inputs\ex10-11.txt,"General Provisions\n----------------------\n\nAll ""terms and conditions"" of the original Agreement will remain effective as\nstated in the Agreement with only the specific revisions as stated below.",[],"[remain, stated, stated]",[Provisions],[],0.314378
inputs\ex10-11.txt,This\naddendum applies to CFC products B10EK black and B5603AB black.,"[CFC, B10EK, B5603AB]",[applies],[addendum],"[addendum, applies]",0.126645
inputs\ex10-11.txt,"2.0 Distribution\n-----------------\n\nBuyer agrees to purchase foil requirements for current users, which are wholly\nowned subsidiaries of Baxter Healthcare Corporation.","[Buyer, Baxter Healthcare Corporation]","[agrees, purchase, owned]","[Buyer, which]",[],0.217432
...,...,...,...,...,...,...
inputs\ex10.txt,"13.6 This Agreement is deemed to have been entered into in the State of\nIllinois and its interpretations, construction, and the remedies for its\nenforcement of breach are to be applied pursuant to and in accordance with the\nlaws of the State of Illinois.","[Illinois, Illinois]","[deemed, entered, applied]",[],[],0.774526
inputs\ex10.txt,"13.7 In the event that a court of competent jurisdiction holds that\nparticular provisions or requirements of this Agreement are in violation of any\nlaw, such provisions or requirements shall be enforced and shall remain in full\nforce and effect to the extent they are not in violation of any such law or not\notherwise unenforceable, and all other provisions and requirements of this\nAgreement shall remain in full force and effect.",[],"[holds, enforced, remain, remain]","[court, provisions, they, provisions]",[],0.130741
inputs\ex10.txt,"In Witness Whereof, the parties have caused this Agreement to be executed by\ntheir authorized representatives.",[Witness Whereof],"[caused, executed]",[parties],"[parties, caused]",0.036323
inputs\ex10.txt,BAXTER HEALTHCARE CORP CFC INTERNATIONAL\n\n\nBy:_____________________________ By:_____________________________\n Dave Valentini Robert E. Jurgens\nTitle: V.P.,"[BAXTER, HEALTHCARE, CORP, CFC, Dave Valentini...",[],[],[],0.713706
